Update - 46/47 recoveries are complete. We are communicating directly with the affected customer on the final recovery effort; incremental status updates will not be posted here. We will leave this incident open until the damaged storage pool has been rebuilt and we begin moving data back to it.
Mar 24, 2023 - 08:54 MDT
Update - Recovery of Petalibrary active allocations impacted by the March 1 event continues. We have copied 39 of the 47 impacted allocations to alternate storage. These allocations are available at their normal path ("/pl/active/") and may be used immediately.

Of the eight allocations that have not been recovered, we have strong evidence that three are unused and have deprioritized those. The remaining five allocations present a more challenging recovery scenario. We're unable to clearly project when we will know if the data is retrievable or exactly how long the data will take to recover. We are continuing our work with a filesystem developer to assist with this work and will communicate updates as we have them.

Some questions have been raised about future plans to address this issue. As an immediate protection, we've taken steps to detect the issue that originated this recovery. Longer term we are exploring options for providing a backup option for Petalibrary with more details to be provided through our stakeholder process as that option develops.

Mar 13, 2023 - 16:18 MDT
Update - One PetaLibrary allocation restore to an alternate storage location finished last night, and spot checks by the customer indicate that their data is intact. A restore of a second allocation was started as soon as the first finished.

Success of a single allocation is promising, but we still need to handle each recovery on a case by case basis. We will provide another update on Monday.

Mar 10, 2023 - 14:15 MST
Update - We are attempting to recover the first PetaLibrary allocation now. The underlying storage pool is sufficiently damaged that it will never again be mounted in read/write mode, meaning we must copy all data elsewhere. This will take time, given the amount of data involved (hundreds of TB). This first recovery effort is expected to finish late today or early tomorrow, and we will share the results of the initial attempt.

If this initial recovery succeeds, that does not imply that all data will be recoverable, but does suggest to us that more data _should_ be recoverable. We will not be sure that this process can recover all data until all data is recovered into a new location. We will continue to update as this process proceeds and provide timelines as we learn more.

Mar 09, 2023 - 14:56 MST
Update - After attempting the standard ZFS recovery procedure (rolling the system back to prior checkpoints), the impacted allocations were still unable to be imported. Our next step is to work with the ZFS developer group to determine whether there is a path to determine which data has been damaged.

We suspect (but cannot prove yet) that damaged data would most likely be limited to snapshots that occurred while the system was in maintenance (less likely to impact customer data). The ZFS development community has been very helpful, but this process will likely be lengthy.

Given the length of this outage, we'd like to work with impacted customers to set up temporary allocations for storing any new data. We understand this outage has impacts on your workflows and would like to do all we can to allow you to continue your work while the recovery process continues. To facilitate this we will be opening a ticket for all impacted allocation owners and working with you to find the best way to restore your operations while the data recovery effort continues.

Mar 06, 2023 - 13:58 MST
Update - We are continuing to investigate this issue.
Mar 06, 2023 - 13:57 MST
Update - We are still working through the list of checkpoints. One of the filesystem developers is beginning work on a patch that will permit us to bypass some sanity checks that should allow us to import the storage pool, and give us a clearer picture of where things stand. Next update will be on Monday.
Mar 03, 2023 - 16:22 MST
Update - The storage pool containing the affected filesystems is still not able to be imported. We have isolated the affected disks to a single system, are running a newer version of the filesystem software that is more lenient of some problems, and are now working on trying to import each of the roughly 75 checkpoints of the storage pool.
Mar 02, 2023 - 17:07 MST
Investigating - One of the storage pools that supports PetaLibrary/active allocations is currently not able to be imported. Until this is resolved the following PetaLibrary allocations are not available:

/pl/active/abha4861
/pl/active/BFALMC
/pl/active/BFSeqF
/pl/active/BioChem-NMR
/pl/active/BioCore
/pl/active/BMICC
/pl/active/brennan
/pl/active/CAL
/pl/active/ceko
/pl/active/CIEST
/pl/active/cmbmgem
/pl/active/COSINC
/pl/active/CU-TRaiL
/pl/active/CUBES-SIL
/pl/active/CUFEMM
/pl/active/EML
/pl/active/fanzhanglab
/pl/active/FCSC
/pl/active/Goodrich_Kugel_lab
/pl/active/GreenLab
/pl/active/ics
/pl/active/ics_affiliated_PI
/pl/active/ics_archive_PI
/pl/active/Instrument_Shop
/pl/active/iphy-sclab
/pl/active/JILLA-KECK
/pl/active/jimeno_amchertz
pl/active/LangeLab
/pl/active/LMCF
/pl/active/LugerLab-EM
/pl/active/McGlaughlin_Lab-UNC
/pl/active/MIMIC
/pl/active/Morris_CSU
/pl/active/neu
/pl/active/NSO-IT
/pl/active/OGL
/pl/active/OIT_DLS_BBA
/pl/active/OIT_DLS_CCS
/pl/active/OIT_DLS_ECHO
/pl/active/Raman_Microspec
/pl/active/scepi_magthin
/pl/active/snag
/pl/active/STEMTECH
/pl/active/swygertlab
/pl/active/UCB-NMR
/pl/active/VoeltzLab

We are working with one of the filesystem developer to understand and resolve the issue. Due to the developer being on EST, the next update on this issue will likely not be until tomorrow (Mar 2) morning.

Mar 01, 2023 - 17:12 MST
Research Computing Core ? Operational
Alpine ? Operational
90 days ago
100.0 % uptime
Today
Blanca ? Operational
PetaLibrary Operational
Open OnDemand ? Operational
90 days ago
100.0 % uptime
Today
CUmulus OpenStack Platform Operational
90 days ago
100.0 % uptime
Today
AWS ec2-us-west-2 Operational
AWS rds-us-west-2 Operational
AWS s3-us-west-2 Operational
RMACC Summit ? Operational
Science Network ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Past Incidents
Mar 2, 2024

No incidents reported today.

Mar 1, 2024

No incidents reported.

Feb 29, 2024

No incidents reported.

Feb 28, 2024

No incidents reported.

Feb 27, 2024

No incidents reported.

Feb 26, 2024

No incidents reported.

Feb 25, 2024

No incidents reported.

Feb 24, 2024

No incidents reported.

Feb 23, 2024

No incidents reported.

Feb 22, 2024

No incidents reported.

Feb 21, 2024

No incidents reported.

Feb 20, 2024

No incidents reported.

Feb 19, 2024

No incidents reported.

Feb 18, 2024

No incidents reported.

Feb 17, 2024

No incidents reported.