Some PetaLibrary/active allocations inaccessible

Incident Report for CU Boulder RC

Update

46/47 recoveries are complete. We are communicating directly with the affected customer on the final recovery effort; incremental status updates will not be posted here. We will leave this incident open until the damaged storage pool has been rebuilt and we begin moving data back to it.

Posted Mar 24, 2023 - 08:54 MDT

Update

Recovery of Petalibrary active allocations impacted by the March 1 event continues. We have copied 39 of the 47 impacted allocations to alternate storage. These allocations are available at their normal path ("/pl/active/") and may be used immediately.

Of the eight allocations that have not been recovered, we have strong evidence that three are unused and have deprioritized those. The remaining five allocations present a more challenging recovery scenario. We're unable to clearly project when we will know if the data is retrievable or exactly how long the data will take to recover. We are continuing our work with a filesystem developer to assist with this work and will communicate updates as we have them.

Some questions have been raised about future plans to address this issue. As an immediate protection, we've taken steps to detect the issue that originated this recovery. Longer term we are exploring options for providing a backup option for Petalibrary with more details to be provided through our stakeholder process as that option develops.

Posted Mar 13, 2023 - 16:18 MDT

Update

One PetaLibrary allocation restore to an alternate storage location finished last night, and spot checks by the customer indicate that their data is intact. A restore of a second allocation was started as soon as the first finished.

Success of a single allocation is promising, but we still need to handle each recovery on a case by case basis. We will provide another update on Monday.

Posted Mar 10, 2023 - 14:15 MST

Update

We are attempting to recover the first PetaLibrary allocation now. The underlying storage pool is sufficiently damaged that it will never again be mounted in read/write mode, meaning we must copy all data elsewhere. This will take time, given the amount of data involved (hundreds of TB). This first recovery effort is expected to finish late today or early tomorrow, and we will share the results of the initial attempt.

If this initial recovery succeeds, that does not imply that all data will be recoverable, but does suggest to us that more data _should_ be recoverable. We will not be sure that this process can recover all data until all data is recovered into a new location. We will continue to update as this process proceeds and provide timelines as we learn more.

Posted Mar 09, 2023 - 14:56 MST

Update

After attempting the standard ZFS recovery procedure (rolling the system back to prior checkpoints), the impacted allocations were still unable to be imported. Our next step is to work with the ZFS developer group to determine whether there is a path to determine which data has been damaged.

We suspect (but cannot prove yet) that damaged data would most likely be limited to snapshots that occurred while the system was in maintenance (less likely to impact customer data). The ZFS development community has been very helpful, but this process will likely be lengthy.

Given the length of this outage, we'd like to work with impacted customers to set up temporary allocations for storing any new data. We understand this outage has impacts on your workflows and would like to do all we can to allow you to continue your work while the recovery process continues. To facilitate this we will be opening a ticket for all impacted allocation owners and working with you to find the best way to restore your operations while the data recovery effort continues.

Posted Mar 06, 2023 - 13:58 MST

Update

We are continuing to investigate this issue.

Posted Mar 06, 2023 - 13:57 MST

Update

We are still working through the list of checkpoints. One of the filesystem developers is beginning work on a patch that will permit us to bypass some sanity checks that should allow us to import the storage pool, and give us a clearer picture of where things stand. Next update will be on Monday.

Posted Mar 03, 2023 - 16:22 MST

Update

The storage pool containing the affected filesystems is still not able to be imported. We have isolated the affected disks to a single system, are running a newer version of the filesystem software that is more lenient of some problems, and are now working on trying to import each of the roughly 75 checkpoints of the storage pool.

Posted Mar 02, 2023 - 17:07 MST

Investigating

One of the storage pools that supports PetaLibrary/active allocations is currently not able to be imported. Until this is resolved the following PetaLibrary allocations are not available:

/pl/active/abha4861
/pl/active/BFALMC
/pl/active/BFSeqF
/pl/active/BioChem-NMR
/pl/active/BioCore
/pl/active/BMICC
/pl/active/brennan
/pl/active/CAL
/pl/active/ceko
/pl/active/CIEST
/pl/active/cmbmgem
/pl/active/COSINC
/pl/active/CU-TRaiL
/pl/active/CUBES-SIL
/pl/active/CUFEMM
/pl/active/EML
/pl/active/fanzhanglab
/pl/active/FCSC
/pl/active/Goodrich_Kugel_lab
/pl/active/GreenLab
/pl/active/ics
/pl/active/ics_affiliated_PI
/pl/active/ics_archive_PI
/pl/active/Instrument_Shop
/pl/active/iphy-sclab
/pl/active/JILLA-KECK
/pl/active/jimeno_amchertz
pl/active/LangeLab
/pl/active/LMCF
/pl/active/LugerLab-EM
/pl/active/McGlaughlin_Lab-UNC
/pl/active/MIMIC
/pl/active/Morris_CSU
/pl/active/neu
/pl/active/NSO-IT
/pl/active/OGL
/pl/active/OIT_DLS_BBA
/pl/active/OIT_DLS_CCS
/pl/active/OIT_DLS_ECHO
/pl/active/Raman_Microspec
/pl/active/scepi_magthin
/pl/active/snag
/pl/active/STEMTECH
/pl/active/swygertlab
/pl/active/UCB-NMR
/pl/active/VoeltzLab

We are working with one of the filesystem developer to understand and resolve the issue. Due to the developer being on EST, the next update on this issue will likely not be until tomorrow (Mar 2) morning.

Posted Mar 01, 2023 - 17:12 MST

This incident affects: PetaLibrary.