Access to /scratch/summit and interim PetaLibrary allocations
Incident Report for CU Boulder RC
Resolved
This incident has been resolved.
Posted May 27, 2021 - 11:07 MDT
Monitoring
Summit has been returned to service, and access to Summit storage has been restored.

Significant work was required to perform sufficient consistency checks necessary to fulfill safeguards present in both the block storage and clustered filesystem storage that makes up the Summit storage stack. But no errors have been detected, a great deal of data has been read from Summit storage as a proof-of-fitness test, and significant writes have also been performed for proof-of-fitness.

We believe that these failures are a result of a failure the SFA storage hardware encountered over the week end; more specifically, a failure of the SFA platform to appropriately accommodate the recovery activity performed Monday morning under direction from upstream vendor support.

We have passed all this information on to them for analysis.
Posted May 04, 2021 - 02:15 MDT
Identified
The problem with Summit storage experienced earlier has reoccurred. We are investigating the root cause of the issue, and will bring it to solution as soon as possible. This may mean Tuesday morning, however.

Summit Slurm has been paused, so no new jobs will attempt to start until this has been resolved.
Posted May 03, 2021 - 22:30 MDT
Monitoring
This outage may have affected all access to /scratch/summit (including compute nodes), and is likely related to maintenance on summit scratch hardware this morning. The underlying issue has been fixed, and we are working with the vendor to understand why an outage occurred.
Posted May 03, 2021 - 14:09 MDT
Investigating
Access to /scratch/summit from RC hosts that are NOT compute nodes (notably the login nodes) is down at the moment. This affects some PetaLibrary allocations as well.
Posted May 03, 2021 - 12:31 MDT
This incident affected: PetaLibrary and RMACC Summit.