PetaLibrary outage
Incident Report for CU Boulder RC
Resolved
The multiple PetaLibrary disk failures in a single slot have not produced enough data to justify additional hardware replacement. The service has been stable for over two weeks, so we are closing this issue. The vendor is committed to assisting us if the failure mode reappears and we can pinpoint the failed component.
Posted Sep 15, 2021 - 14:47 MDT
Monitoring
We have returned PetaLibrary to service and will be following-up with our support vendor to understand the root cause of this issue.
Posted Sep 01, 2021 - 11:07 MDT
Investigating
While investigating this issue with a diagnostic script provided by our hardware supplier, one of the BeeGFS servers supporting PetaLibrary/active encountered an issue. We are working to restore service as soon as possible.
Posted Sep 01, 2021 - 10:34 MDT
Identified
PetaLibrary services are available again, with one node not in the cluster. Two disks have failed, one of which is the third failure in the same disk slot. This implies an issue with a backplane or I/O module. We are working with the vendor to understand the cause of the failures.
Posted Aug 26, 2021 - 09:49 MDT
Investigating
Two PetaLibrary nodes are currently down, making most allocations inaccessible. We are working to restore functionality.
Posted Aug 26, 2021 - 08:50 MDT
This incident affected: PetaLibrary.