Partial PetaLibrary outage / interruption
Incident Report for CU Boulder RC
Resolved
Our BeeGFS storage pools are configured to panic a node if disk writes cannot be performed for 10 seconds. A disk held up writes to a storage pool, leading one of the PetaLibrary nodes hosts to panic/reboot. We are working with the vendor to replace the disks that have reported errors in the past couple of days. The disk replacements should not interrupt PetaLibrary services.
Posted Jul 28, 2021 - 09:37 MDT
Update
We are continuing to monitor for any further issues.
Posted Jul 27, 2021 - 10:09 MDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 27, 2021 - 10:09 MDT
Investigating
We are investigating a PetaLibrary failure that occurred overnight, and a subsequent failure of some services to properly restart after that failure. Some PetaLibrary allocations may be inaccessible or interrupted while we investigate.
Posted Jul 27, 2021 - 09:05 MDT
This incident affected: PetaLibrary.