PetaLibrary storage pool 1 offline Feb 14 6pm-11pm
Incident Report for CU Boulder RC
Resolved
The disk replacement/rebuild finished at 10:30 this morning, and storage pool 1 has remained stable since being brought online last night. The root cause of the multiple disk errors that led to the pool having all I/O operations suspended is still unknown. We are running a scrub of the pool to validate all data held by the pool as a precaution, and we will continue to try to determine the root cause behind this outage. This incident is being closed as all PetaLibrary services are currently stable.
Posted Feb 15, 2024 - 11:54 MST
Monitoring
Multiple disk errors led to one PetaLibrary storage pool having all I/O suspended at 6pm on Feb 14 (this is a safety mechanism), which caused an outage for 14 allocations. The only way to recover from pool I/O suspension is to reboot the host managing the storage pool, which took place at 10pm. This required interrupting service to a second storage pool, affecting an additional 30 allocations. Both pools were back online by 11pm. The storage pool that had I/O suspended suffered a disk failure on Feb 13, and the disk was replaced Feb 14 1pm. It is not known if the disk replacement had anything to do with the pool I/O suspension, but this incident will remain open until the disk rebuild is complete.
Posted Feb 14, 2024 - 23:18 MST
This incident affected: PetaLibrary.