Unplanned outage, PetaLibrary/active (BeeGFS)
Incident Report for CU Boulder RC
Resolved
We will continue to work with our filesystem support vendor to ensure that we are running with the correct configuration, but access to PetaLibrary/active has remained stable following our resolution Sunday.
Posted Mar 12, 2019 - 09:10 MDT
Monitoring
Increasing the limits on the number of open files in the system appears to have resolved the issue, and PetaLibrary/active (BeeGFS) is now accessible. We'll continue to monitor this issue on Monday, and follow-up with support to confirm the correct value for these limits going forward.

Jobs are once again starting on Blanca.

If you notice any further trouble, please contact rc-help@colorado.edu.

https://www.beegfs.io/wiki/FAQ#too_many_open_files
Posted Mar 10, 2019 - 18:14 MDT
Identified
This outage appears to be the result of us exceeding a server-side configured limit on the number of open files in the system. We are following a procedure to increase this limit, which should restore access.

It is our impression that this is a side-effect of increased use of the system, and does not represent an actual system fault.
Posted Mar 10, 2019 - 17:59 MDT
Investigating
We are investigating an unplanned outage on PetaLibrary/active (BeeGFS) as exported via /pl/active/. Slurm has been stopped on Blanca while we investigate.
Posted Mar 10, 2019 - 17:28 MDT
This incident affected: Blanca and PetaLibrary.