We will continue to work with our filesystem support vendor to ensure that we are running with the correct configuration, but access to PetaLibrary/active has remained stable following our resolution Sunday.
Posted Mar 12, 2019 - 09:10 MDT
Monitoring
Increasing the limits on the number of open files in the system appears to have resolved the issue, and PetaLibrary/active (BeeGFS) is now accessible. We'll continue to monitor this issue on Monday, and follow-up with support to confirm the correct value for these limits going forward.
This outage appears to be the result of us exceeding a server-side configured limit on the number of open files in the system. We are following a procedure to increase this limit, which should restore access.
It is our impression that this is a side-effect of increased use of the system, and does not represent an actual system fault.
Posted Mar 10, 2019 - 17:59 MDT
Investigating
We are investigating an unplanned outage on PetaLibrary/active (BeeGFS) as exported via /pl/active/. Slurm has been stopped on Blanca while we investigate.