communication error when accessing /pl/active
Incident Report for CU Boulder RC
Update
When we last experienced this issue with BeeGFS we determined that the error was the beegfs-mgmtd exceeding an internal limit. We did not know whether this limit needed to be increased given the size and complexity of our system, or if it represented a problem (a bug) in the software.

In an effort to reduce the likelihood of a reoccurrence we increased the limit from 10k to 20k. However, the issue occurred again today, hitting the new 20k limit.

Before returning the system to service once again we gathered another detail from the system state which I believe indicates that the problem does, in fact, lie with the file system software itself. This is most likely a regression introduced during our recent upgrade.

With this information I expect the developer should be able to identify and fix the cause of our issue. Until then , we will continue to monitor and return the system to service as necessary, and are deploying additional monitoring to assist us in responding more quickly.
Posted about 22 hours ago. Mar 25, 2019 - 10:57 MDT
Monitoring
We have identified the issue and will continue monitoring the Petalibrary service.
Posted about 23 hours ago. Mar 25, 2019 - 10:16 MDT
Update
We are continuing to investigate this issue.
Posted about 23 hours ago. Mar 25, 2019 - 10:07 MDT
Investigating
We are currently investigating this issue.
Posted about 23 hours ago. Mar 25, 2019 - 09:56 MDT
Update
I believe that we have found the root cause for this issue, or at least we are approaching it.

The BeeGFS cluster has a central management daemon that assists clients and servers in locating resources in the cluster. This daemon is currently configured with a 10,000 open-files limit, and it appears to be reaching this limit in some circumstances. Once this occurs, the file system becomes inaccessible until the management daemon is restarted.

Our BeeGFS installation is notably complex, so it's possible we just need to increase this limit; but it is also possible that this represents a bug in the management daemon, and that increasing the limit would not resolve the issue.

We have presented these findings to the developers, and are awaiting their analysis.
Posted 5 days ago. Mar 21, 2019 - 15:04 MDT
Monitoring
We have gathered a new set of detailed logs and other analytics and provided these to the filesystem support vendor for analysis. We have further identified the minimal action necessary (restarting a single backend service) to restore access to PetaLibrary/active.

Access to PetaLibrary/active has been restored, and jobs are once again being dispatched on Blanca. We regret that this has now happened three times, and are continuing to work with the support vendor to identify root cause and resolve this issue permanently.
Posted 5 days ago. Mar 21, 2019 - 11:00 MDT
Investigating
We are aware of another instance of the same outage type on PetaLibrary/active (BeeGFS). This is the third such outage, all since our most recent upgrade.

While we investigate the cause of this outage further, we have stopped new Blanca jobs from starting. If you would like your partition to be re-activated, please let us know at rc-help@colorado.edu.

We suspect that there has been some kind of bug introduced during the most recent FS upgrade that is leading to this behavior; but it's difficult to track down because the symptoms also partially match what we would see if a user application were exhausting the number of file handles that can be open. We're going to take a bit more time today to try to gather log data to better track down the cause of this error, a continuation of the investigation that has been ongoing since the second such outage.

More information will be posted here as it comes available.
Posted 5 days ago. Mar 21, 2019 - 09:41 MDT
Monitoring
It was hit again the limit for the number of open files. So we increased the limit even more. We have a ticket with our vendor to identify the correct value and prevent that to happen again. This is being followed up.

PetaLibrary is back up again but we'll continue to monitor.
New jobs are able to be started again in Blanca.
Posted 12 days ago. Mar 14, 2019 - 11:20 MDT
Update
To minimize the impact of PetaLibrary/active being inaccessible on queued jobs, I have stopped Slurm from starting new jobs on Blanca. If you would like your partition returned to service before we resolve the problems with PetaLibrary, please contact rc-help@colorado.edu.
Posted 12 days ago. Mar 14, 2019 - 09:21 MDT
Investigating
It was noted some Beegfs communication errors that are preventing spaces at /pl/active from being used.
I'm opening this incident but wasn't able to investigate further.
All Beegfs servers are up. But it was observed error messages from the management service associated to some Beegfs clients.
The incident will be updated as soon as that it is understood.
Posted 12 days ago. Mar 14, 2019 - 08:36 MDT
This incident affects: RMACC Summit, Blanca, and PetaLibrary.