Update - When we last experienced this issue with BeeGFS we determined that the error was the beegfs-mgmtd exceeding an internal limit. We did not know whether this limit needed to be increased given the size and complexity of our system, or if it represented a problem (a bug) in the software.
In an effort to reduce the likelihood of a reoccurrence we increased the limit from 10k to 20k. However, the issue occurred again today, hitting the new 20k limit.
Before returning the system to service once again we gathered another detail from the system state which I believe indicates that the problem does, in fact, lie with the file system software itself. This is most likely a regression introduced during our recent upgrade.
With this information I expect the developer should be able to identify and fix the cause of our issue. Until then , we will continue to monitor and return the system to service as necessary, and are deploying additional monitoring to assist us in responding more quickly.
Mar 25, 10:57 MDT
Monitoring - We have identified the issue and will continue monitoring the Petalibrary service.
Mar 25, 10:16 MDT
Update - We are continuing to investigate this issue.
Mar 25, 10:07 MDT
Investigating - We are currently investigating this issue.
Mar 25, 09:56 MDT
Update - I believe that we have found the root cause for this issue, or at least we are approaching it.
The BeeGFS cluster has a central management daemon that assists clients and servers in locating resources in the cluster. This daemon is currently configured with a 10,000 open-files limit, and it appears to be reaching this limit in some circumstances. Once this occurs, the file system becomes inaccessible until the management daemon is restarted.
Our BeeGFS installation is notably complex, so it's possible we just need to increase this limit; but it is also possible that this represents a bug in the management daemon, and that increasing the limit would not resolve the issue.
We have presented these findings to the developers, and are awaiting their analysis.
Mar 21, 15:04 MDT
Monitoring - We have gathered a new set of detailed logs and other analytics and provided these to the filesystem support vendor for analysis. We have further identified the minimal action necessary (restarting a single backend service) to restore access to PetaLibrary/active.
Access to PetaLibrary/active has been restored, and jobs are once again being dispatched on Blanca. We regret that this has now happened three times, and are continuing to work with the support vendor to identify root cause and resolve this issue permanently.
Mar 21, 11:00 MDT
- We are aware of another instance of the same outage type on PetaLibrary/active (BeeGFS). This is the third such outage, all since our most recent upgrade.
While we investigate the cause of this outage further, we have stopped new Blanca jobs from starting. If you would like your partition to be re-activated, please let us know at firstname.lastname@example.org
We suspect that there has been some kind of bug introduced during the most recent FS upgrade that is leading to this behavior; but it's difficult to track down because the symptoms also partially match what we would see if a user application were exhausting the number of file handles that can be open. We're going to take a bit more time today to try to gather log data to better track down the cause of this error, a continuation of the investigation that has been ongoing since the second such outage.
More information will be posted here as it comes available.
Mar 21, 09:41 MDT
Monitoring - It was hit again the limit for the number of open files. So we increased the limit even more. We have a ticket with our vendor to identify the correct value and prevent that to happen again. This is being followed up.
PetaLibrary is back up again but we'll continue to monitor.
New jobs are able to be started again in Blanca.
Mar 14, 11:20 MDT
- To minimize the impact of PetaLibrary/active being inaccessible on queued jobs, I have stopped Slurm from starting new jobs on Blanca. If you would like your partition returned to service before we resolve the problems with PetaLibrary, please contact email@example.com
Mar 14, 09:21 MDT
Investigating - It was noted some Beegfs communication errors that are preventing spaces at /pl/active from being used.
I'm opening this incident but wasn't able to investigate further.
All Beegfs servers are up. But it was observed error messages from the management service associated to some Beegfs clients.
The incident will be updated as soon as that it is understood.
Mar 14, 08:36 MDT