Monitoring - A fix has been implemented and we are monitoring the results.
May 16, 16:03 MDT
Update - After more than 3 hours debugging and trying to bring the cluster back online with our support vendor, we identified a configuration problem with our High Availability Beegfs cluster which prevented one of the storage targets to stop or start properly. The configuration that lead to the problem we encountered today is associated with how we are providing ZFS-only allocations. So, we are making a plan to change the way we would provision ZFS-only spaces in order to isolate those ZFS-only targets from the ones that Beegfs uses.
The problem verified today started when we tried to apply the patch mentioned earlier and Beegfs resources were not failing over to a secondary node due to the boss100 target not stopping and starting as expected.
That patch includes debugging information for Beegfs team to be able to diagnose the issue with Beegfs management daemon that fails from time to time leading to the communication error seen by the clients.
We did applied the patch while Beegfs were helping us to recover the cluster today.
Beegfs filesystem is accessible now.
So, you should be able to read/write from /pl/active spaces.
May 16, 15:55 MDT
Investigating - Application of a patch today has caused BeeGFS to go offline. We are working to bring it back up ASAP.
May 16, 10:33 MDT
Update - We experienced another instance of this problem today, starting at approximately 5pm and resolved as soon as we were aware at 8:30pm.
We continue to regret this state of affairs. This failure occurred sooner after our remediation step than we had seen before. At this point, all we can do is continue to pass our experiences on to the developer, who is actively pursuing root cause on our behalf.
We are also investigating configuration changes that would allow io to block, rather than fail, in this situation. If we are able to do so, that should at least reduce job failures that result from this and similar problems.
May 10, 20:36 MDT
Update - Another instance of this problem happened again around 1:15am, on April 27th.
As soon as the issue was identified debug data was collected as requested by our vendor. So far, they weren't able to reproduce the problem in their labs and it has been challenging to diagnose with the debug data collected before. A new set of debug data was requested and we expect to be enough data for them to diagnose the problem this time.
Beegfs service is running ok now after the data was collected.
Apr 27, 11:10 MDT
Update - We experienced a reoccurrence of this issue today. Notably, our new monitoring _did_ catch this occurrence, including automatically preventing new Slurm jobs from starting until the issue was resolved.
While we had recently applied a patch to beegfs-mgmt, the developer did not, in fact, believe it would fix our issue. Since we had not seen an occurrence of this issue with this patch in place (or some other configuration issues) we had ceased taking certain preventative measures. Now that the issue has reoccurred with our current configuration, those preventative measures will resume.
Apr 13, 08:35 MDT
Update - When we last experienced this issue with BeeGFS we determined that the error was the beegfs-mgmtd exceeding an internal limit. We did not know whether this limit needed to be increased given the size and complexity of our system, or if it represented a problem (a bug) in the software.
In an effort to reduce the likelihood of a reoccurrence we increased the limit from 10k to 20k. However, the issue occurred again today, hitting the new 20k limit.
Before returning the system to service once again we gathered another detail from the system state which I believe indicates that the problem does, in fact, lie with the file system software itself. This is most likely a regression introduced during our recent upgrade.
With this information I expect the developer should be able to identify and fix the cause of our issue. Until then , we will continue to monitor and return the system to service as necessary, and are deploying additional monitoring to assist us in responding more quickly.
Mar 25, 10:57 MDT
Monitoring - We have identified the issue and will continue monitoring the Petalibrary service.
Mar 25, 10:16 MDT
Update - We are continuing to investigate this issue.
Mar 25, 10:07 MDT
Investigating - We are currently investigating this issue.
Mar 25, 09:56 MDT
Update - I believe that we have found the root cause for this issue, or at least we are approaching it.
The BeeGFS cluster has a central management daemon that assists clients and servers in locating resources in the cluster. This daemon is currently configured with a 10,000 open-files limit, and it appears to be reaching this limit in some circumstances. Once this occurs, the file system becomes inaccessible until the management daemon is restarted.
Our BeeGFS installation is notably complex, so it's possible we just need to increase this limit; but it is also possible that this represents a bug in the management daemon, and that increasing the limit would not resolve the issue.
We have presented these findings to the developers, and are awaiting their analysis.
Mar 21, 15:04 MDT
Monitoring - We have gathered a new set of detailed logs and other analytics and provided these to the filesystem support vendor for analysis. We have further identified the minimal action necessary (restarting a single backend service) to restore access to PetaLibrary/active.
Access to PetaLibrary/active has been restored, and jobs are once again being dispatched on Blanca. We regret that this has now happened three times, and are continuing to work with the support vendor to identify root cause and resolve this issue permanently.
Mar 21, 11:00 MDT
- We are aware of another instance of the same outage type on PetaLibrary/active (BeeGFS). This is the third such outage, all since our most recent upgrade.
While we investigate the cause of this outage further, we have stopped new Blanca jobs from starting. If you would like your partition to be re-activated, please let us know at firstname.lastname@example.org
We suspect that there has been some kind of bug introduced during the most recent FS upgrade that is leading to this behavior; but it's difficult to track down because the symptoms also partially match what we would see if a user application were exhausting the number of file handles that can be open. We're going to take a bit more time today to try to gather log data to better track down the cause of this error, a continuation of the investigation that has been ongoing since the second such outage.
More information will be posted here as it comes available.
Mar 21, 09:41 MDT
Monitoring - It was hit again the limit for the number of open files. So we increased the limit even more. We have a ticket with our vendor to identify the correct value and prevent that to happen again. This is being followed up.
PetaLibrary is back up again but we'll continue to monitor.
New jobs are able to be started again in Blanca.
Mar 14, 11:20 MDT
- To minimize the impact of PetaLibrary/active being inaccessible on queued jobs, I have stopped Slurm from starting new jobs on Blanca. If you would like your partition returned to service before we resolve the problems with PetaLibrary, please contact email@example.com
Mar 14, 09:21 MDT
Investigating - It was noted some Beegfs communication errors that are preventing spaces at /pl/active from being used.
I'm opening this incident but wasn't able to investigate further.
All Beegfs servers are up. But it was observed error messages from the management service associated to some Beegfs clients.
The incident will be updated as soon as that it is understood.
Mar 14, 08:36 MDT