communication error when accessing /pl/active

Incident Report for CU Boulder RC

Resolved

Our workaround for this issue has been in place successfully for some time now. We have plans for work that we will be doing with upstream development to find and resolve the root cause; but, for now, we're closing out this issue.

Posted Jun 25, 2019 - 13:40 MDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 16, 2019 - 16:03 MDT

Update

After more than 3 hours debugging and trying to bring the cluster back online with our support vendor, we identified a configuration problem with our High Availability Beegfs cluster which prevented one of the storage targets to stop or start properly. The configuration that lead to the problem we encountered today is associated with how we are providing ZFS-only allocations. So, we are making a plan to change the way we would provision ZFS-only spaces in order to isolate those ZFS-only targets from the ones that Beegfs uses.

The problem verified today started when we tried to apply the patch mentioned earlier and Beegfs resources were not failing over to a secondary node due to the boss100 target not stopping and starting as expected.

That patch includes debugging information for Beegfs team to be able to diagnose the issue with Beegfs management daemon that fails from time to time leading to the communication error seen by the clients.

We did applied the patch while Beegfs were helping us to recover the cluster today.

Beegfs filesystem is accessible now.
So, you should be able to read/write from /pl/active spaces.

Posted May 16, 2019 - 15:55 MDT

Investigating

Application of a patch today has caused BeeGFS to go offline. We are working to bring it back up ASAP.

Posted May 16, 2019 - 10:33 MDT

Update

We experienced another instance of this problem today, starting at approximately 5pm and resolved as soon as we were aware at 8:30pm.

We continue to regret this state of affairs. This failure occurred sooner after our remediation step than we had seen before. At this point, all we can do is continue to pass our experiences on to the developer, who is actively pursuing root cause on our behalf.

We are also investigating configuration changes that would allow io to block, rather than fail, in this situation. If we are able to do so, that should at least reduce job failures that result from this and similar problems.

Posted May 10, 2019 - 20:36 MDT

Update

Another instance of this problem happened again around 1:15am, on April 27th.

As soon as the issue was identified debug data was collected as requested by our vendor. So far, they weren't able to reproduce the problem in their labs and it has been challenging to diagnose with the debug data collected before. A new set of debug data was requested and we expect to be enough data for them to diagnose the problem this time.

Beegfs service is running ok now after the data was collected.

Posted Apr 27, 2019 - 11:10 MDT

Update

We experienced a reoccurrence of this issue today. Notably, our new monitoring _did_ catch this occurrence, including automatically preventing new Slurm jobs from starting until the issue was resolved.

While we had recently applied a patch to beegfs-mgmt, the developer did not, in fact, believe it would fix our issue. Since we had not seen an occurrence of this issue with this patch in place (or some other configuration issues) we had ceased taking certain preventative measures. Now that the issue has reoccurred with our current configuration, those preventative measures will resume.

Posted Apr 13, 2019 - 08:35 MDT

Update

When we last experienced this issue with BeeGFS we determined that the error was the beegfs-mgmtd exceeding an internal limit. We did not know whether this limit needed to be increased given the size and complexity of our system, or if it represented a problem (a bug) in the software.

In an effort to reduce the likelihood of a reoccurrence we increased the limit from 10k to 20k. However, the issue occurred again today, hitting the new 20k limit.

Before returning the system to service once again we gathered another detail from the system state which I believe indicates that the problem does, in fact, lie with the file system software itself. This is most likely a regression introduced during our recent upgrade.

With this information I expect the developer should be able to identify and fix the cause of our issue. Until then , we will continue to monitor and return the system to service as necessary, and are deploying additional monitoring to assist us in responding more quickly.

Posted Mar 25, 2019 - 10:57 MDT

Monitoring

We have identified the issue and will continue monitoring the Petalibrary service.

Posted Mar 25, 2019 - 10:16 MDT

Update

We are continuing to investigate this issue.

Posted Mar 25, 2019 - 10:07 MDT

Investigating

We are currently investigating this issue.

Posted Mar 25, 2019 - 09:56 MDT

Update

I believe that we have found the root cause for this issue, or at least we are approaching it.

The BeeGFS cluster has a central management daemon that assists clients and servers in locating resources in the cluster. This daemon is currently configured with a 10,000 open-files limit, and it appears to be reaching this limit in some circumstances. Once this occurs, the file system becomes inaccessible until the management daemon is restarted.

Our BeeGFS installation is notably complex, so it's possible we just need to increase this limit; but it is also possible that this represents a bug in the management daemon, and that increasing the limit would not resolve the issue.

We have presented these findings to the developers, and are awaiting their analysis.

Posted Mar 21, 2019 - 15:04 MDT

Monitoring

We have gathered a new set of detailed logs and other analytics and provided these to the filesystem support vendor for analysis. We have further identified the minimal action necessary (restarting a single backend service) to restore access to PetaLibrary/active.

Access to PetaLibrary/active has been restored, and jobs are once again being dispatched on Blanca. We regret that this has now happened three times, and are continuing to work with the support vendor to identify root cause and resolve this issue permanently.

Posted Mar 21, 2019 - 11:00 MDT

Investigating

We are aware of another instance of the same outage type on PetaLibrary/active (BeeGFS). This is the third such outage, all since our most recent upgrade.

While we investigate the cause of this outage further, we have stopped new Blanca jobs from starting. If you would like your partition to be re-activated, please let us know at rc-help@colorado.edu.

We suspect that there has been some kind of bug introduced during the most recent FS upgrade that is leading to this behavior; but it's difficult to track down because the symptoms also partially match what we would see if a user application were exhausting the number of file handles that can be open. We're going to take a bit more time today to try to gather log data to better track down the cause of this error, a continuation of the investigation that has been ongoing since the second such outage.

More information will be posted here as it comes available.

Posted Mar 21, 2019 - 09:41 MDT

Monitoring

It was hit again the limit for the number of open files. So we increased the limit even more. We have a ticket with our vendor to identify the correct value and prevent that to happen again. This is being followed up.

PetaLibrary is back up again but we'll continue to monitor.
New jobs are able to be started again in Blanca.

Posted Mar 14, 2019 - 11:20 MDT

Update

To minimize the impact of PetaLibrary/active being inaccessible on queued jobs, I have stopped Slurm from starting new jobs on Blanca. If you would like your partition returned to service before we resolve the problems with PetaLibrary, please contact rc-help@colorado.edu.

Posted Mar 14, 2019 - 09:21 MDT

Investigating

It was noted some Beegfs communication errors that are preventing spaces at /pl/active from being used.
I'm opening this incident but wasn't able to investigate further.
All Beegfs servers are up. But it was observed error messages from the management service associated to some Beegfs clients.
The incident will be updated as soon as that it is understood.

Posted Mar 14, 2019 - 08:36 MDT

This incident affected: Blanca and PetaLibrary.