Monitoring - A fix has been implemented and we are monitoring the results.
May 16, 16:03 MDT
Update - After more than 3 hours debugging and trying to bring the cluster back online with our support vendor, we identified a configuration problem with our High Availability Beegfs cluster which prevented one of the storage targets to stop or start properly. The configuration that lead to the problem we encountered today is associated with how we are providing ZFS-only allocations. So, we are making a plan to change the way we would provision ZFS-only spaces in order to isolate those ZFS-only targets from the ones that Beegfs uses.

The problem verified today started when we tried to apply the patch mentioned earlier and Beegfs resources were not failing over to a secondary node due to the boss100 target not stopping and starting as expected.

That patch includes debugging information for Beegfs team to be able to diagnose the issue with Beegfs management daemon that fails from time to time leading to the communication error seen by the clients.

We did applied the patch while Beegfs were helping us to recover the cluster today.

Beegfs filesystem is accessible now.
So, you should be able to read/write from /pl/active spaces.
May 16, 15:55 MDT
Investigating - Application of a patch today has caused BeeGFS to go offline. We are working to bring it back up ASAP.
May 16, 10:33 MDT
Update - We experienced another instance of this problem today, starting at approximately 5pm and resolved as soon as we were aware at 8:30pm.

We continue to regret this state of affairs. This failure occurred sooner after our remediation step than we had seen before. At this point, all we can do is continue to pass our experiences on to the developer, who is actively pursuing root cause on our behalf.

We are also investigating configuration changes that would allow io to block, rather than fail, in this situation. If we are able to do so, that should at least reduce job failures that result from this and similar problems.
May 10, 20:36 MDT
Update - Another instance of this problem happened again around 1:15am, on April 27th.

As soon as the issue was identified debug data was collected as requested by our vendor. So far, they weren't able to reproduce the problem in their labs and it has been challenging to diagnose with the debug data collected before. A new set of debug data was requested and we expect to be enough data for them to diagnose the problem this time.

Beegfs service is running ok now after the data was collected.
Apr 27, 11:10 MDT
Update - We experienced a reoccurrence of this issue today. Notably, our new monitoring _did_ catch this occurrence, including automatically preventing new Slurm jobs from starting until the issue was resolved.

While we had recently applied a patch to beegfs-mgmt, the developer did not, in fact, believe it would fix our issue. Since we had not seen an occurrence of this issue with this patch in place (or some other configuration issues) we had ceased taking certain preventative measures. Now that the issue has reoccurred with our current configuration, those preventative measures will resume.
Apr 13, 08:35 MDT
Update - When we last experienced this issue with BeeGFS we determined that the error was the beegfs-mgmtd exceeding an internal limit. We did not know whether this limit needed to be increased given the size and complexity of our system, or if it represented a problem (a bug) in the software.

In an effort to reduce the likelihood of a reoccurrence we increased the limit from 10k to 20k. However, the issue occurred again today, hitting the new 20k limit.

Before returning the system to service once again we gathered another detail from the system state which I believe indicates that the problem does, in fact, lie with the file system software itself. This is most likely a regression introduced during our recent upgrade.

With this information I expect the developer should be able to identify and fix the cause of our issue. Until then , we will continue to monitor and return the system to service as necessary, and are deploying additional monitoring to assist us in responding more quickly.
Mar 25, 10:57 MDT
Monitoring - We have identified the issue and will continue monitoring the Petalibrary service.
Mar 25, 10:16 MDT
Update - We are continuing to investigate this issue.
Mar 25, 10:07 MDT
Investigating - We are currently investigating this issue.
Mar 25, 09:56 MDT
Update - I believe that we have found the root cause for this issue, or at least we are approaching it.

The BeeGFS cluster has a central management daemon that assists clients and servers in locating resources in the cluster. This daemon is currently configured with a 10,000 open-files limit, and it appears to be reaching this limit in some circumstances. Once this occurs, the file system becomes inaccessible until the management daemon is restarted.

Our BeeGFS installation is notably complex, so it's possible we just need to increase this limit; but it is also possible that this represents a bug in the management daemon, and that increasing the limit would not resolve the issue.

We have presented these findings to the developers, and are awaiting their analysis.
Mar 21, 15:04 MDT
Monitoring - We have gathered a new set of detailed logs and other analytics and provided these to the filesystem support vendor for analysis. We have further identified the minimal action necessary (restarting a single backend service) to restore access to PetaLibrary/active.

Access to PetaLibrary/active has been restored, and jobs are once again being dispatched on Blanca. We regret that this has now happened three times, and are continuing to work with the support vendor to identify root cause and resolve this issue permanently.
Mar 21, 11:00 MDT
Investigating - We are aware of another instance of the same outage type on PetaLibrary/active (BeeGFS). This is the third such outage, all since our most recent upgrade.

While we investigate the cause of this outage further, we have stopped new Blanca jobs from starting. If you would like your partition to be re-activated, please let us know at rc-help@colorado.edu.

We suspect that there has been some kind of bug introduced during the most recent FS upgrade that is leading to this behavior; but it's difficult to track down because the symptoms also partially match what we would see if a user application were exhausting the number of file handles that can be open. We're going to take a bit more time today to try to gather log data to better track down the cause of this error, a continuation of the investigation that has been ongoing since the second such outage.

More information will be posted here as it comes available.
Mar 21, 09:41 MDT
Monitoring - It was hit again the limit for the number of open files. So we increased the limit even more. We have a ticket with our vendor to identify the correct value and prevent that to happen again. This is being followed up.

PetaLibrary is back up again but we'll continue to monitor.
New jobs are able to be started again in Blanca.
Mar 14, 11:20 MDT
Update - To minimize the impact of PetaLibrary/active being inaccessible on queued jobs, I have stopped Slurm from starting new jobs on Blanca. If you would like your partition returned to service before we resolve the problems with PetaLibrary, please contact rc-help@colorado.edu.
Mar 14, 09:21 MDT
Investigating - It was noted some Beegfs communication errors that are preventing spaces at /pl/active from being used.
I'm opening this incident but wasn't able to investigate further.
All Beegfs servers are up. But it was observed error messages from the management service associated to some Beegfs clients.
The incident will be updated as soon as that it is understood.
Mar 14, 08:36 MDT
Research Computing Core ? Operational
Science Network ? Operational
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Operational
EnginFrame ? Operational
JupyterHub ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Jun 19, 2019

No incidents reported today.

Jun 18, 2019

No incidents reported.

Jun 17, 2019

No incidents reported.

Jun 16, 2019

No incidents reported.

Jun 15, 2019

No incidents reported.

Jun 14, 2019

No incidents reported.

Jun 13, 2019
Resolved - Tonight, an upgrade of the pgi license caused a conflict with the matlab license, causing both to be temporarily unavailable. I am not certain what impact this would have on running jobs; but jobs that attempted to start matlab during the outage will have been unable to start properly.

Both licenses are again online and operational.
Jun 13, 22:12 MDT
Jun 12, 2019

No incidents reported.

Jun 11, 2019

No incidents reported.

Jun 10, 2019

No incidents reported.

Jun 9, 2019

No incidents reported.

Jun 8, 2019

No incidents reported.

Jun 7, 2019

No incidents reported.

Jun 6, 2019

No incidents reported.

Jun 5, 2019
Completed - Today's maintenance activities have concluded, and PetaLibrary, Blanca, and Summit are back in production.

We were unable to provoke a failure in beegfs-mgmtd; so we are likely to want to allow it to fail naturally at least once in the relatively near future so that we can get a gdb backtrace of the error state.

We were also unable to determine the cause of our prototype PetaLibrary zfs-direct allocation's failure to mount. We are working with the ZFS development community to try to determine root cause.
Jun 5, 17:54 MDT
Update - Today's planned maintenance activities are largely complete. We are running standard performance validation of the Summit environment, and an additional metadata performance test of PetaLibrary/active from Blanca. The metadata performance test is largely an attempt to see if beegfs-mgmtd fails under the load.

We should be able to return to service soon.
Jun 5, 16:24 MDT
Update - We are commencing PetaLibrary maintenance activities, which will include interruptions to I/O for PetaLibrary/active (/pl/active/).
Jun 5, 09:01 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jun 5, 07:00 MDT
Update - Be reminded that we have scheduled planned maintenance activities tomorrow, Wednesday, 5 June 2019. Updates will be posted here as they are available.
Jun 4, 15:22 MDT
Scheduled - Research Computing will perform regularly-scheduled planned maintenance Wednesday, 5 June 2019. June's activities include

- Summit OPA switch firmware update
- Decommission Summit "debug" QoS
- Security updates on Internet-facing servers (including login nodes)
- Internal changes to RC DNS to better conform to public DNS
- PetaLibrary BeeGFS storage configuration testing
- PetaLibrary BeeGFS failure testing
- PetaLibrary BeeGFS OS and software updates
- PetaLibrary BeeGFS xattr and ACL support configuration
- PetaLibrary ZFS allocation incident investigation
- Summit performance validation

Maintenance is scheduled to take place between 07:00 and 19:00, though service will be restored as soon as all activities have concluded. During the maintenance period no jobs will run on Summit or Blanca resources, and PetaLibrary/active (BeeGFS) will be intermittently offline during to testing and configuration changes.

Blanca partitions may be individually returned to service on request, particularly if you are unaffected by the scheduled PetaLibrary/active outage.

If you have any questions or concerns, please contact rc-help@colorado.edu.
May 30, 11:10 MDT