Update - When we last experienced this issue with BeeGFS we determined that the error was the beegfs-mgmtd exceeding an internal limit. We did not know whether this limit needed to be increased given the size and complexity of our system, or if it represented a problem (a bug) in the software.

In an effort to reduce the likelihood of a reoccurrence we increased the limit from 10k to 20k. However, the issue occurred again today, hitting the new 20k limit.

Before returning the system to service once again we gathered another detail from the system state which I believe indicates that the problem does, in fact, lie with the file system software itself. This is most likely a regression introduced during our recent upgrade.

With this information I expect the developer should be able to identify and fix the cause of our issue. Until then , we will continue to monitor and return the system to service as necessary, and are deploying additional monitoring to assist us in responding more quickly.
Mar 25, 10:57 MDT
Monitoring - We have identified the issue and will continue monitoring the Petalibrary service.
Mar 25, 10:16 MDT
Update - We are continuing to investigate this issue.
Mar 25, 10:07 MDT
Investigating - We are currently investigating this issue.
Mar 25, 09:56 MDT
Update - I believe that we have found the root cause for this issue, or at least we are approaching it.

The BeeGFS cluster has a central management daemon that assists clients and servers in locating resources in the cluster. This daemon is currently configured with a 10,000 open-files limit, and it appears to be reaching this limit in some circumstances. Once this occurs, the file system becomes inaccessible until the management daemon is restarted.

Our BeeGFS installation is notably complex, so it's possible we just need to increase this limit; but it is also possible that this represents a bug in the management daemon, and that increasing the limit would not resolve the issue.

We have presented these findings to the developers, and are awaiting their analysis.
Mar 21, 15:04 MDT
Monitoring - We have gathered a new set of detailed logs and other analytics and provided these to the filesystem support vendor for analysis. We have further identified the minimal action necessary (restarting a single backend service) to restore access to PetaLibrary/active.

Access to PetaLibrary/active has been restored, and jobs are once again being dispatched on Blanca. We regret that this has now happened three times, and are continuing to work with the support vendor to identify root cause and resolve this issue permanently.
Mar 21, 11:00 MDT
Investigating - We are aware of another instance of the same outage type on PetaLibrary/active (BeeGFS). This is the third such outage, all since our most recent upgrade.

While we investigate the cause of this outage further, we have stopped new Blanca jobs from starting. If you would like your partition to be re-activated, please let us know at rc-help@colorado.edu.

We suspect that there has been some kind of bug introduced during the most recent FS upgrade that is leading to this behavior; but it's difficult to track down because the symptoms also partially match what we would see if a user application were exhausting the number of file handles that can be open. We're going to take a bit more time today to try to gather log data to better track down the cause of this error, a continuation of the investigation that has been ongoing since the second such outage.

More information will be posted here as it comes available.
Mar 21, 09:41 MDT
Monitoring - It was hit again the limit for the number of open files. So we increased the limit even more. We have a ticket with our vendor to identify the correct value and prevent that to happen again. This is being followed up.

PetaLibrary is back up again but we'll continue to monitor.
New jobs are able to be started again in Blanca.
Mar 14, 11:20 MDT
Update - To minimize the impact of PetaLibrary/active being inaccessible on queued jobs, I have stopped Slurm from starting new jobs on Blanca. If you would like your partition returned to service before we resolve the problems with PetaLibrary, please contact rc-help@colorado.edu.
Mar 14, 09:21 MDT
Investigating - It was noted some Beegfs communication errors that are preventing spaces at /pl/active from being used.
I'm opening this incident but wasn't able to investigate further.
All Beegfs servers are up. But it was observed error messages from the management service associated to some Beegfs clients.
The incident will be updated as soon as that it is understood.
Mar 14, 08:36 MDT
Research Computing Core ? Operational
Science Network ? Operational
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Operational
EnginFrame ? Operational
JupyterHub ? Operational
Degraded Performance
Partial Outage
Major Outage
Scheduled Maintenance
Research Computing will perform regularly-scheduled planned maintenance Wednesday, 3 April 2019. April's activities include

- Cooling tower cleaning at HPCF
- Performance validation of the Summit compute environment
- Configuration changes and bugfixes for PetaLibrary/active (BeeGFS)
- Acceptance testing of PetaLibrary/active (BeeGFS)

Maintenance is scheduled to take place between 07:00 and 19:00, though service will be restored as soon as all activities have concluded. During the maintenance period no jobs will run on Summit resources, and all Summit resources (including Summit storage) will be unavailable. PetaLibrary/active may also experience periodic outages during testing and fixing. Because of the possibility of PetaLibrary/active outages we have also reserved Blanca compute resources for maintenance; if you would like to continue running on Blanca despite potential PetaLibrary outages, please contact rc-help@colorado.edu and let us know which partition to exempt from reservation.

We apologize for having two such high-impact maintenance outages in a row; we were unable to complete the planned HPCF cooling tower maintenance last week due to inclement weather; and we're finishing up BeeGFS testing and problem fixing at the same time.

If you have any questions or concerns, please contact rc-help@colorado.edu.
Posted on Mar 22, 10:42 MDT
Past Incidents
Mar 26, 2019
Completed - The scheduled maintenance has been completed.
Mar 26, 07:16 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Mar 26, 05:45 MDT
Scheduled - Tomorrow morning hosts in the Blanca cluster and blogin01 may be unreachable while the campus firewall is rebooted/upgraded from 5:45am to 6:45am. All jobs will continue to run without issue and we expect no interruption to any of the hosts outside of network connectivity.
Mar 25, 17:03 MDT
Mar 24, 2019

No incidents reported.

Mar 23, 2019

No incidents reported.

Mar 22, 2019

No incidents reported.

Mar 20, 2019

No incidents reported.

Mar 19, 2019

No incidents reported.

Mar 18, 2019

No incidents reported.

Mar 17, 2019

No incidents reported.

Mar 16, 2019

No incidents reported.

Mar 15, 2019

No incidents reported.

Mar 13, 2019

No incidents reported.

Mar 12, 2019
Resolved - This incident has been resolved.
Mar 12, 09:10 MDT
Monitoring - As expected, the network switch serving blanca03 was discovered powered off. We've experienced this failure before (including in other similar chassis) with no root cause provided from the supplier. We will follow-up with support once again.

The switch has been powered on, and the nodes in Blanca 03 have been returned to service.
Mar 3, 20:22 MST
Investigating - This afternoon we received outage alerts from what appears to be all nodes in the Blanca 03 chassis (being nodes with names bnode03*). Most likely there has been a transient error in the network switch, similar to those we have seen before. I will be attempting to return this switch to service now; but if we are unable to resolve the issue immediately, we will continue the investigation Monday morning.
Mar 3, 20:11 MST
Resolved - We will continue to work with our filesystem support vendor to ensure that we are running with the correct configuration, but access to PetaLibrary/active has remained stable following our resolution Sunday.
Mar 12, 09:10 MDT
Monitoring - Increasing the limits on the number of open files in the system appears to have resolved the issue, and PetaLibrary/active (BeeGFS) is now accessible. We'll continue to monitor this issue on Monday, and follow-up with support to confirm the correct value for these limits going forward.

Jobs are once again starting on Blanca.

If you notice any further trouble, please contact rc-help@colorado.edu.

Mar 10, 18:14 MDT
Identified - This outage appears to be the result of us exceeding a server-side configured limit on the number of open files in the system. We are following a procedure to increase this limit, which should restore access.

It is our impression that this is a side-effect of increased use of the system, and does not represent an actual system fault.
Mar 10, 17:59 MDT
Investigating - We are investigating an unplanned outage on PetaLibrary/active (BeeGFS) as exported via /pl/active/. Slurm has been stopped on Blanca while we investigate.
Mar 10, 17:28 MDT