All Systems Operational
Research Computing Core ? Operational
Science Network ? Operational
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Operational
EnginFrame ? Operational
JupyterHub ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Feb 20, 2020

No incidents reported today.

Feb 19, 2020

No incidents reported.

Feb 18, 2020
Resolved - Tonight we experienced another beegfs-meta outage. As before this appears to have created _hangs_ in IO accessibility, rather than returning errors; so running jobs should continue successfully, adding only a few minutes to their runtime.

As before, we are still pursuing a root cause analysis with the developer, and we used this opportunity to take a few more log and telemetry readings than we have in the past; but this also motivates our plans to move away from this particular platform, and those plans are ongoing.

edit: We have received reports that this did, in fact, cause job failures. We apologize for this interruption, and for this misunderstanding.
Feb 18, 21:00 MST
Feb 17, 2020

No incidents reported.

Feb 16, 2020

No incidents reported.

Feb 15, 2020

No incidents reported.

Feb 14, 2020

No incidents reported.

Feb 13, 2020
Resolved - We have applied a configuration change which may prevent this issue happening again, and we are gathering additional data for the vendor that will help with analysis should it happen again. At the moment there are no known issues with PL/active.
Feb 13, 15:13 MST
Monitoring - Another failure in beegfs-meta occurred at 6:30am this morning, and momentarily paused IO requests to PetaLibrary/active.

We are continuing to work with the file system developer to explain this behavior.
Feb 13, 08:36 MST
Resolved - This incident has been resolved.
Feb 13, 11:45 MST
Monitoring - A fix has been implemented and we are monitoring the results.
Feb 12, 21:32 MST
Identified - An upstream networking problem has been identified, and symptoms have been experienced in other CU services. The networking team is working to resolve the issue.
Feb 12, 17:38 MST
Investigating - We are investigating an issue preventing proper authentication to the RC login environment.
Feb 12, 16:47 MST
Resolved - IDL Licenses remain available. Symptoms have not return. Resolving this incident.
Feb 13, 10:48 MST
Monitoring - A fix has been implemented and we are monitoring the IDL license availability.
Feb 13, 08:46 MST
Update - Applied a fix to the server. Suggested by Harris support team. They say it may be the speed at which jobs are being started.
Feb 12, 14:57 MST
Investigating - We have experienced a series of interruptions in the IDL license server. We have restored service for now, but it may go down again.

We are working with support to understand the cause of the issue.
Feb 12, 08:43 MST
Feb 12, 2020
Resolved - The same failure in beegfs-meta occurred again, and momentarily paused IO requests to PetaLibrary/active.

We are still working with the file system developer to explain this behavior.
Feb 12, 18:00 MST
Feb 11, 2020

No incidents reported.

Feb 10, 2020
Resolved - This incident has been resolved.
Feb 10, 15:24 MST
Monitoring - The down'd scompile node has been returned to service.
Feb 10, 12:05 MST
Investigating - We are investigating a problem with one of the allocated scompile nodes. Until this issue is resolved some attempts to access scompile (ssh scompile) will fail, due to the way connections attempts are balanced across the nodes. If you try again, you should be placed on an active node.
Feb 10, 11:34 MST
Resolved - Access to PetaLibrary/active was briefly interrupted due to one of the metadata servers becoming unresponsive. We restarted the affected daemon and access was restored.

We are providing log data to the developer in hopes of better understanding this failure scenario.
Feb 10, 13:00 MST
Feb 9, 2020

No incidents reported.

Feb 8, 2020

No incidents reported.

Feb 7, 2020
Resolved - This incident has been resolved.
Feb 7, 14:51 MST
Monitoring - A fix has been implemented and we are monitoring the results.
Feb 7, 12:02 MST
Identified - The issue has been identified and a fix is being implemented.
Feb 7, 11:35 MST
Investigating - We are investigating an issue following yesterday's Slurm upgrade that is creating the error "Requested node configuration is not available" when submitting to shas-testing.
Feb 6, 09:41 MST
Resolved - This incident has been resolved.
Feb 7, 14:21 MST
Monitoring - A fix has been implemented and we are monitoring the results.
Feb 7, 10:08 MST
Investigating - We are investigating an issue that is preventing Summit KnL (sknl) nodes from accepting jobs. The issue appears to be related to these nodes having not retained their desired operating mode settings, which affects visible memory. Slurm, having identified that the memory available is different than expected, is preventing jobs from starting.

More information is available at https://software.intel.com/en-us/articles/intel-xeon-phi-x200-processor-memory-modes-and-cluster-modes-configuration-and-use-cases
Feb 6, 16:06 MST
Feb 6, 2020
Resolved - We have identified the issue that prevented $SLURM_SCRATCH directories from being made writable during job start, and have updated our Slurm prolog scripts to resolve the issue. Please let us know if you still have trouble writing to $SLURM_SCRATCH.
Feb 6, 16:03 MST
Investigating - We are investigating an issue regarding $SLURM_SCRATCH not being writable during Slurm job execution.
Feb 6, 15:10 MST
Resolved - A fix has been put in place and we believe the issue of jobs reporting "launch failed requeued held" has been resolved. If your problem persists, please contact rc-help@colorado.edu.
Feb 6, 15:27 MST
Investigating - We are investigating the cause of "launch failed requeued held" messages that some users are seeing following the Slurm upgrade yesterday. We will provide updates here as we have them.
Feb 6, 09:23 MST
Resolved - The configuration mistake for PetaLibrary/active (BeeGFS) has been corrected and all nodes are again able to mount the file system.
Feb 6, 09:36 MST
Identified - We are responding to a misconfiguration that is preventing PetaLibrary/active (BeeGFS) from mounting on some compute nodes.
Feb 6, 09:24 MST