Investigating Slurm incident
Incident Report for CU Boulder RC
Resolved
No further issues were observed with Slurm following this incident and patch application. New policies will be implemented during the February planned maintenance to help prevent this kind of event in the future.
Posted about 2 months ago. Jan 31, 2019 - 10:18 MST
Monitoring
A patch from support has been applied, and Blanca appears to be operational once again. We will continue monitoring the system closely.
Posted 2 months ago. Jan 16, 2019 - 18:02 MST
Identified
Slurm support has identified the issue that has caused a failure in Blanca Slurm. Unfortunately, updating Slurm, including with a provided patch, has not yet resolved the issue.

We are continuing to work this issue with upstream support.
Posted 2 months ago. Jan 16, 2019 - 17:34 MST
Update
We are continuing to investigate the cause of the Slurm events we are experiencing today. Blanca Slurm has become unresponsive, and we are currently unable to start it. We are reaching out to Slurm support for assistance.
Posted 2 months ago. Jan 16, 2019 - 12:34 MST
Investigating
We are investigating a Slurm incident that occurred today around 10:30 AM. This incident may have affect Blanca, Summit, and EnginFrame, up to and including the early termination of jobs.

We apologize for the interruption, and are working to determine the cause.
Posted 2 months ago. Jan 16, 2019 - 11:15 MST
This incident affected: RMACC Summit, Blanca, and EnginFrame.