Degraded service on Summit and "Viz"
Incident Report for CU Boulder RC
Resolved
This incident has been resolved.
Posted Nov 21, 2022 - 14:00 MST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 21, 2022 - 11:11 MST
Investigating
We are experiencing issues with the Slurm controller -- the process that coordinates job scheduling and maintenance -- on Summit and "Viz" ("Viz" is the cluster on which Remote Desktop sessions are hosted).

The issues typically manifest as errors when jobs are scheduled, for example indicating "MaxJobsPerUser" or "Batch job submission failed". In addition to jobs scheduled from the command line with "sbatch" or "sinteractive", the issues also affect jobs that are scheduled to Summit from CURC gateways -- JupyterHub and OnDemand.

We will provide updates as progress is made.

The Alpine and Blanca clusters are not affected.
Posted Nov 21, 2022 - 08:38 MST
This incident affected: Research Computing Core and RMACC Summit.