Slurm job start degradation on Alpine

Incident Report for CU Boulder RC

Resolved

We believe this incident is resolved. We will continue to monitor.

Posted Apr 17, 2024 - 17:00 MDT

Monitoring

Changes to our topology file this morning appear to have significantly eased this issue. We will monitor through the day.

Posted Apr 17, 2024 - 09:28 MDT

Update

We are continuing to assess. We have provided interim solutions to speed the start of interactive jobs. Thus far, these appear to have succeeded, though we will work to improve the user experience for them in the coming days.

SchedMD and RC continue to investigate batch jobs. Our best current understanding is that the cluster is consistently under high load or, on many occasions, seeing resources scheduled for very large jobs, leading to longer waits for jobs to start. We are continuing to monitor and have not reached a conclusion, but much evidence (including detailed analysis of log files relevant to backfill scheduling and priority) points in this direction.

As such, RC is discussing options to better support smaller jobs. This may include changes to priority calculations or reconfiguring the cluster with dedicated resources for “short” jobs, to speed processing.

We will leave this incident open for at least one more full day, when the team will convene to make a final determination.

Posted Apr 15, 2024 - 13:20 MDT

Update

We have established a recommendation to ensure researchers have the ability to run interactive sessions as we continue investigation.

Users seeking interactive resources with limited wait times should use the testing partitions (atesting, atesting_mi100, or atesting_a100) or acompile instead of amilan, aa100, or ami100. Please see our documentation for more information about interactive jobs: https://curc.readthedocs.io/en/latest/running-jobs/interactive-jobs.html?highlight=sinteractive#general-interactive-jobs

Our next update will be on Monday.

Posted Apr 12, 2024 - 15:36 MDT

Update

We are engaging the vendor regarding performance degradation in job submission times on Alpine. They have sought, and we have provided, additional information to better diagnose the cause. Internal troubleshooting continues at the same time. We expect our next update to be tomorrow.

Posted Apr 11, 2024 - 18:16 MDT

Update

In the interest of system consistency, we are awaiting guidance from support prior to performing additional tests and troubleshooting. We will provide an update as soon as possible.

Posted Apr 11, 2024 - 09:55 MDT

Update

We are continuing to investigate. We have engaged SchedMD, the vendor who provides support for the Slurm scheduler. We expect to have our next update tomorrow morning.

Posted Apr 10, 2024 - 17:48 MDT

Investigating

The issue has persisted. We are continuing to investigate.

Posted Apr 10, 2024 - 14:55 MDT

Monitoring

The issue with delayed starts to Alpine jobs appears to have improved following troubleshooting this morning. We will monitor today for regression or continued improvement.

Posted Apr 10, 2024 - 11:16 MDT

Update

We are continuing to investigate. The acompile service on Alpine was affected and has been restored. Work continues on the primary Alpine partitions.

Posted Apr 10, 2024 - 10:25 MDT

Investigating

Queued jobs on Alpine are experiencing delayed starts. We are investigating the issue and will provide an update when more information is available.

Running jobs on Alpine are not impacted and are expected to complete successfully. Jobs on Blanca are not impacted.

Posted Apr 09, 2024 - 19:18 MDT

This incident affected: Alpine.