Cooling failure at HPCF
Incident Report for CU Boulder RC
Resolved
Our remediation efforts at the HPCF, coupled with the more favorable weather and temperatures, appear to have successfully obviated the cooling issues at the HPCF. We are continuing to investigate more permanent solutions, including improving our regular maintenance schedule.
Posted Sep 29, 2021 - 23:02 MDT
Update
We have restored full production for Blanca and Summit. New jobs are again starting.

Our datacenter team has confirmed that this is a reoccurrence of the same failure that we experienced yesterday. (VFD overheat leading to fan shutdown, leading to cooling outage for the facility.) It appears that cleaning out the VFD, as we did yesterday, was insufficient to prevent the reoccurrence. As such, the following additional steps have been taken:

- The maximum frequency of the VFD was lowered from 60hz to 55hz.
- Additional interim cooling was added to the facility, directed at the overheating VFD.
- Additional staff will be monitoring the system during peak hours tomorrow, possibly preemptively on-site.

We will continue to discuss more permanent solutions on Monday.
Posted Sep 11, 2021 - 17:18 MDT
Monitoring
Power to the HPCF cooling tower has been restored. We are leaving Summit and Blanca in a "draining" state while we assess.
Posted Sep 11, 2021 - 15:58 MDT
Update
In response to the reoccurrence of this cooling failure we have again requeued/cancelled jobs on Summit as we were able. This has led to an appreciable drop in operating temperatures so, for now, we will try to avoid canceling jobs on Blanca.

We are still waiting for HVAC support to arrive on-site. Further updates as we are able.
Posted Sep 11, 2021 - 15:49 MDT
Identified
We are responding to another cooling failure at the HpCF. Summit and Blanca affected. More details to come.
Posted Sep 11, 2021 - 15:32 MDT
Update
We are continuing to monitor for any further issues.
Posted Sep 10, 2021 - 20:59 MDT
Monitoring
We have returned Summit and Blanca to production and are monitoring the environment for any reoccurrence.
Posted Sep 10, 2021 - 20:58 MDT
Identified
The variable frequency drive (VFD) that controls the fan in our evaporative cooling tower "tripped" due to an over-heat condition. On inspection, the VFD itself was discovered to have been insufficiently maintained. Coupled with today's high temperatures, this led to a cooling failure in this one component which, in turn, disabled cooling for the facility.

This particular VFD has been cleaned, reset, and returned to service. In response to this event, we are scheduling maintenance for all VFDs in the HPCF to ensure that they are properly maintained, both now and in the future.

We have now been cleared to return to production. We regret the cancellation of the production workload on Summit. Running jobs were "requeued," so jobs able to be restarted should be able to restart automatically. Jobs unable to be restarted automatically will need to be requeued.
Posted Sep 10, 2021 - 20:28 MDT
Update
On advice from our datacenter team onsite, we started canceling jobs in preparation for a full-system shutdown. However, before we actually powered-off equipment, reports came in that cooling had been restored, and temperatures had begun to come back down.

We are waiting for a full sign-off from datacenter and facilities management before restoring production.
Posted Sep 10, 2021 - 19:47 MDT
Investigating
We are investigating and responding to a thermal event at the HPCF, which houses Summit and much of Blanca. The situation calls for an emergency power-off of our compute infrastructure, which is underway.
Posted Sep 10, 2021 - 19:38 MDT
This incident affected: Blanca and RMACC Summit.