Update - We have restored full production for Blanca and Summit. New jobs are again starting.
Our datacenter team has confirmed that this is a reoccurrence of the same failure that we experienced yesterday. (VFD overheat leading to fan shutdown, leading to cooling outage for the facility.) It appears that cleaning out the VFD, as we did yesterday, was insufficient to prevent the reoccurrence. As such, the following additional steps have been taken:
- The maximum frequency of the VFD was lowered from 60hz to 55hz.
- Additional interim cooling was added to the facility, directed at the overheating VFD.
- Additional staff will be monitoring the system during peak hours tomorrow, possibly preemptively on-site.
We will continue to discuss more permanent solutions on Monday.
Sep 11, 17:18 MDT
Monitoring - Power to the HPCF cooling tower has been restored. We are leaving Summit and Blanca in a "draining" state while we assess.
Sep 11, 15:58 MDT
Update - In response to the reoccurrence of this cooling failure we have again requeued/cancelled jobs on Summit as we were able. This has led to an appreciable drop in operating temperatures so, for now, we will try to avoid canceling jobs on Blanca.
We are still waiting for HVAC support to arrive on-site. Further updates as we are able.
Sep 11, 15:49 MDT
Identified - We are responding to another cooling failure at the HpCF. Summit and Blanca affected. More details to come.
Sep 11, 15:32 MDT
Update - We are continuing to monitor for any further issues.
Sep 10, 20:59 MDT
Monitoring - We have returned Summit and Blanca to production and are monitoring the environment for any reoccurrence.
Sep 10, 20:58 MDT
Identified - The variable frequency drive (VFD) that controls the fan in our evaporative cooling tower "tripped" due to an over-heat condition. On inspection, the VFD itself was discovered to have been insufficiently maintained. Coupled with today's high temperatures, this led to a cooling failure in this one component which, in turn, disabled cooling for the facility.
This particular VFD has been cleaned, reset, and returned to service. In response to this event, we are scheduling maintenance for all VFDs in the HPCF to ensure that they are properly maintained, both now and in the future.
We have now been cleared to return to production. We regret the cancellation of the production workload on Summit. Running jobs were "requeued," so jobs able to be restarted should be able to restart automatically. Jobs unable to be restarted automatically will need to be requeued.
Sep 10, 20:28 MDT
Update - On advice from our datacenter team onsite, we started canceling jobs in preparation for a full-system shutdown. However, before we actually powered-off equipment, reports came in that cooling had been restored, and temperatures had begun to come back down.
We are waiting for a full sign-off from datacenter and facilities management before restoring production.
Sep 10, 19:47 MDT
Investigating - We are investigating and responding to a thermal event at the HPCF, which houses Summit and much of Blanca. The situation calls for an emergency power-off of our compute infrastructure, which is underway.
Sep 10, 19:38 MDT