Unplanned Summit outage due to datacenter cooler failure
Incident Report for CU Boulder RC
Resolved
The remaining Summit compute nodes have been returned to service, and production appears to have remained stable overnight. Please do report any problems, as always, to rc-help@colorado.edu.
Posted 5 months ago. Jul 12, 2018 - 09:44 MDT
Monitoring
The cooling system in the HPCF has been returned to production, and jobs are again running on Summit. Preliminary performance validation indicates that Summit compute nodes are operating normally, but we will continue to monitor the situation today and tomorrow. Please contact rc-help@colorado.edu if you have any trouble.

There are still a few nodes that remain to be brought back into service, but the majority of nodes are already operational. More nodes will be brought back into service today and tomorrow.
Posted 5 months ago. Jul 11, 2018 - 16:38 MDT
Identified
The cooling issue that prompted us to shut down Summit has been identified and is being rectified. We expect to be able to bring Summit back into production later today.

Recently, a maintenance issue was discovered in the HPCF affecting the reliability and longevity of the cooling system. We have been planning an outage to coincide with the 1 August planned maintenance to address this issue, but we have been conducting daily maintenance to prevent the issue from getting worse in the mean time. This morning, a portion of the cooling system was left in a maintenance state. As a result, all water was allowed to drain from the cooling tower, and the facility began to overheat.

We are restoring the cooling system to its production state, after which we will be bringing Summit back up and into production; however, be advised that we will be taking an extended outage starting 1 August to fully address these related issues. (The maintenance is currently planned to last 3 days.)

Thank you for your understanding while we address this issue. If you have any questions or concerns, please contact rc-help@colorado.edu.
Posted 5 months ago. Jul 11, 2018 - 13:31 MDT
Investigating
Summit is being brought down in response to a critical failure in the cooling system of the high-performance computing facility (HPCF). More information will be provided soon.
Posted 5 months ago. Jul 11, 2018 - 12:35 MDT
This incident affected: RMACC Summit.