Over-temperature at HPCF
Incident Report for CU Boulder RC
Resolved
A drain in the HPCF cooling system was discovered incorrectly "stuck" open, which allowed the evaporative cooling tower to drain, impeding cooling efficiency. The problem has been corrected, and we have resumed normal operation on Summit compute and Blanca HPC resources.
Posted Sep 01, 2019 - 19:52 MDT
Investigating
We have been advised of higher than expected temperatures at the HPCF, which includes Summit and Blanca HPC nodes. We have configured those compute nodes to drain until we receive word that the HVAC system has been inspected and the issue resolved.

We are not killing any running jobs at this time, and we have not observed any thermal alerts from the nodes themselves yet.
Posted Sep 01, 2019 - 16:29 MDT
This incident affected: Blanca and RMACC Summit.