Update - We have restored full production for Blanca and Summit. New jobs are again starting.

Our datacenter team has confirmed that this is a reoccurrence of the same failure that we experienced yesterday. (VFD overheat leading to fan shutdown, leading to cooling outage for the facility.) It appears that cleaning out the VFD, as we did yesterday, was insufficient to prevent the reoccurrence. As such, the following additional steps have been taken:

- The maximum frequency of the VFD was lowered from 60hz to 55hz.
- Additional interim cooling was added to the facility, directed at the overheating VFD.
- Additional staff will be monitoring the system during peak hours tomorrow, possibly preemptively on-site.

We will continue to discuss more permanent solutions on Monday.
Sep 11, 17:18 MDT
Monitoring - Power to the HPCF cooling tower has been restored. We are leaving Summit and Blanca in a "draining" state while we assess.
Sep 11, 15:58 MDT
Update - In response to the reoccurrence of this cooling failure we have again requeued/cancelled jobs on Summit as we were able. This has led to an appreciable drop in operating temperatures so, for now, we will try to avoid canceling jobs on Blanca.

We are still waiting for HVAC support to arrive on-site. Further updates as we are able.
Sep 11, 15:49 MDT
Identified - We are responding to another cooling failure at the HpCF. Summit and Blanca affected. More details to come.
Sep 11, 15:32 MDT
Update - We are continuing to monitor for any further issues.
Sep 10, 20:59 MDT
Monitoring - We have returned Summit and Blanca to production and are monitoring the environment for any reoccurrence.
Sep 10, 20:58 MDT
Identified - The variable frequency drive (VFD) that controls the fan in our evaporative cooling tower "tripped" due to an over-heat condition. On inspection, the VFD itself was discovered to have been insufficiently maintained. Coupled with today's high temperatures, this led to a cooling failure in this one component which, in turn, disabled cooling for the facility.

This particular VFD has been cleaned, reset, and returned to service. In response to this event, we are scheduling maintenance for all VFDs in the HPCF to ensure that they are properly maintained, both now and in the future.

We have now been cleared to return to production. We regret the cancellation of the production workload on Summit. Running jobs were "requeued," so jobs able to be restarted should be able to restart automatically. Jobs unable to be restarted automatically will need to be requeued.
Sep 10, 20:28 MDT
Update - On advice from our datacenter team onsite, we started canceling jobs in preparation for a full-system shutdown. However, before we actually powered-off equipment, reports came in that cooling had been restored, and temperatures had begun to come back down.

We are waiting for a full sign-off from datacenter and facilities management before restoring production.
Sep 10, 19:47 MDT
Investigating - We are investigating and responding to a thermal event at the HPCF, which houses Summit and much of Blanca. The situation calls for an emergency power-off of our compute infrastructure, which is underway.
Sep 10, 19:38 MDT
Research Computing Core ? Operational
Science Network ? Operational
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Operational
EnginFrame ? Operational
JupyterHub ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Sep 22, 2021

No incidents reported today.

Sep 21, 2021

No incidents reported.

Sep 20, 2021

No incidents reported.

Sep 19, 2021

No incidents reported.

Sep 18, 2021

No incidents reported.

Sep 17, 2021

No incidents reported.

Sep 16, 2021

No incidents reported.

Sep 15, 2021
Resolved - The multiple PetaLibrary disk failures in a single slot have not produced enough data to justify additional hardware replacement. The service has been stable for over two weeks, so we are closing this issue. The vendor is committed to assisting us if the failure mode reappears and we can pinpoint the failed component.
Sep 15, 14:47 MDT
Monitoring - We have returned PetaLibrary to service and will be following-up with our support vendor to understand the root cause of this issue.
Sep 1, 11:07 MDT
Investigating - While investigating this issue with a diagnostic script provided by our hardware supplier, one of the BeeGFS servers supporting PetaLibrary/active encountered an issue. We are working to restore service as soon as possible.
Sep 1, 10:34 MDT
Identified - PetaLibrary services are available again, with one node not in the cluster. Two disks have failed, one of which is the third failure in the same disk slot. This implies an issue with a backplane or I/O module. We are working with the vendor to understand the cause of the failures.
Aug 26, 09:49 MDT
Investigating - Two PetaLibrary nodes are currently down, making most allocations inaccessible. We are working to restore functionality.
Aug 26, 08:50 MDT
Resolved - A brief outage of GPFS (Summit scratch) occurred due to a restart of opafm service on the standby fabric manager. The standby fabric manager attempted to seize the master role causing a dual-head situation. opafm service was stopped as soon as the situation was noticed. It appears the outage had minimal impact and was caught quickly enough to prevent more serious issues. We will continue to monitor the service and work with the network vendor to determine root cause.
Sep 15, 09:49 MDT
Sep 14, 2021

No incidents reported.

Sep 13, 2021

No incidents reported.

Sep 12, 2021

No incidents reported.

Sep 11, 2021

Unresolved incident: Cooling failure at HPCF.

Sep 10, 2021
Resolved - Our fix for the LDAP stability issue has been deployed more widely, and continues to successfully avoid the previous issue.
Sep 10, 20:59 MDT
Identified - I believe we have a work-around for this issue staged and ready. It is deployed on tlogin1 as a test right now, and we will evaluate deploying the updated configuration more widely in the morning.
Sep 3, 02:16 MDT
Investigating - We are investigating and triaging an issue in the RC directory services (ldap) subsystem. This is a part of the RC core, and may impact all other RC services, including running jobs.
Sep 2, 12:21 MDT
Resolved - Access to Summit scratch has remained stable, thought we are continuing to investigate with our network support vendor.
Sep 10, 20:58 MDT
Monitoring - Access to Summit scratch has been restored, and some Summit compute nodes are sill recovering (automatically). This appears to have been the result of a network disruption, possibly coinciding with some internal diagnostics. We are engaging with our network support provider for further clarification.
Sep 7, 16:33 MDT
Investigating - The GPFS scratch file system has become unavailable. We are researching the cause.
Sep 7, 15:46 MDT
Sep 9, 2021

No incidents reported.

Sep 8, 2021

No incidents reported.