HPCF interruption affecting Summit and Blanca HPC
Incident Report for CU Boulder RC
Resolved
We believe this issue will be resolved, though we will be testing both sides of our HPCF network connection during our upcoming planned maintenance period.
Posted Oct 29, 2020 - 20:09 MDT
Update
At 11:24am the network between HPCF/Summit and the rest of the science network went down. The network appears to be functional again as of 11:34, and the networking team is investigating what caused the outage. Summit compute jobs running during the outage may have failed.
Posted Oct 20, 2020 - 11:40 MDT
Monitoring
A temporary work-around was implemented at approximately 4:18 PM this afternoon. Summit (and Blanca HPC) appear to be operating normally again.

The network team has identified a faulty link that has been causing disruption between the COMP and HPCF datacenters. RC core services (core storage, DNS, LDAP, etc) are housed in COMP, and Summit and a portion of Blanca are housed in HPCF; so a disruption in this link has caused disruption not only to our access to the HPCF but in the ability of Summit and Blanca HPC to function due to their dependencies on these core services.

Our temporary fix has been to manually fail this connection path over to its secondary link, and the primary link will be inspected Friday. We do not currently understand why the secondary link was not used automatically in the face of the network disruption, as its designed intent.
Posted Oct 08, 2020 - 22:51 MDT
Investigating
The network issue causing this issue has recurred. OIT Network team is working on resolving the outage. Until resolved Summit jobs will not start and most existing jobs will fail. Summit is fine within itself it just can't communicate beyond it own borders.
Posted Oct 08, 2020 - 16:02 MDT
Monitoring
A network issue has been identified and is being reviewed by OIT personnel. Summit remains operational.
Posted Oct 08, 2020 - 13:17 MDT
Investigating
At approximately 5:00AM this morning, an unknown event caused summit to become unavailable and jobs to be cancelled. Summit appears to have returned to normal at around 7:00AM. We are investigating the cause of the incident as well as verifying the functionality of summit.
Posted Oct 08, 2020 - 07:56 MDT
This incident affected: Blanca and RMACC Summit.