Update - At 11:24am the network between HPCF/Summit and the rest of the science network went down. The network appears to be functional again as of 11:34, and the networking team is investigating what caused the outage. Summit compute jobs running during the outage may have failed.
Oct 20, 11:40 MDT
Monitoring - A temporary work-around was implemented at approximately 4:18 PM this afternoon. Summit (and Blanca HPC) appear to be operating normally again.
The network team has identified a faulty link that has been causing disruption between the COMP and HPCF datacenters. RC core services (core storage, DNS, LDAP, etc) are housed in COMP, and Summit and a portion of Blanca are housed in HPCF; so a disruption in this link has caused disruption not only to our access to the HPCF but in the ability of Summit and Blanca HPC to function due to their dependencies on these core services.
Our temporary fix has been to manually fail this connection path over to its secondary link, and the primary link will be inspected Friday. We do not currently understand why the secondary link was not used automatically in the face of the network disruption, as its designed intent.
Oct 8, 22:51 MDT
Investigating - The network issue causing this issue has recurred. OIT Network team is working on resolving the outage. Until resolved Summit jobs will not start and most existing jobs will fail. Summit is fine within itself it just can't communicate beyond it own borders.
Oct 8, 16:02 MDT
Monitoring - A network issue has been identified and is being reviewed by OIT personnel. Summit remains operational.
Oct 8, 13:17 MDT
Investigating - At approximately 5:00AM this morning, an unknown event caused summit to become unavailable and jobs to be cancelled. Summit appears to have returned to normal at around 7:00AM. We are investigating the cause of the incident as well as verifying the functionality of summit.
Oct 8, 07:56 MDT