__Summit Networking Outage__
Incident Report for CU Boulder RC
Resolved
The Summit Ethernet aggregation switch was replaced successfully, and service was restored. We happen to be in the middle of a planned maintenance outage right now; but Summit has been operating normally otherwise.

A second switch has been physically deployed alongside the current aggregation switch, and we intend to split production across the two switches in a redundant pair. This should prevent such a single-point-of-failure outage in the future.
Posted Mar 03, 2021 - 12:21 MST
Update
Summit appears to have remained up and stable since replacing the aggregation switch with our shelf spare. We anticipate receipt of a replacement switch on Tuesday, at which point we intend to deploy the two switches as an active/active redundant pair ("stack"), hopefully obviating this risk in the future.
Posted Feb 13, 2021 - 22:26 MST
Update
All Summit partitions are now online and accepting jobs. We will be closely monitoring the operation of the new hardware. The system is believed fully operational and stable at this time.
Posted Feb 13, 2021 - 13:26 MST
Monitoring
The new hardware is in place and we are verifying correct operation of the cluster.
Posted Feb 13, 2021 - 12:53 MST
Identified
Networking and Research Computing are on-site and working to repair the failed network connection with replacement hardware.
Posted Feb 13, 2021 - 11:33 MST
Investigating
Summit is offline again. We are working to determine the cause.
Posted Feb 13, 2021 - 08:10 MST
Update
All Summit partitions are once again accepting and running jobs. We believe the system to be stable at this time, but will continue monitoring it throughout the weekend. Thank you for your patience during this extended outage.
Posted Feb 12, 2021 - 20:07 MST
Monitoring
We believe that the network problem that led to today's outage has been addressed. We have allowed some jobs to start on Summit, and we are monitoring the system stability to observe whether we have any further problems.

If the system remains stable, we intend to release the system for regular use in the next hour or so.

After this, we will begin preparations to fix this single-point-of-failure so that this doesn't happen again in the future.
Posted Feb 12, 2021 - 19:02 MST
Update
Networking team continues to work on the issue.
Posted Feb 12, 2021 - 15:58 MST
Identified
Switch has failed again. Summit is not available at present time. CU network team is heading in to troubleshoot switch. If necessary a spare is available to be swapped in.
Posted Feb 12, 2021 - 10:04 MST
Monitoring
A network aggregation switch became unresponsive around 04:44 this morning. The switch has been rebooted and is operating correctly now. CU networking team is investigating the root cause of the switch failure.
Posted Feb 12, 2021 - 09:33 MST
Investigating
RMACC Summit experienced an outage sometime overnight and is presently offline. Some storage partitions, including PetaLibrary, are affected. We are investigating the issue and will provide an update as soon as possible.
Summit and PetaLibrary users will be unable to access these resources until the issue is resolved. Blanca nodes appear to be unaffected, however Blanca jobs that use /pl/active for job I/O may be impacted.
Posted Feb 12, 2021 - 07:25 MST
This incident affected: PetaLibrary and RMACC Summit.