The Summit Ethernet aggregation switch was replaced successfully, and service was restored. We happen to be in the middle of a planned maintenance outage right now; but Summit has been operating normally otherwise.
A second switch has been physically deployed alongside the current aggregation switch, and we intend to split production across the two switches in a redundant pair. This should prevent such a single-point-of-failure outage in the future.
Mar 3, 12:21 MST
Summit appears to have remained up and stable since replacing the aggregation switch with our shelf spare. We anticipate receipt of a replacement switch on Tuesday, at which point we intend to deploy the two switches as an active/active redundant pair ("stack"), hopefully obviating this risk in the future.
Feb 13, 22:26 MST
All Summit partitions are now online and accepting jobs. We will be closely monitoring the operation of the new hardware. The system is believed fully operational and stable at this time.
Feb 13, 13:26 MST
The new hardware is in place and we are verifying correct operation of the cluster.
Feb 13, 12:53 MST
Networking and Research Computing are on-site and working to repair the failed network connection with replacement hardware.
Feb 13, 11:33 MST
Summit is offline again. We are working to determine the cause.
Feb 13, 08:10 MST
All Summit partitions are once again accepting and running jobs. We believe the system to be stable at this time, but will continue monitoring it throughout the weekend. Thank you for your patience during this extended outage.
Feb 12, 20:07 MST
We believe that the network problem that led to today's outage has been addressed. We have allowed some jobs to start on Summit, and we are monitoring the system stability to observe whether we have any further problems.
If the system remains stable, we intend to release the system for regular use in the next hour or so.
After this, we will begin preparations to fix this single-point-of-failure so that this doesn't happen again in the future.
Feb 12, 19:02 MST
Networking team continues to work on the issue.
Feb 12, 15:58 MST
Switch has failed again. Summit is not available at present time. CU network team is heading in to troubleshoot switch. If necessary a spare is available to be swapped in.
Feb 12, 10:04 MST
A network aggregation switch became unresponsive around 04:44 this morning. The switch has been rebooted and is operating correctly now. CU networking team is investigating the root cause of the switch failure.
Feb 12, 09:33 MST
RMACC Summit experienced an outage sometime overnight and is presently offline. Some storage partitions, including PetaLibrary, are affected. We are investigating the issue and will provide an update as soon as possible.
Summit and PetaLibrary users will be unable to access these resources until the issue is resolved. Blanca nodes appear to be unaffected, however Blanca jobs that use /pl/active for job I/O may be impacted.
Feb 12, 07:25 MST