Partial Summit storage outage
Incident Report for CU Boulder RC
Resolved
Summit storage, and the OPA interconnect, have remained stable for several weeks since we addressed faulty connections and a misbehaving switch. As such, we are considering this incident resolved.
Posted Jul 29, 2021 - 15:31 MDT
Monitoring
Last week, after disconnecting a faulty network cable, we restored the affected Omni-Path switch and snsd4, a GPFS file-server for Summit storage that happens to be connected to this switch. During our planned maintenance yesterday we also restored service to a subset of compute nodes attached to this switch. We are continuing to monitor the network for any reoccurrence of symptoms.
Posted Jul 08, 2021 - 10:44 MDT
Identified
A single Omni-Path switch in the fabric has been shut down, and the nodes that connect to this switch have been marked unavailable. Access to summit scratch appears to be stable at this time. We will continue to monitor availability of summit scratch, and work with the vendor to determine the root cause of the issue.
Posted Jun 25, 2021 - 07:23 MDT
Update
We have correlated the start of our problem with errors in a certain log. These errors also correlate with nodes connected to a single Omni-Path switch in the Summit fabric. We intend to shut down this switch in hopes that it will resolve the errors we are seeing. If so, we may then need to replace the switch.

These nodes are currently running several jobs. Due to the pervasiveness and duration of this problem, we intend to "requeue" these jobs to clear the nodes so that they may be disconnected. We will be reaching out individually to affected users as well; but the affected nodes are shas04[01-28],sgpu04[01,02],smem0401. Affected (running) jobs can be queried using squeue.

squeue --nodelist=shas04[01-28],sgpu04[01,02],smem0401 --user=$USER

We apologize for this interruption, and for the duration of this debugging effort. We are eager to return the system to full operation as soon as possible.
Posted Jun 24, 2021 - 16:15 MDT
Update
This issue also affects access to the subset of PetaLibrary allocations that use Summit storage.
Posted Jun 23, 2021 - 10:21 MDT
Update
In addition to issues accessing Summit storage from some compute nodes, we are also experiencing performance problems accessing Summit storage from login nodes and DTN. Investigations so far would seem to indicate these are unrelated issues; but we will continue to investigate them together as they may yet be a common underlying cause.
Posted Jun 22, 2021 - 14:32 MDT
Update
We continue to engage with the vendor to resolve this issue.
Posted Jun 21, 2021 - 17:17 MDT
Update
We have identified a possible issue in the OmniPath network and are working with the vendor to troubleshoot.
Posted Jun 18, 2021 - 16:00 MDT
Investigating
Some summit compute nodes are unable to access /scratch/summit. We are investigating the cause of the outage at this time.
Posted Jun 17, 2021 - 15:40 MDT
This incident affected: PetaLibrary and RMACC Summit.