Monitoring - Last week, after disconnecting a faulty network cable, we restored the affected Omni-Path switch and snsd4, a GPFS file-server for Summit storage that happens to be connected to this switch. During our planned maintenance yesterday we also restored service to a subset of compute nodes attached to this switch. We are continuing to monitor the network for any reoccurrence of symptoms.
Jul 8, 10:44 MDT
Identified - A single Omni-Path switch in the fabric has been shut down, and the nodes that connect to this switch have been marked unavailable. Access to summit scratch appears to be stable at this time. We will continue to monitor availability of summit scratch, and work with the vendor to determine the root cause of the issue.
Jun 25, 07:23 MDT
Update - We have correlated the start of our problem with errors in a certain log. These errors also correlate with nodes connected to a single Omni-Path switch in the Summit fabric. We intend to shut down this switch in hopes that it will resolve the errors we are seeing. If so, we may then need to replace the switch.
These nodes are currently running several jobs. Due to the pervasiveness and duration of this problem, we intend to "requeue" these jobs to clear the nodes so that they may be disconnected. We will be reaching out individually to affected users as well; but the affected nodes are shas04[01-28],sgpu04[01,02],smem0401. Affected (running) jobs can be queried using squeue.
squeue --nodelist=shas04[01-28],sgpu04[01,02],smem0401 --user=$USER
We apologize for this interruption, and for the duration of this debugging effort. We are eager to return the system to full operation as soon as possible.
Jun 24, 16:15 MDT
Update - This issue also affects access to the subset of PetaLibrary allocations that use Summit storage.
Jun 23, 10:21 MDT
Update - In addition to issues accessing Summit storage from some compute nodes, we are also experiencing performance problems accessing Summit storage from login nodes and DTN. Investigations so far would seem to indicate these are unrelated issues; but we will continue to investigate them together as they may yet be a common underlying cause.
Jun 22, 14:32 MDT
Update - We continue to engage with the vendor to resolve this issue.
Jun 21, 17:17 MDT
Update - We have identified a possible issue in the OmniPath network and are working with the vendor to troubleshoot.
Jun 18, 16:00 MDT
Investigating - Some summit compute nodes are unable to access /scratch/summit. We are investigating the cause of the outage at this time.
Jun 17, 15:40 MDT