Monitoring - A fix has been made and we are monitoring the result.
Jul 29, 06:52 MDT
Investigating - The server which provides this endpoint is unreachable at the current time. We are investigating the cause of this issue. Other Globus endpoints are unaffected.
Jul 28, 09:21 MDT
Monitoring - Last week, after disconnecting a faulty network cable, we restored the affected Omni-Path switch and snsd4, a GPFS file-server for Summit storage that happens to be connected to this switch. During our planned maintenance yesterday we also restored service to a subset of compute nodes attached to this switch. We are continuing to monitor the network for any reoccurrence of symptoms.
Jul 8, 10:44 MDT
Identified - A single Omni-Path switch in the fabric has been shut down, and the nodes that connect to this switch have been marked unavailable. Access to summit scratch appears to be stable at this time. We will continue to monitor availability of summit scratch, and work with the vendor to determine the root cause of the issue.
Jun 25, 07:23 MDT
Update - We have correlated the start of our problem with errors in a certain log. These errors also correlate with nodes connected to a single Omni-Path switch in the Summit fabric. We intend to shut down this switch in hopes that it will resolve the errors we are seeing. If so, we may then need to replace the switch.

These nodes are currently running several jobs. Due to the pervasiveness and duration of this problem, we intend to "requeue" these jobs to clear the nodes so that they may be disconnected. We will be reaching out individually to affected users as well; but the affected nodes are shas04[01-28],sgpu04[01,02],smem0401. Affected (running) jobs can be queried using squeue.

squeue --nodelist=shas04[01-28],sgpu04[01,02],smem0401 --user=$USER

We apologize for this interruption, and for the duration of this debugging effort. We are eager to return the system to full operation as soon as possible.
Jun 24, 16:15 MDT
Update - This issue also affects access to the subset of PetaLibrary allocations that use Summit storage.
Jun 23, 10:21 MDT
Update - In addition to issues accessing Summit storage from some compute nodes, we are also experiencing performance problems accessing Summit storage from login nodes and DTN. Investigations so far would seem to indicate these are unrelated issues; but we will continue to investigate them together as they may yet be a common underlying cause.
Jun 22, 14:32 MDT
Update - We continue to engage with the vendor to resolve this issue.
Jun 21, 17:17 MDT
Update - We have identified a possible issue in the OmniPath network and are working with the vendor to troubleshoot.
Jun 18, 16:00 MDT
Investigating - Some summit compute nodes are unable to access /scratch/summit. We are investigating the cause of the outage at this time.
Jun 17, 15:40 MDT
Research Computing Core ? Partial Outage
Science Network ? Operational
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Operational
EnginFrame ? Operational
JupyterHub ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Jul 29, 2021

Unresolved incident: XSEDE Globus endpoint unavailable.

Jul 28, 2021
Resolved - Our BeeGFS storage pools are configured to panic a node if disk writes cannot be performed for 10 seconds. A disk held up writes to a storage pool, leading one of the PetaLibrary nodes hosts to panic/reboot. We are working with the vendor to replace the disks that have reported errors in the past couple of days. The disk replacements should not interrupt PetaLibrary services.
Jul 28, 09:37 MDT
Update - We are continuing to monitor for any further issues.
Jul 27, 10:09 MDT
Monitoring - A fix has been implemented and we are monitoring the results.
Jul 27, 10:09 MDT
Investigating - We are investigating a PetaLibrary failure that occurred overnight, and a subsequent failure of some services to properly restart after that failure. Some PetaLibrary allocations may be inaccessible or interrupted while we investigate.
Jul 27, 09:05 MDT
Jul 27, 2021
Completed - The scheduled maintenance has been completed.
Jul 27, 16:00 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jul 27, 15:00 MDT
Scheduled - All of the login and admin2 nodes will be rebooted to apply a new kernel. This is being done to mitigate a recently discovered security issue in the current kernel.

Nodes affected-
admin2
dtn[3-6]
login[10-13]
tlogin1
blogin01
blogin-ics2
Jul 26, 10:56 MDT
Jul 26, 2021

No incidents reported.

Jul 25, 2021

No incidents reported.

Jul 24, 2021

No incidents reported.

Jul 23, 2021

No incidents reported.

Jul 22, 2021

No incidents reported.

Jul 21, 2021

No incidents reported.

Jul 20, 2021

No incidents reported.

Jul 19, 2021

No incidents reported.

Jul 18, 2021

No incidents reported.

Jul 17, 2021

No incidents reported.

Jul 16, 2021

No incidents reported.

Jul 15, 2021

No incidents reported.