Monitoring - At 10:20am this morning, routine maintenance led to a brief (5 min) PL/active outage. The cause was poorly balanced storage targets, several of which filled to capacity. The service is now stable, and we are investigating what led to the target imbalance.
Oct 22, 10:57 MDT
Update - At 11:24am the network between HPCF/Summit and the rest of the science network went down. The network appears to be functional again as of 11:34, and the networking team is investigating what caused the outage. Summit compute jobs running during the outage may have failed.
Oct 20, 11:40 MDT
Monitoring - A temporary work-around was implemented at approximately 4:18 PM this afternoon. Summit (and Blanca HPC) appear to be operating normally again.

The network team has identified a faulty link that has been causing disruption between the COMP and HPCF datacenters. RC core services (core storage, DNS, LDAP, etc) are housed in COMP, and Summit and a portion of Blanca are housed in HPCF; so a disruption in this link has caused disruption not only to our access to the HPCF but in the ability of Summit and Blanca HPC to function due to their dependencies on these core services.

Our temporary fix has been to manually fail this connection path over to its secondary link, and the primary link will be inspected Friday. We do not currently understand why the secondary link was not used automatically in the face of the network disruption, as its designed intent.
Oct 8, 22:51 MDT
Investigating - The network issue causing this issue has recurred. OIT Network team is working on resolving the outage. Until resolved Summit jobs will not start and most existing jobs will fail. Summit is fine within itself it just can't communicate beyond it own borders.
Oct 8, 16:02 MDT
Monitoring - A network issue has been identified and is being reviewed by OIT personnel. Summit remains operational.
Oct 8, 13:17 MDT
Investigating - At approximately 5:00AM this morning, an unknown event caused summit to become unavailable and jobs to be cancelled. Summit appears to have returned to normal at around 7:00AM. We are investigating the cause of the incident as well as verifying the functionality of summit.
Oct 8, 07:56 MDT
Research Computing Core ? Operational
Science Network ? Operational
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Operational
EnginFrame ? Operational
JupyterHub ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Oct 23, 2020

No incidents reported today.

Oct 22, 2020

Unresolved incident: PL/active service interruption.

Oct 21, 2020

No incidents reported.

Oct 20, 2020

Unresolved incident: HPCF interruption affecting Summit and Blanca HPC.

Oct 19, 2020

No incidents reported.

Oct 18, 2020

No incidents reported.

Oct 17, 2020

No incidents reported.

Oct 16, 2020

No incidents reported.

Oct 15, 2020

No incidents reported.

Oct 14, 2020

No incidents reported.

Oct 13, 2020
Completed - The scheduled maintenance has been completed.
Oct 13, 15:00 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Oct 13, 13:00 MDT
Scheduled - We will be running a load test on the Isilon to determine root cause on performance issues due to misplaced writes in a workflow. Systems may slow dramatically but for a relatively short period of time (15-10 minutes).
Oct 12, 09:28 MDT
Oct 12, 2020

No incidents reported.

Oct 11, 2020

No incidents reported.

Oct 10, 2020

No incidents reported.

Oct 9, 2020

No incidents reported.