All Systems Operational
Research Computing Core ? Operational
Science Network ? Operational
Alpine ? Operational
90 days ago
99.95 % uptime
Today
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Operational
EnginFrame ? Operational
JupyterHub ? Operational
CUmulus OpenStack Platform Operational
90 days ago
100.0 % uptime
Today
AWS ec2-us-west-2 Operational
AWS rds-us-west-2 Operational
AWS s3-us-west-2 Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Past Incidents
Aug 19, 2022

No incidents reported today.

Aug 18, 2022

No incidents reported.

Aug 17, 2022

No incidents reported.

Aug 16, 2022

No incidents reported.

Aug 15, 2022

No incidents reported.

Aug 14, 2022

No incidents reported.

Aug 13, 2022

No incidents reported.

Aug 12, 2022
Resolved - This incident has been resolved. A brief explanation follows.

After Wednesday's maintenance, we experienced unexpected issues on some Blanca GPU nodes relating to the OS images and to updates to our job scheduler. It presented itself on a small number of nodes late in the day Thursday, and it became apparent that it was a widespread issue Friday morning. Today (Friday) RC determined and applied the necessary fixes. The nodes are now in production again.

We believe the issues on these nodes have been fixed, and additionally that no additional nodes will be affected. However, we will continue monitoring the situation. We are also preserving the affected groups' access to the blanca-curc-gpu through the weekend in case of further related issues.

We will resume regular maintenance and troubleshooting next week.

Aug 12, 18:40 MDT
Monitoring - A fix has been implemented and we are monitoring results.
Aug 12, 18:05 MDT
Update - We continue to work on the Blanca GPU issue. To minimize weekend research disruption, we have temporarily granted access on blanca-curc-gpu to the labs owning the affected Blanca nodes (see the link for full list):

bgpu-bortz1
bgpu-kann1
bgpu-papp1
bgpu-casa1
bgpu-ivc

In job scripts, users in those groups may specify --account=blanca-curc-gpu, --qos=blanca-curc-gpu, and --partition=blanca-curc-gpu to receive higher-priority access.

Aug 12, 17:08 MDT
Update - We are continuing to work on a fix for this issue.
Aug 12, 15:11 MDT
Identified - Most nodes are now restored. The Blanca GPU image, used by several nodes, has an additional issue and a fix is being developed now.
Aug 12, 13:10 MDT
Update - We have restored around half the downed nodes to service. Work continues on the others.
Aug 12, 12:11 MDT
Investigating - We are currently investigating this issue.
Aug 12, 09:03 MDT
Aug 11, 2022
Resolved - This incident has been resolved.
Aug 11, 18:00 MDT
Monitoring - A fix has been implemented and we are monitoring the results.
Aug 11, 17:44 MDT
Identified - The Summit GPU nodes (the "sgpu" partition) is presently out of service so that we can implement a new node image to address Slurm issues. We anticipate this work will be completed shortly.

No other Summit partitions are impacted by this outage.

Aug 11, 14:13 MDT
Aug 10, 2022
Resolved - This incident has been resolved.
Aug 10, 17:47 MDT
Monitoring - A fix has been implemented and we are monitoring the results.
Aug 10, 17:29 MDT
Investigating - We are currently investigating this issue.
Aug 10, 17:10 MDT
Completed - The scheduled maintenance has been completed.
Aug 10, 16:45 MDT
Update - We are continuing to verify the maintenance items.
Aug 10, 16:38 MDT
Verifying - Planned work has concluded and verification is currently underway for the maintenance items.
Aug 10, 16:37 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Aug 10, 08:00 MDT
Scheduled - We will perform our monthly scheduled planned maintenance during this time. Affected services will be unavailable.
Aug 2, 16:16 MDT
Aug 9, 2022

No incidents reported.

Aug 8, 2022

No incidents reported.

Aug 7, 2022

No incidents reported.

Aug 6, 2022

No incidents reported.

Aug 5, 2022

No incidents reported.