All Systems Operational
Research Computing Core ? Operational
Science Network ? Operational
Alpine ? Operational
90 days ago
100.0 % uptime
Today
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Operational
EnginFrame ? Operational
JupyterHub ? Operational
CUmulus OpenStack Platform Operational
90 days ago
100.0 % uptime
Today
AWS ec2-us-west-2 Operational
AWS rds-us-west-2 Operational
AWS s3-us-west-2 Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Past Incidents
May 16, 2022

No incidents reported today.

May 15, 2022

No incidents reported.

May 14, 2022

No incidents reported.

May 13, 2022

No incidents reported.

May 12, 2022
Resolved - This incident has been resolved.
May 12, 08:58 MDT
Monitoring - A fix has been implemented and we are monitoring the results.
May 12, 08:23 MDT
Investigating - RMACC Summit is presently not accepting new jobs, and some running jobs may have failed. We are still troubleshooting the issue and will provide an update when the cluster is back online. This outage is not affecting the Blanca or Alpine clusters. The outage does affect JupyterHub users.
May 11, 18:23 MDT
May 11, 2022
May 10, 2022

No incidents reported.

May 9, 2022
Completed - The scheduled maintenance has been completed.
May 9, 11:39 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
May 9, 09:00 MDT
Scheduled - We must patch several critical vulnerabilities in the Slurm scheduler. This patch requires upgrading the shared Slurm database, requiring us to additional schedule downtime on all clusters. We will perform the upgrades on Monday 5/9/2022 at 9am and expect it to require approximately one hour. Some existing jobs scheduled to conclude after Monday at 9am may be affected but will be requeued once the patch is completed.

Thank you for your patience while we perform this work. Keeping the research systems safe is of paramount importance.
May 6, 13:04 MDT
May 8, 2022

No incidents reported.

May 7, 2022

No incidents reported.

May 6, 2022

No incidents reported.

May 5, 2022

No incidents reported.

May 4, 2022
Completed - The scheduled maintenance has been completed.
May 4, 17:00 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
May 4, 07:00 MDT
Update - We will be undergoing scheduled maintenance during this time.
May 3, 08:37 MDT
Scheduled - We will be undergoing maintenance during this period. Including, preventative maintenance on the cooling system at HPCF which will require a shutdown of all equipment in that facility.
Apr 25, 15:39 MDT
May 3, 2022
Resolved - Switch has been replaced and all nodes attached are now fully functioning.
May 3, 08:09 MDT
Identified - We are working with the network team to restore service.
Apr 27, 08:14 MDT
Investigating - We have lost a switch in a rack of nodes. The following nodes are unavailable at present:

sgpu05[01-02], shas05[01-28], and smem0501.

The remainder of Summit is unaffected.
Apr 26, 08:02 MDT
May 2, 2022

No incidents reported.