All Systems Operational
Research Computing Core   ? Operational
Science Network   ? Operational
RMACC Summit   ? Operational
Blanca   ? Operational
PetaLibrary   ? Operational
EnginFrame   ? Operational
JupyterHub   ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Dec 16, 2018

No incidents reported today.

Dec 15, 2018

No incidents reported.

Dec 14, 2018

No incidents reported.

Dec 13, 2018

No incidents reported.

Dec 12, 2018

No incidents reported.

Dec 11, 2018
Resolved - One of the RC login nodes, login12, was rebooted today when it became unresponsive to SSH and console access. Logs in our VM infrastructure indicate that the VM was experiencing high CPU load, often the result of a computational workload being mistakenly dispatched on a login node.

The RC login service is a group of four redundant login nodes (and one dedicated tutorial login node); however, SSH does not support the fail-over of a session from one node to another. As a result, users who were connected to login12 will see their sessions fail; but a reconnection attempt should be able to use one of the remaining login nodes, even while login12 is rebooting.
Dec 11, 13:33 MST
Resolved - The HPCF has remained stable with the UPS in its default operating mode. The manufacturer is continuing to investigate the root cause of this issue, and has suggested the replacement of a component of the control system as part of this effort.
Dec 11, 13:29 MST
Monitoring - Summit has been returned to service. We will continue to monitor the status of the system, and expect to receive a root cause analysis regarding our observations of the UPS. In the mean time, we expect that our use of the default UPS mode will reduce the likelihood that the problem will reoccur.
Dec 2, 00:30 MST
Update - We have discovered an apparent anomaly in the operation of the UPS “ECOnversion” mode, a mode that allows the UPS to operate with greater power efficiency than in its default mode. This anomaly does not appear to affect the default mode; so we are gathering diagnostic reports from the UPS in each mode for further analysis, and plan to bring Summit back into production with the UPS in the default mode.
Dec 1, 22:28 MST
Investigating - The technician has arrived and is inspecting the UPS.
Dec 1, 20:55 MST
Identified - The UPS that supports the HPCF has experienced a fault which is preventing it from being able to supply power to the environment. A service technician has been dispatched and is en route.
Dec 1, 19:55 MST
Update - We have confirmed onsite that the HPCF has experienced a major power outage. All HPCF systems, including Summit and Summit storage (including both scratch and interim PetaLibrary allocations) are offline.

We are investigating the cause and are working to restore service as soon as possible.
Dec 1, 17:18 MST
Investigating - We are aware of an HPCF outage affecting Summit and Summit scratch. We are investigating and will update here as more information becomes available.
Dec 1, 15:58 MST
Dec 10, 2018
Completed - After encountering issues applying the patch to lock down Blanca nodes, we have decided to revert the changes and will instead apply these changes at a later date.
Dec 10, 08:58 MST
Scheduled - During the maintenance period we will be enacting a policy change to the way users access Blanca nodes. After the maintenance period only users who have a job running on a node will be able to SSH into a node. This is in an effort to keep users who don't have a job on the node from still accessing it, and to bring the Blanca Slurm policies in line with Summit.
Dec 4, 10:17 MST
Dec 9, 2018

No incidents reported.

Dec 8, 2018

No incidents reported.

Dec 7, 2018

No incidents reported.

Dec 6, 2018

No incidents reported.

Dec 5, 2018

No incidents reported.

Dec 4, 2018

No incidents reported.

Dec 3, 2018

No incidents reported.