All Systems Operational
Research Computing Core   ? Operational
Science Network   ? Operational
RMACC Summit   ? Operational
Blanca   ? Operational
PetaLibrary   ? Operational
EnginFrame   ? Operational
JupyterHub   ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Nov 14, 2018

No incidents reported today.

Nov 13, 2018

No incidents reported.

Nov 12, 2018

No incidents reported.

Nov 11, 2018

No incidents reported.

Nov 10, 2018

No incidents reported.

Nov 9, 2018

No incidents reported.

Nov 8, 2018
Completed - The network configuration change on Blanca 05 has completed successfully, and all nodes have been returned to service.
Nov 8, 11:17 MST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Nov 8, 09:15 MST
Scheduled - The Blanca 05 chassis, which contains all nodes with hostname `bnode05*`, is an experiment in providing low-latency MPI via RoCE. However, the switch in this chassis was installed into the wrong slot, which is preventing the RoCE-capable interfaces from operating.

We have reserved the nodes in this chassis via Slurm, and will be reconfiguring their networking in pursuit of enabling RoCE.

Running jobs on Blanca, as well as general Blanca operation, will be unaffected.

https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet
Nov 8, 09:15 MST
Nov 7, 2018
Completed - The scheduled maintenance has completed on Summit and we have released the compute nodes back to production.

If you have any questions or concerns, please contact rc-help@colorado.edu.
Nov 7, 17:36 MST
Update - Scheduled maintenance is still in progress. We will provide updates as necessary.
Nov 7, 12:01 MST
Update - We are releasing Blanca partitions as maintenance work has concluded for Blanca nodes.
Nov 7, 12:00 MST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Nov 7, 07:00 MST
Scheduled - Research Computing will perform regularly-scheduled planned maintenance Wednesday, 7 September 2018. November's activities include

- Installing new breakers for mechanical components in the HPCF
- Upgrade Omnipath Software and Firmware on Summit
- Rebuild and upgrade of Blanca Slurm controllers
- Upgrade Slurm on Blanca compute
- Performance validation of Summit compute

Maintenance is scheduled to take place between 07:00 and 19:00, though service will be restored as soon as all activities have concluded. During the maintenance period no jobs will run on Summit or Blanca resources.

If you have any questions or concerns, please contact rc-help@colorado.edu.
Oct 31, 12:51 MDT
Nov 6, 2018

No incidents reported.

Nov 5, 2018

No incidents reported.

Nov 4, 2018

No incidents reported.

Nov 3, 2018

No incidents reported.

Nov 2, 2018

No incidents reported.

Nov 1, 2018
Resolved - No more problems were verified in summit. It has been monitored and tests were done earlier today.
Nevertheless, if you find any problems please write to rc-help@colorado.edu.
Nov 1, 14:44 MDT
Monitoring - We believe the problem is fixed. But we are still monitoring. There was a compute node in a bad state which may have created a deadlock in the filesystem. We removed this node and after that the deadlock went away.

We are still in contact with our vendor to confirm the root cause of the problem. But tests indicate the problem is solved as we can access both scratch and the PL new spaces after the action above was taken.

If you still find any problems, please contact us via rc-help@colorado.edu.
Oct 31, 16:37 MDT
Investigating - We received some reports and confirmed that there is a problem affecting summit (scratch and new PL spaces) filesystem. You may observe a long response to "cd" or "ls" in either scratch or PL new spaces and some times that hangs. A ticket is being opened with our vendor and hopefully we will have an update soon.
Oct 31, 14:31 MDT