All Systems Operational
Research Computing Core ? Operational
Science Network ? Operational
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Operational
EnginFrame ? Operational
JupyterHub ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Oct 16, 2019

No incidents reported today.

Oct 15, 2019

No incidents reported.

Oct 14, 2019

No incidents reported.

Oct 13, 2019

No incidents reported.

Oct 12, 2019

No incidents reported.

Oct 11, 2019

No incidents reported.

Oct 10, 2019

No incidents reported.

Oct 9, 2019

No incidents reported.

Oct 8, 2019

No incidents reported.

Oct 7, 2019

No incidents reported.

Oct 6, 2019

No incidents reported.

Oct 5, 2019

No incidents reported.

Oct 4, 2019

No incidents reported.

Oct 3, 2019
Completed - A redundant link connecting PetaLibrary BeeGFS (active) has been reactivated.

A redundant link between HPCF and the Science Network was reporting errors, so fail-over was not tested. The link errors will be addressed.

Changes to the Science Network core have been deferred to another time.
Oct 3, 09:49 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Oct 2, 09:35 MDT
Scheduled - The Network Engineering and Operations team (NEO) intends to make the following changes to the Science Network today:

- Re-enable a redundant link connecting PetaLibrary BeeGFS (active) that is currently in an "error" state

- Test fail-over of the redundant links between HPCF and the Science Network core

- Enable a redundant link in the Science Network core

All activity should be transparent, but carries a minor risk of momentary interruption.
Oct 2, 09:34 MDT
Oct 2, 2019
Completed - Today's UPS planned maintenance has concluded successfully with no outage for Summit or Blanca.
Oct 2, 13:42 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Oct 2, 09:00 MDT
Scheduled - Next week on our regular planned maintenance day the CU Boulder datacenter team will be conducting planned maintenance on the UPS that supports Summit and Blanca HPC in the HPCF. No outages are expected, and jobs will continue to be scheduled and run as normal.

While the UPS is under maintenance both Summit and Blanca HPC will be running on bypass power. Bypass power carries increased risk of disruption from instabilities in our utility power supply.
Sep 27, 10:34 MDT
Resolved - This incident has been resolved.
Oct 2, 09:58 MDT
Monitoring - All storage targets are now online.
During the procedure of creating new storage targets for new customers today, boss1 storage daemon restart failed. It kept being in an infinite loop of stopping and starting. The restart of this daemon shouldn't cause an outage as done many times before. However, it is understood that it affected the filesystem operation this time due to targets being full or nearly full. Quota was slightly increased for those targets and the old targets and new ones that were offline went back online after allowing again new targets to join the system. We will follow up with PI's for the 2 storage pools which had increased quota.

New monitoring will be added to PL/Active to notify of nearly full targets so as to avoid this problem in the future.
Oct 1, 19:52 MDT
Investigating - It was verified that targets attached to boss1 storage server went offline. It is not immediately clear why those targets went and remain offline. But we are investigating. That is affecting PetaLibrary Active spaces hosted in Beegfs. More news soon.
Oct 1, 18:47 MDT