Failure in RC Core Virtual Infrastructure
Incident Report for CU Boulder RC
Resolved
Hardware replacements in the RC core virtual infrastructure appear to have successfully addressed this issue. No further disruption is anticipated.
Posted Jun 10, 2021 - 09:22 MDT
Monitoring
Last night RC experienced a failure in its "Core Virtual Infrastructure" which hosts, among many other things, the login nodes and Slurm services. This is the second such recent failure, though the first passed without notable disruption. This time the login nodes were not automatically returned to service correctly, nor the Blanca Slurm service, apparently due to network involvement in the disruption.

We have advised our upstream OIT support team, which administers the Core Virtual Infrastructure, about this failure, and are awaiting their feedback. Meanwhile, we have returned the login and Blanca Slurm services to service.

We will continue to monitor the situation and follow-up with upstream support staff on Monday.
Posted Apr 24, 2021 - 10:57 MDT
This incident affected: Research Computing Core and Blanca.