Blanca partial outage
Incident Report for CU Boulder RC
Resolved
This incident has been resolved. A brief explanation follows.

After Wednesday's maintenance, we experienced unexpected issues on some Blanca GPU nodes relating to the OS images and to updates to our job scheduler. It presented itself on a small number of nodes late in the day Thursday, and it became apparent that it was a widespread issue Friday morning. Today (Friday) RC determined and applied the necessary fixes. The nodes are now in production again.

We believe the issues on these nodes have been fixed, and additionally that no additional nodes will be affected. However, we will continue monitoring the situation. We are also preserving the affected groups' access to the blanca-curc-gpu through the weekend in case of further related issues.

We will resume regular maintenance and troubleshooting next week.
Posted Aug 12, 2022 - 18:40 MDT
Monitoring
A fix has been implemented and we are monitoring results.
Posted Aug 12, 2022 - 18:05 MDT
Update
We continue to work on the Blanca GPU issue. To minimize weekend research disruption, we have temporarily granted access on blanca-curc-gpu to the labs owning the affected Blanca nodes (see the link for full list):

bgpu-bortz1
bgpu-kann1
bgpu-papp1
bgpu-casa1
bgpu-ivc

In job scripts, users in those groups may specify --account=blanca-curc-gpu, --qos=blanca-curc-gpu, and --partition=blanca-curc-gpu to receive higher-priority access.
Posted Aug 12, 2022 - 17:08 MDT
Update
We are continuing to work on a fix for this issue.
Posted Aug 12, 2022 - 15:11 MDT
Identified
Most nodes are now restored. The Blanca GPU image, used by several nodes, has an additional issue and a fix is being developed now.
Posted Aug 12, 2022 - 13:10 MDT
Update
We have restored around half the downed nodes to service. Work continues on the others.
Posted Aug 12, 2022 - 12:11 MDT
Investigating
We are currently investigating this issue.
Posted Aug 12, 2022 - 09:03 MDT
This incident affected: Blanca.