Resolved -
This incident has been resolved. A brief explanation follows.
After Wednesday's maintenance, we experienced unexpected issues on some Blanca GPU nodes relating to the OS images and to updates to our job scheduler. It presented itself on a small number of nodes late in the day Thursday, and it became apparent that it was a widespread issue Friday morning. Today (Friday) RC determined and applied the necessary fixes. The nodes are now in production again.
We believe the issues on these nodes have been fixed, and additionally that no additional nodes will be affected. However, we will continue monitoring the situation. We are also preserving the affected groups' access to the blanca-curc-gpu through the weekend in case of further related issues.
We will resume regular maintenance and troubleshooting next week.
Aug 12, 18:40 MDT
Monitoring -
A fix has been implemented and we are monitoring results.
Aug 12, 18:05 MDT
Update -
We continue to work on the Blanca GPU issue. To minimize weekend research disruption, we have temporarily granted access on blanca-curc-gpu to the labs owning the affected Blanca nodes (see the link for full list):
bgpu-bortz1
bgpu-kann1
bgpu-papp1
bgpu-casa1
bgpu-ivc
In job scripts, users in those groups may specify --account=blanca-curc-gpu, --qos=blanca-curc-gpu, and --partition=blanca-curc-gpu to receive higher-priority access.
Aug 12, 17:08 MDT
Update -
We are continuing to work on a fix for this issue.
Aug 12, 15:11 MDT
Identified -
Most nodes are now restored. The Blanca GPU image, used by several nodes, has an additional issue and a fix is being developed now.
Aug 12, 13:10 MDT
Update -
We have restored around half the downed nodes to service. Work continues on the others.
Aug 12, 12:11 MDT
Investigating -
We are currently investigating this issue.
Aug 12, 09:03 MDT