Problematic user workload leading to "Kill task failed" on Blanca
Incident Report for CU Boulder RC
Resolved
Most Blanca nodes have been returned to service, and the rest will be returned to service as they are able to be rebooted.
Posted 3 months ago. May 28, 2019 - 10:27 MDT
Monitoring
A specific workload running preemptably on Blanca has been identified to be leaving large numbers of Blanca compute nodes in "Kill task failed" state. These nodes are automatically drained and rebooted, but new jobs cannot be scheduled on such nodes while they drain for reboot.

We have held all future work in this workload until we are able to correct the behavior.
Posted 3 months ago. May 26, 2019 - 22:45 MDT
This incident affected: Blanca.