Unresponsive login node login12 rebooted

Incident Report for CU Boulder RC

Resolved

One of the RC login nodes, login12, was rebooted today when it became unresponsive to SSH and console access. Logs in our VM infrastructure indicate that the VM was experiencing high CPU load, often the result of a computational workload being mistakenly dispatched on a login node.

The RC login service is a group of four redundant login nodes (and one dedicated tutorial login node); however, SSH does not support the fail-over of a session from one node to another. As a result, users who were connected to login12 will see their sessions fail; but a reconnection attempt should be able to use one of the remaining login nodes, even while login12 is rebooting.

Posted Dec 11, 2018 - 13:33 MST

This incident affected: Research Computing Core.