Summit scratch expel event
Incident Report for CU Boulder RC
Resolved
Be advised: Summit experienced a significant GPFS expel event today between 17:09:33 and 17:11:19. In this period, 484 compute clients were expelled from the GPFS cluster, momentarily disrupting access to Summit storage, including Summit scratch.

Nodes reconnected to Summit scratch automatically; but compute jobs actively using Summit scratch at the time may experience job failures.

This has been an ongoing investigation with the manufacturers of Summit's storage subsystem and Summit's network subsystem. In fact, the timing of today's expel event appears to coincide with the execution of a (supposedly non-disruptive) diagnostic test for this specific issue. As such, I will be recommending that we no longer run this test in production until we hear back from Intel as to the root cause of this event.
Posted Aug 26, 2019 - 17:46 MDT
This incident affected: RMACC Summit.