Out-of-memory events in Summit scratch
Incident Report for CU Boulder RC
Resolved
We have completed the configuration change on Summit scratch servers and the gateway, and memory use has returned to normal. We have provided this feedback to Intel, as it appears to have been an OPA configuration change that caused the issue, and should hear back on the actual root cause next week.

We have not yet applied this change to all compute nodes; however, the issue appears to have had the largest affect on our file servers, and does not appear to have had the same impact on pure filesystem clients. We may delay further changes on compute nodes until we have been able to get more information from Intel.
Posted 4 months ago. Aug 10, 2018 - 17:24 MDT
Update
After testing on a representative node, we believe that we have identified the source of the new memory utilization pattern that has been affecting Summit scratch. We are making an additional configuration change to the Summit scratch servers, but this should not impact production access to the filesystem. (Two of four servers have already been updated.)

Meanwhile, sgate1, which provides access to Summit scratch from beyond Summit, has crashed, likely due to out-of-memory event from this same cause. We will be applying the fix there as well, and returning the node to service as soon as possible.
Posted 4 months ago. Aug 10, 2018 - 16:25 MDT
Monitoring
We have completed the revert of the recent Summit scratch configuration changes, and returned Summit scratch and Summit compute to service. We will continue to monitor memory use and see if the system has returned to its previous behavior.
Posted 4 months ago. Aug 09, 2018 - 17:27 MDT
Update
Unfortunately, an attempt to revert some of the recent configuration changes to Summit scratch has caused the entire subsystem to go offline. We are working to bring scratch back up as soon as possible.
Posted 4 months ago. Aug 09, 2018 - 16:17 MDT
Investigating
Following a recent (supplier-recommended) configuration change, we have started experiencing out-of-memory events in the Summit scratch subsystem. This had led to intermittent loss of access to Summit scratch from the login nodes, dtn, and other non-Summit systems. Access to Summit scratch from Summit itself, including from scompile, has been unaffected so far.

There may be further outages to Summit scratch access from outside of Summit while we continue to investigate; particularly if such an out-of-memory event occurs over-night.
Posted 4 months ago. Aug 09, 2018 - 16:14 MDT
This incident affected: RMACC Summit.