Post-maintenance issues with Summit sknl, MPI

Incident Report for CU Boulder RC

Resolved

We believe that issues with MPI on Summit have been resolved. If you are continuing to have trouble, please contact rc-help@colorado.edu.

Posted Jul 10, 2019 - 08:56 MDT

Monitoring

All production nodes have been configured to update to the latest compute image. So far, 74 nodes have rebooted into the new image, with 423 remaining. (Nodes automatically reboot when they have drained.)

Any new jobs that start at this point should be dispatched onto nodes with the updated image. As such, we expect MPI to be working on Summit now, though its effective capacity is reduced while we wait for the remaining nodes to drain and reboot.

If you have had trouble with MPI on Summit since our last maintenance period (3 July) please try again. We also recommend unsetting I_MPI_FABRICS if you set it to work-around this problem.

Posted Jul 07, 2019 - 17:48 MDT

Identified

We believe we have identified the root cause of the problem, a missing package. We have successfully restored correct behavior on two test compute nodes, as seen via osu_alltoall (a synthetic micro benchmark) and a representative WRF job. We're trying to get confirmation from some other test cases; but we're going to go ahead and start deploying this change (restoring a missing package) to production.

We'll advise here once once the new compute image is available to run jobs.

Posted Jul 07, 2019 - 15:24 MDT

Update

We appear to have succeeded in constructing an OPA environment that _does_ work for at least one of our tests. (Others will need to be tested as well.) This is a promising step towards determining root cause of the MPI issue, as we can now compare a working environment against a faulty environment to audit the differences.

Further updates as we have them./

Posted Jul 07, 2019 - 14:12 MDT

Update

We are still investigating the cause of the issues with MPI on Summit. We have attempted to revert both the OPA and Slurm upgrades on sample hosts with no affect. We have also replicated the issue with more recent MPI versions.

We are continuing to investigate and engage with Intel support.

Posted Jul 06, 2019 - 01:02 MDT

Update

We have now been able to replicate the issue with both WRF and nwchem, and with both Intel MPI and OpenMPI. We are pursuing a theory that the upgraded OPA software (now at version 10.9) has broken compatibility with our (admittedly quite old) MPI installations. We are replicating our tests using newer MPI implementations to see if that resolves the issue.

Unfortunately, we have not yet succeeded in working-around the issue with OpenMPI. What we expect to work is `mpirun -mca btl tcp,self`, but it hasn't in our tests.

Posted Jul 05, 2019 - 14:34 MDT

Update

If you are using Intel MPI (e.g., module load impi) then a workaround is to set

export I_MPI_FABRICS=tcp

This will likely give lower performance, but should allow you to run while we continue to investigate.

https://software.intel.com/en-us/mpi-developer-guide-linux-selecting-fabrics

Posted Jul 05, 2019 - 13:29 MDT

Investigating

We have been successfully able to replicate reported MPI problems (using WRF as a test case), but we do not have a root cause or explication yet. We are continuing to investigate, and have opened a support case with Intel (supporting the fabric that MPI uses).

Posted Jul 05, 2019 - 13:13 MDT

Monitoring

A problem with the sknl health check has been resolved with a work-around, and a plan is in place to fix it permanently in the future.

We have been unable to replicate problems with MPI, and we are interested to hear from users who are still having MPI trouble, particularly for jobs *submitted* today. Please let us know at rc-help@colorado.edu. (If you already have a relevant case open, feel free to update that case rather than open a new case.)

Posted Jul 05, 2019 - 10:52 MDT

Investigating

We have received reports of post-maintenance problems on Summit with MPI jobs and with sknl in general. We are investigating these reports and will provide more information here when we have it.

Posted Jul 05, 2019 - 08:44 MDT

This incident affected: RMACC Summit.