In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jul 15, 09:15 MDT
Scheduled - Next week (from July 15th to 19th), we plan to disable the restarts of Beegfs management daemon we currently have in place that avoids the disconnects between Beegfs management and clients.

Those disconnects led to the error many users seen: "Communication error on send" for PetaLibrary/Active.

We enabled the restarts of the daemon in May and since then no failures associated to the disconnects occurred. However, it remains unclear for us and for Beegfs developers the reason of those failures. We plan to debug the problem next week with a dump of all running threads with GDB. Beegfs developers are expecting us to send back that debug data for some time now and that should led them to narrow down the problem and identify the real cause.

We don't know if or when the failures will really happen by disabling the daemon restarts. So that is an attempt to provoke the problem and get the debug data Beegfs support needs.

The daemon restarts will be enabled again on the 19th (hopefully earlier once the problem is reproduced). The restarts will be maintained until the problem is fixed or another debug step will be required (we will communicate either scenarios). We suspect that that problem has some correlation with the load on the system. So, please don't hesitate to submit jobs that uses PL/Active, next week. The earlier the better if we can reproduce the problem, collect the debug data and avoid that it happens again until fixed.

Please note that this would affect the PL/Active spaces in Beegfs not the interim PL/Active spaces that are still hosted in Summit.

(Moved from Incidents to Scheduled Maintenance)
Research Computing Core ? Operational
Science Network ? Operational
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Under Maintenance
EnginFrame ? Operational
JupyterHub ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Scheduled Maintenance
Jupyterhub maintenance Jul 18, 08:00-08:15 MDT
We plan to reboot the jupyterhub instance to apply security updates Thursday morning at 8am and expect to have jupyterhub back up within 15 minutes of the reboot. Jupyterhub notebooks will become unavailable at that time.

If you have any questions or concerns, please contact rc-help@colorado.edu.
Posted on Jul 16, 11:10 MDT
Past Incidents
Jul 17, 2019

No incidents reported today.

Jul 16, 2019

No incidents reported.

Jul 14, 2019

No incidents reported.

Jul 13, 2019

No incidents reported.

Jul 12, 2019

No incidents reported.

Jul 11, 2019

No incidents reported.

Jul 10, 2019
Completed - We have completed today's special planned maintenance period for the PetaLibrary (BeeGFS). All tasks were completed successfully, and reservations have been lifted on Blanca.

Thank you again.
Jul 10, 15:48 MDT
Verifying - We have completed all of our PetaLibrary/active (BeeGFS) changes and resiliency tests. Notably, all of our changes were completed without any problems, and all of our resiliency / fail-over tests performed precisely as expected. We even managed to sort out the problem with our prototype ZFS-native file system which had been causing us problems in the past, including one of our past significant outages.

Also notably, all of our activities were completed today without any IO failures. We did have one fail-over test that failed on its first attempt, causing a 5-minute pause to IO (2 minutes more than the expected 3-minute pause during the fail-over) but even this did not, according to our monitoring, cause any actual IO errors.

Our last activity for today is to re-run metadata benchmarks to see if they have been affected by the performance tuning we performed today. Benchmarks are more meaningful in a quiesced environment, so we still have Blanca partitions reserved for now; but if you need to resume work immediately, please let us know and we'll happily release your reservations. Otherwise, we should be done soon.

Thank you everyone for your patience while we work out the eccentricities of this system. With each maintenance session we have improved our understanding of the environment and fixed configuration problems that were causing us trouble; and this has been our most successful maintenance period for BeeGFS yet.
Jul 10, 14:13 MDT
Update - We are commencing maintenance activities for PetaLibrary/active (BeeGFS) today, including some performance tuning and further resiliency (e.g., fail-over) testing. We hope to minimize or prevent any actual BeeGFS outages (some momentary interruption/pauses are to be expected); but there is a potential for outages to /pl/active/ allocations.

We will do our best to communicate our current status and events here throughout the activities.
Jul 10, 08:58 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jul 10, 07:01 MDT
Scheduled - Research Computing will perform special PetaLibrary planned maintenance Wednesday, 10 July 2019. Activities include

- Performance optimizations for BeeGFS
- Failover testing for beegfs-storage
- ZFS mount testing and root cause investigation
- beegfs-mgmtd update

Maintenance is scheduled to take place between 07:00 and 19:00, though service will be restored as soon as all activities have concluded. We will endeavor to perform these tasks while minimizing or preventing any PetaLibrary outages; but outages are possible. During the maintenance period we will have reservations in place on Blanca, but Blanca contributors may request that their reservation be released if they prefer.

If you have any questions or concerns, please contact rc-help@colorado.edu.
Jun 25, 13:51 MDT
Resolved - We believe that issues with MPI on Summit have been resolved. If you are continuing to have trouble, please contact rc-help@colorado.edu.
Jul 10, 08:56 MDT
Monitoring - All production nodes have been configured to update to the latest compute image. So far, 74 nodes have rebooted into the new image, with 423 remaining. (Nodes automatically reboot when they have drained.)

Any new jobs that start at this point should be dispatched onto nodes with the updated image. As such, we expect MPI to be working on Summit now, though its effective capacity is reduced while we wait for the remaining nodes to drain and reboot.

If you have had trouble with MPI on Summit since our last maintenance period (3 July) please try again. We also recommend unsetting I_MPI_FABRICS if you set it to work-around this problem.
Jul 7, 17:48 MDT
Identified - We believe we have identified the root cause of the problem, a missing package. We have successfully restored correct behavior on two test compute nodes, as seen via osu_alltoall (a synthetic micro benchmark) and a representative WRF job. We're trying to get confirmation from some other test cases; but we're going to go ahead and start deploying this change (restoring a missing package) to production.

We'll advise here once once the new compute image is available to run jobs.
Jul 7, 15:24 MDT
Update - We appear to have succeeded in constructing an OPA environment that _does_ work for at least one of our tests. (Others will need to be tested as well.) This is a promising step towards determining root cause of the MPI issue, as we can now compare a working environment against a faulty environment to audit the differences.

Further updates as we have them./
Jul 7, 14:12 MDT
Update - We are still investigating the cause of the issues with MPI on Summit. We have attempted to revert both the OPA and Slurm upgrades on sample hosts with no affect. We have also replicated the issue with more recent MPI versions.

We are continuing to investigate and engage with Intel support.
Jul 6, 01:02 MDT
Update - We have now been able to replicate the issue with both WRF and nwchem, and with both Intel MPI and OpenMPI. We are pursuing a theory that the upgraded OPA software (now at version 10.9) has broken compatibility with our (admittedly quite old) MPI installations. We are replicating our tests using newer MPI implementations to see if that resolves the issue.

Unfortunately, we have not yet succeeded in working-around the issue with OpenMPI. What we expect to work is `mpirun -mca btl tcp,self`, but it hasn't in our tests.
Jul 5, 14:34 MDT
Update - If you are using Intel MPI (e.g., module load impi) then a workaround is to set

export I_MPI_FABRICS=tcp

This will likely give lower performance, but should allow you to run while we continue to investigate.

https://software.intel.com/en-us/mpi-developer-guide-linux-selecting-fabrics
Jul 5, 13:29 MDT
Investigating - We have been successfully able to replicate reported MPI problems (using WRF as a test case), but we do not have a root cause or explication yet. We are continuing to investigate, and have opened a support case with Intel (supporting the fabric that MPI uses).
Jul 5, 13:13 MDT
Monitoring - A problem with the sknl health check has been resolved with a work-around, and a plan is in place to fix it permanently in the future.

We have been unable to replicate problems with MPI, and we are interested to hear from users who are still having MPI trouble, particularly for jobs *submitted* today. Please let us know at rc-help@colorado.edu. (If you already have a relevant case open, feel free to update that case rather than open a new case.)
Jul 5, 10:52 MDT
Investigating - We have received reports of post-maintenance problems on Summit with MPI jobs and with sknl in general. We are investigating these reports and will provide more information here when we have it.
Jul 5, 08:44 MDT
Jul 9, 2019

No incidents reported.

Jul 8, 2019

No incidents reported.

Jul 4, 2019

No incidents reported.

Jul 3, 2019
Completed - Today's planned maintenance activities have concluded, and Summit is once again in production.

Today, we accomplished

- Summit GPFS update
- Summit OPA update
- Summit kernel update
- Transitioned remaining Summit compute to stateless provisioning
- Slurmdbd and slurmctld (major/feature) update
- Summit Slurmd update
- blanca-nso slurmd update
- Summit performance validation
Jul 3, 18:06 MDT
Update - The issue with Blanca scheduling was traced to changed behavior surrounding the topology configuration. We have adjusted the configuration and Blanca appears to again be fully operational.
Jul 3, 16:28 MDT
Update - Much of the upgrade work occurring today has been successful. We had some trouble with the GRIDScaler upgrade (part of the Summit storage system) but we've engaged with upstream support and it looks like we may be making progress again.

An upgrade to slurmctld has caused an apparent compatibility issue with not-yet-upgraded slurmd running on Blanca compute nodes. We are investigating the cause of the issue, and have also opened a support case with Slurm support. This issue produces error messages of the form "Unable to allocate resources: Requested node configuration is not available."
Jul 3, 15:34 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jul 3, 07:00 MDT
Scheduled - Research Computing will perform regularly-scheduled planned maintenance Wednesday, 3 July 2019. July's activities include

- Summit scratch firmware updates
- Summit scratch filesystem (GPFS) updates
- Summit scratch disk repositioning
- Summit interconnect (OPA) software and firmware updates
- Summit kernel updates
- Slurm database server (slurmdbd) update (in preparation for later Summit, Blanca, and core Slurm updates)

Maintenance is scheduled to take place between 07:00 and 19:00, though service will be restored as soon as all activities have concluded. During the maintenance period no jobs will run on Summit resources, and Summit scratch will likely be unavailable during firmware and filesystem updates.

If you have any questions or concerns, please contact rc-help@colorado.edu.
Jun 25, 13:46 MDT