In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Aug 19, 09:15 MDT
Scheduled - Next week (from August 19th to 24th), we plan to disable the restarts of Beegfs management daemon we currently have in place that avoids the disconnects between Beegfs management and clients.

Those disconnects led to the error many users seen: "Communication error on send" for PetaLibrary/Active. We enabled the restarts of the daemon in May and since then no failures associated to the disconnects occurred. However, it remains unclear for us and for Beegfs developers the reason of those failures (even after a similar prior debug week).

We plan to debug the problem next week with a different approach now. Beegfs support team will attach GDB to the management daemon such that they should see the problem while it is happening. . We don't know if or when the failure will really happen by disabling the daemon restarts. So that is an attempt to provoke the problem and debug it.

The daemon restarts will be enabled again on the 24th at 8pm (hopefully earlier once the problem is reproduced). The restarts will be maintained until the problem is fixed or another debug step will be required (we will communicate either scenarios).

We suspect that that problem has some correlation with the load on the system. So, please don't refrain from using PL/Active, next week. The earlier the better if we can reproduce the problem, debug and avoid that it happens again until fixed. Please note that this would affect the PL/Active spaces in Beegfs not the interim PL/Active spaces that are still hosted in Summit.
Research Computing Core ? Operational
Science Network ? Operational
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Under Maintenance
EnginFrame ? Operational
JupyterHub ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Scheduled Maintenance
Summit Emergency Upgrade Aug 21, 08:30-10:30 MDT
We are scheduling an emergency upgrade for Summit in face of a bulletin received from our vendor asking us to upgrade our SFAOS to prevent dual controller crash. The upgrade should be completed in 2 hours and will take place on Aug 21st at 8:30am.

Since it involves a downtime, it will affect Summit scratch, datasets and all PetaLibrary allocations configured on the interim space.
Posted on Aug 12, 10:44 MDT
Past Incidents
Aug 20, 2019

No incidents reported today.

Aug 18, 2019

No incidents reported.

Aug 17, 2019

No incidents reported.

Aug 16, 2019

No incidents reported.

Aug 15, 2019
Completed - The scheduled maintenance has been completed.
Aug 15, 15:31 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Aug 15, 15:30 MDT
Scheduled - We are deploying a new version of RCAMP and the service will be unavailable for about 15 minutes.
Aug 15, 15:27 MDT
Aug 14, 2019

No incidents reported.

Aug 13, 2019

No incidents reported.

Aug 12, 2019

No incidents reported.

Aug 11, 2019

No incidents reported.

Aug 10, 2019

No incidents reported.

Aug 9, 2019

No incidents reported.

Aug 8, 2019
Completed - The scheduled maintenance has been completed.
Aug 8, 11:26 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Aug 8, 11:00 MDT
Scheduled - We are deploying a new version of RCAMP today and the service will be unavailable for a few hours.
Aug 8, 09:10 MDT
Aug 7, 2019
Completed - We have brought PetaLibrary/archive back into service successfully.

This concludes todays planned maintenance activities.
Aug 7, 13:31 MDT
Update - Work on Summit has concluded successfully. We were not able to perform the Summit storage update due to not having the update procedure from the supplier in time.

Work is continuing on PetaLibrary/archive.
Aug 7, 12:01 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Aug 7, 07:00 MDT
Update - We will also restart the TSM server tomorrow (Aug 7th) to clear many processes that didn't completed as expected after starting the script that reconciles the GPFS Filesystem with the external storage pool.

That will affect PetaLibrary Archive allocations under "/archive".
It won't affect the ones under "/pl/active".

That should be performed by 9am tomorrow. It will be informed once it is completed.
Aug 6, 12:06 MDT
Scheduled - Research Computing will perform regularly-scheduled planned maintenance Wednesday, 7 August 2019. August's activities include

- Reboot a Summit interconnect switch
- Reboot Summit compute nodes to test interconnect resiliency
- [Potentially] Update Summit storage infrastructure

Maintenance is scheduled to take place between 07:00 and 19:00, though service will be restored as soon as all activities have concluded. We are expecting this to be a brief outage for the Summit interconnect reboot and testing; but we have been advised of a critical update for the Summit storage infrastructure. If we have the update procedure and time to prepare before Wednesday we will likely want to update the SFA at this time as well; but we do not have an estimated duration for that operation yet. Otherwise we may need to take an off-cycle outage to perform this update.

During the maintenance period no jobs will run on Summit resources, and access to Summit storage may be interrupted. If we update Summit storage access will likely be disrupted for the duration of the update procedure.

If you have any questions or concerns, please contact rc-help@colorado.edu.
Aug 2, 12:40 MDT
Aug 6, 2019

No incidents reported.