All Systems Operational
Research Computing Core ? Operational
Science Network ? Operational
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Operational
EnginFrame ? Operational
JupyterHub ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Dec 6, 2019

No incidents reported today.

Dec 5, 2019
Completed - Storage Servers were tuned, so this maintenance is complete.
The servers statistics shows an increase in the worker threads serving users in parallel.
Please report your experience at rc-help (a ticket you have opened already with us).

We will continue to monitor the system.
Dec 5, 14:10 MST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Dec 5, 13:30 MST
Scheduled - As many PL users noticed performance of PL/Active Beegfs have been very poor even after we last tuned 12 days ago.

We have been closely monitoring the system and in contact with our support to understand the reason for that. Yesterday, we identified a problem with our configuration. As we were trying to increase the parallelism to serve simultaneous resource requests, we didn't realized that we were doing that on the storage target level where there shouldn't exist many concurrent requests. After analyzing some server statistics and getting feedback from Beegfs support, it became clear what we should be tuning in order to improve performance.

So, we plan to tune the servers again today to allow parallelism across the different storage targets.
The change should be non disruptive.

Please keep reporting your experience with the system especially after the change today as we are interested in verifying the results with today's change.
Dec 5, 12:18 MST
Dec 4, 2019

No incidents reported.

Dec 3, 2019

No incidents reported.

Dec 2, 2019

No incidents reported.

Dec 1, 2019

No incidents reported.

Nov 30, 2019

No incidents reported.

Nov 29, 2019

No incidents reported.

Nov 28, 2019

No incidents reported.

Nov 27, 2019

No incidents reported.

Nov 26, 2019

No incidents reported.

Nov 25, 2019
Completed - The configuration change in Beegfs Storage servers was applied and the filesystem has been monitored.
No disruptions were identified. Tests with benchmarks have shows that the change improved the balancing of the storage servers among the users. More tests will be performed and the filesystem is continuously monitored, beyond the completion of the maintenance.

If you observe any issues please report to rc-help@colorado.edu as usual.
Nov 25, 12:20 MST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Nov 25, 10:15 MST
Scheduled - Some of you may have noticed performance degradation when using PetaLibrary Active/Beegfs this week.
This is the result of unbalanced storage requests processing across users under the current Beegfs configuration.

Some time ago we improved Beegfs performance on the metadata servers by increasing the number of worker threads to process incoming requests. However, we haven't yet tuned the storage servers, which process reading and writing of files.

Our monitoring shows that the performance issues on Beegfs now lie mostly on the storage servers.
As such we are performing a change recommended by the Beegfs developers on this coming Monday (November 25) in an attempt to balance the processing of storage requests by our users. The change consists of creating a storage message queue for each user instead of using the single queue that we have today; under the present queue configuration incoming read/write requests are processed on a first-come, first-served order.

We contacted one of our users to request that they place the system under high load on Monday, and we will run a benchmark provided by another PL user to verify how the system behaves after the change.

We expect the change itself to be non disruptive. So, users shouldn't encounter any error messages if using the /pl/active filesystem while we apply the change.

Please note that the configuration change should be completed within 15min. The 2 hour timeframe for the maintenance takes into account filesystem monitoring after the change.
Nov 22, 14:46 MST
Nov 24, 2019

No incidents reported.

Nov 23, 2019
Completed - shas-interactive is now configured with DefMemPerCPU=1212.
Nov 23, 21:40 MST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Nov 23, 21:22 MST
Scheduled - The `shas-interactive` partition is used to configure resources available to interactive jobs. In order to provide additional interactive slots (supporting more simultaneous interactive jobs) the `shas-interactive` partition over-subscribes CPUs at a 4:1 ratio. However, the default memory request has been mistakenly retained from the default `DefMemPerCPU=4848` was retained from the `shas` partition, effectively blocking CPU oversubscription. This has been discovered recently when JupyterHub was unable to start more than 24 simultaneous sessions.

In order to alleviate this, we are reducing the default memory allocation for `shas-interactive` to `DefMemPerCPU=1212`. For now we will leave `MaxMemPerCPU=4848`; but we may reduce this value as well in the future.

This change will be made immediately to support a class that is using JupyterHub; but we will report back here whether the change is successful (and retained) or unsuccessful (and reverted).

https://curc.readthedocs.io/en/latest/running-jobs/interactive-jobs.html
Nov 23, 21:21 MST
Nov 22, 2019

No incidents reported.