Monitoring - A fix has been implemented and we are monitoring the results.
Jan 27, 17:33 MST
Update - While the database server is down, the user management portal "RCAMP" is also non-functional.
Jan 27, 15:45 MST
Investigating - We have stopped jobs from being able to be submitted to both Blanca and Summit while we work to add space to our database server that serves out our Slurm database. Jobs currently running will continue to run, but no new jobs will be started.
Jan 27, 13:19 MST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jan 21, 14:00 MST
Scheduled - Today at 2:00 PM an on-site technician is scheduled to replace local boot drives in the storage infrastructure that supports RC Core Storage. This maintenance is intended to resolve an issue that is preventing us from upgrading the software platform of this storage cluster which, in turn, is preventing us from refreshing the infrastructure with new hardware.

No disruption is anticipated as a result of this maintenance.
Update - The fix addressing this issue has been deployed to one of our two storage servers; but coincident problems prevented us from finishing both servers on the scheduled day. We will re-schedule the completion of this effort, possibly next week.
Jan 17, 09:59 MST
Identified - A component of the PetaLibrary/active service (ZFS, providing storage for beegfs-storage, part of the BeeGFS parallel file system) is experiencing a load-induced race condition. When the race condition results in an error, a write fails with an error message like "Bad address".

This issue has previously been reported (and resolved) upstream.

This fix is available in the 0.8 branch of ZFS. We are planning an update from our currently-deployed ZFS 0.7.13 to resolve this issue. We will provide updates here as more information becomes available.
Jan 6, 13:58 MST
Research Computing Core ? Under Maintenance
Science Network ? Operational
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Operational
EnginFrame ? Operational
JupyterHub ? Operational
Degraded Performance
Partial Outage
Major Outage
Scheduled Maintenance
Isilon upgrade Jan 28, 12:00-17:00 MST
We will be upgrading the Isilon cluster at this time. It will be a rolling upgrade which should result in no downtime. There may be brief lags in performance as each node completes it's upgrade and resets itself.
Posted on Jan 27, 12:55 MST
Past Incidents
Jan 28, 2020

No incidents reported today.

Jan 26, 2020

No incidents reported.

Jan 25, 2020

No incidents reported.

Jan 24, 2020

No incidents reported.

Jan 23, 2020
Completed - The update of beegfs-meta from 7.1.2 to 7.1.4 was a success, with virtually no problems during deployment. Our secondary metadata server is now re-syncing from the primary, and we anticipate this to complete successfully.

This update fixes other issues as well, so we are hoping to see more stability in beegfs-meta overall as a result.

We do not think that this update disrupted access to PetaLibrary, aside from a brief pause to I/O; but we apologize if it interrupted or provoked errors in any running jobs.
Jan 23, 09:49 MST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jan 23, 09:30 MST
Scheduled - We have identified that the BeeGFS metadata component of PetaLibrary/active is experiencing a race condition that is provoking errors in the storage cluster, interrupting access, and preventing the proper synchronization of metadata between the two metadata servers.

A workaround is available in a patch, but deployment of this patch will necessitate a (hopefully brief) interruption while we restart the beegfs-meta service on the primary metadata server.

Given the fact that this is preventing the proper synchronization of our metadata, impeding our ability to recover in the event of a primary metadata server failure and putting the data at risk in general, we are working to deploy this patch as soon as possible.

We will patch the secondary server today, and have scheduled the patching of the primary to occur on or after 09:30 Thursday (tomorrow) morning.

We are hoping that this interruption will be only a brief pause / block to IO, and not impact running jobs beyond that. Further updates will be provided here as we have them.
Jan 22, 13:57 MST
Jan 22, 2020

No incidents reported.

Jan 20, 2020

No incidents reported.

Jan 19, 2020

No incidents reported.

Jan 18, 2020

No incidents reported.

Jan 17, 2020
Completed - We are marking this first half of the activity as complete, and will schedule a new maintenance activity for the second half.
Jan 17, 09:58 MST
Update - We discovered a ZFS build problem prior to our fail-over operation which has since been resolved; we subsequently moved beegfs-storage back over to boss2 successfully, and PetaLibrary remains available.

Since this operation took longer than we scheduled, we will conclude our activities for today and re-schedule the second half of this activity.
Jan 15, 17:21 MST
Update - We are continuing our beegfs-storage update simultaneously with the beegfs-meta outage reported elsewhere. Again, these two issues appear to be completely independent an coincidental.

The upgrade of boss2 is complete. Our next step is to fail beegfs-storage for boss2 back from boss1. We will be proceeding with this operation now, which will cause a momentary pause in PetaLibrary IO; however, this is not expected to cause any outage, nor did it previously (during our initial failover from boss2 to boss1).
Jan 15, 15:59 MST
Update - The upgrade on boss2 is still in progress; but we have become aware that there may be disruption to beegfs from at least some access points (notably login nodes). We are investigating.
Jan 15, 14:35 MST
Update - Our first failover operation has completed successfully and without error. All PetaLibrary beegfs-storage load is currently being carried by the "boss1" server. We will proceed with upgrades on boss2, during which time there will likely be performance degradation but no loss of access to data.

We will continue to provide updates here as we make progress.
Jan 15, 13:48 MST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jan 15, 13:01 MST
Update - We intend to begin maintenance activities on this cluster at approximately 13:00 today, though this will include some transit time for us to relocate to the datacenter. We will continue to provide updates here as we progress.
Jan 15, 11:01 MST
Scheduled - Research Computing will be conducting off-cycle planned maintenance this Wednesday to address a known issue with the ZFS component of the PetaLibrary. During the maintenance period, access to PetaLibrary and compute on RMACC Summit and Blanca should continue. There will be momentary pauses in IO as services are moved from one storage server to another, and a likely decrease in performance with one server carrying the entire load; but we will do everything we can to ensure that the service remains up and available throughout the maintenance.

This activity addresses the previously-reported incident at
Jan 13, 12:26 MST
Resolved - The PetaLibrary BeeGFS metadata service has remained stable without further incident. We are continuing to work with our support vendor (the file system developer) to understand root cause.
Jan 17, 09:54 MST
Update - The PetaLibrary has remained accessible after our previous administrative action, including during and after our separate maintenance activities.

We are still waiting to hear from support regarding why this happened in the first place.

In the mean time, we will keep this incident open and monitoring the system closely.
Jan 15, 17:23 MST
Monitoring - We have implemented a fix for the beegfs-meta outage. The fix immediately restored access to the fs, but we are monitoring the system to observe whether the problem reoccurs.

We are in contact with both our integrator and file system developer regarding this issue.
Jan 15, 15:55 MST
Investigating - We are currently experiencing a BeeGFS metadata outage affecting all services that use the PetaLibrary, including Summit and Blanca. We are actively investigating the incident and will provide any information as we get it.

We currently believe this issue is unrelated to the maintenance work, as its initial symptoms started before any action had been taken, and before the maintenance was scheduled.

We apologize for this inconvenience, and will be working to restore service as soon as possible.
Jan 15, 15:06 MST
Jan 16, 2020

No incidents reported.

Jan 14, 2020

No incidents reported.