Update - We have identified a backup process that correlated with the timing of most of these errors. Two days ago we suspended this backup process and the error has not reoccurred since. We are still working with Dell to understand why the failure was occurring, and plan to restore the backup process once a fix has been deployed.
Aug 13, 12:36 MDT
Identified - We have been following up with DELL on this problem on the Core Storage which is still present. We were able to reproduce it as well by running Matlab jobs and confirmed the "permission denied" on some of those jobs.
According to DELL, the root cause was identified and they are working on a solution which may come on the weekend.

In the meantime, we’ve asked DELL if there is any condition(s) that can be monitored to prevent the problem from happening. The problem can also affect NFS shares like it did today for PetaLibrary Active spaces that uses Beegfs.
Aug 6, 11:59 MDT
Update - Dell/EMC has escalated the issue internally to their engineering department.
Jul 9, 16:28 MDT
Investigating - An issue has developed on the core storage (/home, /projects, /curc/*). Occasional "Permission Denied" errors are occurring when accessing files on core storage. We have as of yet been unable to successfully duplicate the error condition. This seems to be related to an update made to the operating system of the Isilon cluster.

We have a support ticket open with Dell/EMC and are working with them to determine the cause of this issue and develop a resolution.
Jul 9, 10:18 MDT
Research Computing Core ? Operational
Science Network ? Operational
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Operational
EnginFrame ? Operational
JupyterHub ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Sep 19, 2020

No incidents reported today.

Sep 18, 2020

No incidents reported.

Sep 17, 2020

No incidents reported.

Sep 16, 2020

No incidents reported.

Sep 15, 2020

No incidents reported.

Sep 14, 2020

No incidents reported.

Sep 13, 2020

No incidents reported.

Sep 12, 2020

No incidents reported.

Sep 11, 2020

No incidents reported.

Sep 10, 2020
Completed - The scheduled maintenance has been completed.
Sep 10, 10:37 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 10, 08:00 MDT
Scheduled - We ran into an issue while upgrading RCAMP to Python3 today. We have since fixed the issue and will try the upgrade again.
Sep 9, 15:19 MDT
Resolved - This incident is resolved.
Sep 10, 09:05 MDT
Monitoring - A fix has been implemented and we are monitoring the results.
Aug 31, 11:33 MDT
Identified - Issue has been identified. A workaround has been put in place. Permanent fix is being worked with Globus support.
Aug 21, 13:37 MDT
Investigating - Users are reporting authentication issues when utilizing Globus. We are currently investigating this issue.
Aug 21, 10:15 MDT
Sep 9, 2020
Completed - The scheduled maintenance has been completed.
Sep 9, 16:59 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 9, 16:05 MDT
Scheduled - ** Maintenance is still ongoing, we are waiting for PL to be in a stable state before we release compute nodes back to production **

Research Computing will perform a planned maintenance Wednesday, 9 September 2020 relating to Petalibrary.

Maintenance is scheduled to take place between 08:00 and 14:00, though service will be restored as soon as all activities have concluded. During the maintenance period we will have Summit, Blanca, and Viz cluster unavailable for jobs due to the nature of the Petalibrary work.

- Update configuration on the Petalibrary to increase overall reliability of the service

If you have any questions or concerns, please contact rc-help@colorado.edu.
Sep 9, 16:04 MDT
Completed - The scheduled maintenance has been completed.
Sep 9, 16:00 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 9, 08:00 MDT
Scheduled - Research Computing will perform a planned maintenance Wednesday, 9 September 2020 relating to Petalibrary.

Maintenance is scheduled to take place between 08:00 and 14:00, though service will be restored as soon as all activities have concluded. During the maintenance period we will have Summit, Blanca, and Viz cluster unavailable for jobs due to the nature of the Petalibrary work.

- Update configuration on the Petalibrary to increase overall reliability of the service

If you have any questions or concerns, please contact rc-help@colorado.edu.
Aug 27, 14:52 MDT
Completed - The scheduled maintenance has been completed.
Sep 9, 11:00 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 9, 08:00 MDT
Scheduled - We will be upgrading RCAMP to Python 3.
Sep 8, 09:56 MDT
Sep 8, 2020

No incidents reported.

Sep 7, 2020
Resolved - This incident has been resolved.
Sep 7, 19:48 MDT
Monitoring - Datacenter operations and facilities management both report that the cooling systems inside the HPCF are operating normally again. We have enabled all the queues on Summit again and will be monitoring the systems into the evening.
Sep 7, 13:48 MDT
Investigating - Around 11:50 today we started receiving alerts that there are issues with the cooling systems inside the HPCF which houses Summit and Blanca HPC nodes. We have set all queues on Summit to a state of down for the time being in order to help reduce the load on Summit while datacenter operations and facilities management work on addressing the cooling issues. We have not stopped the queues for Blanca HPC just yet since the temperature for those nodes have not reached critical limits yet.
Sep 7, 12:38 MDT
Sep 6, 2020

No incidents reported.

Sep 5, 2020

No incidents reported.