Update - We have identified a backup process that correlated with the timing of most of these errors. Two days ago we suspended this backup process and the error has not reoccurred since. We are still working with Dell to understand why the failure was occurring, and plan to restore the backup process once a fix has been deployed.
Aug 13, 12:36 MDT
Identified - We have been following up with DELL on this problem on the Core Storage which is still present. We were able to reproduce it as well by running Matlab jobs and confirmed the "permission denied" on some of those jobs.
According to DELL, the root cause was identified and they are working on a solution which may come on the weekend.

In the meantime, we’ve asked DELL if there is any condition(s) that can be monitored to prevent the problem from happening. The problem can also affect NFS shares like it did today for PetaLibrary Active spaces that uses Beegfs.
Aug 6, 11:59 MDT
Update - Dell/EMC has escalated the issue internally to their engineering department.
Jul 9, 16:28 MDT
Investigating - An issue has developed on the core storage (/home, /projects, /curc/*). Occasional "Permission Denied" errors are occurring when accessing files on core storage. We have as of yet been unable to successfully duplicate the error condition. This seems to be related to an update made to the operating system of the Isilon cluster.

We have a support ticket open with Dell/EMC and are working with them to determine the cause of this issue and develop a resolution.
Jul 9, 10:18 MDT
Research Computing Core ? Operational
Science Network ? Operational
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Operational
EnginFrame ? Operational
JupyterHub ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Aug 15, 2020

No incidents reported today.

Aug 14, 2020
Resolved - The upstream network incident was resolved yesterday and service appears to be restored and stable.
Aug 14, 09:10 MDT
Monitoring - Upstream reports that network service has been restored. All CU Boulder services should be available, including users' ability to connect to the VPN.
Aug 13, 10:57 MDT
Update - Some services are coming back online as access from off-campus is being restored. The network is still running in a degraded state, so those services may go back down. We will post additional updates as they become available.

The upstream incident is being reported at https://oit.colorado.edu/node/25451
Aug 13, 09:11 MDT
Investigating - After upstream firewall maintenance this morning, a service disruption has occurred which impacts user’s ability to access many CU Boulder services and websites. IdentiKey logins are also impacted, as well as the VPN. This likely affects the ability to access RC services as well.

OIT is working to resolve this issue as quickly as possible. We will post more updates as they become available.
Aug 13, 08:31 MDT
Aug 13, 2020

Unresolved incident: Core storage "Permission Denied" errors.

Aug 12, 2020

No incidents reported.

Aug 11, 2020

No incidents reported.

Aug 10, 2020

No incidents reported.

Aug 9, 2020

No incidents reported.

Aug 8, 2020

No incidents reported.

Aug 7, 2020
Resolved - Access to Summit storage from outside of Summit was resolved earlier today.
Aug 7, 16:17 MDT
Monitoring - A fix has been implemented and we are monitoring the results.
Aug 7, 10:25 MDT
Investigating - Summit storage, including a subset of PetaLibrary allocations backed by Summit storage, is not currently accessible outside of Summit. This appears to be a residual networking configuration problem following the networking changes performed during the planned maintenance.

This will be addressed as soon as possible on Friday.
Aug 7, 00:39 MDT
Completed - Summit has been returned to service, and today's planned maintenance activities have all completed successfully.
Aug 7, 00:19 MDT
Update - Scheduled maintenance is still in progress. We will provide updates as necessary.
Aug 6, 23:16 MDT
Update - We have restored network access to the HPCF and are slowly bringing compute resources back online. We hope to have Summit and Blanca HPC resources back in production in the next few hours.
Aug 6, 21:50 MDT
Update - We are still waiting for network access to be restored at the HPCF. Once connectivity has been restored we should be in a position to restore service.
Aug 6, 19:01 MDT
Update - Power has been restored at the HPCF, and we are starting to bring systems back up. A simultaneous network change at the HPCF "gateway" has presented some configuration challenges, and we are working through those now.
Aug 6, 16:25 MDT
Update - Today's planned maintenance activities in the HPCF are in progress and reportedly on-schedule. We are scheduled to have power again at 2:30 PM, and will do our best to restore service as soon as possible after that.
Aug 6, 09:05 MDT
Update - Today's maintenance is in progress. However, it has come to our attention that we neglected to announce that a subset of PetaLibrary allocations, still temporarily located on Summit storage, are also unavailable for the duration of this maintenance period. We regret omitting this detail in our prior announcement.
Aug 5, 12:40 MDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Aug 5, 07:00 MDT
Update - Because of the impact and breadth of the HPCF maintenance scheduled to start 5 August (Wednesday) I have so far emphasized it in our announcements; but I have neglected to mention that we also intend to add additional storage to the PetaLibrary/active cluster, in pursuit of supporting native ZFS allocations and an eventual migration away from BeeGFS. This should be non-disruptive; but we have experienced disruption in the past. For this reason we are performing this operation during the maintenance period, though we do not intend to proactively halt jobs on the portion of Blanca that is otherwise unaffected by the HPCF work.
Aug 4, 23:17 MDT
Update - Be reminded that we have an extended planned maintenance outage for the HPCF, including Summit and a portion of Blanca, scheduled to start tomorrow, 5 August 2020 (Wednesday). This outage addresses an outstanding electrical health and safety issue at the datacenter.
Aug 4, 14:15 MDT
Update - Be reminded that we have an extended planned maintenance outage for the HPCF, including Summit and a portion of Blanca, scheduled 5 August 2020 (Wednesday). This outage addresses an outstanding electrical health and safety issue at the datacenter.
Jul 23, 10:25 MDT
Scheduled - The datacenter operations team that supports the RC environment has requested an extended 48-hour outage window to correct an electrical health and safety issue at the High Performance Computing Facility (HPCF). This outage is being scheduled to coincide with our August regular maintenance schedule.

During this maintenance, Summit compute, Summit scratch, and Blanca HPC will be entirely offline. This includes the following Blanca partitions:

- blanca-curc
- blanca-nso
- blanca-topopt
- blanca-ngpdl

While the maintenance is scheduled for 48 hours, we will endeavor to complete the work as quickly as possible without compromising the necessary work.

If you have any questions or concerns, please contact rc-help@colorado.edu.
Jul 7, 12:01 MDT
Aug 6, 2020
Resolved - The PetaLibrary/active outage was caused by a failure with our core storage service, which has already been identified and a resolution is pending from the vendor. Our monitoring system is configured to be silent for most failures during our maintenance windows. PetaLibrary/active is stable, and we are planning to implement workarounds to both issues during our maintenance period in September.
Aug 6, 11:33 MDT
Investigating - At approximately 5am a failover event failed to migrate a process which led to all PetaLibrary/active allocations being inaccessible. Our monitoring system also failed to notify us of the event. The PL/active service is stable at the moment, and both issues are being investigated.
Aug 6, 08:16 MDT
Aug 5, 2020
Aug 4, 2020
Resolved - With assistance of our user community, we were able to free up space on rc_scratch. Since then, we have enabled per-user quotas on rc_scratch and, while not all users have had quotas implemented yet, this new functionality will give us greater ability to manage near-full events in the future.

We have also added additional monitoring for rc_scratch capacity and fullness.
Aug 4, 14:18 MDT
Identified - We have a plan for establishing per-user quotas in /rc_scratch, but this will require a configuration change that will require the file system to be briefly unmounted. As such, we are scheduling a brief outage for our next PM date, 1 July, and will plan to make the change then.

In the mean time, at least some space has been freed, and rc_scratch is now showing 85% full with 16T available. Per-user quotas have been partially implemented for a few rc_scratch storage outliers, and we will be working with them individually to try to allow them to work until the quota configuration change is made.
Jun 19, 12:08 MDT
Investigating - The /rc_scratch file system, mostly used during Blanca computation, is full. We are working on a plan and hope to implement it today. Until then, individuals are being contacted to ask them to free up space.

Be advised: data is, as always, automatically removed from /rc_scratch after 90 days; but bursts of write activity within that time window can fill the file system.
Jun 19, 09:28 MDT
Resolved - We applied a patch to the operating system that underlies core storage. This patch appears to have addressed the issues that led to uncontrolled node reboots. We have not experienced any further node reboots since the patch was applied, so we believe this issue is resolved.
Aug 4, 14:16 MDT
Investigating - We are investigating an issue that is causing uncontrolled node reboots in the core storage infrastructure that serves /home, /projects, and /curc. We have experienced two reboots so far. In both cases the nodes recovered on their own and re-joined the cluster after a few minutes; but, during the reboot, access to these file systems blocked.

We have a case open with upstream support regarding this issue.
Jul 13, 15:13 MDT
Aug 3, 2020

No incidents reported.

Aug 2, 2020

No incidents reported.

Aug 1, 2020

No incidents reported.