Update - 46/47 recoveries are complete. We are communicating directly with the affected customer on the final recovery effort; incremental status updates will not be posted here. We will leave this incident open until the damaged storage pool has been rebuilt and we begin moving data back to it.
Mar 24, 2023 - 08:54 MDT
Update - Recovery of Petalibrary active allocations impacted by the March 1 event continues. We have copied 39 of the 47 impacted allocations to alternate storage. These allocations are available at their normal path ("/pl/active/") and may be used immediately.

Of the eight allocations that have not been recovered, we have strong evidence that three are unused and have deprioritized those. The remaining five allocations present a more challenging recovery scenario. We're unable to clearly project when we will know if the data is retrievable or exactly how long the data will take to recover. We are continuing our work with a filesystem developer to assist with this work and will communicate updates as we have them.

Some questions have been raised about future plans to address this issue. As an immediate protection, we've taken steps to detect the issue that originated this recovery. Longer term we are exploring options for providing a backup option for Petalibrary with more details to be provided through our stakeholder process as that option develops.

Mar 13, 2023 - 16:18 MDT
Update - One PetaLibrary allocation restore to an alternate storage location finished last night, and spot checks by the customer indicate that their data is intact. A restore of a second allocation was started as soon as the first finished.

Success of a single allocation is promising, but we still need to handle each recovery on a case by case basis. We will provide another update on Monday.

Mar 10, 2023 - 14:15 MST
Update - We are attempting to recover the first PetaLibrary allocation now. The underlying storage pool is sufficiently damaged that it will never again be mounted in read/write mode, meaning we must copy all data elsewhere. This will take time, given the amount of data involved (hundreds of TB). This first recovery effort is expected to finish late today or early tomorrow, and we will share the results of the initial attempt.

If this initial recovery succeeds, that does not imply that all data will be recoverable, but does suggest to us that more data _should_ be recoverable. We will not be sure that this process can recover all data until all data is recovered into a new location. We will continue to update as this process proceeds and provide timelines as we learn more.

Mar 09, 2023 - 14:56 MST
Update - After attempting the standard ZFS recovery procedure (rolling the system back to prior checkpoints), the impacted allocations were still unable to be imported. Our next step is to work with the ZFS developer group to determine whether there is a path to determine which data has been damaged.

We suspect (but cannot prove yet) that damaged data would most likely be limited to snapshots that occurred while the system was in maintenance (less likely to impact customer data). The ZFS development community has been very helpful, but this process will likely be lengthy.

Given the length of this outage, we'd like to work with impacted customers to set up temporary allocations for storing any new data. We understand this outage has impacts on your workflows and would like to do all we can to allow you to continue your work while the recovery process continues. To facilitate this we will be opening a ticket for all impacted allocation owners and working with you to find the best way to restore your operations while the data recovery effort continues.

Mar 06, 2023 - 13:58 MST
Update - We are continuing to investigate this issue.
Mar 06, 2023 - 13:57 MST
Update - We are still working through the list of checkpoints. One of the filesystem developers is beginning work on a patch that will permit us to bypass some sanity checks that should allow us to import the storage pool, and give us a clearer picture of where things stand. Next update will be on Monday.
Mar 03, 2023 - 16:22 MST
Update - The storage pool containing the affected filesystems is still not able to be imported. We have isolated the affected disks to a single system, are running a newer version of the filesystem software that is more lenient of some problems, and are now working on trying to import each of the roughly 75 checkpoints of the storage pool.
Mar 02, 2023 - 17:07 MST
Investigating - One of the storage pools that supports PetaLibrary/active allocations is currently not able to be imported. Until this is resolved the following PetaLibrary allocations are not available:

/pl/active/abha4861
/pl/active/BFALMC
/pl/active/BFSeqF
/pl/active/BioChem-NMR
/pl/active/BioCore
/pl/active/BMICC
/pl/active/brennan
/pl/active/CAL
/pl/active/ceko
/pl/active/CIEST
/pl/active/cmbmgem
/pl/active/COSINC
/pl/active/CU-TRaiL
/pl/active/CUBES-SIL
/pl/active/CUFEMM
/pl/active/EML
/pl/active/fanzhanglab
/pl/active/FCSC
/pl/active/Goodrich_Kugel_lab
/pl/active/GreenLab
/pl/active/ics
/pl/active/ics_affiliated_PI
/pl/active/ics_archive_PI
/pl/active/Instrument_Shop
/pl/active/iphy-sclab
/pl/active/JILLA-KECK
/pl/active/jimeno_amchertz
pl/active/LangeLab
/pl/active/LMCF
/pl/active/LugerLab-EM
/pl/active/McGlaughlin_Lab-UNC
/pl/active/MIMIC
/pl/active/Morris_CSU
/pl/active/neu
/pl/active/NSO-IT
/pl/active/OGL
/pl/active/OIT_DLS_BBA
/pl/active/OIT_DLS_CCS
/pl/active/OIT_DLS_ECHO
/pl/active/Raman_Microspec
/pl/active/scepi_magthin
/pl/active/snag
/pl/active/STEMTECH
/pl/active/swygertlab
/pl/active/UCB-NMR
/pl/active/VoeltzLab

We are working with one of the filesystem developer to understand and resolve the issue. Due to the developer being on EST, the next update on this issue will likely not be until tomorrow (Mar 2) morning.

Mar 01, 2023 - 17:12 MST
Research Computing Core ? Operational
Alpine ? Operational
90 days ago
100.0 % uptime
Today
Blanca ? Operational
PetaLibrary Operational
Open OnDemand ? Operational
90 days ago
100.0 % uptime
Today
CUmulus OpenStack Platform Operational
90 days ago
100.0 % uptime
Today
AWS ec2-us-west-2 Operational
AWS rds-us-west-2 Operational
AWS s3-us-west-2 Operational
RMACC Summit ? Operational
Science Network ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Scheduled Maintenance
Upcoming Monthly Planned Maintenance May 1, 2024 07:00 - May 2, 2024 07:00 MDT
We will perform our monthly scheduled planned maintenance during this time. Affected services will be unavailable.
Posted on Apr 23, 2024 - 14:28 MDT
Past Incidents
Apr 24, 2024

No incidents reported today.

Apr 23, 2024

No incidents reported.

Apr 22, 2024

No incidents reported.

Apr 21, 2024

No incidents reported.

Apr 20, 2024

No incidents reported.

Apr 19, 2024

No incidents reported.

Apr 18, 2024

No incidents reported.

Apr 17, 2024
Resolved - We believe this incident is resolved. We will continue to monitor.
Apr 17, 17:00 MDT
Monitoring - Changes to our topology file this morning appear to have significantly eased this issue. We will monitor through the day.
Apr 17, 09:28 MDT
Update - We are continuing to assess. We have provided interim solutions to speed the start of interactive jobs. Thus far, these appear to have succeeded, though we will work to improve the user experience for them in the coming days.

SchedMD and RC continue to investigate batch jobs. Our best current understanding is that the cluster is consistently under high load or, on many occasions, seeing resources scheduled for very large jobs, leading to longer waits for jobs to start. We are continuing to monitor and have not reached a conclusion, but much evidence (including detailed analysis of log files relevant to backfill scheduling and priority) points in this direction.

As such, RC is discussing options to better support smaller jobs. This may include changes to priority calculations or reconfiguring the cluster with dedicated resources for “short” jobs, to speed processing.

We will leave this incident open for at least one more full day, when the team will convene to make a final determination.

Apr 15, 13:20 MDT
Update - We have established a recommendation to ensure researchers have the ability to run interactive sessions as we continue investigation.

Users seeking interactive resources with limited wait times should use the testing partitions (atesting, atesting_mi100, or atesting_a100) or acompile instead of amilan, aa100, or ami100. Please see our documentation for more information about interactive jobs: https://curc.readthedocs.io/en/latest/running-jobs/interactive-jobs.html?highlight=sinteractive#general-interactive-jobs

Our next update will be on Monday.

Apr 12, 15:36 MDT
Update - We are engaging the vendor regarding performance degradation in job submission times on Alpine. They have sought, and we have provided, additional information to better diagnose the cause. Internal troubleshooting continues at the same time. We expect our next update to be tomorrow.
Apr 11, 18:16 MDT
Update - In the interest of system consistency, we are awaiting guidance from support prior to performing additional tests and troubleshooting. We will provide an update as soon as possible.
Apr 11, 09:55 MDT
Update - We are continuing to investigate. We have engaged SchedMD, the vendor who provides support for the Slurm scheduler. We expect to have our next update tomorrow morning.
Apr 10, 17:48 MDT
Investigating - The issue has persisted. We are continuing to investigate.
Apr 10, 14:55 MDT
Monitoring - The issue with delayed starts to Alpine jobs appears to have improved following troubleshooting this morning. We will monitor today for regression or continued improvement.
Apr 10, 11:16 MDT
Update - We are continuing to investigate. The acompile service on Alpine was affected and has been restored. Work continues on the primary Alpine partitions.
Apr 10, 10:25 MDT
Investigating - Queued jobs on Alpine are experiencing delayed starts. We are investigating the issue and will provide an update when more information is available.

Running jobs on Alpine are not impacted and are expected to complete successfully. Jobs on Blanca are not impacted.

Apr 9, 19:18 MDT
Apr 16, 2024

No incidents reported.

Apr 15, 2024
Apr 14, 2024

No incidents reported.

Apr 13, 2024

No incidents reported.

Apr 12, 2024
Apr 11, 2024
Apr 10, 2024