Monitoring - Summit has been returned to service, and access to Summit storage has been restored.

Significant work was required to perform sufficient consistency checks necessary to fulfill safeguards present in both the block storage and clustered filesystem storage that makes up the Summit storage stack. But no errors have been detected, a great deal of data has been read from Summit storage as a proof-of-fitness test, and significant writes have also been performed for proof-of-fitness.

We believe that these failures are a result of a failure the SFA storage hardware encountered over the week end; more specifically, a failure of the SFA platform to appropriately accommodate the recovery activity performed Monday morning under direction from upstream vendor support.

We have passed all this information on to them for analysis.
May 4, 02:15 MDT
Identified - The problem with Summit storage experienced earlier has reoccurred. We are investigating the root cause of the issue, and will bring it to solution as soon as possible. This may mean Tuesday morning, however.

Summit Slurm has been paused, so no new jobs will attempt to start until this has been resolved.
May 3, 22:30 MDT
Monitoring - This outage may have affected all access to /scratch/summit (including compute nodes), and is likely related to maintenance on summit scratch hardware this morning. The underlying issue has been fixed, and we are working with the vendor to understand why an outage occurred.
May 3, 14:09 MDT
Investigating - Access to /scratch/summit from RC hosts that are NOT compute nodes (notably the login nodes) is down at the moment. This affects some PetaLibrary allocations as well.
May 3, 12:31 MDT
Monitoring - Last night RC experienced a failure in its "Core Virtual Infrastructure" which hosts, among many other things, the login nodes and Slurm services. This is the second such recent failure, though the first passed without notable disruption. This time the login nodes were not automatically returned to service correctly, nor the Blanca Slurm service, apparently due to network involvement in the disruption.

We have advised our upstream OIT support team, which administers the Core Virtual Infrastructure, about this failure, and are awaiting their feedback. Meanwhile, we have returned the login and Blanca Slurm services to service.

We will continue to monitor the situation and follow-up with upstream support staff on Monday.
Apr 24, 10:57 MDT
Research Computing Core ? Operational
Science Network ? Operational
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Operational
EnginFrame ? Operational
JupyterHub ? Operational
Degraded Performance
Partial Outage
Major Outage
Past Incidents
May 8, 2021

No incidents reported today.

May 7, 2021

No incidents reported.

May 6, 2021

No incidents reported.

May 5, 2021

No incidents reported.

May 4, 2021

Unresolved incident: Access to /scratch/summit and interim PetaLibrary allocations.

May 3, 2021
May 2, 2021

No incidents reported.

May 1, 2021

No incidents reported.

Apr 30, 2021

No incidents reported.

Apr 29, 2021

No incidents reported.

Apr 28, 2021

No incidents reported.

Apr 27, 2021

No incidents reported.

Apr 26, 2021

No incidents reported.

Apr 25, 2021

No incidents reported.

Apr 24, 2021
Resolved - This incident has been resolved.
Apr 24, 10:54 MDT
Monitoring - Access to has been restored, and configuration changes have been made. We are monitoring to observe whether this is sufficient to keep the service active.

If your account is still locked, please follow the unlock instructions that were sent to you in email.
Mar 10, 14:27 MST
Update - Our security office has reviewed our logs and confirmed our understanding of the attack. It does seem as though there was no actual unauthorized access, aside from the fact that the attacker does seem to have a list of at least some valid accounts for the GitLab server. We are investigating how this list may have been obtained, but it likely has to do with intentionally public access to projects stored in the server.

We are about to restore access to this server, and make two changes:

- New accounts will require admin approval in order to be created. (This should be handled using an existing automated workflow.)

- Unauthenticated connections will be rate-limited by IP address.

These changes--particularly the rate limiting--may require some tuning, so we will make an initial estimate and adjust if necessary.
Mar 10, 14:19 MST
Update - We have upgraded all packages relevant to this Gitlab instance, but the attack is ongoing. To protect the server, and its account credentials, we are going to leave the server offline while the security office reviews our logs and recommends a mitigation plan.
Mar 9, 12:11 MST
Investigating - We are investigating an apparently ongoing attack on our externally-facing GitLab server. Details about the attack have been communicated with the IT security office. We are taking this opportunity to upgrade the OS and GitLab software to ensure we have all the latest security updates. So far we have no indication of any actual unauthorized access or compromise.

You may have received a notification that your account was locked, with "Unlock instructions." Until this attack has been completely addressed, we recommend _not_ attempting to unlock or access your GitLab account.

More information will be provided here as it is available.
Mar 9, 10:41 MST