Investigating - After the IBM library shutdown last Wednesday for the UPS planned maintenance, our TSM Server was not initiating correctly. After some debugging we brought the system up late on Thursday.

It turned out that the storage manager system failed again and we have been debugging the possible cause of it. Right now, allocations on PL Archive legacy (under /archive) are not accessible.

According to the library interface itself, no errors are verified and drives are operational. But we need to understand and fix the TSM issue to make the PL allocations accessible again. To that end, we got in contact with our support and we will be also engaging IBM soon.

We will provide further updates as we learn more.
Mar 2, 08:19 MST
Research Computing Core ? Operational
Science Network ? Operational
RMACC Summit ? Operational
Blanca ? Operational
PetaLibrary ? Partial Outage
EnginFrame ? Operational
JupyterHub ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Mar 7, 2021

No incidents reported today.

Mar 6, 2021

No incidents reported.

Mar 5, 2021

No incidents reported.

Mar 4, 2021

No incidents reported.

Mar 3, 2021
Completed - Today's planned maintenance activities have concluded, and Summit, Blanca, and PetaLibrary have been returned to production.

- Legacy PetaLibrary/archive is still offline following a planned outage last week, and we will resume our attempts to restore service there

- We replaced a defective and damaged backplane in one of the Blanca HPC chassis which was impeding correct functioning of the cooling system and some of the InfiniBand interconnect. We also repaired damage to compute nodes that had themselves been damaged by the backplane.

- We upgraded InfiniBand interconnect firmware in Blanca HPC to bring all chassis to the latest version.

- We improved the clustering fail-over configuration in PetaLibrary/active to prevent some erroneous failure conditions we have previously experienced.

- We migrated data within PetaLibrary/active BeeGFS to free up infrastructure for conversion to ZFS storage. This will likely cause a reduction in performance for allocations that remain in BeeGFS until they are migrated to ZFS.

- We performed further tests for SMB support (particularly fail-over) in PetaLibrary/active.
Mar 3, 16:50 MST
Update - Due to an oversight, not all Blanca compute resources were correctly reserved for today's planned maintenance activities. As a result, some jobs were running when the PM period started. We have requeued these jobs when possible; but some jobs may not be re-queueable, and will have been cancelled as a result.
Mar 3, 10:09 MST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Mar 3, 07:00 MST
Scheduled - Research Computing will perform regularly-scheduled planned maintenance on Wednesday, 3 Mar 2021. March's activities include:

- routine maintenance of the UPS for HPCF (Summit compute/storage). The HPCF UPS will be in "maintenance bypass" mode, meaning the HPCF will be dependent on utility power for 30-60 minutes.
- data relocation on PetaLibrary/active to support ongoing migrations to a new filesystem (BeeGFS to ZFS).

Maintenance is scheduled to take place between 07:00 and 19:00, though service will be restored as soon as all activities have concluded. During the maintenance period no jobs will run on Summit resources, and access to PetaLibrary/active allocations via /pl/active and /work will not be available.

If you have any questions or concerns, please contact rc-help@colorado.edu.
Feb 26, 11:02 MST
Resolved - All new production has been moved to the new DTN nodes, including all names including dtn.rc.colorado.edu, dtn.rc.int.colorado.edu, dtn-data.rc.int.colorado.edu, and dtn-new-data.rc.int.colorado.edu. As such, we are closing this incident. Some globus shared endpoints are still using the legacy DTN nodes, but migrating shared endpoints to the new nodes requires coordination with the endpoint owners. We will be scheduling a date for shutdown of the legacy DTN, after which such shared endpoints will stop working; but we will still be able to migrate them after-the-fact.
Mar 3, 12:23 MST
Monitoring - We may have found the reason why the data interfaces of the older DTN's aren't working as expected.

We've informed that a workaround (until we have the new nodes tested) is to use dtn02.rc.colorado.edu that right now relies on other network interfaces. But for some reason the connection between dtn02.rc.colorado.edu and other RC nodes are unstable, as noted today.

We are close to complete the transition of users to the new nodes, after testing FTP and sshfs services. A message with the current status of this transition will be ent very shortly via rc-news@colorado.edu.
Feb 10, 11:13 MST
Investigating - There is a problem affecting the data interface of our older data transfer nodes.
We can't reach other hosts on the RC data network with those interfaces. As such Beegfs PL/active fails to mount on those nodes as well as summit scratch, etc.

User who use sshfs and ftp via dtn.rc.colorado.edu or dtn-data.rc.int.colorado.edu are affected by this issue as well.

Those nodes are specifically dtn01.rc.colorado.edu and dtn02.rc.colorado.edu.
Some services are still running on dtn02 because the data interface was brought down and the management interface is serving the traffic used by those services. So you may want to use specifically dtn02.rc.colorado.edu when doing a rsync or using ftp or sshfs.

We are prioritizing the work that still needs to be completed for the new data transfer nodes in order to decommission dtn01 and dtn02. That will take possibly more than a week. But we should be able to move users to the new nodes within a week while we continue to finalize all the tasks we had identified for the new nodes. Users will be contacted by email.

If you have any questions, please write to us at rc-help@colorado.edu.
Feb 5, 10:34 MST
Resolved - The Summit Ethernet aggregation switch was replaced successfully, and service was restored. We happen to be in the middle of a planned maintenance outage right now; but Summit has been operating normally otherwise.

A second switch has been physically deployed alongside the current aggregation switch, and we intend to split production across the two switches in a redundant pair. This should prevent such a single-point-of-failure outage in the future.
Mar 3, 12:21 MST
Update - Summit appears to have remained up and stable since replacing the aggregation switch with our shelf spare. We anticipate receipt of a replacement switch on Tuesday, at which point we intend to deploy the two switches as an active/active redundant pair ("stack"), hopefully obviating this risk in the future.
Feb 13, 22:26 MST
Update - All Summit partitions are now online and accepting jobs. We will be closely monitoring the operation of the new hardware. The system is believed fully operational and stable at this time.
Feb 13, 13:26 MST
Monitoring - The new hardware is in place and we are verifying correct operation of the cluster.
Feb 13, 12:53 MST
Identified - Networking and Research Computing are on-site and working to repair the failed network connection with replacement hardware.
Feb 13, 11:33 MST
Investigating - Summit is offline again. We are working to determine the cause.
Feb 13, 08:10 MST
Update - All Summit partitions are once again accepting and running jobs. We believe the system to be stable at this time, but will continue monitoring it throughout the weekend. Thank you for your patience during this extended outage.
Feb 12, 20:07 MST
Monitoring - We believe that the network problem that led to today's outage has been addressed. We have allowed some jobs to start on Summit, and we are monitoring the system stability to observe whether we have any further problems.

If the system remains stable, we intend to release the system for regular use in the next hour or so.

After this, we will begin preparations to fix this single-point-of-failure so that this doesn't happen again in the future.
Feb 12, 19:02 MST
Update - Networking team continues to work on the issue.
Feb 12, 15:58 MST
Identified - Switch has failed again. Summit is not available at present time. CU network team is heading in to troubleshoot switch. If necessary a spare is available to be swapped in.
Feb 12, 10:04 MST
Monitoring - A network aggregation switch became unresponsive around 04:44 this morning. The switch has been rebooted and is operating correctly now. CU networking team is investigating the root cause of the switch failure.
Feb 12, 09:33 MST
Investigating - RMACC Summit experienced an outage sometime overnight and is presently offline. Some storage partitions, including PetaLibrary, are affected. We are investigating the issue and will provide an update as soon as possible.
Summit and PetaLibrary users will be unable to access these resources until the issue is resolved. Blanca nodes appear to be unaffected, however Blanca jobs that use /pl/active for job I/O may be impacted.
Feb 12, 07:25 MST
Mar 2, 2021

Unresolved incident: PL/Archive Legacy Offline.

Mar 1, 2021

No incidents reported.

Feb 28, 2021

No incidents reported.

Feb 27, 2021

No incidents reported.

Feb 26, 2021
Completed - TSM is up now and so as PL Archive legacy allocations (under "/archive")

We had to replace a system board in one of the servers used on Archive GPFS cluster. After fixing the network for that server as a follow up from the system board swap, we re started all the Archive GPFS cluster and everything looks good now.

If you have questions or concerns please write to rc-help@colorado.edu.
Feb 26, 01:24 MST
Verifying - Power was restored in the IBM tape library.
The TSM server that connects to the library didn't start correctly though. So, we will need to verify why it didn't initiate as expected tomorrow and provide more updates.
Feb 24, 19:43 MST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Feb 24, 06:30 MST
Scheduled - Data Center team will perform a critical maintenance on the UPS A tomorrow (2/24) at SPSC/N190 data center. Among RC resources hosted in SPSC/N190, only the IBM tape library is expected to be impacted by the maintenance and will require a downtime.

We are planning on powering off the library at 6:30am. As such, access to all /archive (legacy archive allocations; not /pl/archive allocations) will be affected. If files in any /archive allocation are hosted in disk, those files will still be accessible. Reading of files stored in tape only are expected to fail while the library is down.

The library should be back online by 5pm. It will be informed as soon as the IBM library is up.
Feb 23, 14:06 MST
Feb 25, 2021

No incidents reported.

Feb 24, 2021
Feb 23, 2021

No incidents reported.

Feb 22, 2021
Completed - VAX (vertical) robotic column was aligned again in the library and it is functional now.
We are asking Spectra if it is expected this maintenance to happen frequently and if not, why the problem is re occurring.

All 5 PL/archive allocations listed previously are operational again.
Feb 22, 13:46 MST
Update - Spectra engineer is arriving on site.
StrongBox management software will be shutdown shortly. At this point the following allocations will be unaccessible:

/curcZone/pl/archive/anschutz-test
/curcZone/pl/archive/calipso_zhu
/curcZone/pl/archive/ics
/curcZone/pl/archive/mfix
/curcZone/pl/archive/VertZoo
Feb 22, 10:53 MST
Update - We are waiting for Spectra arrival on site.
The 5 allocations on /pl/archive remains operational. As soon as the engineer arrives we will inform here and shutdown the software manager device.
Feb 22, 08:48 MST
Scheduled - We are receiving alerts regarding move failures in our Spectra Tape Library. Those errors may again require a column alignment similar to what was done in November, 2020. Some tests confirm that files can still be accessed even in the presence of those errors (including files only hosted in tapes). So, there is a chance that we are experiencing something different this time.

We are scheduling a downtime from 8am to 12:00 on Monday for a Spectra engineer to come on site, determine the cause of the recent events and apply any needed correction.

The allocations that will be affected by the downtime are:
/curcZone/pl/archive/anschutz-test
/curcZone/pl/archive/calipso_zhu
/curcZone/pl/archive/ics
/curcZone/pl/archive/mfix
/curcZone/pl/archive/VertZoo

We will post updates as we learn more about those errors on Monday and inform when the allocations above are available again.
Feb 19, 18:13 MST
Feb 21, 2021

No incidents reported.