Update - Regarding the Globus Shared Endpoints, we are actively testing the procedure to have the endpoints operating again. We spent some time to identify a procedure that would be as simple as possible for the Shared Endpoint admins. That involves recreating the Shared Endpoints with the same permissions and rules defined today. We likely won't be able to recreate all the endpoints in a single day next week, mainly because it will require to work together with the admins. But we will try to complete this process as early as possible.
Aug 10, 21:01 MDT
Update - Please note that only transfers that use a shared endpoint in either the source or destination should fail.

Transfers that don't use a shared endpoint are expected to work as long as credentials are deactivated for the host endpoint used and you re authenticate in that endpoint. (To deactivate your credential for an endpoint, please go to Endpoints tab, click on the 3 dots for the endpoint you want to use and then press "Deactivate Credentials". Then please re authenticate again).

We will contact tomorrow the shared endpoint admins with instructions to fix those endpoints as well.

Sorry for the trouble,
Aug 9, 19:01 MDT
Update - Login to Research Computing Globus endpoints works now.
We had a Certificate Authority cert that expired and the host certificate in place for the data transfer nodes was issued using that CA cert, so that the expired cert invalidated the host certificate. We replaced the host certificate and now this issue is solved.

However the shared endpoints certificates are now also invalid (just like the host certificate). So, transfers probably will fail.
We will send instructions tomorrow accordingly to fix that and provide assistance as needed.

- Patricia
Aug 9, 18:40 MDT
Investigating - Globus users,

If you try to login in Globus RC endpoints, you will get "The certificate has expired: Credential with subject: /C=US/O=Globus Consortium/CN=Globus Connect CA has expired.\n".

This is something that appeared today and is being investigated.

I'll report, once we have more news.

- Patricia
Aug 9, 09:48 MDT
Research Computing Core   ? Major Outage
Science Network   ? Operational
RMACC Summit   ? Operational
Blanca   ? Operational
PetaLibrary   ? Operational
Sneffels (Viz) Cluster   ? Operational
JupyterHub   ? Operational
Sandstone HPC   ? Operational
Degraded Performance
Partial Outage
Major Outage
Past Incidents
Aug 18, 2018

No incidents reported today.

Aug 17, 2018

No incidents reported.

Aug 16, 2018

No incidents reported.

Aug 15, 2018

No incidents reported.

Aug 14, 2018

No incidents reported.

Aug 13, 2018

No incidents reported.

Aug 12, 2018

No incidents reported.

Aug 11, 2018

No incidents reported.

Aug 10, 2018
Resolved - We have completed the configuration change on Summit scratch servers and the gateway, and memory use has returned to normal. We have provided this feedback to Intel, as it appears to have been an OPA configuration change that caused the issue, and should hear back on the actual root cause next week.

We have not yet applied this change to all compute nodes; however, the issue appears to have had the largest affect on our file servers, and does not appear to have had the same impact on pure filesystem clients. We may delay further changes on compute nodes until we have been able to get more information from Intel.
Aug 10, 17:24 MDT
Update - After testing on a representative node, we believe that we have identified the source of the new memory utilization pattern that has been affecting Summit scratch. We are making an additional configuration change to the Summit scratch servers, but this should not impact production access to the filesystem. (Two of four servers have already been updated.)

Meanwhile, sgate1, which provides access to Summit scratch from beyond Summit, has crashed, likely due to out-of-memory event from this same cause. We will be applying the fix there as well, and returning the node to service as soon as possible.
Aug 10, 16:25 MDT
Monitoring - We have completed the revert of the recent Summit scratch configuration changes, and returned Summit scratch and Summit compute to service. We will continue to monitor memory use and see if the system has returned to its previous behavior.
Aug 9, 17:27 MDT
Update - Unfortunately, an attempt to revert some of the recent configuration changes to Summit scratch has caused the entire subsystem to go offline. We are working to bring scratch back up as soon as possible.
Aug 9, 16:17 MDT
Investigating - Following a recent (supplier-recommended) configuration change, we have started experiencing out-of-memory events in the Summit scratch subsystem. This had led to intermittent loss of access to Summit scratch from the login nodes, dtn, and other non-Summit systems. Access to Summit scratch from Summit itself, including from scompile, has been unaffected so far.

There may be further outages to Summit scratch access from outside of Summit while we continue to investigate; particularly if such an out-of-memory event occurs over-night.
Aug 9, 16:14 MDT
Resolved - The datacenter operations team was able to address the HVAC issue at the HPCF without requiring an outage or affecting production. Scheduling of new jobs has been resumed.
Aug 10, 16:10 MDT
Investigating - We have been advised by the datacenter operations team that there may be an HVAC (cooling) issue at the HPCF that will impact Summit. The service is up right now, but we have temporarily prevented new jobs from starting in case we need to take an emergency outage.
Aug 10, 13:51 MDT
Resolved - One of the Summit compile nodes, shas0348, crashed unexpectedly, and was rebooted. The node has been returned to service.

Given the nature of access available on these nodes, some crashed like this are to be expected; however, it is possible that this was an out-of-memory event related to the ongoing issues with Summit scratch. We will continue to monitor this node and the situation in general.
Aug 10, 14:15 MDT
Aug 8, 2018

No incidents reported.

Aug 7, 2018

No incidents reported.

Aug 6, 2018
Resolved - Between Aug 5 16:37:39 and Aug 6 09:25:00, Summit scratch was inaccessible from login nodes, visualization nodes, and data-transfer nodes. This was the result of an out-of-memory event that caused the unexpected reboot of the gateway server that provides this access. Operations on Summit itself, including jobs running on Summit, were unaffected.

We will monitor this gateway server to better understand the cause of this outage. We apologize for any inconvenience.
Aug 6, 09:30 MDT
Aug 5, 2018

No incidents reported.

Aug 4, 2018

No incidents reported.