PL/Archive Legacy Offline
Incident Report for CU Boulder RC
Resolved
Nodes are stable and no failures were verified for more than 24h. We are confident that the failures were identified and fixed. We will keep MegaRAID daemon turned off. All PL/Archive legacy allocations are available now.
Posted Mar 30, 2021 - 14:14 MDT
Update
As a follow up from Friday's update, it was identified the reason why one of the CNFS nodes was unresponsive on Friday.
Sometimes, ABRT doesn't take action on a crashing process quickly enough. Other processes (including GPFS) were affected by the MegaRAID daemon crash and ABRT removed MegaRAID daemon hours latter.
It is likely feasible to configure ABRT to better fit the node's needs. In stead of doing that though, we just disabled MegaRAID and it is not running anymore on the CNFS nodes.

The nodes are stable and GPFS is healthy. If it remains like that tomorrow, PL/archive legacy will be returned to production.
Posted Mar 29, 2021 - 19:38 MDT
Update
We want to monitor PL/archive legacy for longer given that one of the CNFS nodes went down again. We have the 2 of them up and running right now since we had the system board replaced successfully in one of them.

We will keep monitoring and report back on Monday about the system status.
Posted Mar 25, 2021 - 19:40 MDT
Monitoring
We fixed the TSM Server after installing an RSA key in the Cluster NFS (CNFS) nodes used by GPFS file system. That key was apparently removed by our configuration management system and the Cluster NFS nodes were unable to communicate with each other making the messages from TSM Server to pile up until the server got unresponsive. We also disabled the storage manager client (which is not required) running on the TSM Server that was using a lot of CPU via its java process.

Unfortunately the same unresponsiveness that was affecting the TSM Server was manifested in the Cluster NFS nodes as well, making PL/archive legacy offline for a prolonged period of time. While we started debugging the possible cause of the failure on the CNFS nodes, one of them also got its system board failed. Since the beginning of this week we had schedule visits for our vendor to replace the board, but they are having trouble to replaced it correctly and insert the serial number on it.

PL/archive legacy can tolerate one CNFS node failure though. So, while we are trying to get the board replaced for that one node, we continue the debugging of the original problem on CNFS. Yesterday, we noticed that the MegaRaid storage manager running on the other node was crashing from time to time. That was affecting the CNFS monitor daemon used by GPFS and at this point both the machine and GPFS would be unreachable. We verified that the ABRT (Automatic Bug Reporting Tool used by Red Hat) was trying to overcome from the MegaRAID daemon failure, but it couldn't because of a package key verification. We turned off that verification and the next time that MegaRAID daemon crashed, ABRT removed/killed the process used by MegaRAID and both the node and GPFS remained alive after that.

We believe that ABRT was removing the process used by MegaRAID upon a crash, in the past, however the key that it uses to verify MegaRAID package may have expired and only if the package verification succeeds (or if verification is disabled), it is able to actually kill the process that crashed.

So, PL/Archive is recovering. We continue to monitor the system and we are currently executing many reads/writes from disk and tape. If the systems looks healthy and responsive by 7pm today, we will announce that it is back to production.
Posted Mar 25, 2021 - 12:15 MDT
Investigating
After the IBM library shutdown last Wednesday for the UPS planned maintenance, our TSM Server was not initiating correctly. After some debugging we brought the system up late on Thursday.

It turned out that the storage manager system failed again and we have been debugging the possible cause of it. Right now, allocations on PL Archive legacy (under /archive) are not accessible.

According to the library interface itself, no errors are verified and drives are operational. But we need to understand and fix the TSM issue to make the PL allocations accessible again. To that end, we got in contact with our support and we will be also engaging IBM soon.

We will provide further updates as we learn more.
Posted Mar 02, 2021 - 08:19 MST
This incident affected: PetaLibrary.