Beegfs Management daemon debug week

Scheduled Maintenance Report for CU Boulder RC

Completed

The scheduled maintenance has been completed.

Posted Jul 20, 2019 - 09:15 MDT

Verifying

The Beegfs management problem occurred today. However, this time the problem manifested differently. We have no indication that the File Descriptors used by the management daemon was growing like it used to when the problem happened before. Also, the daemon recovered from the error itself. Before, we had to manually restart it to allow the clients to connect to it again.

We didn't realize the problem occurred until we got a ticket from a user reporting that his jobs failed today.

We did get debug data with GDB though. So that was communicated to the vendor in addition to the log messages verified this time.

Management daemon restarts will be enabled again at least until we hear back from the vendor. We will inform the next steps according to their feedback to us.

If you have any questions please write to rc-help@colorado.edu.

Posted Jul 19, 2019 - 16:39 MDT

Update

We are continuing to run beegfs-mgmtd continuously without proactive reboots. So far we have not experienced a failure.

Posted Jul 17, 2019 - 10:13 MDT

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Posted Jul 15, 2019 - 09:15 MDT

Scheduled

Next week (from July 15th to 19th), we plan to disable the restarts of Beegfs management daemon we currently have in place that avoids the disconnects between Beegfs management and clients.

Those disconnects led to the error many users seen: "Communication error on send" for PetaLibrary/Active.

We enabled the restarts of the daemon in May and since then no failures associated to the disconnects occurred. However, it remains unclear for us and for Beegfs developers the reason of those failures. We plan to debug the problem next week with a dump of all running threads with GDB. Beegfs developers are expecting us to send back that debug data for some time now and that should led them to narrow down the problem and identify the real cause.

We don't know if or when the failures will really happen by disabling the daemon restarts. So that is an attempt to provoke the problem and get the debug data Beegfs support needs.

The daemon restarts will be enabled again on the 19th (hopefully earlier once the problem is reproduced). The restarts will be maintained until the problem is fixed or another debug step will be required (we will communicate either scenarios). We suspect that that problem has some correlation with the load on the system. So, please don't hesitate to submit jobs that uses PL/Active, next week. The earlier the better if we can reproduce the problem, collect the debug data and avoid that it happens again until fixed.

Please note that this would affect the PL/Active spaces in Beegfs not the interim PL/Active spaces that are still hosted in Summit.

(Moved from Incidents to Scheduled Maintenance)

Posted Jul 12, 2019 - 12:22 MDT

This scheduled maintenance affected: PetaLibrary.