Beegfs Management daemon debug week
Scheduled Maintenance Report for CU Boulder RC
Completed
The scheduled maintenance has been completed.
Posted Aug 24, 2019 - 20:15 MDT
In progress
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted Aug 19, 2019 - 09:15 MDT
Scheduled
Next week (from August 19th to 24th), we plan to disable the restarts of Beegfs management daemon we currently have in place that avoids the disconnects between Beegfs management and clients.

Those disconnects led to the error many users seen: "Communication error on send" for PetaLibrary/Active. We enabled the restarts of the daemon in May and since then no failures associated to the disconnects occurred. However, it remains unclear for us and for Beegfs developers the reason of those failures (even after a similar prior debug week).

We plan to debug the problem next week with a different approach now. Beegfs support team will attach GDB to the management daemon such that they should see the problem while it is happening. . We don't know if or when the failure will really happen by disabling the daemon restarts. So that is an attempt to provoke the problem and debug it.

The daemon restarts will be enabled again on the 24th at 8pm (hopefully earlier once the problem is reproduced). The restarts will be maintained until the problem is fixed or another debug step will be required (we will communicate either scenarios).

We suspect that that problem has some correlation with the load on the system. So, please don't refrain from using PL/Active, next week. The earlier the better if we can reproduce the problem, debug and avoid that it happens again until fixed. Please note that this would affect the PL/Active spaces in Beegfs not the interim PL/Active spaces that are still hosted in Summit.
Posted Aug 15, 2019 - 10:14 MDT
This scheduled maintenance affected: PetaLibrary.