Momentary outage in PL/active (BeeGFS)

Incident Report for CU Boulder RC

Resolved

Tonight we experienced another beegfs-meta outage. As before this appears to have created _hangs_ in IO accessibility, rather than returning errors; so running jobs should continue successfully, adding only a few minutes to their runtime.

As before, we are still pursuing a root cause analysis with the developer, and we used this opportunity to take a few more log and telemetry readings than we have in the past; but this also motivates our plans to move away from this particular platform, and those plans are ongoing.

edit: We have received reports that this did, in fact, cause job failures. We apologize for this interruption, and for this misunderstanding.

Posted Feb 18, 2020 - 21:00 MST