PetaLibrary race condition leading to "Bad address" write error
Incident Report for CU Boulder RC
Resolved
ZFS has been updated on both beegfs-storage servers, which is expected to resolve this issue.
Posted Feb 05, 2020 - 16:35 MST
Update
The fix addressing this issue has been deployed to one of our two storage servers; but coincident problems prevented us from finishing both servers on the scheduled day. We will re-schedule the completion of this effort, possibly next week.
Posted Jan 17, 2020 - 09:59 MST
Identified
A component of the PetaLibrary/active service (ZFS, providing storage for beegfs-storage, part of the BeeGFS parallel file system) is experiencing a load-induced race condition. When the race condition results in an error, a write fails with an error message like "Bad address".

This issue has previously been reported (and resolved) upstream.

https://github.com/zfsonlinux/zfs/issues/8640

This fix is available in the 0.8 branch of ZFS. We are planning an update from our currently-deployed ZFS 0.7.13 to resolve this issue. We will provide updates here as more information becomes available.
Posted Jan 06, 2020 - 13:58 MST
This incident affected: PetaLibrary.