PetaLibrary race condition leading to "Bad address" write error
Incident Report for CU Boulder RC
Update
The fix addressing this issue has been deployed to one of our two storage servers; but coincident problems prevented us from finishing both servers on the scheduled day. We will re-schedule the completion of this effort, possibly next week.
Posted Jan 17, 2020 - 09:59 MST
Identified
A component of the PetaLibrary/active service (ZFS, providing storage for beegfs-storage, part of the BeeGFS parallel file system) is experiencing a load-induced race condition. When the race condition results in an error, a write fails with an error message like "Bad address".

This issue has previously been reported (and resolved) upstream.

https://github.com/zfsonlinux/zfs/issues/8640

This fix is available in the 0.8 branch of ZFS. We are planning an update from our currently-deployed ZFS 0.7.13 to resolve this issue. We will provide updates here as more information becomes available.
Posted Jan 06, 2020 - 13:58 MST
This incident affects: PetaLibrary.