PetaLibrary/active outage

Incident Report for CU Boulder RC

Resolved

On 7 May 2019 the Research Computing operations team performed
acceptance testing on a portion of our new PetaLibrary infrastructure,
"BeeGFS." Due to a series of unfortunate events, the PetaLibrary
experienced a series of two unplanned, production-impacting outages;
the first relatively brief, the second moderately long. We are taking
further steps to prevent such outages in the future.

The details of events and next steps is provided below.

## The plan

The two tests to be performed were:

- Remove a single disk from the environment; observe that full
read/write access to PetaLibrary/active is retained; reinsert the
disk; observe that the disk is properly re-incorporated
automatically.

- "Fail" an entire disk enclosure; observe the behavior of the cluster
with the disk enclosure offline; return the enclosure to service;
observe that the enclosure and its disks are properly
re-incorporated automatically.

- We further hoped to coschedule a configuration change to one BeeGFS
server "bmds2" to swich it from 10-gigabit ethernet to 40-gigabit
ethernet. This was a configuration change only: its 40-gigabit link
was simply operating at 10-gigabit due to a switch-side
misconfiguration.

Prior to the commencement of testing it was recognized that the
enclosure test would be inescapably disruptive; so this test was
postponed for rescheduling during a planned maintenance outage.

The removal of a single disk was intended to be non-disruptive. A
prior instance of this test _had_ been disruptive; but a patch
provided by BeeGFS development was meant to have addressed this prior
issue.

## What happened

Application of the patch went as expected; however, restarting of the
beegfs-storage daemon (necessary to activate the patch) raised a
not-previously-understood fault in our quota system that caused
beegfs-storage to enter a "crash loop" and not start properly.

PetaLibrary BeeGFS allocations are implemented as BeeGFS storage pools
made of 16 ZFS storage targets. Each ZFS storage target has 1/16 the
total quota. However, when a storage target becomes full, it not only
becomes unable to accept new user data (as expected) but also system
data. In this case, BeeGFS was unable to create the necessary
"lock.pid" file during beegfs-storage startup, which prevented
beegfs-storage from starting.

```
(0) May07 11:21:47 Main [SessionStore (save)] >> Unable to create session file: /data/boss104/mbigdata/storage/sessions. SysErr: Disk quota exceeded
(0) May07 11:21:47 Main [App] >> Could not store all sessions to file /data/boss104/mbigdata/storage/sessions; targetID: 857
```

When beegfs-storage had been offline beyond the timeout limit (10
minutes by default) IO operations against BeeGFS began to fail
(usually with an error like "communication error on send").

We resolved this by temporarily increasing the quotas for the affected
storage pool, which allowed beegfs-storage to start, and restored IO
access.

With the patch applied, we proceeded with the acceptance test. Removal
and re-insertion of the disk proceeded as planned, with no additional
disruption to production.

We then proceeded to reconfigure the network on bmds2 for 40-gigabit
connectivity. Due to a miscommunication within the team, this change
was performed when the mirror "bmds1" server was out-of-sync. During
the change, "bmds2" experienced an expected, brief network
interruption. This interruption was long enough for the cluster
management software to detect the node as "failed," at which point
STONITH powered "bmds2" off.

```
May 07 12:29:58 [35022] boss1 crmd: notice: tengine_stonith_notify: Peer bmds2 was terminated (off) by bmds1 on behalf of crmd.123291: OK | initiator=boss2 ref=dc5ec3e6-cef6-47be-9b80-fd6953e7eeb1
```

With no good metadata node available, the filesystem once again became
unavailable. We restored "bmds2" to service, but the beegfs-meta
service did not recover automatically.

```
# beegfs-ctl --listtargets --nodetype=meta --longnodes --state
TargetID Reachability Consistency NodeID
======== ============ =========== ======
1 Online Needs-resync beegfs-meta bmds1 [ID: 1]
2 Online Needs-resync beegfs-meta bmds2 [ID: 2]
```

After research and conference with upstream support, we were able to
force the known-good metadata server "bmds2" back into "good" state,
after which service was restored.

```
# beegfs-ctl --setstate --nodetype=meta --nodeid=2 --state=good --force
# beegfs-ctl --listtargets --nodetype=meta --longnodes --state
TargetID Reachability Consistency NodeID
======== ============ =========== ======
1 Online resyncing beegfs-meta bmds1 [ID: 1]
2 Online Good beegfs-meta bmds2 [ID: 2]
```

## Next steps

In general, we are re-evaluating our comfort level with the BeeGFS
environment, and will likely move most or all such work to planned
maintenance outages for the foreseeable future. It is our hope that,
as we gain familiarity with BeeGFS and its configuration, we will be
able to do more such work non-disruptively in the future; but it seems
apparent that we should build up more instances of non-disruptive
success in isolation before we expect to be able to in the future.

To address the issue with quota status affecting core BeeGFS services,
we will need to adjust where our quota enforcement happens. Today we
enforce quotas on the entire ZFS storage target; but we will need to
reconfigure this such that quotas are only enforced on the "chunks"
directory, which contains explicit user data objects. This will
separate PetaLibrary allocation quotas from affecting internal BeeGFS
service functionality.

To address the issue with a double-failure in our metadata
environment, we will adopt the practice of taking no maintenance
actions on our bmds servers without confirming that both servers are
in a known-good state and proactively.

Finally, to address the issue with user-facing IO being so fragile
during backend interruptions, we are investigating the possibility of
reconfiguring the beegfs-client to block, rather than return an
error. This should be possible, but we will test this configuration
during a planned outage to validate our understanding of the
configuration parameters in question.

Posted May 08, 2019 - 11:33 MDT

Update

We are continuing to monitor for any further issues.

Posted May 07, 2019 - 14:56 MDT

Update

We are continuing to monitor for any further issues.

Posted May 07, 2019 - 13:59 MDT

Monitoring

A fix has been implemented and BeeGFS appears to be up. We are monitoring the state to ensure everything is in order, and will do our best to return Blanca to service ASAP.

Do be advised that automatic health-check monitoring _also_ prevented jobs from starting on Summit during the outage. This may or may not be correct, and we'll be considering whether this should be the case going forward.

A full postmortem of events will follow.

Posted May 07, 2019 - 13:59 MDT

Identified

A series of unanticipated events have caused a series of outages on PetaLibrary/active. We have stopped Blanca queues while we work to resolve the issue.

Posted May 07, 2019 - 13:04 MDT

This incident affected: Blanca and PetaLibrary.