Nearly full metadata storage for PetaLibrary/active (beegfs)

Incident Report for CU Boulder RC

Resolved

Our now-secondary metadata server bmds2 has now also been updated, and is resyncing from bmds1. No further intervention is expected in the immediate future, though additional metadata disks were proactively ordered and are likely to be schedule for incorporation after they arrive.

No interruption to fs accessibility was observed at any point during this operation, though io did block (pause) for up to 3 minutes at a time during beegfs-mgmtd restart and beegfs-meta failover. No jobs are expected to have been significantly impacted by these events, except in the theoretical case where this small runtime extension might have caused jobs to reach their runtime limit before completion.

Thank you for your understanding as we responded to this issue.

Posted May 25, 2019 - 23:12 MDT

Update

We have successfully reformatted our secondary metadata server with an appropriate block and inode size to more efficiently house our BeeGFS metadata, and failed over to it as primary. After a few experiments and attempts we settled on 512B inodes and 1024B blocks, bringing our typical use down from 4608B/file to 1536B/file. We are now at 37% block, 21% inode used, down from 96%/51%.

No outage was detected as a result of this work. There were brief pauses in IO as services were restarted, but no job failures are expected to have occurred.

We will proceed with making this modification on bmds2, now the secondary but offline, likely over the weekend, but after I have had some rest. bmds2 will require an additional kernel update, already present on bmds1, to overcome a previously diagnosed issue affecting XFS during BeeGFS metadata resync. (This update was otherwise scheduled to be applied during the upcoming planned outage, but will be required for bmds2 to be reliably put into service as a secondary mds.)

Posted May 25, 2019 - 04:59 MDT

Update

We have successfully reformatted our secondary metadata server with 1024-byte inodes. We are checking our observations with development, and the secondary is currently resyncing from the primary.

Posted May 24, 2019 - 15:41 MDT

Update

Cleaning orphaned file data from BeeGFS has not reduced our primary metadata consumption below 96%. As such, we need to proceed with our next plan, that being reformatting our secondary to a 1024-byte inode size, rebuilding our metadata onto the secondary, and then moving to using our secondary as our primary.

This should be a zero-downtime operation (though there may be momentary pauses in IO as services are stopped and started). We acknowledge that there have been unplanned outages recently when we have taken administrative actions during production. That said, we feel we must take action because the risk to the file system, should we run out of metadata space, is too great, especially in light of the upcoming long weekend.

We have planned out our procedure; in a desire to be as transparent as possible, I am including it here:

- shut down beegfs-meta on our secondary bmds and confirm the fs remains accessible [we have done this before]
- back-up beegfs config files from secondary bmds storage
- reformat secondary bmds storage
- restore beegfs config files to secondary bmds storage
- start beegfs-meta on secondary bmds
- observe that beegfs-meta is resyncing
- proactively restart beegfs-mgmtd and disable automatic restart [further restarts disrupt the resync process]
- wait for beegfs-meta resync to complete and observe occupancy reduction on the secondary
- shut down primary bmds and confirm that secondary bmds becomes the primary, fs remains accessible
- re-activate beegfs-mgmtd automatic restart
- reformat now-secondary (previously primary) bmds and return to service

Posted May 24, 2019 - 14:25 MDT

Update

There appears to be someone actively creating a high volume of files on BeeGFS (/pl/active/). If you believe that might be you, please contact us immediately at rc-help@colorado.edu. We would like to try to temporarily move your workload to a different server, or postpone it if at all possible while we sort out the current state.

Posted May 24, 2019 - 11:07 MDT

Update

We have observed that over the last 48 hours our BeeGFS metadata storage utilization has jumped from 93% to 96%. This is a significant jump that we must respond to immediately, despite our desire to do no maintenance outside of a planned outage.

We now understand why our metadata utilization is higher than expected, and details are below; but our first action will be to run a BeeGFS maintenance command to identify and remove orphaned files that are no longer resident in the file system, but nonetheless taking up space on the backend storage. This may allow the primary metadata server to return to a utilization ratio equivalent to what we see on the secondary, 85%.

Should we need to take additional action once that's complete, we will advise here in a further message.

We are currently only considering actions that should be able to be completed with zero downtime, and with much planning and consideration; but should we accidentally provoke an outage, we will advise here.

--

We believe we are seeing higher metadata storage consumption because, while we are using 512-byte inodes, the BeeGFS extended attributes on that inode are requiring the allocation of an additional 4Kb data block. As a result, the vast majority of our inodes are consuming 4,608 bytes, rather than the expected 512 bytes. This is because our files have a stripe width of 16 (i.e., they write data in parallel to 16 storage targets each) whereas a 512-byte inode can only internally accommodate a stripe width of 4.

We have been advised that a 1024-byte (1Kb) inode would be able to accommodate a stripe width of 16, which is 22% of our current effective metadata storage consumption. This will require the reformat of our metadata storage file systems which, given our redundant "buddy-mirror", should be able to be performed live (first on our secondary, then on our primary). Given our experiences trying to perform maintenance on BeeGFS live so far, we had hoped to do this during our upcoming planned outage; but if we can't bring utilization down otherwise, we may have to proceed immediately.

We also intend to move our file system from a stripe width of 16 to a stripe width of 4; but this restripe is a heavy, long-term operation, and unlikely to resolve our issue in the short term.

Finally, we have already ordered additional metadata storage space; but we do not know how long it will take for the order to clear CU, and how long it will take for the disks to arrive. Probably not long, but given our current situation, we should not wait.

Posted May 24, 2019 - 11:00 MDT

Monitoring

We are monitoring a potential near-full issue on our PetaLibrary/active BeeGFS file system. The metadata servers are reporting near capacity for block storage, while simultaneously reporting more than 50% free space for inode storage.

At this time, the file system remains accessible, but we will continue to monitor the reported free space while we investigate the discrepancy. We are also procuring additional metadata storage drives to mitigate any risk of an event should metadata storage use spike during our investigation.

Posted May 20, 2019 - 12:03 MDT

This incident affected: PetaLibrary.