Subsections of Notes

Heystack aka needleheap, Facebook object storage for photos

Based on this Facebook needleheap paper.

Numbers

Facebook stores 4 different sizes of photo for a single photo upload. They use 3x replication factor, so every photo upload results in storing 12 files.

They noticed that file lookup in a directory with 1000 of files is slow, and limited directory size to 100.

Other numbers

06729B76-4160-403B-9012-908C6A721CAF.png

Most recently written files have a high probability being read. 25% of files are removed within a year. A0DF8F53-A430-4B4D-9F42-42BB2F3C9B89.png Lots of reads occur inside of a few first days, but there’s a very long tail.

System design

Write:
The directory maintains the logical volumes_machines_file id. (Note 3 arrow, signifying 3 synchronous writes) 107A06AC-0F47-48B5-91D9-5E2226EEFB05.png The store consists of append-only volumes, with index files laying nearby.

Volume file consists of header (no details), and records they call “needles” for no good reason. Each record consists of file size, file itself, whether the file was deleted, and ID.

Writes to volumes are append-only, the files are preallocated thanks to XFS, the server keeps file handlers open to each volume. Writes to volumes are very synchronous: the disk buffer is disabled.

The writes to index file are asynchronous. No delete info is in the index file.

The api is : put, get, delete.

Rewrite

The object can be rewritten by put operation.

If it’s in the same volume, it will be appended, and deleted space will be compacted in the future.

If it’s on a different volume (say, if the file is old, and resides on write-only volume), it’s unclear whether the delete photo space will ever be reclaimed: FB4DC096-45B9-4510-8E97-306B27FBC1D8.png It seems that this paper doesn’t really go into details of GDPR compliance.

Recovery

Everything can be rebuilt from volume files.

You can restart from index files.

An elegant solution: when server is restarted, the volume files need to be scanned for the writes that didn’t get into index files, but only after the last offset recorded in the index file. Otherwise, the index files can be used for fast restart.

Index file contains the id and offset, couple of other fields.

Failure recovery is mostly manual.

Wrong name

Their haystack consists of needless. It’s a needle heap.