MMS • RSS
Facebook open sourced their internal distributed log storage project called LogDevice. It offers high write availability using replication, durable log storage and recovery from failure.
Most of Facebook’s applications that perform logging require high write availability, durable storage of logs, and workloads that vary in terms of performance and latency requirements. Another important requirement was to be able to survive hardware failures. An older Facebook project called Scribe was more focused on aggregating logs to central storage, and there were cases where data loss could occur. Scribe now uses LogDevice as a log storage backend.
Facebook uses LogDevice internally in its datacenters for stream processing pipelines, distribution of database index updates, machine learning pipelines, replication pipelines, and durable task queues where it ingests over 1TB/sec of data. Although Facebook has built a lot of open source tools to manage LogDevice clusters, they are yet to open source any of those except a basic toolset at this point. The LDShell tool allows cluster management from the command line, and the complementary LDQuery command can be used to view cluster statistics.
LogDevice uses the abstraction of a “log record” to demarcate individual log events, with each record assigned a unique id called a Log Sequence Number (LSN). The LSNs are generated based on an epoch number by a component called “Sequencer”, and the epoch numbers are stored in ZooKeeper. The log store is append only, i.e., once written a record cannot be modified. Like most log storage systems, LogDevice performs “trimming” – log rotation based on a time or a space based policy. It can also trim logs on demand. Apart from this, there are no restrictions on how long logs can be stored.
LogDevice achieves high availability, especially write availability, by storing multiple copies of each log record on different machines. Each record can be replicated across 20-30 storage nodes. However, a spike in the number of writes to a single log would limit the throughput if some of the machines that have copies of that log are slow or unavailable. LogDevice can automatically detect which nodes are down and exclude them from writes for new records. It attempts to minimize the impact of hardware failures as much as possible by replication, and by “rebuilding” the lost copies as fast as possible. During such a rebuilding, restoration can be done for “the replication factor of all records affected by the failure at a rate of 5-10GB per second.” The underlying storage is based on RocksDB, a key value store also open sourced by Facebook.
LogDevice’s team also had to deal with challenges in which they discovered that users of LogDevice would perform backfills, where older data hours or days old would be requested. This was triggered by downstream services that consume logs from LogDevice, and backfilling would happen when these services recovered from failures and had to reprocess the logs. These read spikes are handled by spreading the read load across members of a “node set”, which is the set of nodes that will store a given record.
LogDevice has been compared with other log storage systems like Apache BookKeeper and Apache Kafka. The primary difference with Kafka seems to be the decoupling of computation and storage that LogDevice does to be able to handle Facebook’s scale. LogDevice is written in C++ and hosted on GitHub.