Scylla’s Compaction Strategies Series: Write Amplification in Leveled Compaction

Leveled Compaction Strategy

The Leveled Compaction Strategy was the second compaction strategy introduced in Apache Cassandra. It was first introduced in Cassandra 1.0 in 2011, and was based on ideas from Google’s LevelDB. As we will show below, it solves STCS’s space-amplification problem. It also reduces read amplification (the average number of disk reads needed per read request), which we will not further discuss in this post.

  • When we have enough (e.g., 4) sstables in L0, we compact them into L1.
    We do this by compacting all the sstables in L0 together with all the sstables in L1. The result of this compaction is a new run (large sstable split by our chosen size limit, by default 160 MB) which we put in L1, replacing the entire content of L1.
  • The new run in L1 may have more than the desired 10 sstables. If that happens, we pick one sstable from L1 and compact it into L2:
  • A single sstable in L1 is part of a run of 10 files. The whole run covers the entire token range, which means that the single sstable we chose covers roughly 1/10th of the token range. At the same time, each of the L2 sstables covers roughly 1/100th of the token range. So the single L1 sstable we are compacting will overlap around 10 of the L2 sstables.
  • So what we do is to take the single L1 sstable and the roughly 10 L2 sstables which overlap its token range, and compact those together — as usual splitting the result into small sstables. We replace the input sstables with the compaction results, putting the results in L2 (note that L2 remains a single run).
  • After we compacted a table from L1 into L2, now L2 may have more than the desired number of sstables, so we compact sstables from L2 into L3. Again, this involves compacting one sstable from L2 and about 10 sstables from L3.
  • And so on.

Space Amplification in LCS

Let’s explain now why LCS indeed fulfills its ambition to provide low space amplification and therefore indeed solves STCS’s main problem.

Write Amplification

So Leveled Compaction Strategy is the best invention since sliced bread, right?

  1. STCS picks several (e.g., 4) sstables of total size X bytes to compact, and writes the result, roughly X bytes (assuming no overwrites).
  2. LCS picks one sstable, with size X, to compact. It then finds the roughly 10 sstables in the next higher level which overlap with this sstable, and compacts them against the one input sstable. It writes the result, of size around 11*X, to the next level.

Why Does Write Amplification Matter?

For write-only or write-mostly workloads, the importance of not doing 11 times more write I/O is obvious. For such workloads, LCS will have terrible performance, and not be a reasonable choice at all (however, do note that above we saw that some specific types of workloads, those mostly overwriting recently-written data, have low write-amplification in LCS). In the next post, we will look at a different compaction strategy — Hybrid Compaction — which retains STCS’s low write amplification, and which should be considered instead.

  1. As more disk bandwidth is dedicated to writes, read requests are slowed down, which makes LCS’s superior read performance (90% of the read requests need to read from a single sstable) moot.
  2. If disk bandwidth needed by writes exceeds what we can do, LCS compaction can no longer keep up. This could cause LCS to leave too many sstables in L0, and result in unbounded space and read amplification. For this reason, Scylla (and Cassandra) have an emergency mode, where if we have too many sstables piling up in L0, STCS is done inside L0 (to at least replace the many small files thereby fewer, larger, files). When we fall back to STCS in this manner we lose all advantages that LCS had in space and read amplification over STCS.

What’s Next?

In this post, we saw that although the Leveled Compaction Strategy solves the serious space-amplification problem of the Size-Tiered Compaction Strategy, it introduces a new problem of write-amplification. We end up writing the same data to disk over-and-over many more times than STCS did, which in many mixed read-write workloads can cause the disk to not be able to keep up, and both write and read performance can suffer.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
ScyllaDB

ScyllaDB

The monstrously fast and scalable NoSQL database.