Implementing a New IO Scheduler Algorithm for Mixed Read/Write Workloads

Introduction

Being a large datastore, ScyllaDB operates many internal activities competing for disk I/O. There are data writers (for instance memtable flushing code or commitlog) and data readers (which are most likely fetching the data from the SSTables to serve client requests). Beyond that, there are background activities, like compaction or repair, which can do I/O in both directions. This competition is very important to tackle; otherwise just issuing a disk I/O without any fairness or balancing consideration, a query serving a read request can find itself buried in the middle of the pile of background writes. By the time it has the opportunity to run, all that wait would have translated into increased latency for the read.

Current State / Previous Big Change

Last time we touched this subject, it was about solving the disk throughput and IOPS capacity being statically partitioned between shards. Back then, the challenge was to make correct accounting and limiting of a shared resource constrained with the shard-per-core design of the library.

Towards Modeling the Disk

Most likely when evaluating a disk one would be looking at its 4 parameters — read/write IOPS and read/write throughput (such as in MB/s). Comparing these numbers to one another is a popular way of claiming one disk is better than the other and estimating the aforementioned “bandwidth capacity” of the drive by applying Little’s Law. With that, the scheduler’s job is to provide a certain level of concurrency inside the disk to get maximum bandwidth from it, but not to make this concurrency too high in order to prevent disk from queueing requests internally for longer than needed.

The Data

The model was elaborated by developing a tool, Diskplorer, to collect a full profile of how a disk behaves under all kinds of load. What the tool does is load the disk with reads and writes of different “intensities” (including pure workloads, when one of the dimensions is literally zero) and collects the resulting latencies of requests. The result is 5-dimension space showing how disk latency depends on the { read, write } x { iops, bandwidth } values. Drawing such a complex thing on the screen is challenging by itself. Instead, the tool renders a set of flat slices, some examples of which are shown below.

The Math

If trying to approximate the above plots with a linear equation, the result would be like this

The Code

There’s a well developed algorithm called token bucket out there. It originates from telecommunications and networking and is used to make sure that a packet flow doesn’t exceed the configured rate or bandwidth.

Results

The results can be seen from two angles. First, is whether the scheduler does its job and ensures the bandwidth and IOPS stay in the “safety area” from the initial profile measurements. Second, is whether this really helps, i.e. does the disk show good latencies or not. To check both we ran a cassandra stress test over two versions of ScyllaDB — 4.6 with its old scheduler and 5.0 with the new one. There were four runs in a row for each version — the first run was to populate ScyllaDB with data, and subsequent to query this data back, but “disturbed” with background compaction (because we know that ordinary workload is too easy for modern NVMe disk to handle). The querying was performed with different rates of client requests — 10k, 5k and 7.5k requests per second.

Latency Implications

Seastar keeps track of two latencies — in-queue and in-disk. The scheduler’s goal is to maintain the latter one. The in-queue can grow as high as it wants; it’s up to upper layers to handle it. For example, if the requests happen to wait too long in the queue, ScyllaDB activates a so-called “backpressure” mechanism which involves several techniques and may end up canceling some requests sitting in the queue.

What’s Next

The renewed scheduler will come as a part of lots of other great improvements in 5.0. However, some new features should be added on top. For example — the metrics list will be extended to report the described accumulated costs each scheduling class had collected so far. Another important feature is in adjusting the rate limiting factor (the K thing) run-time to address disks aging out and any potential mis-evaluations we might have done. And finally, to adjust the scheduler to be usable on slow drives like persistent cloud disks or spinning HDDs we’ll need to tune the configuration of the latency goal the scheduler aims at.

Discover More in ScyllaDB V

ScyllaDB V is our initiative to make ScyllaDB, already the most monstrously fast and scalable NoSQL database available, into even more of a monster to run your most demanding workloads. You can learn more about this in three upcoming webinars:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
ScyllaDB

ScyllaDB

The monstrously fast and scalable NoSQL database.