Project Circe May Update
by Dor Laor
Project Circe is ScyllaDB’s year-long initiative to make Scylla, already the best NoSQL database, even better. We’re sharing our updates for the month of May 2021.
Better failure detection
Failure detection is now done directly by nodes pinging each other rather than through the gossip protocol. This is more reliable and the information is available more rapidly. Impact on networking is low, since Scylla implements a fully connected mesh in all clusters smaller than 256 nodes per datacenter, which is much larger than the typical cluster.
SLA per workload
Scylla supports multiple workloads, some with real-time guarantees, some have batch nature where latency does not matter. Prior work allows Scylla to prioritize workloads by definition of roles and map them to different scheduling groups. We recently researched how Scylla behaves when it is overwhelmed with requests. In such situations, a process called workload shedding comes into play. But which workload should we shed? Of course, it’s best to not take a completely random approach. A good step forward is the new timeout per role feature that allows the user to map workloads to timeouts so the system will choose them for automatic workload shedding.
Configuration is now possible to set timeouts based on a role, using the
SERVICE LEVEL infrastructure. These timeouts override the global timeouts in scylla.yaml, and can be overridden on a per-statement basis.
Virtual table enhancements
Infrastructure for a new style of virtual tables has been merged. While Scylla already supported virtual tables, it was hard to populate them with data. The new infrastructure reuses memtables as a simple way to populate a virtual table.
Off-strategy compaction is now enabled for repair. After repair completes, the SSTables generated by repair will first be merged together, then incorporated into the set of SSTables used for serving data. This reduces read amplification due to the large number of SSTables that repair can generate, especially for range queries where the bloom filter cannot exclude those SSTables.
Off strategy is also important when Repair Based Node Operations (RBNO) is used — this will soon be the default. RBNO pushes repair everywhere — to streaming, remove and node decommission. It’s important to tame compaction accordingly and automatically, so you as an end user wouldn’t even be aware of it.
Repair is now delayed until hints for that table are replayed. This reduces the amount of work that repair has to do, since hint replay can fill in the gaps that a downed node misses in the data set.
This month we took another step towards Raft. Currently Raft group-0 has reached a functional stage. It resembles the functionality etcd provides (you can read more about how we tested this in our April update). The team was required to answer an interesting question — how does a minimal group of cluster members find each other? Raft in Scylla, as opposed to eventual consistency, needs a minimal set of nodes. The seed mechanism will come into play but with a new definition the seed’s UUIDs which will be a must to boot an existing cluster, a better way than to have a split brain right from the get go.
- We fixed a performance problem with many range tombstones.
- Change Data Capture (CDC) uses a new internal table for maintaining the stream identifiers. The new table works better with large clusters.
- Authentication had a 15-second delay, working around dependency problems. But it is long unneeded and is now removed, speeding up node start.
- Repair allocates working memory for holding table rows, but did not consider memory bloat and could over-allocate memory. It is now more careful.
- Scylla uses a log-structured memory allocator (LSA) for memtable and cache. Recently, unintentional quadratic behavior in LSA was discovered, so as a workaround the memory reserve size is decreased. Since the quadratic cost is in terms of this reserve size, the bad behavior is eliminated. Note the reserves will automatically grow if the workload really needs them.
- We fixed a bug in the row cache that can cause large stalls on schemas with no clustering key.
- The setup scripts will now format the filesystem with 1024 byte blocks if possible. This reduces write amplification for lightweight transaction (LWT) workloads. Yes, the Scylla scripts take care of this for you too!
- SSTables will now automatically choose a buffer size that is compatible with achieving good latency, based on disk measurements by iotune.
- The ScyllaDB git repositories has lots of gems in it, one of them is a new project called Scylla Stress Orchestrator — https://github.com/scylladb/scylla-stress-orchestrator/ — which allows you to test Scylla with clients and monitoring using a single command line.
- Another, even competing method is the cloud-formation based container client setup that allows our cloud team to reach two million requests per second in a trivial way. Check out the blog post here, and the load test demo for Scylla Cloud in Github.
- The perf_simple_query benchmark now reports how many instructions were executed by the CPU per query. This is just a unit-benchmark but cool to track!
- The tarball installer now works correctly when SElinux is enabled.
- There is now rudimentary support for code-coverage reports in unit tests. (Coverage may not be the coolest kid in the block but it is not cool not to test properly!)
Scylla Operator for Kubernetes News
Scylla’s Operator 1.2 release was published with helm charts (find it on Github; plus read our blog and the Release Notes). Now 1.3 and 1.4 are in the making. In addition, our Kubernetes deployment can autoscale! An internal demonstration using https://github.com/scylladb/scylla-cluster-autoscaler was presented and you are welcome to play with it.
Scylla Enterprise News
As you may have heard, we released Scylla Enterprise 2021.1, bringing two long-anticipated features:
- Space Amplification Goal (SAG) for Incremental Compaction Strategy allows a user to strike the utilization balance the desire between write amplification (which is more CPU and IO intensive) and disk amplification (which is more storage intensive). SAG can push the ICS volume consumption lower, towards the domain of Level compaction which you still enjoy from size-tiered behavior. This is the default and the recommended compaction strategy.
- New Deployment Options for Scylla Enterprise now include our Scylla Unified Installer and allow you to install anywhere, also with an air-gap environment.
At ScyllaDB, we take security seriously. In this day and age it is vital to establish trust with any 3rd party handling your data, and reports of unsecured databases are still far too common in the news. With our customers’ vital needs foremost in our minds, we set to work ensuring that we met stringent goals established by the industry and confirmed by third party auditors. Scylla is proud to announce that we have successfully conducted a System and Organization Controls 2 (SOC2) Type II report for Scylla Cloud. This applies to all regions where Scylla Cloud can be deployed. (Learn more)