Project Circe June Update

by Peter Corless

Summer’s here! Which means that we’re getting ready for our Scylla University LIVE Summer School session. We hope to meet you all there. Meanwhile, behind the scenes we’ve been diligently working to deliver new software across our product set — the database itself, drivers (Rust 0.2), our k8s operator, Spark Migrator, and the list goes on.

Project Circe aims to make Scylla, already a kickass database, even better. With that goal in mind, here’s a look at our progress for the month of June.

Scylla Open Source 4.5 Coming Soon!

We’re on the verge of releasing Scylla Open Source 4.5 (following RC2, which went out in early June). Let’s have a look at the new features and capabilities you can look forward to in the coming release.

Load and Stream SSTables

This feature extends nodetool refresh to allow loading arbitrary SSTables. It will help make restorations and migrations much easier. You can take an SSTable from a cluster and place it on any node in the new cluster. When you trigger the load and stream process, it will distribute and stream the data across the nodes in the new cluster. Previously, one had to carefully place the SSTable within every node that owned key ranges that belong to it. Today, this feature does the job for you.

For example, you could take SSTables created on a cluster of 9 small nodes, then load and stream them across a cluster of 5 large nodes. Best of all, there’s no need to run nodetool cleanup afterwards to remove unused data.

Project Alternator

We’re making improvements to our DynamoDB-compatible API in a number of ways:

  • The sstable loader utility will work with Alternator tables, beginning with 4.5.
  • Cross-Origin Resource Sharing (CORS) will allow client browsers to access the database via JavaScript, avoiding a middle tier.
  • You will be able to limit maximum concurrency, with queries exceeding that concurrency returning a RequestLimitExceeded error.
  • Nested attribute paths will allow the modification of just an object’s attributes instead of the entire object.
  • Slow query logging will allow you to find queries that exceed a threshold and log them to system_traces.node_slow_log.
  • Support for attribute paths in ConditionExpression, FilterExpression, and ProjectionExpression.

Raft

Raft implementation in Scylla is a core deliverable of Project Circe. While the changes made to the Scylla infrastructure will have no visible effect yet, we’re adding the major building blocks upon which a number of future capabilities will be delivered.

  • Schema Tables on Shard 0 — To date Scylla stored the database schema in a set of tables, sharded across all cores like ordinary user tables. With Scylla 4.5, this schema data will be maintained by shard 0 alone. This is the first step to letting Raft manage them.
  • Log Data — Raft will now be able to store its log data in a system table, implemented in a modular fashion.
  • Joint Consensus — Now merged, this provides the ability to change a Raft group from one set of nodes to another, which is a requisite for cluster topology changes and data migrations to different nodes.
  • Additional changes to the Raft implementation provide support for non-voting nodes, per-server timers, and leader step-down.

Change Data Capture (CDC)

We are thrilled that our users are eagerly looking for ways to leverage the new CDC capabilities in Scylla. (Have a look at our recent webinar with Confluent on how to build event streaming architectures using CDC with Kafka.) This month we optimized CDC enablement on large Scylla clusters with many partitions and streams: first was limiting the number of streams (though this does incur some loss of efficiency), and, secondly, we adopted a new format that uses partitions and clustering rows.

CDC is also an official part of the Enterprise 2021 release and in July it will be fully integrated with Scylla Cloud.

Other June Releases

Beyond Scylla Open Source, we also provided a new update to our Scylla Enterprise 2021 release, as well as updates to our supporting applications and utilities:

Velocity of Software Delivery

We often hear from Scylla users that the velocity of software delivery matters to them when deciding what infrastructure components to implement in their ecosystems. Already in the first half of this year we delivered Scylla Open Source 4.3 and 4.4, plus early in the second half we will deliver 4.5. This strong, steady release cadence allows us to add new capabilities while also allowing us to fix bugs at a rapid and regular clip.

Meanwhile, Scylla Enterprise, offered as a separate deliverable, allows us to perform even greater testing for the resiliency and maturity needed for production-readiness.

If the frequency of software delivery is also a major concern of yours, here’s an interesting way to compare our team’s output to a couple of other well known open source big data projects. All information here is for the month of June 2021 (30 May to 30 June, to be precise):

scylladb/scylla (500k lines of code)

  • 28 authors pushed
  • 383 commits for the month
  • 1,487 files were changed

apache/spark (2.1m lines of code)

  • 84 authors pushed
  • 393 commits for the month
  • 1,441 files were changed

apache/kafka (892k lines of code)

  • 52 authors pushed
  • 122 commits for the month
  • 737 files were changed

apache/cassandra (1m lines of code)

  • 20 authors pushed
  • 49 commits for the month
  • 140 files were changed

To break the progress in our code base down to some salient real-world examples, we recommend checking out our CTO Avi Kivity’s series entitled “Last week in scylla.git master,” which include these more interesting changes over the past month:

  • June 06 — featuring a new process for making Docker images
  • June 13 — which enables off-strategy compaction for bootstrap and replace operations
  • June 20 — changes to how range tombstones are internally represented
  • June 27 — making the bootstrap process more robust

Sign Up for Scylla University LIVE!

We look forward to seeing you at Scylla University LIVE for our Summer Session. This is an event you won’t want to miss. Besides the tracks about Scylla operations and development, we’re also going to have sessions devoted to hooking up Scylla to the rest of your big data architecture, including integrating it with Apache Spark and Apache Kafka.

You can read more about the Scylla University LIVE agenda, as well as other new developments at Scylla University here. But meanwhile, don’t forget to reserve your seat in our live, online classes coming up July 28th and 29th. Until we next meet, enjoy your summer!

REGISTER FOR THE SCYLLA UNIVERSITY LIVE SUMMER SESSION

The monstrously-fast NoSQL database.