Scylla Open Source Release 4.4

by Tzach Livyatan

The Scylla team is pleased to announce the release of Scylla Open Source 4.4, a production-ready release of our open source NoSQL database.

Scylla is an open source, NoSQL database with superior performance and consistently low latencies.

Scylla 4.4 includes performance, stability improvements and bug fixes (below).

Find the Scylla Open Source 4.4 repository for your Linux distribution here. Scylla 4.4 is also available as Docker, EC2 AMI and GCP image.

Please note that only the last two minor releases of the Scylla Open Source project are supported. Starting today, only Scylla Open Source 4.4 and Scylla 4.3 are supported, and Scylla 4.2 is retired.

Related Links

New Features

Timeout per Operation

There is now new syntax for setting timeouts for individual queries with “USING TIMEOUT”. #7777

This is particularly useful when one has queries that are known to take a long time. Till now, you could either increase the timeout value for the entire system (with request_timeout_in_ms), or keep it low and see many timeouts for the longer queries. The new Timeout per Operation allows you to define the timeout in a more granular way. Conversely, some queries might have tight latency requirements, in which case it makes sense to set their timeout to a small value. Such queries would get time out faster, which means that they won’t needlessly hold the server’s resources.

You can use the new TIMEOUT parameters for both queries (SELECT) and updates (INSERT, UPDATE, DELETE).

Examples:

SELECT * FROM t USING TIMEOUT 200ms;INSERT INTO t(a,b,c) VALUES (1,2,3) USING TIMESTAMP 42 AND TIMEOUT 50ms;

Working with prepared statements works as usual — the timeout parameter can be explicitly defined or provided as a marker:

SELECT * FROM t USING TIMEOUT ?;INSERT INTO t(a,b,c) VALUES (?,?,?) USING TIMESTAMP 42 AND TIMEOUT 50ms;

More Active Client Info

The system.clients table includes information on connected actively connected clients (drivers). Scylla 4.4 add more fields to the table:

  • connection_stage
  • driver_name
  • driver_version
  • protocol_version

It also improves:

  • client_type – distinguishes CQL from Thrift just in case
  • username – now it displays the correct username if `PasswordAuthenticator` is configured.

#6946

Deployment and Packaging

  • Scylla is now available for Ubuntu 20
  • We now include node-exporter in rpm/deb/tar packages. For deb/rpm, this is the optional scylla-node-exporter subpackage. This simplifies installation, especially for air-gapped systems (without an Internet connection) #2190
  • Downgrading on Debian derivatives is now easier: installing the scylla metapackage with a specific version will pull in subpackages with the same version. #5514

Documentation

The project developer oriented, in-tree documentation is now published to a new doc site using the Scylla Sphinx Theme. With time we plan to move, and open source all of Scylla documentation from the current doc site to the Scylla project.

CDC

Change Data Capture (CDC) is production ready from Scylla 4.3.

CDC API has changed from Scylla 4.3 to Scylla 4.4. If you are using CDC with Scylla 4.3 please refer to Scylla Docs for more info.

#8116

New reference example of a CDC consumer implementation see:

Note that only the Go consumer currently uses the end-of-record column.

Alternator

Now supports nested attribute paths in all expressions #8066

Additional Features

  • It is now possible to ALTER some properties of system tables, for example update the speculative_retry for the system_auth.roles table #7057
    Scylla will still prevent you from deleting columns that are needed for its operation, but will allow you to adjust other properties.
  • The removenode operation has been made safe. Scylla will now collect and merge data from the other nodes to preserve consistency with quorum queries.

Tools and APIs

  • API: There is now a new API to force removal of a node from gossip. This can be used if information about a node lingers in gossip after it is gone. Use with care as it can lead to data loss. #2134

Performance Optimizations — I/O Scheduler 2.0

The Seastar I/O scheduler is used to maximize the requests throughput from all shards to the storage. Till now, the scheduler was running in a per shard scope: each shard runs its own scheduler, balanced between its I/O tasks, like reads, updates and compactions. This works well when the workload between shards is approximately balanced; but when, as often happened, one shard was more loaded, it could not take more I/O, even if other shards were not using their share.

I/O scheduler 2.0 included in Scylla 4.4 fixes this. As storage bandwidth and IOPS are shared, each shard can use the whole disk if required.

More Performance Optimizations

  • Repair works by having the repairing node (“repair master”) merge data from the other nodes (“repair followers”) and write the local differences to each node. Until now, the repair master calculated the difference with each follower independently and so wrote an SStable corresponding to each follower. This creates additional work for compaction, and is now eliminated as the repair master writes a single SStable containing all the data. #7525
  • When aggregating (for example, SELECT count(*) FROM tab), Scylla internally fetches pages until it is able to compute a result (in this case, it will need to read the entire table). Previously, each page fetch had its own timeout, starting from the moment the page fetch was initiated. As a result, queries continued to be processed long after the client gave up on them, wasting CPU and I/O bandwidth. This was changed to have a single timeout for all internal page fetches for an aggregation query, limiting the amount of time it can be alive. #1175
  • Scylla maintains system tables tracking SSTables that have large cells, rows, or partitions for diagnostics purposes. When tracked SSTables were deleted, Scylla deleted the records from the system tables. Unfortunately, this uses range tombstones which are not (yet) well supported in Scylla. A series was merged to reduce the number of range tombstones to reduce impact on performance. #7668
  • Queries of Time Window Compaction Strategy tables now open sstables in different windows as the query needs them, instead of all at once. This greatly reduces read amplification, as noted in the commit. #6418
  • For some time now, Scylla reshards (rearranges SSTables to contain data for one shard) on boot. This means we can stop considering multi-shard SSTables in compaction, as done here. This speeds up compaction a little. A similar change was done to cleanup compactions. #7748
  • During bootstrap, decommission, compaction, and reshape Scylla will separate data belonging to different windows (in Time Window Compaction Strategy) into different SSTables (to preserve the compaction strategy invariant). However, it did not do so for memtable flush, requiring a reshape if the node was restarted. It will now flush a single memtable into multiple SSTables, if needed. #4617

The following updates eliminate latency spikes:

  • Large SSTables with many small partitions often require large bloom filters. The allocation for the bloom filters has been changed to allocate the memory in steps, reducing the chance of a reactor stall. #6974
  • The token_metadata structure describes how tokens are distributed across the cluster. Since each node has 256 tokens, this structure can grow quite large, and updating it can take time, causing reactor stalls and high latency. It is now updated by copying it in the background, performing the change, and swapping the new variant into the place of the old one atomically. #7220 #7313
  • A cleanup compaction removes data that the node no longer owns (after adding another node, for example) from SSTables and cache. Computing the token ranges that need to be removed from cache could take a long time and stall, causing latency spikes. This is now fixed. #7674
  • When deserializing values sent from the client, Scylla will no longer allocate contiguous space for this work. This reduces allocator pressure and latency when dealing with megabyte-class values. #6138
  • Continuing improvement of large blob support, we now no longer require contiguous memory when serializing values for the native transport, in response to client queries. Similarly, validating client bind parameters also no longer requires contiguous memory. #6138
  • The memtable flush process was made more preemptible, reducing the probability of a reactor stall. #7885
  • Large allocation in mutation_partition_view::rows() #7918
  • Potential reactor stall on LCS compaction completion #7758

Configuration

  • Hinted handoff configuration can now be changed without restart. #5634

Monitoring and Log Collection

  • Scylla Monitoring 3.6 adds support for log collection using Grafana Loki.
    A new setup utility in Scylla allows users to configure rsyslog to send logs to Loki or any other log collection server supporting the rsyslog format. #7589
  • New metrics report CQL errors #5859
  • There are now metrics for the different types of transport-level messages:
  • STARTUP
  • AUTH_RESPONSE
  • OPTIONS
  • QUERY
  • PREPARE
  • EXECUTE
  • BATCH
  • REGISTER
  • and for overload indicators:
  • Reads_shed_due_to_overload – The number of reads shed because the admission queue reached its max capacity. When the queue is full, excessive reads are shed to avoid overload
  • Writes_failed_due_to_too_many_in_flight_hints – number of CQL write requests which failed because the hinted handoff mechanism is overloaded and cannot store any more in-flight hints

#4888

  • For all metric update from Scylla 4.3 to Scylla 4.4 see here

Build

  • The compiler used to build Scylla has been changed from gcc to clang. With clang we get a working implementation of coroutines, which are required for our Raft implementation. #7531

Debugging

Scylla will now log a memory diagnostic report when it runs out of memory. Among other data the report includes free and used memory of the LSA, Cache, Memtable and system, as well as memory pools . See example report in the commit log #6365
A new tool, scylla-sstable-index, is now available to examine sstable -Index.db files. The tool is not yet packaged.
Scylla uses a type called managed_bytes to store serialized data. The data is stored fragmented in cache and memtable, but contiguous outside it, with automatic conversion to contiguous storage on access. This automatic conversion made it difficult to detect when this conversion happens (so we can eliminate it), so a large patch series made all the conversions explicit and removed automatic conversion. #7490

Bugs Fixed in the Release

For a full list use git log

  • CQL: Invalid aggregation result on table with index: When using aggregates (COUNT(*)) on table with index on clustering key and filtering on clustering key, wrong results are returned #7355
  • CQL: Keyspaces have a durable_writes attribute that says whether to use the commitlog or not. Now, when changing it, the effects take place immediately. #3034
  • Stability: Cleanup compaction in KA/LA SSTables may crash the node in some cases #7553 (also part of Scylla 4.2.2)
  • Stability: ‘ascii’ type column isn’t validated, and one could create illegal ascii values by using CQL without bind variables #5421
  • Alternator AttributesToGet breaks QueryFilter #6951 (also part of Scylla 4.2.2)
  • Stability: From time to time, Scylla needs to move SSTables from one directory to another. If Scylla crashed at exactly the wrong moment, it could leave valid SSTables in both the old and the new places. #7429
  • Install: Scylla does not start when kernel inotify limits are exceeded #7700 (also part of Scylla 4.2.2)
  • Install: missing /etc/systemd/system/scylla-server.service.d/dependencies.conf on scylla-4.4 rpm #7703 (also part of Scylla 4.2.2)
  • Stability: Query with multiple indexes, one date one boolean, fail with ServerError #7659
  • Thrift: The ability to write to non-thrift tables (tables which were created by CQL CREATE TABLE) from thrift has been removed, since it wasn’t supported correctly. Use CQL to write to CQL tables. #7568
  • Stability: SSTable reshape for Size-Tiered Compaction Strategy has been fixed. Reshape happens when the node starts up and finds its STTables do not conform to the strategy invariants, or on import. It then compacts the data so the invariants are preserved, guaranteeing reasonable read amplification. A bug caused reshape to be ignored for this compaction strategy. #7774
  • Stability: When joining a node to a cluster, Scylla will now verify the correct snitch is used in the new node. #6832
  • Stability: Wild pointer dereferenced when try_flush_memtable_to_sstable on shutdown fails to create sstable in a dropped keyspace #7792
  • Stability: Scylla records large data cells in the system.large_rows table. If such a large cell was part of a static rows, Scylla would crash #6780
  • CQL: CQL prepared statements incomplete support for “unset” values #7740
  • Stability: Crash in row cache if a partition key is larger than 13k, with a managed_bytes::do_linearize() const: Assertion `lc._nesting' failed error #7897
  • CQL: min/max aggregate functions are broken for timeuuid #7729
  • CQL: paxos_grace_seconds table option: prevent applying a negative value #7906
  • LWT: Provide default for serial_consistency even if not specified in a request #7850
  • Redis: redis: parse error message is broken #7861 #7114
  • CQL: Restrictions missing support for “IN” on tables with collections, added in Cassandra 4.0 #7743 #4251
  • Stability: Possible schema discrepancy while updating secondary index computed column #7857
  • Stability: filesystem error: open failed: Too many open files and coredump during resharding with 5000 tables #7439
  • Stability: leveled_compaction_strategy: fix boundary of maximum SSTable level #7833
  • Packaging: seastar-cpu-map.sh missing from PATH #6731
  • Stability: Scylla crashes on SELECT by indexed part of the pkey with WHERE like "key = X AND key = Y" #7772
  • Stability: a race condition can cause coredump in truncating a table after refresh #7732
  • Gossip ACK message in the wrong scheduling group cause latency spikes in some cases #7986
  • Stability: nodetool repair and nodetool removenode commands failed during repair process running #7965
  • Stability: [CDC] nodes aborting with coredump when writing large row #7994
  • Stability: Unexpected partition key when repairing with different number of shards may resulting in error similar to: “WARN 2020-11-04 19:06:41,168 [shard 1] repair - repair_writer: keyspace=ks, table=cf, multishard_writer failed: invalid_mutation_fragment_stream
    This error may occur when running repair between nodes with a different core number. #7552 Seastar #867
  • Scylla on AWS: When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly reports root partition is part of ephemeral disks, and RAID construction will fail. #8055
  • Alternator: unexpected ValidationException in ConditionExpression and FilterExpression #8043
  • Stability: stream_session::prepare is missing a string format argument #8067
  • Scylla on GCE: scylla_io_setup did not configure pre-tuned GCE instances correctly #7341
  • install: dist/debian: node-exporter package does not install systemd unit #8054
  • Stability: Compaction Backlog controller will misbehave for time series use cases #6054
  • Monitoring: False 1 second latency reported in I/O queue delay metrics #8166
  • Stability: TWCS reshape was silently ignoring windows which contain at least min_threshold SSTables (can happen with data segregation). #8147
  • Packaging: dist/debian: node-exporter files mistakenly packaged in scylla-conf package #8163
  • scylla_setup failed to setup with error /etc/bashrc: line 99: TMOUT: readonly variable #8049
  • Performance: SSTables being compacted by Cleanup, Scrub, Upgrade can potentially be compacted by regular compaction in parallel #8155
  • Stability: Failing to start Scylla after reboot of machine “Startup failed: seastar::rpc::timeout_error (rpc call timed out)#8187
  • Stability: a regression introduced in 4.4 in cdc generation query for Alternator (Stream API) #8210
  • Stability: Split CDC streams table partitions into clustered rows #8116
  • Install: scylla_raid_setup returns ‘mdX is already using‘ even it’s unused #8219
  • Stability: a regression on cleanup compaction’s space requirement introduced in Scylla 4.1, due to unlimited parallelism #8247
  • Stability: Scylla crash when the IN marker is bound to null #8265
  • Stability: segfault due to corrupt _last_status_change map in cql_transport::cql_server::event_notifier::on_down during shutdown #8143

DOWNLOAD SCYLLA NOW

The monstrously-fast NoSQL database.