An Interview with Pekka Enberg: Modern Hardware, Old APIs
by Avishai Ish Shalom
Pekka Enberg has been working on Scylla since before its inception in 2014. Prior to joining ScyllaDB he worked on a variety of technologies ranging from High Frequency Trading (HFT) backends through the JVM runtime to kernels, and also web applications. Pekka is currently working on a PhD in computer science exploring new kernel abstractions and interfaces suitable for modern computer architectures.
Being fascinated by the evolution of hardware and software and the mismatch that sometimes happens between them — or rather, their intricate co-evolution — I couldn’t resist the opportunity to ask Pekka for his perspective on how Kernel APIs and architecture is standing the test of time. After all, Linux is a 30 year old project; older if you trace it back to the UNIX API from the 1970s. Pekka’s unique experience on both sides of the API brings an interesting perspective which also sheds light on the rationale behind the Seastar framework and the Scylla database architecture.
Let’s dive in!
Avishai: Pekka, tell me a little bit about your PhD dissertation and how what drove you to work on new kernel interfaces? What’s wrong with the ones we have?
Pekka: I am obviously not the first one to consider this topic, but I have a bit of a personal story behind it, related to Scylla. Some time in Spring of 2013, Avi Kivity (ScyllaDB CTO) approached me, and wanted me to talk about maybe joining his newly founded company. I knew Avi from the Linux kernel virtualization scene, and had met him at Linux conferences. When he told me they were building an operating system from scratch to make things run faster in virtualized environments, I was immediately sold. So we built OSv, an operating system kernel purpose built for hypervisor-based virtual machines (VMs), and integrated with the Java virtual machine (JVM) so that you could run any Java application tightly coupled with the kernel with improved performance — or at least that was the assumption.
I was performance testing Apache Cassandra running on top of OSv, which was one of the big application targets for us and the idea was that we would optimize the operating system layer and co-optimize it with the JVM — we had people that had extensive JVM background and worked on jRockit JVM. However, we discovered that we couldn’t get a lot of gains despite the obvious architectural edge of OSv because of the way the application was structured.
Apache Cassandra was built with this traditional large thread pool architecture (staged event-driven architecture). With this kind of architecture, you break work into multiple stages. For example, you receive a request in one stage, then do some parsing in another stage, and so on. Each stage can run on a different thread, and so you have these large thread pools to be able to take advantage of the parallelism in hardware. But what we saw in performance testing was that in some scenarios, Cassandra was spending a lot of its time waiting for locks to be released. Each stage handing over work to another had to synchronize and we could see locking really high on CPU profiles.
Around that time (late 2014) Avi started to work on Seastar, which is sort of an antithesis to the thread pool architecture. You have just one kernel thread running on a CPU, and try to partition application work and data — just as you would in a distributed system so that threads don’t share anything, and never synchronize with each other. So Seastar is thinking about the machine as a distributed thing rather than a shared-memory machine which was really typical of building multi-threaded services at the time.
But back to your question, why did the kernel become such a problem? We have a lot of different abstractions like threading in the kernel and they require crossing from user space to kernel space (context switching). This cost has gone up quite a bit, partly because of the deep security issues that were uncovered in CPUs [more here] but also relative to other hardware speeds which have changed over the years.
For example, the network is getting faster all the time and we have very fast storage devices, which is why the relative cost of context switching is much higher than it was before. The other thing is the kernel has the same synchronization problem we saw in user-level applications. The Linux kernel is a monolithic shared-memory kernel, which means that all CPUs share many of the same data structures, which they have to synchronize. Of course, there’s tons of work going on in making those synchronization primitives very efficient, but fundamentally you still have the same synchronization problem. Take the Linux networking stack for example: you have some packet arriving on the NIC, which causes an interrupt or a poll event. The event is handled on some CPU, but it’s not necessarily the same CPU that actually is going to handle the network protocol processing of the packet, and the application-level message the packet contains might also be handled on another CPU. So not only do you have to synchronize and lock moving data around, you also invalidate caches, do context switches, etc.
Avishai: Suppose you are running JVM on Linux, you have the networking stack which has a packet on one kernel thread that is supposed to be handled by a JVM thread but they don’t actually know about each other, is that correct? It sounds like the JVM scheduler is “fighting” the kernel scheduler for control and they don’t actually coordinate.
Pekka: Yes, that’s also a problem for sure. This is a mismatch in what the application thinks it’s doing and what the kernel decides to do. A similar and perhaps more fundamental issue is the virtual memory abstraction: the application thinks it has some memory allocated for it, but unless you specifically tell the kernel to never ever take this memory away (the
mlock system call) then when you’re accessing some data structure it might not be in the memory, triggering a page fault which may result in unpredictable performance. And while that page fault is being serviced, the application’s kernel thread is blocked, and there is no way for the application to know this might happen.
The Seastar framework attempts to solve this issue by basically taking control over the machine, bypassing many OS abstractions. So Seastar is not just about eliminating context switches, synchronizations and such, it’s also very much about control. Many people ask if the choice of C++ as the programming language is the reason why Scylla has a performance advantage over Cassandra. I think it is, but not because of the language, but because C++ provides more control.
The JVM generates really efficient code which can be as fast as C++ in most cases, but when it comes to control and predictability the JVM is more limited. Also, when Scylla processes a query it handles caching itself, as opposed to many other databases which use the kernel controlled page cache. All the caching in Scylla is controlled by Scylla itself, and so you know there’s some predictability in what’s going to happen. This translates into request processing latency which is very predictable in Scylla.
Avishai: You said that Seastar is not only about control, can you elaborate more on that?
Pekka: Seastar is built around an idea of how to program multi-core machines efficiently: avoiding coordination and not blocking kernel threads. Seastar has this future/promise model which allows you to write application code efficiently to take advantage of both concurrency and parallelism. A basic example: you write to the disk which is a blocking operation because there is some delay until the data hits whatever storage; the same for networking as well. For a threaded application which uses blocking I/O semantics you would have thread pools because this operation would block a thread for some time, so other threads can use the CPU in the meantime, and this switching work is managed by the kernel. With a thread-per-core model if a thread blocks that’s it — nothing can run on that CPU, so Seastar uses non-blocking I/O and a future/promise model which is basically a way to make it very efficient to switch to some other work. So Seastar is moving these concurrency interfaces or abstractions into user space where it’s much more efficient.
Going back to your question about why the operating system became such a problem, the kernel provides concurrency and parallelism by kernel threads but sometimes you have to block the thread for whatever reason, perhaps wait for some event to happen. Often the application actually has to consider the fact that making the thread sleep can be much more expensive than just burning the CPU a little bit — for example, polling. Making threads sleep and wake up takes time because there’s a lot of things that the kernel has to do when the thread blocks — crossing between kernel and user space, some updating and checking the thread data structure so that the CPU scheduler can run the thread on some other core, etc, so it becomes really expensive. There’s a delay in the wake up when that event happened. Maybe your I/O completed or a packet arrived. For whatever reason your application thread doesn’t run immediately and this can become a problem for low latency applications.
Avishai: So it’s all about dealing with waiting for data and running concurrent computations in the meantime on the CPU?
Pekka: Yes, and is a problem that they already had and solved in the 1950s. The basic problem was that the storage device was significantly slower than the CPU, and they wanted to improve throughput of the machine by doing something useful while waiting for I/O to complete. So they invented something called “multi-programming”, which is more or less what we know as multithreading today. And this is what the POSIX programming model is to applications: you have a process and this process can have multiple threads performing sequential work. You do some stuff in a thread and maybe some I/O, and you have to wait for that I/O before you can proceed with the computation.
But as we already discussed, this blocking model is expensive. Another issue is that hardware has changed over the decades. For example, not all memory accesses have equal cost because of something called NUMA (non-uniform memory access), but this isn’t really visible in the POSIX model. Also, system calls are quite expensive because of the crossing between kernel and user space. Today, you can dispatch I/O operation on a fast NVMe storage device in the time it takes to switch between two threads on the CPU. So whenever you block you probably missed an opportunity to do I/O, so that’s an issue. The question is: how do I efficiently take advantage of the fact that I/O is quite fast but there is still some delay and I want to do some useful work? You need to be able to switch tasks very fast and this is exactly what Seastar aims to do. Seastar eliminates the CPU crossing cost as much as possible and context switching costs and instead of using kernel threads for tasks we use continuation chains or coroutines.
Avishai: It sounds like Seastar is quite a deviation from the POSIX model?
Pekka: The POSIX model really is something that was born in the 1960s and 1970s, to a large extent. It’s a simple CPU-centric model designed around a single CPU that does sequential computation (also known as the von Neumann model), which is easy to reason about. But CPUs internally haven’t worked that way since the beginning of the 1980s or even earlier. So it’s just an abstraction that programmers use — kind of a big lie in a sense, right?
POSIX tells programmers that you have this CPU that can run processes, which can have threads. It’s still the sequential computation model, but with multiple threads you need to remember to do some synchronization. But how things actually get executed is something completely different, and how you can efficiently take advantage of these capabilities is also something completely different. All abstractions are a lie to some degree, but that’s the point — it would be very difficult to program these machines if you didn’t have abstractions that everybody knows about.
Seastar is a different kind of programming model, and now you see these types of programming frameworks much more frequently. For example, you have async/await in Rust, which is very similar as well. When we started doing this in 2014, it was all a little bit new and a weird way of thinking about the whole problem, at least to me. Of course, if you write some application that is not so performance sensitive and you don’t care about latency very much, POSIX is more than fine, although you’ll want to use something even more high level.
Avishai: Can’t we just make existing models faster? Like using user-space POSIX threads running on top of something like Seastar?
Pekka: So user-space based threading is not a new idea. In the 1990s, for example, you had the Solaris operating system do this. They had something called M:N scheduling, where you have N number of kernel threads, and then you have M number of user-level threads, which are time-multiplexed on the kernel threads. So you could have the kernel set up a kernel thread per core and then in user space you run hundreds of threads on top of those kernel threads, for example.
Seastar has the concept of a continuation, which is it’s just a different incarnation of the same programming model. It’s just a different way of expressing the concurrency in your program. But yes, we could make thread context switching and maybe synchronization much faster in user space, but of course there are some additional problems that need to be solved too. There’s the issue of blocking system calls when doing I/O. There are known solutions to the problem, but they are not supported by Linux at the moment. In any case, this issue nicely ties back to my PhD topic: what kind of capabilities the O/S should expose so you could implement POSIX abstractions in user space.
Avishai: I think right now most backend programmers are not actually familiar with the POSIX abstraction, POSIX is more popular with system level programmers. Backend developers are mostly familiar with the model that is presented by the runtime they use — JVM or Python or Golang, etc. — which is not exactly the same as POSIX. It raises an interesting question, especially now that we’re getting sandboxes with WebAssembly, perhaps we want to replace POSIX with a different model?
Pekka: So hopefully no kernel developer reads this, but I tend to think that POSIX is mostly obsolete… Tracing back the history of POSIX, the effort was really about providing a portable interface for applications so you could write an application once and run it on different kinds of machine architectures and operating systems. For example, you had AT&T UNIX, SunOS, and BSD, and Linux later. If you wrote an application in the 1980s or 1990s, you probably wrote it in C or C++, and then POSIX was very relevant for portability because of all these capabilities that it provided. But with the emergence of runtimes like the Java Virtual Machine (JVM) and more recently Node.js and others, I think it’s a fair question to ask how relevant POSIX is. But in any case all of these runtimes are still largely built on top of the POSIX abstraction, but the integration is not perfect, right?
Take the virtual memory abstraction as an example. With memory-mapped files (
mmap), a virtual memory region is transparently backed by files or anonymous memory. But there’s this funny problem that if you use something like Go programming language and you use its goroutines to express concurrency in your application — guess what happens if you need to take a page fault? The page fault screws up everything because it will basically stop the Go runtime scheduler, which is a user space thing.
Another interesting problem that shouldn’t happen in theory, but actually does, is file systems calls that shouldn’t block but do. We use the asynchronous I/O interface of Linux but it’s a known fact that still, some operations which by specification of the interface should be non-blocking actually block and for some specific reasons.
For example, we recommend the XFS file system for Scylla, because it’s the best file system that implements non-blocking operations. However, In some rare cases even with XFS, when you’re writing and then the file system has to allocate a new block or whatever and then you hit a code path which has a lock. If you happen to have two threads doing that then now you are blocked. There’s a good blog post by Avi about the topic, actually.
Anyway, this is one of the reasons why Seastar attempts to bypass anything it can. Seastar tells the Linux kernel “give me all the memory and don’t touch it.” It has its own I/O scheduler, and so on. Avi has sometimes even referred to Seastar as an “operating system in user space,” and I think that’s a good way to think about it.
Avishai: It sounds like one of biggest problems here is that we have a lot of pre-existing libraries and runtimes that make a lot of assumptions about the underlying operating system, so if we change the abstraction to whatever, we basically create a new world and we would have to rewrite a lot of the libraries and runtimes.
Pekka: Yeah, when I started the work on my PhD I had this naive thinking to throw out POSIX. But when talking about my PhD work with Anil Madhavapeddy of MirageOS unikernel fame, he told me that he thought POSIX was a distraction, and we probably can never get rid of it. And although POSIX doesn’t matter as much as it once did, you have to consider that it’s not just POSIX specification or the interfaces but all the underlying stuff like CPUs which are highly optimized to run this type of sequential code.
A lot of work has been done to make this illusion of a shared memory system fast — and it’s amazingly fast. But for something like Scylla the question is can you squeeze more from the hardware by programming in a different way? I think that we might have reached a point where the cost of maintaining the old interfaces is too high because CPU performance is unlikely to get significantly better. The main reason being that CPU clock frequencies are not getting higher.
In the beginning of the 2000s, we were thinking that we’d have these 10 GHz CPUs soon, but of course that didn’t happen. By 2006 when we started to get multi-core chips, there was this thinking that soon we will have CPUs with hundreds of cores, but we actually don’t really have that today. We certainly have more cores, but not in the same order of magnitude as people thought we would.
But although CPU speeds aren’t growing as much, the network speeds are insane already, with 40GbE NICs that are commodity at least in cloud environments, and 100GbE and 200GbE NICs in the horizon.
It’s also interesting to see what’s happening in the storage space. With something like Intel Optane devices, which can connect directly to your memory controller, you have persistent memory that’s almost the same speed as DRAM.
I’m not an expert on storage, so I don’t know how far the performance can be improved, but it’s a big change nonetheless. This puts the idea of a memory hierarchy under pressure. You have this idea that as you move closer to the CPU, storage becomes faster, but smaller. So you start from the L1 cache, then move to DRAM, and finally to storage. But now we’re seeing the storage layer closing in on the DRAM layer in terms of speed.
This brings us back to the virtual memory abstraction, which is about providing an illusion of a memory space that is as large as storage space. But what do you gain from this abstraction, now that you can have this persistent storage which is almost as fast as memory?
Also, we have distributed systems so if you need more memory than you can fit in one machine, you can use more machines over the network. So I think we are at that point in time where the price of this legacy abstraction is high. But is it too high to justify rewriting the OS layer remains to be seen.
Want to Be Part of the Conversation?
If you want to contribute your own say on how the next generation of our NoSQL database gets designed, feel free to join our Slack community. Or, if you have the technical chops to contribute directly to the code base, check out the career opportunities at ScyllaDB.