Bare Metal Is For Massively Scalable Distributed Transaction Processing

InfiniSQL™ is a massively scalable relational database management system (RDBMS). It excels at performing distributed transactions--and is arguably the best at this kind of workload. This was demonstrated recently through repeatable benchmark testing performed on Internap's Bare Metal Cloud infrastructure. This testing proved the assumptions guiding InfiniSQL's design--and bare metal proved to be a requirement for the level of performance achieved. This article will describe InfiniSQL with emphasis on elements that benefit the most from bare metal computing.

Background

InfiniSQL was conceived by me through the course of working in fast-growing, mission critical enterprise transaction processing environments. For several years I was the capacity planner and systems performance engineer for the Internet's biggest card payments gateway. Traditional enterprise databases have architectural flaws that make them poorly fit modern environments that need to be up 24x7, grow continuously, and rapidly respond to new business needs. InfiniSQL's design overcomes these flaws and enables the types of transactional workloads that RDBMSs have always carried out. InfiniSQL is being written from scratch. We're still in early stages, so not all planned production capabilities are available yet. But the scalable transaction processing core of InfiniSQL is functional and ready for early adopters. It is free software that is hosted on GitHub, and new developers are invited to contribute.

Transaction Scaling Challenge

Scaling transaction processing systems beyond a single server is a problem that few projects have solved. There are many examples of key-value and document databases that are able to scale across multiple nodes And a handful of relational databases which can also do so. But in most cases, these systems' performance drops off drastically for transactions that contain records on multiple nodes, if they are even able to do that kind of work. Unless a transaction type is carefully crafted (or unless the designer is lucky), workloads running on single RDBMS systems become distributed transaction workloads when moved to a horizontally scalable RDBMS such as InfiniSQL. As anybody who has had to scale a transaction processing environment knows, there is often a large amount of energy spent in application re-architecture, implementing workarounds, and incorporating various point solutions. InfiniSQL was designed to support the kind of work that users currently  in their environment while minimizing this kind of re-architecture.

It is difficult for distributed transactions to perform well primarily because of the chatter between nodes. Traditional databases were designed with data and cache tightly coupled with the transaction processing logic on the same server host. When distributed, much of the communication which was local to a single node gets spread across the network between nodes. Even with fast LAN connections, the latency for this traffic is often a thousand-fold worse. Intra-node memory accesses are measured in nanoseconds while inter-node communication is measured in microseconds at best. There is no way based on current technology for networked hosts to communicate with one another as fast as they can communicate internally. Therefore, InfiniSQL was designed to minimize the effects of latency on transaction processing throughput. And a part of this design is a dependency on fast underlying networking infrastructure such as is provided by bare metal servers.

Architecture

The rest of this article focuses primarily on the aspects of InfiniSQL that enable it to minimize the impacts of network latency on transaction processing, which is is the most critical design feature enabling massive scalability. More comprehensive documentation for InfiniSQL is available online, and further articles and discussions about InfiniSQL are linked from the main web site.

Actor Model

InfiniSQL implements a variation on the actor model of concurrent programming. C++ is the main language used to create InfiniSQL, and the actor model isn't natively supported in that language. Much of the work of implementing InfiniSQL has involved getting actors to work in C++. The actor model allows InfiniSQL to tolerate network latency while sustaining very high throughput and near-linear scalability as nodes increase. This is a radical departure from legacy RDBMS architectures.

Here's an example that attempts to graphically illustrate the difference between the traditional shared memory model of database design and InfiniSQL's actor model:

concurrencymodels.png

With actors, there's no locking. As more processing is necessary, more actors are added, with each actor roughly optimally corresponding to a single CPU thread or core. As cores are added, actor-based architectures keep up very well. However, the traditional locked shared memory model suffers the more that cores are added--because lock contention only increases. Large monolithic databases have very complex lock management methods to minimize this problem.

The actor model allows scalability and mitigates the latency problem because processing logic is handled by one set of actors, and data storage is handled by another set. Their functions are loosely coupled in InfiniSQL. Messaging happens between actors regardless of the node that they reside upon: the actor which governs a particular transaction doesn't know or care whether the data resides locally or remotely. And the actor that manages a particular data partition responds to messages regardless of origin. In theory, the actor model allows InfiniSQL to scale without limit. Each record is assigned to a specific data region based on hash value of its first field, and each index record is assigned to a region based on a hashing of the index value. All data is accessible through inter-node messaging from all available nodes, no matter how many servers are in use.

To get a sense of how InfiniSQL tolerates latency better than traditional designs, consider the following:

Throughput = Concurrency ÷ Latency

That very simple formula is at the heart of managing transaction processing environments. For example, 1000 concurrent transactions with an average of 0.2 second latency per transaction equals 5000 transactions per second of throughput. As latency increases, throughput will only remain steady if concurrency also increases. If high concurrency is not tolerated, then neither is increased latency. In the above example, if average latency were doubled to 0.4 second average, and concurrency stayed at 1000, then throughput would reduce to 2500/s.

Nothing can eliminate latency between nodes for distributed transactions. However, InfiniSQL's implementation of the actor model allows it to tolerate latency better than traditional designs. This is because legacy RDBMS architecture has a single worker process or thread handling each transaction at a time. It is relatively expensive from a system resource standpoint to create threads. Very high concurrency with this model causes resource starvation. Therefore, throughput is quite sensitive to latency. Each InfiniSQL actor thread, on the other hand, can handle a very large number of concurrent transactions. Each concurrent InfiniSQL transaction requires only a little bit of memory. Transaction processing actors are always waiting for replies from data actors--but this is not a problem because they are able to work on other transactions while they wait. No distributed transaction model can eliminate network latency. But transaction latency has a small impact on InfiniSQL throughput because it can handle very high concurrency. This is what makes InfiniSQL the superior distributed transaction processing platform.

Inter-Node Communication

The benchmark workload used for InfiniSQL testing entailed an average of 3 updates on random distinct server nodes per transaction. This caused a very large amount of processing time to be spent sending and receiving messages over the network between cluster nodes. My own test environment contains two servers, each with 12 cores, 72GB of RAM, and connected via Infiniband. Since InfiniSQL uses regular TCP/IP for inter-node communication, I found that the best performance in this environment came from configuring the Infiniband adapters as 10GB Ethernet interfaces. This two-node environment showed promising performance results, but two servers is not enough to tell a compelling scalability story. I needed more servers to benchmark. Since InfiniSQL is a self-funded open source project, buying many servers outright was out of the question. Instead, I turned to the cloud. My first two attempts at benchmarking InfiniSQL in the cloud came from different virtualization providers. I was unaware at the time of hourly bare metal pricing such as is offered by Internap. I thought that the only way to get bare metal was to purchase by the month, or to buy servers outright. Both of those options were beyond my budget. So, instead, I sought out cloud providers with virtualized environments that seemed to provide high speed networking. One provider environment was based on the KVM Hypervisor, and the other on Xen's. I knew, of course, that hypervisors entail at least some overhead. This is tolerable in many circumstances. I was hopeful that InfiniSQL would still perform well in these environments. What I found, however, was that InfiniSQL performed very poorly for the benchmark workload. Single node performance was very poor in each environment, bottlenecking at only a few processor cores. And multiple node performance scalability was even more disappointing--adding server nodes barely moved the needle. There was no way to tell a compelling story about InfiniSQL as a distributed transaction processing system.

The bottleneck in each environment was the network drivers for Linux under these types of hypervisors. Latency wasn't much of a problem, as described above. However, the sheer volume of messages being passed between nodes caused inordinate CPU overhead. CPU overhead from network processing was so severe that two nodes combined had less throughput than a single node! I tried various tuning mechanisms with little improvement. I made InfiniSQL's inter-node communication more efficient by re-writing serialization/deserialization, increasing size of message batches, and compressing the payloads. These improvements were valuable but ultimately did not bring about satisfactory results.

I knew that I needed to test on bare metal. I knew that, as far as hypervisor technologies had improved, that they were still not up to par with bare metal, especially for I/O intensive applications such as InfiniSQL. After a brief false start with a bare metal provider who could not deliver what they promised, I happened upon Internap. Internap was able to provide an hourly bare metal environment perfect for InfiniSQL benchmarking. As hoped, CPU overhead from network traffic decreased such that InfiniSQL scaled "up and to the right" very gracefully as nodes were added:

  Median sustained (blue line) and max (red line) throughput by node quantity.

Median sustained (blue line) and max (red line) throughput by node quantity.

More details about the benchmark are on the blog, but the message about scalability is quite clear. Due to budget constraints, I did not push the environment any further. However, based on the direction of the line graph, it is very likely that InfiniSQL can scale gracefully to a much large quantity of nodes. I look forward to further tests (on bare metal, of course) to really stress the limits of InfiniSQL. What isn't conveyed by the picture but that is true nonetheless, is that bare metal is a requirement for infinitely scalable distributed transaction processing.