Scale-out with PostgreSQL

Load Up

Open Source

Just like PostgreSQL, YugabyteDB does not have an Enterprise edition: YugabyteDB with all its database features is open source, giving you all the freedom you need and allowing for community contributions. For production, the company Yugabyte offers support and managed services, but the database and all of its features remain under the Apache 2 open source license. The decision to use free software was explained in detail by the founder and CTO of Yugabyte, Karthik Ranganathan [2].

Because YugabyteDB is open source, it is easy to test. It runs on any Linux platform – whether server, virtual machine, or container. You only need local storage and network access. The easiest way to test it on a laptop is to use Docker. A single node is all you need. If you add three or more nodes, you can test for resilience (by simulating a failure) and elasticity (by simply adding more nodes). The simple web console lets you monitor automatic data balancing and processing.

The commands

$ docker network create -d bridge yb
$ docker run -d --name yb1 --hostname yb1 --net=yb -p5431:5433 -p7001:7000 yugabytedb/yugabyte:latest yugabyted start --daemon=false --listen yb1

create a network, start an initial node, and export ports 5433 for PostgreSQL and 7000 for the web console revealing the tablet server and tablets. Port 7000 can be accessed on http://localhost:7001 . After startup, any PostgreSQL client can connect to the database with


To add more nodes, use the --join option. You can build a three-node cluster with:

$ docker run -d --name yb2 --hostname yb2 --net=yb -p5432:5433-p7002:7000 yugabytedb/yugabyte:latest yugabyted start --daemon=false --listen yb2 --join yb1
$ docker run -d --name yb3 --hostname yb3 --net=yb-p5433:5433 -p7003:7000 yugabytedb/yugabyte:latest yugabyted start --daemon=false --listen yb3 --join yb1

The user can then connect to any available node and read from and write to it, even if another node is down. The same database can also be created in YugabyteDB Managed, a fully managed database as a service (DBaaS) that runs on AWS and Google Cloud. A small machine can be used free of charge.


The query layer is based on PostgreSQL code; the rows and index entries of the YugabyteDB tables (YSQL) are sharded. A range or hash function handles the distribution work, which can be specified with create table and create index by specifying ASC or DESC for range sharding or HASH for sharding with a hash function. The hash function is computed by default for the first column of the primary key of secondary indexes. The default values are used to facilitate migration from PostgreSQL without changing the data definition language (DDL).

Sharding occurs automatically by splitting the shards (or tablets) as the table grows. Additional syntax, or management commands, splits the tables and indexes manually (e.g., before a bulk load). The table rows and index entries are all distributed independently by default, but for performance reasons, it is possible to group small tables and indexes together into a single tablet. This decision depends on your operations. A cluster with only one region can accept a latency of one millisecond between availability zones; a cluster with multiple regions needs to reduce cross-node remote procedure calls (RPCs).

Yugabyte Tablet Server (YB-TServer) stores and manages data for client applications, which can connect to any node. Each node provides a PostgreSQL endpoint that forks a back-end process for each connection. Parsing and execution of the query relies on PostgreSQL code, which has been extended and optimized for cluster use. However, instead of reading or writing tuples from or to the shared buffer cache and local data files, as would be the case with PostgreSQL, the back end sends reads and writes in batches to tablet leaders distributed across nodes (TServers) in the cluster. The PostgreSQL back end is a client for DocDB.

Each tablet is replicated for high availability, with one of the replicas (tablet peers) acting as the tablet leader to which reads and writes are sent. The name "leader" comes from the Raft consensus algorithm, which guarantees exactly one leader per tablet, even in the event of a failure, as long as the quorum is present. The leader lease algorithm prevents split-brain situations. If a leader is unavailable, the quorum of followers selects a new leader, guaranteeing that no data is lost and that recovery time remains minimal (recovery time objective, RTO, is expressed in seconds and is configuration dependent on expected network latency).

All SQL layer reads and writes are sent to the tablet leader according to the tuple distribution method (by splitting on the primary key or index entry) and are optimized to reduce the effect of latency on response time. A single leader guarantees consistency of reads and writes. Data is distributed and processed by automatically balancing the leader and followers. Writes are synchronized with followers and wait for quorum confirmation. In a cluster with RF=3, the change is usually executed locally and waits for confirmation from one of the two followers. If the leader is no longer available (e.g., because of a server failure), at least one of the followers will have received the last change and can be elected as the new leader. In terms of the consistency, availability, and partition tolerance (CAP) theorem, YugabyteDB satisfies the requirements for C and P: It always remains consistent with the best possible availability.

Each tablet peer is a log-structured merge (LSM) tree based on RocksDB that stores all changes as new versions of rows or columns. All isolation levels are supported with pessimistic or optimistic locking. The first level of the LSM tree, the MemTable, is located in RAM. It is swapped out to sorted sequence table (SST) files as soon as it reaches a certain size (128MB by default), creating many SST files with all intermediate versions. To limit the amount of data, these files are compressed in the background: The data from the SST files is concatenated in a new SST file, and deprecated entries are removed; then, the deprecated original SST files are deleted.

All changes are sequenced by a hybrid logical clock (HLC) to ensure the serialization capability of transactions despite the possible time difference between the physical times of each server [3]. The title is a reference to Google Spanner, which solved the time-shifting problem with atomic clocks but then limited the use to Google's own data centers. YugabyteDB solves this problem by synchronizing the hybrid logical clock with Lamport's algorithm. It only has to wait for the lock offset in the very rare cases of no messages between servers.


YugabyteDB is designed to enable high-performance OLTP that requires all the SQL features invented in the 30-year history of monolithic databases and, thus, the reason for reusing PostgreSQL. Few features are not supported in a distributed context [4].

OLTP applications also run some analytical queries for which YugabyteDB cannot be optimized. In cloud-native applications, these queries are usually processed by another database service developed in-house, which requires more code and components to define the extraction, transformation, and loading of data into the specialized database. YugabyteDB has implemented change data capture, which can stream data to other data services with the support of Debezium or Kafka. Event streaming, however, will never surpass the simplicity and performance of a converged database. Where possible, real-time analysis is easier to perform in an operational database.

SQL provides many features for running complex queries with window functions, secondary indexes, and materialized views. At this point, work is being done on YugabyteDB to optimize it in a distributed context. The main approach is to move the filters (WHERE), aggregations (GROUP BY), and sorts (ORDER BY) to the DocDB storage layer. The filters are processed on each database node instead of having to send a full set of rows to the SQL layer to perform these actions.

As already mentioned, besides the main PostgreSQL API (referred to as YSQL), another API is compatible with Cassandra (YCQL). With time series, for example, you can choose the interface to suit your needs. YCQL supports optimal data entry by taking advantage of Cassandra drivers that are threaded, are cluster-aware, and do not add overhead by providing complex transactional semantics. YSQL is better suited for queries because it offers many functions for processing data and many index types for optimizing access.

YugabyteDB is constantly being improved, and open source is not just about the code. The architecture design and roadmap are published on GitHub [5]. Many new features are designed to improve the user experience; support new cloud-native applications; or help migrate from PostgreSQL, Oracle, or other traditional databases to Yugabyte.

Although all benchmarks show good performance for OLTP, some complex queries, especially for analytics, require further development in YugabyteDB's query planner and executor. This work is ongoing, and the developers welcome community input.


  1. Bacon, D. F., N. Bales, N. Bruno, et al. "Spanner: Becoming a SQL System." In: ACM SIGMOD, ed. SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data (Association for Computing Machinery, New York, 2017), pp. 331-343
  2. "Why We Changed YugabyteDB Licensing to 100 percent Open Source" by Karthik Ranganathan, YugabyteDB blog, July 2019:
  3. Distributed transactions without atomic clocks:
  4. PostgreSQL compatibility:
  5. Design and roadmap:

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs

Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.

Learn More”>


		<div class=