Coordinating distributed systems with ZooKeeper

Relaxed in the Zoo

Limits

Even if it seems tempting to use one system for everything, you might run into some obstacles if you try to replace existing filesystems with ZooKeeper. The first obstacle is the jute.maxbuffer setting, which defines a 1MB size limit for single znodes by default. The developers recommend not changing this value, because ZooKeeper is not a large data repository.

I found an exception to this rule: If a client using many watchers loses the connection to ZooKeeper, the client library (which is called Curator [4] in this scenario) will try to reconstruct all watchers again as soon as the client connects successfully.

Because the same applies to all messages sent and received by ZooKeeper, the admin has to increase the size limit so that Curator again successfully connects the clients with ZooKeeper. So, what is Curator? The software helps create a solid implementation of ZooKeeper that correctly handles all possible exceptions and special cases in the network area [5].

On the other hand, you can expect limitations in data throughput if you use ZooKeeper as a message service. Because the software mainly relies on correctness and consistency, speed and availability are secondary (see the "Zab vs. Paxos" and "The CAP theorem" boxes).

The CAP Theorem

The CAP theorem takes into account three properties – consistency, availability, and partition tolerance – and states that a distributed system can support only two of these three properties at the same time. In this light, ZooKeeper is a CP system because it maintains consistency and has split tolerance, so it still works if portions of the network fail. ZooKeeper sacrifices availability for this: If it cannot guarantee correct behavior, it does not respond to inquiries.

Zab vs. Paxos

Although ZooKeeper provides functionality similar to the Paxos algorithm, quorum building on the network does not rely on Paxos. The algorithm used by ZooKeeper is Zab (ZooKeeper atomic broadcast). Like Paxos, it relies on a quorum to achieve durability of the stored data.

The difference is that Zab only uses one proposer, whereas Paxos runs different proposers in parallel (Figure 2). This approach, however, can compromise the integrity of the order that Zab puts value on. This is one reason why a synchronization phase follows each election of a new leader before Zab accepts new changes. A Stanford University paper contains more details [6].

Figure 2: A Stanford University paper discusses the advantages and disadvantages of Zab and Paxos.

In the Wild

Found [7], which offers Elasticsearch (ES) instances [8] and is also my employer, uses ZooKeeper intensively to discover services, allocate resources, carry leader elections, and send messages with high priority. The complete service consists of many systems that have read and write access to ZooKeeper.

At Found, ZooKeeper is specifically used in combination with the client console: The web application opens a window for the customer into the world of ZooKeeper and lets it manage the ES cluster hosted by Found. If a customer creates a new cluster, or changes something on an existing one, this step ends up as a scheduled change request in ZooKeeper.

Last Instance

A constructor looks for new in jobs in ZooKeeper and converts them by calculating how many ES instances it needs and whether it can reuse the existing instances. Based on this information, it updates the instance list for each ES server and waits for the new instances to start.

A small application managed by ZooKeeper monitors the instance list on each server running Elasticsearch instances. It starts and stops LXC containers with ES instances on demand. When starting a search instance, a small ES plugin provides the IP address and port to ZooKeeper and discovers further ES instances in order to form a cluster.

The constructor will be waiting for the address information provided by ZooKeeper so it can connect with the instances and verify whether the cluster is there. If no feedback arrives within a certain time, the constructor cancels the changes. ES plugins that are misconfigured or too memory-hungry are typical problems that prevent the start of new nodes.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus