In this second article on scalable storage in cloud environments, we cover the inner workings of RADOS and how to avoid pitfalls.

The RADOS object store and Ceph filesystem: Part 2

In an earlier article, ADMIN magazine introduced RADOS and Ceph  [1] and explained what these tools are all about. In this second article, I will take a closer look and explain the basic concepts that play a role in their development. How does the cluster take care of internal redundancy of the stored objects, for example, and what possibilities exist besides Ceph for accessing the data in the object store?

The first part of this workshop demonstrated how an additional node can be added to an existing cluster using

ceph osd crush add 4 osd.4 1.0 pool=default host=daisy

assuming that daisy is the hostname of the new server. This command integrates the host daisy in the cluster and gives it the same weight (1,0) as all the other nodes. Removing a node from the cluster configuration is as easy as adding one. The command for that is:

ceph osd crush remove osd.4

The pool=default parameter for adding the node already refers to an important feature: pools.

Working with Pools

RADOS offers the option of dividing the storage of the entire object store into individual fragments called pools. One pool, however, does not correspond to a contiguous storage area, as with a partition; rather, it is a logical layer consisting of binary data tagged as belonging to the corresponding pool. Pools allow configurations in which individual users can only access specific pools, for example. The pools metadata, data, and rbd are available in the default configuration. A list of existing pools can be called up with rados lspools (Figure 1).

Figure 1: Listing all current pools in RADOS.

If you want to add another pool to the configuration, you can use the

rados mkpool <Name>

command, replacing <Name> with a unique name. If an existing pool is no longer needed,

rados rmpool <Name>

removes it.

Inside RADOS

One of the explicit design goals for RADOS is to provide seamlessly scalable storage. Admins should be able to add any number of storage nodes to a RADOS object store at any time. Redundancy is important, and RADOS takes this into account by managing replication of the data automatically, without the admin of the RADOS cluster having to intervene manually. Combined with scalability, however, this process results in a problem for RADOS that other replication solutions don’t have: How to distribute data optimally in a large cluster.

Conventional storage solutions generally “only” make sure data is copied from one server to another, so in the worst case, a failover can be executed. Usually, such solutions only run within one cluster with two, or at most three, nodes. The possibility of adding more nodes is ruled out from the beginning.

With RADOS, theoretically, any number of nodes could be added to the cluster, and for each one, the object store must ensure that its contents are available redundantly within the whole cluster. Not the least of developers’ problems is dealing with “rack awareness.” If you have a 20-node cluster with RADOS in your data center, you will ideally have it distributed over different multiple compartments or buildings for additional security. For that setup to work properly, the storage solution must know where each node is and where and how which data can be accessed. The solution RADOS developers – above all RADOS guru Sage A. Weil – came up with consists of two parts: placement groups and the Crush map.

Placement Groups

Three different maps exist within a RADOS cluster: the MONmap, which is a list of all monitoring servers; the OSDmap, in which all physical Object Storage Devices (OSDs) are found; and the Crush map. OSDs themselves contain the binary objects – that is, the data actually saved in the object store.

Here is where placement groups (PGs) come into play: From the outside, it seems as if the allocation of storage objects to specific OSDs occurs randomly. In reality, however, the allocation is done by means of the placement groups. Each object belongs to such a group. Simply speaking, a placement group is a list of different objects that are placed in the RADOS store. RADOS computes which placement group an object belongs to by using the name of the object, the desired replication level, and a bitmask that determines the sum of all PGs in the RADOS cluster.

The Crush Map

The Crush map, which is the second part of this system, contains information about where each placement group is found in the cluster – that is, on which OSD (Figure 2). Replication is always executed at the PG level: All objects of a placement group are replicated between different OSDs in the RADOS cluster.

Figure 2: The elements of the RADOS universe, whose interaction is controlled by the Crush map.

The Crush map got its name from the algorithm it uses: Controlled Replication under Scalable Hashing. The algorithm was developed by Weil specifically for such tasks in RADOS. Weil highlights one feature of Crush in particular: In contrast to hash algorithms, Crush remains stable when many storage devices leave or join the Cluster simultaneously. The rebalancing that other storage solutions require creates a lot of traffic with correspondingly long waiting periods. Crush-based clusters, on the other hand, transfer just enough data between storage nodes to achieve a balance.

Manipulating Data Placement

Of course, the admin also has a word to say about what data lands where. Practically all parameters that pertain to replication in RADOS can be configured by the admin, including, for example, how often an object should exist within an RADOS cluster (i.e., how many replicas of it should be made). Of course, the admin is also free to manipulate the allocation of the replicas to the OSDs. Thus, RADOS can take into account where specific racks are. Rules for replication specified by the admin control distribution of the replicas from placement groups to different OSD groups. A group can, for example, include all servers in the same data center room, another group in another room, and so on.

Administrators can define how the replication of data in RADOS is done by manipulating the corresponding Crush rules. The basic principle is the neither the allocation of the placement groups nor the results of the Crush calculations can be influenced directly. Instead, replication rules are set for each pool; subsequently, distribution of the placement groups and their positioning by means of the Crush map are done by RADOS.

By default, RADOS includes a rule that two replicas of each object must exist per pool. The number of replicas is always set for each pool; a typical example would be to raise this number to three, as is done here:

ceph osd pool set data size 3

To make the same change for the test pool, data would be replaced by test. Whether the cluster subsequently actually does what the admin expects can be investigated with ceph -v.

The use ceph in this kind of operation is certainly comfortable, but it does not provide the full range of functions. In the example, the pool continues to use the default Crush map, which does not consider properties such as the rack where the server is. If you want to distribute replicas according to such properties, you have to create your own Crush map.

A Crush Map of Your Own

With your own Crush map, you as administrator can have the objects in your pools distributed any way you want. crushtool is a valuable utility for this purpose because it creates corresponding templates. The following example creates a Crush map for a setup consisting of six OSDs (i.e., individual storage devices in servers) distributed over three racks:

crushtool --num_osds 6 -o --build host straw 1 rack straw 2 root straw 0

The --num_osds 6 parameter specifies that the cluster has six individual storage devices at its disposal. The --build option introduces a statement with three three-part parameters that follow the <Name> <Internal Crush Algorithm> <Number> syntax.

<Name> can be chosen freely; however, it is wise to choose something meaningful. host straw 1 specifies that one replica is allowed per host, and rack straw 2 tells RADOS that two servers exist per rack. root straw 0 refers to the number of racks and determines that the replicas should be distributed equally on all available racks.

Subsequently, if you want to see the resulting Crush map in plain text, you can enter

crushtool -d -o crush.example

The file with the map in plain text will then be called crush.example (Listing 1).

Listing 1: Crush Map for Six Servers in Two Racks

001 # begin crush map
003 # devices
004 device 0 device0
005 device 1 device1
006 device 2 device2
007 device 3 device3
008 device 4 device4
009 device 5 device5
011 # types
012 type 0 device
013 type 1 host
014 type 2 rack
015 type 3 root
017 # buckets
018 host host0 {
019         id -1            # do not change unnecessarily
020         # weight 1.000
021         alg straw
022         hash 0  # rjenkins1
023         item device0 weight 1.000
024 }
025 host host1 {
026         id -2            # do not change unnecessarily
027         # weight 1.000
028         alg straw
029         hash 0  # rjenkins1
030         item device1 weight 1.000
031 }
032 host host2 {
033         id -3            # do not change unnecessarily
034         # weight 1.000
035         alg straw
036         hash 0  # rjenkins1
037         item device2 weight 1.000
038 }
039 host host3 {
040         id -4            # do not change unnecessarily
041         # weight 1.000
042         alg straw
043         hash 0  # rjenkins1
044         item device3 weight 1.000
045 }
046 host host4 {
047         id -5            # do not change unnecessarily
048         # weight 1.000
049         alg straw
050         hash 0  # rjenkins1
051         item device4 weight 1.000
052 }
053 host host5 {
054         id -6            # do not change unnecessarily
055         # weight 1.000
056         alg straw
057         hash 0  # rjenkins1
058         item device5 weight 1.000
059 }
060 rack rack0 {
061         id -7            # do not change unnecessarily
062         # weight 2.000
063         alg straw
064         hash 0  # rjenkins1
065         item host0 weight 1.000
066         item host1 weight 1.000
067 }
068 rack rack1 {
069         id -8            # do not change unnecessarily
070         # weight 2.000
071         alg straw
072         hash 0  # rjenkins1
073         item host2 weight 1.000
074         item host3 weight 1.000
075 }
076 rack rack2 {
077         id -9            # do not change unnecessarily
078         # weight 2.000
079         alg straw
080         hash 0  # rjenkins1
081         item host4 weight 1.000
082         item host5 weight 1.000
083 }
084 root root {
085         id -10           # do not change unnecessarily
086         # weight 6.000
087         alg straw
088         hash 0  # rjenkins1
089         item rack0 weight 2.000
090         item rack1 weight 2.000
091         item rack2 weight 2.000
092 }
094 # rules
095 rule data {
096         ruleset 1
097         type replicated
098         min_size 2
099         max_size 2
100         step take root
101         step chooseleaf firstn 0 type rack
102         step emit
103 }
105 # end crush map

Of course, the names of the devices and hosts (device0, device2, … and host1, host2, …) must be adapted to the local conditions. Thus, the name of the device should correspond to the device name in ceph.conf (in this example: osd.0), and the hostname should agree with the hostname of the server.

Distributing Replicas on the Racks

The replication is defined in line 101 of Listing 1 by step chooseleaf firstn 0 type rack, which determines that replicas are to be distributed on the racks. To achieve a distribution per host, rack would be replaced by host. The min_size and max_size parameters (lines 98 and 99) seem inconspicuous; however, especially in combination with ruleset 1 (line 96), they are very important for RADOS to use the rule created. Which ruleset from the Crush map RADOS will use is specified for each pool; for this, RADOS not only matches names, but also the min_size and max_size parameters, which refer to the number of replicas.

In concrete terms, this means: If RADOS is supposed to process a pool according to ruleset 1, which uses two replicas, then the rule in the example would apply. However, if the admin has used the command as explained above to specify that three replicas should exist for the objects in the pool, then RADOS would not apply the rule. To be less specific, it is recommended to set min_size to 1 and max_size to 10 – this rule would then apply for all pools that use ruleset 1 and require from one to 10 replicas.

On the basis of this example, admins will be able to create their own Crush maps, which must subsequently find their way back into RADOS.

Extending the Existing Crush Map

Practice has shown that it is useful to extend an existing Crush map with new entries instead of building a new one from scratch. The following command will give access to the Crush map currently in use:

ceph osd getcrushmap -o

This command will save the Crush map in binary format in This file can be decoded with

crushtool -d -o

which transfers the plain text version to After editing, the Crush map must be encoded again with

crushtool -c -o

before it can be sent back to the RADOS cluster with

ceph osd setcrushmap -i

Ideally, for a newly created pool to use the new rule, it would be set accordingly in ceph.conf; the new rule would simply become the standard value. If the new ruleset from the example has the ID 4 in the Crush map (ruleset 4), then the line

osd pool default crush rule = 4

would help configuration block [osd]. Additionally,

osd pool default size = 3

can be used to determine that new pools should always be created with three replicas.

The 4K Limit for Ext3 and Ext4

If you complete a RADOS installation and use ext3 or ext4 as the filesystem, you might run into a snare: In these filesystems, the XATTR attributes (i.e., the extended file attributes) are limited to a maximum of 4KB, and the XATTR entries from RADOS regularly take up just that. If no preventive measures are taken, this setup could, in the worst case, cause ceph-osd to crash or spit out cryptic error messages like (Operation not supported).

This problem can be remedied using an external file to save the XATTR entries. To do so, the line

filestore xattr use omap = true ; for ext3/4 filesystem

must be inserted into the [osd] entry in ceph.conf (Figure 3). In this way, you can protect the cluster against possible problems.

Figure 3: Special care is needed when operating RADOS with ext3 or ext4.

Alternative: The RADOS Block Device

In the first part of this workshop [1], I took an in-depth look at the Ceph filesystem, which is a front end for RADOS. Ceph is not, however, the only way to access the data deposited in a RADOS store – the RADOS block driver (RBD) is an alternative. With rbd, objects in RADOS can be addressed as if they were on a hard disk.

This functionality is especially helpful in the context of virtualization because presenting a block device to KVM as a hard disk saves the virtualization a detour over container formats like qcow2. Additionally, using rbd is very easy – an rbd pool is already included that admins can take advantage of. For example, to create an rbd drive with a size of 1GB, you only need to use the command

rbd create test --size 1024

Subsequently, the block device can be used on any host with an rbd kernel module, which is now part of the mainline kernel and is found on practically every system. Because ceph.conf from the first part of this workshop has already specified that users must authenticate themselves to gain access to the RADOS services, that is also true for rbd. Note that login credentials can be called with:

ceph-authtool -l /etc/ceph/admin.keyring

Then, on a machine that has loaded rbd, you can type

echo "IP addresses of the MONs, separated by commas name=admin,secret=Authkeyrbd test" > /sys/bus/rbd/add

to activate the RBD drive.


RADOS lets administrators replace two-node storage with seamlessly scalable RADOS-based storage. Currently, the project developers plan to turn version 0.48 into version 1.0, which will then receive the official “Ready for Enterprise” stamp. Because the previous version 0.47 has already been released, the enterprise version may be expected soon. Next, I turn to security with CephX.


[1] “RADOS and Ceph” by Martin Loschwitz, ADMIN, Issue 09, pg. 28

The Author

Martin Gerhard Loschwitz is Principal Consultant at hastexo, where he is intensively involved with high-availability solutions. In his spare time, he maintains the Linux cluster stack for Debian GNU/Linux.