Lead Image © maridav, 123RF.com

Lead Image © maridav, 123RF.com

The SDFS deduplicating filesystem

Slimming System

Article from ADMIN 30/2015
Deduplicating filesystems like SDFS store redundant data, such as that created by backups, only once, thereby saving valuable disk space. Additionally, the filesystem can distribute the data to be stored across multiple computer nodes.

The SDFS [1] filesystem developed in the scope of the Opendedup project first breaks a file to be stored down into individual data blocks. It then stores only those blocks that do not already exist on disk. In this way, SDFS can also deduplicate only partially identical files. From the outside, users do not see anything of this slimming process: They still see a backup copy although only the original exists on the disk. Of course, SDFS also ensures that the backup copy is not modified when the original is edited.

Block-Based Storage

SDFS optionally stores the data blocks locally, on up to 126 computer nodes on a network, or in the cloud. A built-in load balancer ensures even distribution of the load across nodes. This means that SDFS can handle large volumes of data quickly – given a suitably fast network connection. Whatever the case, SDFS installs itself as a layer on top of the existing filesystem. SDFS will work either with fixed or variable block sizes.

In this way, both structured and unstructured data can be efficiently deduplicated. Additionally, the filesystem can handle files with a block size of 4KB. This is necessary to be able to deduplicate virtual machines efficiently. SDFS discovers identical data blocks by creating a fingerprint in the form of a hash for each block and then comparing the values.

The risk of failure increases of course because each data block exists only once on disk. If a block is defective, all files with this content are too. SDFS can thus redundantly store each data block on up to seven storage nodes. Finally, the filesystem lets you create snapshots of files and directories. SDFS is licensed under the GNU GPLv2 and can thus be used for free in the enterprise. You can view the source code on GitHub [2].

SDFS exclusively supports systems with 64-bit Linux on an x86 architecture. Although a Windows version [3] is under development, it was still in beta when this issue went to press and is pretty much untested. Additionally, the SDFS developers provide a ready-made appliance in the OVA exchange format [4]. This virtual machine offers a NAS that when launched deduplicates the supplied data with SDFS and stores it on its (virtual) disk. A look under the hood reveals Ubuntu underpinnings.

Installing SDFS

To install SDFS on an existing Linux system, start by using your package manager to install the Java Runtime Environment (JRE), Version 7 or newer. The free OpenJDK is fine as well. On Ubuntu, it resides in the openjdk-7-jre-headless package, whereas Red Hat and CentOS users need to install the java-1.7.0-openjdk package.

If you have Ubuntu Linux version 14.04 or newer, Red Hat 7, or CentOS 7, you can now download the package for your distribution [4]. The Red Hat package is also designed for CentOS. Then, you just need to install the package; on Ubuntu, for example, type:

sudo dpkg -i sdfs-2.0.11_amd64.deb

The Red Hat and CentOS counterpart is:

rpm -iv --force SDFS-2.0.11-2.x86_64.rpm

Users on other distributions need to go to the Linux (Intel 64bit) section on the SDFS download page [4] and look for SDFS Binaries . Download and unpack the tarball. The tools here can then be called directly, or you can copy all the files manually to a matching system directory. In any case, the files in the subdirectory etc/sdfs belong in the directory /etc/sdfs.

After completing the SDFS install, you need to increase the maximum number of simultaneously open files. To do so, open a terminal window, become root (use sudo su on Ubuntu), then run the two following commands:

echo "* hardnofile 65535" > /etc/security/limits.conf
echo "* soft nofile 65535" > /etc/security/limits.conf

On Red Hat and CentOS systems, you also need to disable the firewall:

service iptables save
service iptables stop
chkconfig iptables off

SDFS uses a modified version of FUSE, which the filesystem already includes. Any FUSE components you installed previously are not affected. FUSE lets you run filesystem drivers in user mode (i.e., like normal programs) [5].

Management via Volumes

SDFS only deduplicates files that reside on volumes. Volumes are virtual, deduplicate drives. You can create a new volume with the following command:

mkfs.sdfs --volume-name=pool0 --volume-capacity=256GB

In this example, it goes by the name pool0 and can store a maximum of 256GB (Figure 1). You need to be root to run this command and all the following SDFS commands; Ubuntu users thus need to prepend sudo. For the volume you just created, SDFS uses a fixed block size of 4KB.

Figure 1: After creating the pool0 volume, you can mount it in /media/pool0.

If you want the filesystem to use a variable block size for deduplication instead, you need to add the --hash-type= VARIABLE_MURMUR3 parameter to the previous command. On the physical disk, the volume only occupies the amount of space that the deduplicated data it contains actually occupies. SDFS volumes can also be exported via iSCSI and NFSv3.

To be able to populate the volume with files, you first need to mount it. The following commands create the /media/pool0 directory and mount the volume with the name pool0:

mkdir /media/pool0
mount.sdfs pool0 /media/pool0/ &

All files copied into the /media/pool0 directory from this point on are automatically deduplicated by SDFS in the background. When done, you can unmount the volume like any other:

umount /media/ pool0

The & that follows mount.sdfs really is necessary, by the way: The mount.sdfs script launches a Java program that monitors the directory and handles the actual dedup. On terminating, the volume automatically unmounts. The & sends the Java program to the background. It continues running there until the administrator unmounts the volume again. In our lab, the Java program did not always launch reliably. If you add an &, you will thus want to check that the volume mounted correctly to be on the safe side. For information on a mounted volume, type, enter

sdfscli --volume-info

(Figure 2). The sdfscli command lets you retroactively grow a volume. The following command expands the volume to 512GB:

sdfscli --expandvolume 512GB
Figure 2: Information about a mounted volume courtesy of the sdfscli command.

SDFS saves the data stored on pool0 in the /opt/sdfs/volumes/pool0/ subdirectory. The subdirectories at this point contain the actual data blocks and the metadata that SDFS relies on to reconstruct the original files.

Do not make any changes here unless you want to lose all your stored files. The storage location for the data blocks can only be changed to another directory on creating the volume. You need to pass in the complete new path to mkfs.sdfs with the --base-path= parameter. For example, the command

mkfs.sdfs --volume-name=pool0 --volume-capacity=256GB \

would store the data blocks in /var/pool0.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus