Lead Image © maridav, 123RF.com

The SDFS deduplicating filesystem

Slimming System

Article from ADMIN 30/2015

By Tim Schürmann

Deduplicating filesystems like SDFS store redundant data, such as that created by backups, only once, thereby saving valuable disk space. Additionally, the filesystem can distribute the data to be stored across multiple computer nodes.

The SDFS [1] filesystem developed in the scope of the Opendedup project first breaks a file to be stored down into individual data blocks. It then stores only those blocks that do not already exist on disk. In this way, SDFS can also deduplicate only partially identical files. From the outside, users do not see anything of this slimming process: They still see a backup copy although only the original exists on the disk. Of course, SDFS also ensures that the backup copy is not modified when the original is edited.

Block-Based Storage

SDFS optionally stores the data blocks locally, on up to 126 computer nodes on a network, or in the cloud. A built-in load balancer ensures even distribution of the load across nodes. This means that SDFS can handle large volumes of data quickly – given a suitably fast network connection. Whatever the case, SDFS installs itself as a layer on top of the existing filesystem. SDFS will work either with fixed or variable block sizes.

In this way, both structured and unstructured data can be efficiently deduplicated. Additionally, the filesystem can handle files with a block size of 4KB. This is necessary to be able to deduplicate virtual machines efficiently. SDFS discovers identical data blocks by creating a fingerprint in the form of a hash for each block and then comparing the values.

The risk of failure increases of course because each data block exists only once on disk. If a block is defective, all files with this content are too. SDFS can thus redundantly store each data block on up to seven storage nodes. Finally, the filesystem lets you create snapshots of files and directories. SDFS is licensed under the GNU GPLv2 and can thus be used for free in the enterprise. You can view the source code on GitHub [2].

SDFS exclusively supports

...

Use one of the options below to read the full article