If you are an intensive, or even a typical, computer user, you store an amazing amount of data on your personal systems, servers, and HPC systems that you rarely touch. SquashFS is an underestimated filesystem that can address that needed, but little used, data.

SquashFS

As part of my life experience, I have discovered that people like to keep pretty much every piece of data that’s crossed their hard drive. That is, the rm command is rarely, if ever, used. I am no exception. I have lots of data that I want to keep available, yet rarely touch.

Even though you can now get 10TB hard drives and HPC systems routinely have more than 1PB of storage, it is fairly easy to run out of space. Users don’t take the time to compress their data files to save space, and possibly for good reasons. A compressed file has to be uncompressed and then examined to discover its contents. If it is used, then it needs to be compressed again, which means several commands have to be performed just to examine the data.

What could be more useful would be the use of a compressed filesystem. Linux has several options, including the definitely underestimated SquashFS.

Compressing Data

The concept behind data compression, which has been around for a long time, is to encode data by using various techniques that save storage space. Data compression also reduces the size of data that is to be transmitted. The most common example in Linux is gzip, which is used to compress data files. Here’s a quick example illustrating the change in file size:

$ ls -lsah FS_scan.csv 
3.2M -rw-r--r-- 1 laytonjb laytonjb 3.2M 2014-06-09 20:31 FS_scan.csv
$ gzip -9 FS_scan.csv 
$ ls -lsah FS_scan.csv.gz 
268K -rw-r--r-- 1 laytonjb laytonjb 261K 2014-06-09 20:31 FS_scan.csv.gz

The original file is 3.2MB, but after using gzip with the -9 option (i.e., maximum compression), the resulting file is 268KB. The .gz extension indicates that the file has been compressed with gzip. The compression ratio, which is the ratio of the original size to the compressed size, is 11.9:1. This compression ratio is very dependent on the uncompressed data (i.e., how compressible the data is) and the compression algorithm.

Not all data can be compressed as much as a CSV file, which is pure text and highly compressible. Binary data files usually cannot be compressed as much because they are already just about as small as possible.

A huge number of compression algorithms can take data and find a new encoding that results in much smaller files. Classically, this involves looking for patterns in the data. The algorithms vary by how they search for patterns and create the encoding. Sometimes they require a great deal of memory to store the various patterns, and sometimes they require lots of CPU time, which leads to the need to find a balance between compression level, the amount of memory required, and the amount of time it takes to complete the compression. However, the goal of all compression programs remains the same: reduce the size of data to save space.

The primary trade-off for compressed filesystems is the number of CPU cycles and the time it takes to compress and uncompress data in return for reduced storage space. If you have the cycles and don’t need a fast filesystem, you can save space. Alternatively, if your memory or storage space is severely constrained, then compressed filesystems may be your only choice. Severe storage restrictions is most common in embedded systems.

Compressed Filesystems in Linux

Through the growth and development of Linux, a fair number of filesystems have focused on compressing data, including:

Other filesystems such as Btrfs, ZFS, and ReiserFS have compression capability, but they have had to make some pretty serious compromises. So that filesystem performance is not overly penalized, they cannot perform compressions that require a great deal of time, and they really can only compress the data chunks they are given. Although they can achieve some level of compression, they cannot obtain the levels that focused filesystems, such as those in the list, do.

SquashFS

SquashFS is a compressed read-only filesystem for Linux. It takes data and creates something like a compressed “archive” that can be mounted on Linux systems. You can then read data from the filesystem as needed, but you can’t write to it.

The exciting thing about SquashFS is that it has a wide variety of compression algorithms available:

You can experiment with all of them to find the one that compresses the most, the fastest, or according to whatever metric you value.

SquashFS has been in the kernel for a long time (since 2.6.29). The tools for managing SquashFS are available in almost all Linux distributions. A number of features from among the many are summarized here:

  • Maximum filesystem size is 2^64.
  • Maximum file size is 2TiB.
  • Can be NFS-exported (read-only).
  • Compresses metadata.
  • Has extended file attribute (xattr) support (but not ACL support).
  • Supports an unlimited number of directories, files, and entries per directory.
  • Can support sparse files.
  • Can use a 1MiB block size (default is 128KiB).

The larger block sizes can producer greater compression ratios (i.e., smaller size filesystems), although using block sizes other than the typical 4KB has created some difficulties for SquashFS.

SquashFS doesn’t decompress the blocks into the kernel pagecache. This means that SquashFS has its own caches: one for metadata and one for fragments (i.e., two small caches). The cache is not used for file data blocks, which are decompressed and cached in the kernel pagecache in the typical fashion; rather, it is used for fragment or metadata blocks that have been read as a result of a metadata or fragment access. The blocks are decompressed and temporarily placed into the SquashFS cache. SquashFS packs metadata and fragments together into blocks for maximum compression so that when a read access of a fragment or metadata occurs, the retrieval also obtains other metadata and fragment data. Rather than discard these additional pieces of information, they are placed into the temporary cache, so they do not have to be retrieved and decompressed in the case of a near future access.