Incremental Backups on Linux
A common system task is backing up files – that is, copying files with the ability to go back in time and restore them. For example, if someone erases or overwrites a file but needs the original version, then a backup allows you to go back to a previous version of the file and restore it. In a similar case, if someone is editing code and discovers they need to go back to a version of the program from four days earlier, a backup allows you to do so. The important thing to remember is that backups are all about copies of the data at a certain point in time.
In contrast to backing up is “replication.” A replica is simply a copy of the data when the replication took place. Replication by itself does not allow you to go back in time to retrieve an earlier version of a file. However, if you have a number of replicas of your data created over time, you can sort of go back and retrieve an earlier version of a file, but you need to know when the replica was made, then you can copy the file from that replica.
By definition, replicas can use a great deal of space, because each time a replica is made, the entire filesystem as it exists is copied. If you keep several replicas, you are going to have multiple copies of the same data. For example, if you have a 100TB filesystem that is 20% full, (20TB) and you create a copy, you copy 20TB of data. If you make another copy later when the filesystem uses about 25TB (5TB of data has changed), you copy 25TB of data. If you make a third copy of the data at 30TB (another 5TB of changed data), you copy another 30TB. Through time, then, you have had to use 75TB of space (20 + 25 + 30), with no doubt lots of duplicated data wasting space and money.
The backup world uses a few techniques that differentiate it from replication. The first is called a “full backup” and really is a copy of the data as it existed when the copy was made (point in time). It is a true copy, so if you create a full backup of 20TB of data, you will need another 20TB to store the copy. It’s fairly obvious that making frequent full backups and keeping them around can be very expensive in terms of disk space and money, because it’s exactly the same as replication.
The second backup technique is called an “incremental backup,” which only stores the data that has changed since some point in time (typically the previous backup). This technique can save a large amount of storage capacity because files that have not changed are not backed up. In the simple example of 20TB of data, then, a full backup consumes 20TB, but the incremental backup will only copy 5TB of data rather than the 25TB the replication consumed. The third incremental backup is made relative to the previous incremental backup, so it consumes only 5TB more of space (only 5TB of data has changed since the last backup). This means with a full and incremental backups, I only need 30TB of space rather than the 75TB the replication needed.
A third backup technique, and one I’m not really going to discuss, is a “differential backup.” The incremental backup only copies those files that have changed since the last backup, whether it was full or incremental; however, a differential backup copies all of the files that have changed but only does so against the last full backup, not against an incremental backup.
Backups have some fundamental limitations. For example, if a file is open and being actively written during a backup operation, you will get a version of the file as it existed when it was copied; that is, you will not have a backup of the final data because it was in the process of being written when the backup occurred. Also, any file created and destroyed between backups will not be covered by any backup: If a user creates a file and then erases it before a backup occurs, you can’t restore that file.
Backups in the Linux World
The Linux world has many backup tools: some commercial, some open source, and some homegrown. Although I haven’t tried all of them, or even a large cross section of them, I have used a number backup tools in the past and present. One of my sensitivities is I want to understand how the tools work and how much effort they require. In general, the quality of the tool should also be good enough to allow me to schedule a backup, either full or incremental, easily or restore a file from backup without having to learn a new language or attend a multiday class.
In this article, I’m going to focus on open source tools rather than commercial tools. I have nothing against commercial tools, but I’m more familiar with open source tools. Plus, I want to discuss how you can use one open source tool for backups, even if you didn’t realize you could. That tool is rsync.
Rsync is an administrator’s best friend, and it can be a wonderful tool for doing all kinds of things that admins need to do. If you look at the man pages for rsync, you will see a rather generic description of what it does:
rsync – a fast, versatile, remote (and local) file-copying tool
Although the description is accurate, rsync is much more than a simple copy tool. It can copy files both locally, to/from a remote host using a remote shell (e.g., ssh), or to/from a remote rsync daemon. One of the things that makes rsync unique is that it has a “delta transfer” capability that reduces the amount of data actually transferred. This capability lets it make replicas, copies, and even backups.
Using rsync is fairly easy, but it has a large number of options that make it very flexible and sometimes difficult. A number of tutorial or introductory articles on the web can help you get started with rsync, so I won't try to reproduce what they have done; however in the interest of completeness, I will mention a few things about rsync that are germane to this article. The basic rsync command is:
% rsync <options> <source directory> <target directory>
The source directory is on the system where you execute the rsync command. The target directory can be on a different system (host). A wide range of options can be used, but I’m only going to present a few of them since this isn’t a complete introduction. The options initially most useful for servers are:
- -r – This option is the classic option that recursively copies all files and subdirectories within the source target directory.
– The archive option triggers a number of other rsync options: (archive mode; equals -rlptgoD
- -r – (as mentioned above)
- -l – copy symlinks as symlinks
- -p – preserve permissions
- -t – preserve modification times
- -g – preserve group
- -o – preserve owner
- -D – preserve device files and special files
- -v – increase verbosity
- -z – compress file data during the transfer
- -X – preserve extended attributes
- -A – preserve ACLs (implies -p option)
- -S – handle sparse files efficiently
- --delete – if the file is deleted from the source directory, delete it from the target directory
One additional option that you might want to try is --dry-run , which will do everything but copy the data (great for testing); the --stats option , which outputs some file transfer statistics; and the --progress option, which indicates the progress of the copy operation.
Something else to pay attention to is mapping the UID and GIDs on the host system (the source) to the backup system (the target). With rsync, you either keep the numeric UID and GIDs from the source system or you can have rsync try to use the UID and GIDs from the target system. Personally, I like to keep using the numeric UID and GIDs from the host system, because if I restore a file, I don't run the risk of getting the wrong UID and GIDs. The -p , -g , and -o options previously mentioned do this.
The following output is a quick rsync example from my home system. On my desktop I have the directory /home/laytonjb/TEST/SOURCE that I want to copy to my central server. I’ll use rsync to perform the file copy. (Note that the command line spans two lines because it is fairly long.)
[laytonjb@home4 TEST]$ rsync -ravzX --delete /home/laytonjb/TEST/SOURCE/ \ laytonjb@test8:/home/laytonjb/TEST/ email@example.com's password: sending incremental file list ./ HPCTutorial.pdf Open-MPI-SC13-BOF.pdf PrintnFly_Denver_SC13.pdf easybuild_Python-BoF-SC12-lightning-talk.pdf sent 10651918 bytes received 91 bytes 1638770.62 bytes/sec total size is 12847115 speedup is 1.21 [laytonjb@test8 ~]$ cd TEST [laytonjb@test8 TEST]$ ls -s total 12556 1592 easybuild_Python-BoF-SC12-lightning-talk.pdf 532 HPCTutorial.pdf 7784 Open-MPI-SC13-BOF.pdf 2648 PrintnFly_Denver_SC13.pdf
The rsync command copied over the files in the directory using a number of the options previously discussed. I used SSH to copy the files to the remote server, since I specified the user login and machine name as the destination. However, given the current climate around accessibility to data on the network, I like to make it as difficult as possible to capture my data. To this end, you might want to read the sidebar on hardening SSH.
Rsync Incremental Backups
Typically people limit the number of full backups because of the extra space, and therefore the extra cost, needed for storage and the time it takes – an incremental backup is faster than doing a full backup. The time to make the backups can also affect your costs because you have to buy faster backup devices so that the backup can complete before the next one starts.
As an example of an incremental backup, assume you have a directory named SOURCE against which you do a full backup to directory SOURCE.FULL . The next backup will be an incremental backup that only contains those files from the directory SOURCE that are different from the files in SOURCE.FULL . Call this backup SOURCE.1 . The size of SOURCE.1 should always be smaller than SOURCE.FULL – much smaller unless a great deal of data has changed.
The next backup can also be another incremental backup; call it SOURCE.2 . This backup contains the files in the SOURCE directory that have changed relative to the previous incremental backup, SOURCE.1 , not relative to the full backup SOURCE.FULL (which would be a differential backup and would use more space). You can repeat this process with incremental backups ad nauseam.
A number of articles around the web discuss how to do full and incremental backups using rsync, but as an admin I think it’s important that you understand the process. Therefore, I will go through one of the first and best examples of using rsync for incremental backups. From what I can tell the first article to use rsync for full and incremental backups used hard links to create a very simple full and incremental backup solution. The use of hard links allows the backups to save space. This article by Mike Rubel is definitely worth a read, but I’m going to walk through the basic concepts and the sample backup script.
The approach used by Rubel has several advantages. The first is that the most recent backup, backup.0 , always contains the full backup and backup.1 through backup. < N >, where <N> is the last incremental backup kept , are the incremental backups. Therefore, if you want the latest version of a file, you go to backup.0 , and if you need an earlier version of the file, you just work your way through the other backups.
The second advantage is that rsync uses a filesystem as a backup target. Ideally you would like a backup to be user accessible so, as an admin, you don’t have to spend a great deal of time responding to restore requests (I remember doing a great deal of this, and it was very tedious). If the backups are mounted as regular read-only filesystems, then the user can copy the file they need with little concern for damaging the backup. I think over time, they would actually appreciate having the backups online so they can grab the files they need, but it will be a little bit of a bumpy road to get there. You can even “sell” the approach to users as a self-service file restore.
The backup process described by Rubel allows each backup to appear like a full backup, even though it’s only incremental. The key is the use of hard links. In case you've forgotten or don’t know about hard links, see the sidebar titled “Hard Link Review.”