Incremental Backups on Linux

Although commercial Linux backup tools are available, many people prefer open source to better understand and control the backup process. One open source tool that can do both full and incremental backups is rsync.

A common system task is backing up files – that is, copying files with the ability to go back in time and restore them. For example, if someone erases or overwrites a file but needs the original version, then a backup allows you to go back to a previous version of the file and restore it. In a similar case, if someone is editing code and discovers they need to go back to a version of the program from four days earlier, a backup allows you to do so. The important thing to remember is that backups are all about copies of the data at a certain point in time.

In contrast to backing up is “replication.” A replica is simply a copy of the data when the replication took place. Replication by itself does not allow you to go back in time to retrieve an earlier version of a file. However, if you have a number of replicas of your data created over time, you can sort of go back and retrieve an earlier version of a file, but you need to know when the replica was made, then you can copy the file from that replica.

By definition, replicas can use a great deal of space, because each time a replica is made, the entire filesystem as it exists is copied. If you keep several replicas, you are going to have multiple copies of the same data. For example, if you have a 100TB filesystem that is 20% full, (20TB) and you create a copy, you copy 20TB of data. If you make another copy later when the filesystem uses about 25TB (5TB of data has changed), you copy 25TB of data. If you make a third copy of the data at 30TB (another 5TB of changed data), you copy another 30TB. Through time, then, you have had to use 75TB of space (20 + 25 + 30), with no doubt lots of duplicated data wasting space and money.

The backup world uses a few techniques that differentiate it from replication. The first is called a “full backup” and really is a copy of the data as it existed when the copy was made (point in time). It is a true copy, so if you create a full backup of 20TB of data, you will need another 20TB to store the copy. It’s fairly obvious that making frequent full backups and keeping them around can be very expensive in terms of disk space and money, because it’s exactly the same as replication.

The second backup technique is called an “incremental backup,” which only stores the data that has changed since some point in time (typically the previous backup). This technique can save a large amount of storage capacity because files that have not changed are not backed up. In the simple example of 20TB of data, then, a full backup consumes 20TB, but the incremental backup will only copy 5TB of data rather than the 25TB the replication consumed. The third incremental backup is made relative to the previous incremental backup, so it consumes only 5TB more of space (only 5TB of data has changed since the last backup). This means with a full and incremental backups, I only need 30TB of space rather than the 75TB the replication needed.

A third backup technique, and one I’m not really going to discuss, is a “differential backup.” The incremental backup only copies those files that have changed since the last backup, whether it was full or incremental; however, a differential backup copies all of the files that have changed but only does so against the last full backup, not against an incremental backup.

Backups have some fundamental limitations. For example, if a file is open and being actively written during a backup operation, you will get a version of the file as it existed when it was copied; that is, you will not have a backup of the final data because it was in the process of being written when the backup occurred. Also, any file created and destroyed between backups will not be covered by any backup: If a user creates a file and then erases it before a backup occurs, you can’t restore that file.

Backups in the Linux World

The Linux world has many backup tools: some commercial, some open source, and some homegrown. Although I haven’t tried all of them, or even a large cross section of them, I have used a number backup tools in the past and present. One of my sensitivities is I want to understand how the tools work and how much effort they require. In general, the quality of the tool should also be good enough to allow me to schedule a backup, either full or incremental, easily or restore a file from backup without having to learn a new language or attend a multiday class.

In this article, I’m going to focus on open source tools rather than commercial tools. I have nothing against commercial tools, but I’m more familiar with open source tools. Plus, I want to discuss how you can use one open source tool for backups, even if you didn’t realize you could. That tool is rsync.

Rsync

Rsync is an administrator’s best friend, and it can be a wonderful tool for doing all kinds of things that admins need to do. If you look at the man pages for rsync, you will see a rather generic description of what it does:

rsync – a fast, versatile, remote (and local) file-copying tool

Although the description is accurate, rsync is much more than a simple copy tool. It can copy files both locally, to/from a remote host using a remote shell (e.g., ssh), or to/from a remote rsync daemon. One of the things that makes rsync unique is that it has a “delta transfer” capability that reduces the amount of data actually transferred. This capability lets it make replicas, copies, and even backups.

Using rsync is fairly easy, but it has a large number of options that make it very flexible and sometimes difficult. A number of tutorial or introductory articles on the web can help you get started with rsync, so I won't try to reproduce what they have done; however in the interest of completeness, I will mention a few things about rsync that are germane to this article. The basic rsync command is:

% rsync   

The source directory is on the system where you execute the rsync command. The target directory can be on a different system (host). A wide range of options can be used, but I’m only going to present a few of them since this isn’t a complete introduction. The options initially most useful for servers are:

  • -r – This option is the classic option that recursively copies all files and subdirectories within the source target directory.
  • -a – The archive option triggers a number of other rsync options: (archive mode; equals -rlptgoD)
    • -r – (as mentioned above)
    • -l – copy symlinks as symlinks
    • -p – preserve permissions
    • -t – preserve modification times
    • -g – preserve group
    • -o – preserve owner
    • -D – preserve device files and special files
  • -v – increase verbosity
  • -z – compress file data during the transfer
  • -X – preserve extended attributes
  • -A – preserve ACLs (implies -p option)
  • -S – handle sparse files efficiently
  • --delete – if the file is deleted from the source directory, delete it from the target directory

One additional option that you might want to try is --dry-run, which will do everything but copy the data (great for testing); the --stats option, which outputs some file transfer statistics; and the --progress option, which indicates the progress of the copy operation.

Something else to pay attention to is mapping the UID and GIDs on the host system (the source) to the backup system (the target). With rsync, you either keep the numeric UID and GIDs from the source system or you can have rsync try to use the UID and GIDs from the target system. Personally, I like to keep using the numeric UID and GIDs from the host system, because if I restore a file, I don't run the risk of getting the wrong UID and GIDs. The -p, -g, and -o options previously mentioned do this.

The following output is a quick rsync example from my home system. On my desktop I have the directory /home/laytonjb/TEST/SOURCE that I want to copy to my central server. I’ll use rsync to perform the file copy. (Note that the command line spans two lines because it is fairly long.)

[laytonjb@home4 TEST]$ rsync -ravzX --delete /home/laytonjb/TEST/SOURCE/ \
 laytonjb@test8:/home/laytonjb/TEST/
laytonjb@192.168.1.250's password: 
sending incremental file list
./
HPCTutorial.pdf
Open-MPI-SC13-BOF.pdf
PrintnFly_Denver_SC13.pdf
easybuild_Python-BoF-SC12-lightning-talk.pdf

sent 10651918 bytes received 91 bytes 1638770.62 bytes/sec
total size is 12847115 speedup is 1.21

[laytonjb@test8 ~]$ cd TEST
[laytonjb@test8 TEST]$ ls -s
total 12556
1592 easybuild_Python-BoF-SC12-lightning-talk.pdf
 532 HPCTutorial.pdf
7784 Open-MPI-SC13-BOF.pdf
2648 PrintnFly_Denver_SC13.pdf

The rsync command copied over the files in the directory using a number of the options previously discussed. I used SSH to copy the files to the remote server, since I specified the user login and machine name as the destination. However, given the current climate around accessibility to data on the network, I like to make it as difficult as possible to capture my data. To this end, you might want to read the sidebar on hardening SSH.

Sidebar: Hardening SSH

While strictly not a part of rsync, ssh can be a part of any rsync backup solution to a separate host. Given today’s environment, in which data can be acquired through various means without your knowledge, it behooves you to pay attention to the security of the data transmission and use a method to “harden” or improve the security of SSH.

If you google for articles about hardening SSH, you will find quite a few. From these articles here are a few tips I have collected:

  • Disable SSH protocol 1
  • Reduce the grace time (time to login)
  • Use TCP wrappers (always good to check)
  • Increase key strength (maybe go to 2048-bit keys)
  • Check the defaults and disable a few options (there are several, so read a few of the articles to generate your own list)

These changes should help harden SSH – but remember, I’m not a security expert.

Rsync Incremental Backups

Typically people limit the number of full backups because of the extra space, and therefore the extra cost, needed for storage and the time it takes – an incremental backup is faster than doing a full backup. The time to make the backups can also affect your costs because you have to buy faster backup devices so that the backup can complete before the next one starts.

As an example of an incremental backup, assume you have a directory named SOURCE against which you do a full backup to directory SOURCE.FULL. The next backup will be an incremental backup that only contains those files from the directory SOURCE that are different from the files in SOURCE.FULL. Call this backup SOURCE.1. The size of SOURCE.1 should always be smaller than SOURCE.FULLmuch smaller unless a great deal of data has changed.

The next backup can also be another incremental backup; call it SOURCE.2. This backup contains the files in the SOURCE directory that have changed relative to the previous incremental backup, SOURCE.1, not relative to the full backup SOURCE.FULL (which would be a differential backup and would use more space). You can repeat this process with incremental backups ad nauseam.

A number of articles around the web discuss how to do full and incremental backups using rsync, but as an admin I think it’s important that you understand the process. Therefore, I will go through one of the first and best examples of using rsync for incremental backups. From what I can tell the first article to use rsync for full and incremental backups used hard links to create a very simple full and incremental backup solution. The use of hard links allows the backups to save space. This article by Mike Rubel is definitely worth a read, but I’m going to walk through the basic concepts and the sample backup script.

The approach used by Rubel has several advantages. The first is that the most recent backup, backup.0, always contains the full backup and backup.1 through backup.<N>, where <N> is the last incremental backup kept, are the incremental backups. Therefore, if you want the latest version of a file, you go to backup.0, and if you need an earlier version of the file, you just work your way through the other backups.

The second advantage is that rsync uses a filesystem as a backup target. Ideally you would like a backup to be user accessible so, as an admin, you don’t have to spend a great deal of time responding to restore requests (I remember doing a great deal of this, and it was very tedious). If the backups are mounted as regular read-only filesystems, then the user can copy the file they need with little concern for damaging the backup. I think over time, they would actually appreciate having the backups online so they can grab the files they need, but it will be a little bit of a bumpy road to get there. You can even “sell” the approach to users as a self-service file restore.

The backup process described by Rubel allows each backup to appear like a full backup, even though it’s only incremental. The key is the use of hard links. In case you've forgotten or don’t know about hard links, see the sidebar titled “Hard Link Review.”

Sidebar: Hard Link Review

Hard links are an easy way to “link” or associate directory entries with an inode (a file). You can have any number of hard links to a specific file. Here is a quick example of creating a hard link:

[laytonjb@home4 TEST]$ echo "foo" > a
[laytonjb@home4 TEST]$ ln a b

You can check that files a and b are hard linked by using the -i option with the ls command (the inode associated with the file):

[laytonjb@home4 TEST]$ ls -i a45220043 a
[laytonjb@home4 TEST]$ ls -i b45220043 b

Notice that the two files have the same inode number, so they are really the same file. You can even remove file a, and file b will still exist with the same contents:

[laytonjb@home4 TEST]$ rm a
[laytonjb@home4 TEST]$ ls -i als: cannot access a: No such file or directory
[laytonjb@home4 TEST]$ ls -i b
45220043 b

For more details about what is happening, you can use the stat command.

Sample Script

Rubel’s article posted a sample script for creating three incremental backups, as well as a full backup. The basic script is very simple, yet it has a great deal of power in a few lines:

rm -rf backup.3
mv backup.2 backup.3
mv backup.1 backup.2
cp -al backup.0 backup.1
rsync -a --delete source_directory/  backup.0/

To better understand the script, I’ll use it on a simple example in which I create a sample directory in my account /home/laytonjb/TEST/SOURCE that I want to back up. To begin, I’ll put a single file in the directory and then run through the script. Next, I will add files to the directory, simulating the creation of new data, and keep running through the script, tracking what happens with the backups. I will also delete a file so you can see what happens in the backups.

For the first pass through the script, only one file is in the directory to be backed up. The output from the script is:

[laytonjb@home4 TEST]$ ls -s SOURCE/
total 7784
7784 Open-MPI-SC13-BOF.pdf
[laytonjb@home4 TEST]$ du -sh
7.7M .

[laytonjb@home4 TEST]$ rm -rf backup.3
[laytonjb@home4 TEST]$ mv backup.2 backup.3
mv: cannot stat `backup.2': No such file or directory
[laytonjb@home4 TEST]$ mv backup.1 backup.2
mv: cannot stat `backup.1': No such file or directory
[laytonjb@home4 TEST]$ cp -al backup.0 backup.1
cp: cannot stat `backup.0': No such file or directory
[laytonjb@home4 TEST]$ rsync -a --delete /home/laytonjb/TEST/SOURCE/ backup.0
[laytonjb@home4 TEST]$ ls -s
total 8
4 backup.0/ 4 SOURCE/

[laytonjb@home4 TEST]$ du -sh
16M .
[laytonjb@home4 TEST]$ du -sh SOURCE
7.7M SOURCE
[laytonjb@home4 TEST]$ du -sh backup.0
7.7M backup.0

This is the first pass through the script, so only the directory backup.0 is created and it is the same size as the SOURCE directory. You can think of this as a full backup of the directory.

To better understand what is happening with the backups, I’ll track the inode number of the files in the SOURCE subdirectory as well as the files in the four backup directories using ls -i (Table 1).

Table 1: inode Numbers After First Rsync

File SOURCE backup.0 backup.1 backup.2 backup.3
Open-MPI-SC13-BOF.pdf 45220041 45220196 NA NA NA

Notice that the file has two different inode numbers, one in the SOURCE subdirectory and one in the first backup subdirectory backup.0. This indicates that they are two different files (one is a copy of the other). The directory backup.0 is a “snapshot” of the SOURCE subdirectory, and the file in that directory is real, not a hard link. You can confirm this by running the command

stat Open-MPI-SC13-BOF.pdf

in the backup.0 directory and looking for the Links: output, which should be 1. Also notice that if the backups don’t exist or if the file doesn’t exist in the backup directory, the inode number will be listed as NA.

Before executing the script a second time, I copied a second file into the SOURCE directory to serve the purpose of a new file being created. The output from running the backup script is:

[laytonjb@home4 TEST]$ cd SOURCE
[laytonjb@home4 SOURCE]$ ls -s
total 9376
1592 easybuild_Python-BoF-SC12-lightning-talk.pdf
7784 Open-MPI-SC13-BOF.pdf
[laytonjb@home4 SOURCE]$ du -sh
9.2M .

[laytonjb@home4 TEST]$ rm -rf backup.3
[laytonjb@home4 TEST]$ mv backup.2 backup.3
mv: cannot stat `backup.2': No such file or directory
[laytonjb@home4 TEST]$ mv backup.1 backup.2
mv: cannot stat `backup.1': No such file or directory
[laytonjb@home4 TEST]$ cp -al backup.0 backup.1
[laytonjb@home4 TEST]$ rsync -a --delete /home/laytonjb/TEST/SOURCE/ backup.0/
[laytonjb@home4 TEST]$ ls -s
total 12
4 backup.0/ 4 backup.1/ 4 SOURCE/

[laytonjb@home4 TEST]$ du -sh
19M .
[laytonjb@home4 TEST]$ du -sh SOURCE/
9.2M SOURCE/
[laytonjb@home4 TEST]$ du -sh backup.0
9.2M backup.0
[laytonjb@home4 TEST]$ du -sh backup.1
7.7M backup.1

Notice that I now have two subdirectories: backup.0 and backup.1. The cp -al command copies the files in backup.0 to backup.1 using hard links instead of actually copying the files, then the rsync command copies only the new files into backup.0 and deletes any files from backup.0 that have been deleted from SOURCE. Therefore, you will see one file in backup.1 (the oldest backup), and all of the current files in backup.0. The directory backup.0 becomes the most recent snapshot (full backup) of the SOURCE directory, and backup.1 becomes the incremental backup relative to backup.0.

The command ls -i is used to examine the inodes of the files in the two backup directories and the SOURCE directory (Table 2).

Table 2: inode Numbers After Second Rsync

File SOURCE backup.0 backup.1 backup.2 backup.3
Open-MPI-SC13-BOF.pdf 45220041 45220196 45220196 NA NA
easybuild_Python-BoF-SC12-lightning-talk.pdf 45220199 45220206 NA NA NA

This is the first time an incremental backup has been made. In Table 2, notice that the inode number of the first file is the same in both backups, which means the file is really only stored once with a hard link to it, saving time, space, and money. Because of the hard link, no extra data is required. To better understand this, you can run the stat command against the files in the two backup directories.

To see whether I am actually saving space, I can examine the space used in the two backup directories and the SOURCE directory. The SOURCE directory reports using 9.2MB; backup.0, the most recent snapshot, also reports using 9.2MB (as it should), and backup.1, the previous backup, reports using only using 7.7MB (as it should). This is a total of 27.5MB. However, when I ran the du -sh command in the root of the tree, it reported only using 19MB. The difference is a result of using hard links in the backup process, saving storage space.

Now I’ll add a third file to the SOURCE directory and run the backup script for a third time.

[laytonjb@home4 TEST]$ cd SOURCE
[laytonjb@home4 SOURCE]$ ls -s
total 12024
1592 easybuild_Python-BoF-SC12-lightning-talk.pdf 
7784 Open-MPI-SC13-BOF.pdf
2648 PrintnFly_Denver_SC13.pdf
[laytonjb@home4 SOURCE]$ du -sh
12M .

[laytonjb@home4 TEST]$ rm -rf backup.3
[laytonjb@home4 TEST]$ mv backup.2 backup.3
mv: cannot stat `backup.2': No such file or directory
[laytonjb@home4 TEST]$ mv backup.1 backup.2
[laytonjb@home4 TEST]$ cp -al backup.0 backup.1
[laytonjb@home4 TEST]$ rsync -a --delete /home/laytonjb/TEST/SOURCE/ backup.0
[laytonjb@home4 TEST]$ ls -s
total 16
4 backup.0/ 4 backup.1/ 4 backup.2/ 4 SOURCE/

[laytonjb@home4 TEST]$ du -sh
24M .
[laytonjb@home4 TEST]$ du -sh SOURCE/
12M SOURCE/
[laytonjb@home4 TEST]$ du -sh backup.0
12M backup.0
[laytonjb@home4 TEST]$ du -sh backup.1
9.2M backup.1
[laytonjb@home4 TEST]$ du -sh backup.2
7.7M backup.2

Notice how the sizes of the backup directories decrease as the “backup count” increases, indicating which are the incremental backup directories. (Remember that backup.0 is the full backup at any time.)

The size of the SOURCE directory is 24MB, as is the reported size of backup.0. Also notice that the reported size of backup.1 is 9.2MB, and the reported size of backup.2 is 7.7MB as expected. This should total 64.9MB, but du -sh reports the actual space used is 24MB (about 37% of the actual total). I love hard links!

To understand what is happening with the backups, I’ll again tabulate the inodes of the files in the various backups (Table 3).

Table 3: inode Numbers After Third Rsync

File SOURCE backup.0 backup.1 backup.2 backup.3
Open-MPI-SC13-BOF.pdf 45220041 45220196 45220196 45220196 NA
easybuild_Python-BoF-SC12-lightning-talk.pdf 45220199 45220206 45220206 NA NA
PrintnFly_Denver_SC13.pdf 45220217 45220219 NA NA NA

Notice how the file Open-MPI-SC13-BOF.pdf has the same inode in all three backup directories. This indicates the file is only stored once with two hard links to it. You can verify this by using the stat command against the file in backup.0; you should see that the Links: value is 3. You can also check the stat output for the file easybuild_Python-BoF-SC12-lightning-talk.pdf, which should have a Links: value of 2.

After adding a fourth file to SOURCE, I run the backup script again:

[laytonjb@home4 TEST]$ cd SOURCE
[laytonjb@home4 SOURCE]$ ls -s
total 14116
1592 easybuild_Python-BoF-SC12-lightning-talk.pdf 
2092 IL-ARG-CaseStudy-13-01_HighLift.pdf 
7784 Open-MPI-SC13-BOF.pdf
2648 PrintnFly_Denver_SC13.pdf
[laytonjb@home4 SOURCE]$ du -sh
14M .

[laytonjb@home4 TEST]$ rm -rf backup.3
[laytonjb@home4 TEST]$ mv backup.2 backup.3
[laytonjb@home4 TEST]$ mv backup.1 backup.2
[laytonjb@home4 TEST]$ cp -al backup.0 backup.1
[laytonjb@home4 TEST]$ rsync -a -delete /home/laytonjb/TEST/SOURCE/ backup.0/
[laytonjb@home4 TEST]$ ls -s
total 20
4 backup.0/ 4 backup.1/ 4 backup.2/ 4 backup.3/ 4 SOURCE/

[laytonjb@home4 TEST]$ du -sh
28M .
[laytonjb@home4 TEST]$ du -sh backup.0
14M backup.0
[laytonjb@home4 TEST]$ du -sh backup.1
12M backup.1
[laytonjb@home4 TEST]$ du -sh backup.2
9.2M backup.2
[laytonjb@home4 TEST]$ du -sh backup.3
7.7M backup.3

Notice how I now have the final directory backup.3, but I haven’t eliminated any backups yet. Table 4 lists the inode numbers for the various files in the backup and source directories.

Table 4: inode Numbers After Fourth Rsync

File SOURCE backup.0 backup.1 backup.2 backup.3
Open-MPI-SC13-BOF.pdf 45220041 45220196 45220196 45220196 45220196
easybuild_Python-BoF-SC12-lightning-talk.pdf 45220199 45220206 45220206 45220206 NA
PrintnFly_Denver_SC13.pdf 45220217 45220219 45220219 NA NA
IL-ARG-CaseStudy-13-01_HighLift.pdf 45220266 45220268 NA NA NA

To show what happens the first time a backup is eliminated, I’ll add a fifth file and run the script again:

[laytonjb@home4 SOURCE]$ ls -s
total 14648
1592 easybuild_Python-BoF-SC12-lightning-talk.pdf
 532 HPCTutorial.pdf
2092 IL-ARG-CaseStudy-13-01_HighLift.pdf
2648 PrintnFly_Denver_SC13.pdf
7784 Open-MPI-SC13-BOF.pdf
[laytonjb@home4 SOURCE]$ du -sh
15M .

[laytonjb@home4 TEST]$ rm -rf backup.3
[laytonjb@home4 TEST]$ mv backup.2 backup.3
[laytonjb@home4 TEST]$ mv backup.1 backup.2
[laytonjb@home4 TEST]$ cp -al backup.0 backup.1
[laytonjb@home4 TEST]$ rsync -a --delete /home/laytonjb/TEST/SOURCE/ backup.0/
[laytonjb@home4 TEST]$ ls -s
total 20
4 backup.0/ 4 backup.1/ 4 backup.2/ 4 backup.3/ 4 SOURCE/

[laytonjb@home4 TEST]$ du -sh
29M .
[laytonjb@home4 TEST]$ du -sh SOURCE
15M SOURCE/
[laytonjb@home4 TEST]$ du -sh backup.0
15M backup.0/
[laytonjb@home4 TEST]$ du -sh backup.1
14M backup.1
[laytonjb@home4 TEST]$ du -sh backup.2
12M backup.2
[laytonjb@home4 TEST]$ du -sh backup.3
9.2M backup.3

[laytonjb@home4 TEST]$ ls -i SOURCE/
45220199 easybuild_Python-BoF-SC12-lightning-talk.pdf
45220220 HPCTutorial.pdf
45220266 IL-ARG-CaseStudy-13-01_HighLift.pdf
45220041 Open-MPI-SC13-BOF.pdf
45220217 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ ls -i backup.0/
45220206 easybuild_Python-BoF-SC12-lightning-talk.pdf
45220221 HPCTutorial.pdf
45220268 IL-ARG-CaseStudy-13-01_HighLift.pdf
45220196 Open-MPI-SC13-BOF.pdf
45220219 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ ls -i backup.1/
45220206 easybuild_Python-BoF-SC12-lightning-talk.pdf
45220268 IL-ARG-CaseStudy-13-01_HighLift.pdf
45220196 Open-MPI-SC13-BOF.pdf
45220219 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ ls -i backup.2/
45220206 easybuild_Python-BoF-SC12-lightning-talk.pdf
45220196 Open-MPI-SC13-BOF.pdf
45220219 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ ls -i backup.3/
45220206 easybuild_Python-BoF-SC12-lightning-talk.pdf
45220196 Open-MPI-SC13-BOF.pdf

I included the inode discovery output (ls -i), because I wanted to point out that the last backup, backup.3 has two files in it because the previous backup, which only had one file, was erased (rm -rf backup.3).

Table 5 shows the inode numbers for the various files in the backup and source directories for this fifth pass through the script.

Table 5: inode Numbers After Fifth Rsync

File SOURCE backup.0 backup.1 backup.2 backup.3
Open-MPI-SC13-BOF.pdf 45220041 45220196 45220196 45220196 45220196
easybuild_Python-BoF-SC12-lightning-talk.pdf 45220199 45220206 45220206 45220206 45220206
PrintnFly_Denver_SC13.pdf 45220217 45220219 45220219 45220219 NA
IL-ARG-CaseStudy-13-01_HighLift.pdf 45220266 45220268 45220268 NA NA
HPCTutorial.pdf 45220220 45220221 NA NA NA

If you compare Tables 4 and 5, you can see the progression of the hard links in the various backup directories.

I want to do one last experiment with the script, in which I erase a file from the SOURCE directory and see how it propagates into the backups.

[laytonjb@home4 SOURCE]$ rm easybuild_Python-BoF-SC12-lightning-talk.pdf 
[laytonjb@home4 SOURCE]$ ls -s
total 13056
 532 HPCTutorial.pdf
2092 IL-ARG-CaseStudy-13-01_HighLift.pdf
7784 Open-MPI-SC13-BOF.pdf
2648 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ rm -rf backup.3
[laytonjb@home4 TEST]$ mv backup.2 backup.3
[laytonjb@home4 TEST]$ mv backup.1 backup.2
[laytonjb@home4 TEST]$ cp -al backup.0 backup.1
[laytonjb@home4 TEST]$ rsync -a --delete /home/laytonjb/TEST/SOURCE/ backup.0/
[laytonjb@home4 TEST]$ ls -s
total 20
4 backup.0/ 4 backup.1/ 4 backup.2/ 4 backup.3/ 4 SOURCE/

[laytonjb@home4 TEST]$ du -sh
28M .
[laytonjb@home4 TEST]$ du -sh backup.0
13M backup.0
[laytonjb@home4 TEST]$ du -sh backup.1
15M backup.1
[laytonjb@home4 TEST]$ du -sh backup.2
14M backup.2
[laytonjb@home4 TEST]$ du -sh backup.3
12M backup.3

[laytonjb@home4 TEST]$ ls -s backup.0
total 13056
 532 HPCTutorial.pdf
2092 IL-ARG-CaseStudy-13-01_HighLift.pdf
7784 Open-MPI-SC13-BOF.pdf
2648 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ ls -s backup.1
total 14648
1592 easybuild_Python-BoF-SC12-lightning-talk.pdf
 532 HPCTutorial.pdf
2092 IL-ARG-CaseStudy-13-01_HighLift.pdf
7784 Open-MPI-SC13-BOF.pdf
2648 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ ls -s backup.2
total 14116
1592 easybuild_Python-BoF-SC12-lightning-talk.pdf
2092 IL-ARG-CaseStudy-13-01_HighLift.pdf
7784 Open-MPI-SC13-BOF.pdf
2648 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ ls -s backup.3
total 12024
1592 easybuild_Python-BoF-SC12-lightning-talk.pdf
7784 Open-MPI-SC13-BOF.pdf
2648 PrintnFly_Denver_SC13.pdf

Notice how the file easybuild_Python-BoF-SC12-lightning-talk.pdf doesn't appear in backup.0, but it does appear from backup.1 onward. This is expected behavior because I used the --delete option with rsync. However, it also illustrates one of the limitations of backups: You can’t back up everything, all of the time, because you would use too much space. To paraphrase Steven Wright: “You can’t back up everything. Where would you put it?” There will come a point in time when you won’t be able to recover an erased file. It’s just outside the scope of backups. The length of a backup period is a business- and process-based decision and not a technology-driven decision.

You can also see the deleted file in Table 6.

Table 6: inode Numbers After a File is Deleted

File SOURCE backup.0 backup.1 backup.2 backup.3
Open-MPI-SC13-BOF.pdf 45220041 45220196 45220196 45220196 45220196
easybuild_Python-BoF-SC12-lightning-talk.pdf NA NA 45220206 45220206 45220206
PrintnFly_Denver_SC13.pdf 45220217 45220219 45220219 45220219 45220219
IL-ARG-CaseStudy-13-01_HighLift.pdf 45220266 45220268 45220268 45220268 NA
HPCTutorial.pdf 45220220 45220221 45220221 NA NA

Notice that the file easybuild_Python-BoF-SC12-lightning-talk.pdf is no longer listed in SOURCE or backup.0. The file has been erased from those directories but still exists in the others. If you did a stat on that file in backup.1, you should see the Links: number decrease from 4 to 3.

Rsync Can Do Hard Links

Although I personally like the method of using rsync with hard links, rsync has an option that does the hard links for you, so you don't have to create them manually. Most modern Linux distributions have a fairly recent rsync that includes the very useful option --link-dest=. This option allows rsync to compare the file copy to an existing directory structure and lets you tell rsync to copy only the changed files (an incremental backup) relative to the stated directory and to use hard links for other files.

I’ll look at a quick example of using this option. Assume you have a source directory, SOURCE, and you do a full copy of the directory to SOURCE.full:

[laytonjb@home4 TEST]$ rsync -avh --delete \
/home/laytonjb/TEST/SOURCE/ /home/laytonjb/TEST/SOURCE.full
sending incremental file list
created directory /home/laytonjb/TEST/SOURCE.full
./
Open-MPI-SC13-BOF.pdf
PrintnFly_Denver_SC13.pdf
easybuild_Python-BoF-SC12-lightning-talk.pdf

sent 12.31M bytes received 72 bytes 24.61M bytes/sec
total size is 12.31M speedup is 1.00

You can then create an incremental backup based on that full copy using the following command:

[laytonjb@home4 TEST]$ rsync -avh --delete --link-dest=/home/laytonjb/TEST/SOURCE.full /home/laytonjb/TEST/SOURCE/ /home/laytonjb/TEST/SOURCE.1

Rsync checks which files it needs to copy relative to SOURCE.full using hard links when it creates SOURCE.1, creating the incremental copy.

To better use this approach, you would want to implement the backup rotation scheme discussed in the previous section. The script might look something like this:

rm -rf backup.3
mv backup.2 backup.3
mv backup.1 backup.2
mv backup.0 backup.1
rsync -avh --delete --link-dest= backup.1/ source_directory/ backup.0/

Summary

Although I’m sure the many commercial offerings to back up data work fine, I tend to like simple open source solutions to problems. Recently, I’ve been revisiting the question of how best to do backups, and I found that it’s possible to just use rsync to make both full and incremental backups. Over time, rsync has gained the ability to copy files that have changed relative to a directory other than the source directory and use hard links for common files, effectively allowing rsync to make incremental backups. I use rsync for a variety of tasks, but using it for backups was one that I had never considered, although I’m probably behind the times in this regard.

One advantage of using rsync for backups is that it will create a backup using a filesystem. You can then mount this filesystem as read-only on users’ systems or a central file server, and users can then restore files themselves (a self-service file restore). Furthermore, you can also use ssh to make the backups to a remote system and then use nfs to export the backups to the appropriate systems. This allows you to isolate the backups from a central server so that, in the event that the server dies, you still have the user data in a different location.

Using rsync for full backups is pretty simple, but making incremental backups requires a little more work. Combined with hard links, rsync creates incremental backups, with the added benefit that the most recent backup is a full backup. I reviewed a simple script from likely the first article that discussed using rsync and hard links for incremental backups in this way.

If you are looking for a backup tool, take a look at rsync. It has a very large feature set and is capable of making file copies to remote systems. Before you implement a backup that you, and possibly other users, rely on, be sure you test the process, and be sure you create very good logs of the backup process – then check those logs after the backup. The last suggestion I want to leave with you is that you test your backups, particularly your full backup. A backup that cannot restore data is worthless. Performing restores ensures that the backup works.