Moving your data – It's not always pretty

Moving Day

2. dcp

If you're an HPC person, you're used to the power of many cores and many nodes, so why not use these cores to copy data? There is a project to do just this: DCP [5] is a simple code that uses MPI [6] and a library called libcircle [7] to copy a file. This sounds exactly like what an HPC admin would do, right?

DCP uses a certain block size when doing the copy. For larger files, you actually have to change the block size in the dcp.h file and rebuild the code, but you could easily build the code with several different block sizes and name the executable something different (e.g., dcp_1KB, dcp_10KB, dcp_1MB, dcp_10MB, dcp_1GB). Then, in a script, you check the size of the file and use a version of DCP that allows you to split the file into a fair number of chunks.

For example, don't use a 1MB version of DCP to copy a 1MB file. If you do, then only one process will be copying the file. Instead, divide the file size by the number of MPI processes and then divide that by the number 10 or larger (number of blocks transferred by each process). This should get you some reasonable performance. Also, note that by using dcp, you don't have to use parallel cp streams. The scripting can just be serial.

On the plus side, dcp is MPI based, so you can take advantage of IB networks for more performance. On the down side, dcp doesn't transfer all attributes. In the documentation, it says that it preserves the ownership, permissions, and timestamps, but, in Marc Stearman's talk, he noted that these attributes are not yet implemented. However, DCP might be something to test to see if it meets your needs. You also might contribute something to the project, perhaps adding file attributes.

3. tar

Another tool you might not have thought too much about is tar. Although it's not a data copy tool in itself, you can use it to gather up lots of small files into a single (.tar) file, copy it to the new storage system, and then untar the file once there. This can make data migration much easier, but you still have to pay attention to the details (see the "Recommendations" section below).

If you want to preserve all of the file attributes when using tar, you have to pay attention to which option(s) [8] you are using. The options you might want to consider are shown in Table 1. Many of the options can be used with tar. I highly recommend you experiment with them to make sure they do everything you want, especially options that affect xattr data. One note of caution: If you run the tar command on the old storage, be sure you have enough space to hold the .tar file. I also recommend you don't run tar on the entire contents of the old storage because you could run out of space.

Table 1

tar Options

Option Description
-c Create the tar file (starts writing at beginning of file).
-f <filename> Specify the tar file name.
-v Give verbose output that shows the files being processed.
-p Preserve the file permission information (used when the files are extracted from the tar file).
--xattrs Save the user/root xattr data to the archive.
--gzip: Use gzip in the stream to compress the tar file.
--bzip2 Use bzip2 to compress the archive (tar file).
--lzma Use lzma to compress the archive (tar file).
--sparse Handle sparse files efficiently (use if you have sparse files that are to go into the tar file).
--no-overwrite-dir Preserve metadata of existing directories. Note that the default is to overwrite the directory metadata.
--selinux Preserve the SELinux context to the archive (tar file).
--preserve Extract information about the file permissions (the default for superuser).
--same-owner Try extracting files with the same ownership as exists in the archive (default for superuser).
--acls Save the ACLs to the archive.

If you have tar working the way you expect, you can simply copy the .tar file from the old storage to the new storage using cp or something similar. Also, you can make tar part of a script in which you tar a smaller part of the overall directory tree, copy it to the new storage, untar it, and then checksum the new files. Then you can repeat the process. Note that this is a serial process and might not give you the performance you want, although you can do multiple tar commands at the same time in a script. However, this process may not be everything you expect. Please read the "Recommendations" section for some ideas about how to use tar.

4. Mutil

An additional idea that some people have pursued is to multithread cp. This can be done by patching the standard Linux tools (GNU coreutils) [9], and the Mutil [10] project does exactly that – it modifies cp and md5sum so that they are multithreaded. In the words of the Mutil authors, it is intended to be used for local copies. This fits with the model in Figure 1, in which a data mover has both filesystems mounted so any data copying between mountpoints is essentially a local copy.

The current version of Mutil, version 1.76.6, is built against coreutils version 7.6. However, more recent Linux distributions, such as RHEL 6.x, use a newer version of coreutils, so it might not be easy to use an older version of coreutils on the data mover.

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Moving Your Data – It’s Not Always Pleasant

    The world is swimming in data, and the pool is getting deeper at an alarming rate. At some point you will have to migrate data from one set of storage devices to another. Although it sounds easy, is it? We take a look at some tools that can help.

  • Google Cloud Storage for backups
    We compare Google Cloud Storage for Internet-based backups with Amazon S3.
  • S3QL filesystem for cloud backups
    Many HPC sites with petabytes of data need some sort of backup solution. Among the many candidates, cloud storage is a serious contender. In this article, we look at one solution with some serious advantages: S3QL.
  • Nine home clouds compared
    Dropbox was the first of a number of cloud service providers. However, only services that promise full control over your own data can give users a feeling of security. We provide an overview of nine cloud projects and two BitTorrent tools.
  • HPC Cloud Storage

    Many HPC sites with petabytes of data need some sort of backup solution. Among the many candidates, cloud storage is a serious contender. In this article, we look at one solution with some serious advantages: S3QL.

comments powered by Disqus