Moving your data – It's not always pretty

Moving Day

Recommendations

I've covered a number of tools and added comments about how the tools might be used to migrate data from an old storage solution to a new one. However, the most important thing is not necessarily the tool(s), but the planning and execution of the migration. I think it's worthwhile to review some of the steps you should consider when migrating data.

The first, extremely important step – and I cannot emphasize this enough – is that you need to understand your data; that is, you need to understand the state of your data. Answering, or being able to answer, the questions in Table 3 will make your life infinitely easier as you plan for data migration.

Table 3

Understanding Your Data

How many files?
What is the average number of files per user?
How old is the oldest file based on the three timestamps (atime, mtime, ctime)?
What is the average file age?
How many files are added every day?
How many files are deleted every day?
How much capacity is added every day?
How big is the largest file?
What is the size of the smallest file?
Which users have the most data and the largest number of files?
Which user is creating the most data?
What is the deepest subtree?
How many levels does it have?
How many users are on the system?
Do files have xattr data or other associated attributes?
How many directories are there?
Do directories have associated xattr data?

Once you have this information, you can develop a plan for migrating data. In general, I would start with user data that hasn't changed in a while but still needs to be migrated. You need to understand how the data is distributed through a user's directory tree. Is most of the data near the bottom (it usually is), or is the data distributed throughout the tree? Understanding this allows you to plan how best to migrate the data.

For example, if most of the data is at the bottom of the tree, you will want to think about how to migrate the directory structure first and then use tar to capture the data at the lowest levels. The reason is because tar doesn't understand directory structures, so if you untar it at the wrong level, you will have problems. Also, you need to migrate the directories; otherwise, you can lose permissions, timestamps, links, owners, and xattr data. Don't think you can just re-create the data structure on the new storage.

As part of this process, you will also likely need a list of files and directories in the user's account. It's just a simple list of the fully qualified name for the file.

This list will be very important because many of the tools I discussed can be fed a file name for data migration, and a list of files will allow you to parallelize the operations, improving throughput and reducing migration time.

Once you have identified the user with the least active data, you can then work through the list of users until you arrive to the most active users in terms of data manipulation. For each user, you need to go through the same steps you did with the least active user. Be sure to understand their tree structure and create a list of all files.

At this point, you should have a pretty good time sequence for migrating the data, starting with the user with the least active data and progressing to the most active user; additionally, you should have lists of files for each user that needs to be migrated. Now, you still need to determine just how to do the transfer, and you need to understand which tool(s) to use for the data migration to get the most throughput between the storage solutions and reduce migration time. This step could mean writing scripts, but it definitely means lots of testing.

To perhaps help you in your testing, I have two fundamental recommendations at this point: (1) Transfer the directories first, including any directory attributes; (2) create a dummy account and put in some dummy data for testing.

The first step is somewhat counterintuitive, but Stearman also mentioned it in his talk. By migrating the directory structure over first, you can migrate files further down the directory, without having to transfer everything at once.

As I described earlier, some of the tools don't understand directory trees very well. For example, what if you created a tar file for the path /home/jones/project1/data1/case1 that is recursive. If you then copy over the tar file to the new storage and untar it in /home/jones, the data will be in the wrong place. To prevent this from happening, it greatly behooves you to migrate the directory structure over first. Just be sure to capture all the attributes for the directories.

My second recommendation – testing everything on a dummy account – seems obvious, but it gives you more feedback on the tools and processes than simply trying a single file. Also, don't forget to test all of your procedures. Although this might seem obvious, sometimes the little things bite the worst.

Once you have tested with the dummy account, I recommend testing with a user who doesn't have much data or hasn't been using their account too much, and I would start migrating other user data and increase the rate of the transfer until the maximum throughput is hit.

Another observation that is tremendously obvious but easily forgotten is: Don't allow users to modify data while it is migrating or once it has been migrated. Although this seems obvious, in practice, it can be a struggle. You might disable that person's account during the migration or simply ask them to refrain from using the system. Once the data is migrated, you can then point them to that data and not the old location (you could simply rename their root directory, but be careful because this changes the file attributes on their root directory).

Regardless of the tool(s) you use, be sure to checksum the file before and after you migrate it to make sure the checksums are the same, and you haven't lost any data. Even if the tool does the checksum check for you, I recommend doing it yourself. Only when you are satisfied that the checksums match should you mark the file as migrated or erase or otherwise modify the file on the old storage.

I also recommend using something a little more sophisticated than md5sum. I personally like to use the SHA-2 series of checksums because there is less chance of a false positive. (I realize the chances aren't very great with an md5sum, but I just feel better using SHA-2. However, this is a personal preference.) Some of the tools use md5sum for their own file checksum checks.

My last suggestion is to check, check, check. Make sure the data on the new storage matches the data on the old server. It's easy to write a simple script that walks the directories and gathers this information. Run the script on the old storage and the new storage and compare them. If they don't match, go back and fix it. This includes file checksums as well.

Summary

Storage is growing so fast that it's like looking at your daughter when she's five years old, blinking, and suddenly she's 15. You need to keep your eye on the ball if you hope to manage everything. I'm not just talking about large systems, but medium-sized systems, small systems, workstations, and even your home systems.

At some point during this growth spurt, you will have to think about migrating your data from an old storage solution to a new one, but copying the data over isn't as easy as it sounds. You would like to preserve the attributes of the data during the migration, including xattr (extended attribute) information, and losing information such as file ownership or timestamps can cause havoc with projects. Plus, you have to pay attention to the same things for directories; they are just as important as the files themselves (remember that everything is a file in Linux).

In this article, I wanted to present some possible tools for helping with data migration, and I covered just a few of them. However, I also wanted to emphasize the need to plan your data migration if you really want to succeed.

One thing I learned in writing this article is that there is no one tool that seems to do everything you want. You end up having to use several tools to achieve everything in an efficient manner. Plus, these tools are fairly generic, which is good because they can be used in many situations. At the same time, they might not take advantage of the hardware.

A simple example is that all of these tools but one are TCP based, yet InfiniBand offers much greater bandwidth that can't be fully utilized for data migration. You can definitely use IP over InfiniBand, and you will get greater bandwidth than not using it, but you still won't be using the entire capability of the network.

Infos

  1. Lustre User Group conference: http://www.opensfs.org/events/lug13/
  2. Sequoia Data Migration Experiences: http://www.opensfs.org/wp-content/uploads/2013/04/LUG-2013-Sequoia-Data-Migration-Experiences.pdf
  3. ZFS and Lustre: http://wiki.lustre.org/index.php/ZFS_and_Lustre
  4. cp command: http://www.gnu.org/software/coreutils/manual/html_node/cp-invocation.html
  5. DCP: http://filecopy.org/
  6. MPI: http://en.wikipedia.org/wiki/Message_Passing_Interface
  7. libcircle: https://github.com/hpc/libcircle
  8. tar options: http://linux.die.net/man/1/tar
  9. GNU coreutils: http://www.gnu.org/software/coreutils/
  10. Mutil: http://people.nas.nasa.gov/~kolano/projects/mutil.html
  11. Rsync: http://rsync.samba.org/
  12. Rsync options: http://rsync.samba.org/ftp/rsync/rsync.html
  13. Lustre: http://wiki.lustre.org/index.php/Main_Page
  14. Parallel rsync'ing a huge directory tree: http://blog.ciberterminal.net/2012/10/16/parallel-rsyncing-a-huge-directory-tree/
  15. Parallelizing RSYNC Processes: http://sun3.org/archives/280
  16. Parallelizing rsync: http://superuser.com/questions/353383/parallelizing-rsync
  17. BitTorrent: http://en.wikipedia.org/wiki/Bittorrent
  18. BitTorrent offers file sync tool for PCs and NAS: http://www.theregister.co.uk/2013/04/24/bit_torrent_sync_for_nas/
  19. BitTorrent Sync: http://labs.bittorrent.com/experiments/sync.html
  20. BitTorrent Takes Its File-Syncing Service Public: http://mashable.com/2013/04/23/bittorrent-sync/
  21. BBCP: http://www.slac.stanford.edu/~abh/bbcp/
  22. Using BBCP: http://pcbunn.cithep.caltech.edu/bbcp/using_bbcp.htm
  23. bbFTP: http://doc.in2p3.fr/bbftp/
  24. SLAC: http://www.slac.stanford.edu/
  25. IN2P3: http://www.in2p3.fr/
  26. BBftpPRO: http://www.weizmann.ac.il/physics/linux_farm/bbftp_PRO.html
  27. GridFTP: http://www.globus.org/toolkit/docs/latest-stable/gridftp/
  28. Aspera: http://asperasoft.com/
  29. Aspera Sync: http://asperasoft.com/software/synchronization/

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Moving Your Data – It’s Not Always Pleasant

    The world is swimming in data, and the pool is getting deeper at an alarming rate. At some point you will have to migrate data from one set of storage devices to another. Although it sounds easy, is it? We take a look at some tools that can help.

  • Google Cloud Storage for backups
    We compare Google Cloud Storage for Internet-based backups with Amazon S3.
  • S3QL filesystem for cloud backups
    Many HPC sites with petabytes of data need some sort of backup solution. Among the many candidates, cloud storage is a serious contender. In this article, we look at one solution with some serious advantages: S3QL.
  • Nine home clouds compared
    Dropbox was the first of a number of cloud service providers. However, only services that promise full control over your own data can give users a feeling of security. We provide an overview of nine cloud projects and two BitTorrent tools.
  • HPC Cloud Storage

    Many HPC sites with petabytes of data need some sort of backup solution. Among the many candidates, cloud storage is a serious contender. In this article, we look at one solution with some serious advantages: S3QL.

comments powered by Disqus