How Old Is That Data?

The explosion of data is a storage burden that all system administrators bear. The agedu tool lets you discover what data is being used.

One of the fundamental theorems of system administration is: “Users will find a way to use space faster than new space can be added to systems.” The corollary to this theorem is: “Users will always insist that all of their data is critical and must be retained online.” To help system administrators get their arms around the data boom, they can use tools that scan filesystems to determine how much data is being used and to “age” data.

In this article, I want to introduce an essential admin application named agedu that you can use to get a snapshot of the age of files and directories. From this information, you can get a general sense of what directories have older data that hasn't been accessed (or modified) in a while. It can also be used in scripts to create reports about systems or simply to understand what’s going on with your storage, even on your desktop or laptop.

Studies of Data Age

A few years ago, a study from the University of California, Santa Cruz, and NetApp examined CIFS storage within the NetApp company. Part of the storage was deployed in the corporate data center where the hosts were used by more than 1,000 marketing, sales, and finance employees. The second part of the storage was a high-end file server deployed in the engineering data center and used by more than 500 engineering employees. From this study, a few observations can be made:

  • Workloads are more write oriented
  • Files are 10x bigger than in previous studies
  • Files are rarely reopened: >66 percent are reopened just once and 95 percent are opened fewer than five times
  • <1 percent of clients account for 50 percent of requests
  • >76 percent of files are opened by just one client
  • Only 5 percent of files are opened by multiple clients, and 90 percent of those are read just once

The big Vegas finish for the study was that more than 90 percent of the active storage was untouched during the study.

If you combine these results with those of other studies, it becomes apparent that users are creating more data than before, they are keeping it around, and they are not reusing too much of it. However, if you ask the users, they will naturally tell you that all of the data is important and cannot be erased. If the data hasn't been touched in two years, is still needed? To answer that question, you need to be able to scan a user’s filesystem(s) to determine the age of files.

Data Comes in All Ages

On *nix systems, “age” is not a single number. Age of a file or directory is measured in three ways:

  • change time(ctime)
  • access time (atime)
  • modify time (mtime)

The first, ctime, is the time that changes were last made to the file or directory’s inode. This can include changes to the data, file or directory permissions, file or directory ownership, and so on. The ctime can be viewed with the command, ls -lc. The second time, atime, is the time the file was last accessed. The access times can be found with the command ls -lu. The third time, mtime, is the modify time, or the time the actual file contents were changed. You can view the modify time with the command ls -l. To get all of the information in a quasi-readable format, you can use the stat command in Linux.

When talking about the "age" of a file, you need a precise definition. Does the discussion concernctime, atime, or mtime? Do you need to take into consideration a combination of the three metrics? Are you interested in the oldest time, regardless of the metric (i.e., max[em>ctime, atime, mtime] )?

As an aside, many users want to know when a file was first created (i.e., its “birth” time), regardless of whether the data or the inode information has changed. Some discussion has taken place about adding this time to various filesystems, but no real standard has developed around it.

Every time a file is accessed, the atime changes, forcing the filesystem to change the file inode by reading the inode, modifying the atime, and then writing the data back to storage, even if the data in the file is not actually changed. This process requires a large number of very small I/O operations (IOPS).

Many filesystems allow you to turn off atime to reduce the IOPS load, which can increase performance at the price of not being able to track when a file was last accessed. Although lot of people don't care about the last access time, particularly for local systems such as a laptop or a desktop, for HPC systems, atime can be a very important number because it allows you to track when the file was last accessed.

Before plowing into a very large filesystem with millions (or even billions) of files, it would be good to get a glimpse at the distribution of the three ages of the files and directories, so you can focus on where most of the older files are located. A simple tool for this task is agedu.

aegdu

Filesystems can contain thousands, millions, or even billions of files. Tracking how they are being used is very difficult and time consuming. Fortunately, a simple tool named agedu can give you a quick glimpse into the “age” of the data on a directory basis. In the case of HPC systems, you can use it to scan directories quickly for old applications or user directories.

Ageduis likely to be in your distribution repository (e.g., the CentOS 6 and CentOS 7 EPEL repositories). However, if it isn't there, it is simple to install, configure, and run. A simple and familiar ./configure command builds the code, after which you install it into /usr/local as root. The other option is to build the code and install the tool in your user account; for example:

./configure --prefix=/home/laytonjb/bin/agedu

Then you just create an alias to the executable. For example, in your .bashrc file, add the line:

 alias agedu=/home/laytonjb/bin/agedu/bin/agedu

After you have installed agedu, building from source or from your package manager, you can proceed in a number of ways. The first thing you should do is create an index of all of the files in the directory tree and their sizes. All subsequent queries can be made against the index, which is much faster than continually scanning the filesystem. Note that for all directories and files below the current directory. agedu sums the used storage (e.g., like du -s). Once the index is built, you can then “query” for a variety of information. Ageducomes with a basic HTML server, so it will produce a graphical display of the results.

To create an index of the directory tree, you just run the command,

[laytonjb@home4 ~]$ agedu -s /home/laytonjb
Built pathname index, 748917 entries, 67182982 bytes of index
Faking directory atimes
Building index
Final index file size = 162381160 bytes
[laytonjb@home4 ~]$ ls -s agedu*
158580 agedu.dat

where -s <directory> produces an index file named agedu.dat in the current directory. (Note: If the index file is in a directory being scanned, agedu will ignore it.)

Once the index is created, you can query it. A great way to get started is to use the HTML display capabilities. Agedu will print out a URL that you can then copy into your browser. For example,

[laytonjb@home4 ~]$ agedu -w
Using Linux /proc/net magic authentication
URL: http://127.0.0.1:42821/

A screenshot of the web browser output is shown in Figure 1.

Figure 1: Aegdu screenshot using access time (atime).

The web graphics display the age of the files in a specific directory – red being the oldest and green being the newest. (For this example, notice that the oldest file is fur years old.) The web page orders the directories by total space used. For this specific example, the first directory has the vast majority of the used space, as well as a large number of fairly new files. The second and third directories have some older files, as does the fifth directory.

The image also indicates the total space used bya directory to the far left and the percentage of the total space the directory uses (listed to the far right). When you are finished with the web page, just close agedu by pressing Ctrl+C.

In Figure 1, notice at the very top of the page that it states the data age is based on access time (atime), which is the default setting. However, you can easily perform the same analysis using mtime if you like (at this time ctime is not an option), with:

[laytonjb@home4 ~]$ agedu --mtime -s /home/laytonjb
Built pathname index, 751666 entries, 67486718 bytes of index
Faking directory atimes
Building index
Final index file size = 174723288 bytes
[laytonjb@home4 ~]$ ls -s agedu*
170632 agedu.dat
[laytonjb@home4 ~]$ agedu -w
Using Linux /proc/net magic authentication
URL: http://127.0.0.1:51579/

Remember that the first command produces the index; then, you need either to display the graphic output, as in the second command. or to query the output.

Figure 2 shows the resulting web page when mtime is used as the metric.

Figure 2: Aegdu screenshot using modify time (mtime), even though it says “access time.”

Note that the top of the web page still says last-access time, even though mtime was used.

The range of dates is from six years to present. The oldest end of the spectrum (around 6 years) is fairly small,with a fairly long spread of “new” files in terms of mtime. The subdirectory AWS has the largest percentage of file capacity and some of the youngest files (lots of PDF files).

In addition to the HTML output, you can query the database to get text information (which is great for scripting). For example,

[laytonjb@home4 ~]$ agedu -s /home/laytonjb
[laytonjb@home4 ~]$ agedu -t /home/laytonjb

sends a summary of space usage (including subdirectories) as text to stdout.

By default agedu looks for the oldest file when creating the scale, as displayed in the web output. You can use the text option to query the index for the age of the data that doesn't have to follow that scale. For example, you can scan for the amount of space in each directory that is older than six months:

[laytonjb@home4 ~]$ agedu -s /home/laytonjb
[laytonjb@home4 ~]$ agedu -a 6m -t /home/laytonjb
4           /home/laytonjb/.abrt
7344        /home/laytonjb/.adobe
8           /home/laytonjb/.atom
14764       /home/laytonjb/.cache
8           /home/laytonjb/.cfncluster
16188       /home/laytonjb/.config
4           /home/laytonjb/.dbus
8           /home/laytonjb/.distlib
3524        /home/laytonjb/.e
8           /home/laytonjb/.emacs.d
44          /home/laytonjb/.fontconfig
148         /home/laytonjb/.gconf
712         /home/laytonjb/.gimp-2.2
76          /home/laytonjb/.gimp-2.6
20          /home/laytonjb/.gkrellm2
60          /home/laytonjb/.gnome2
16          /home/laytonjb/.gnote
24          /home/laytonjb/.gnupg
1704        /home/laytonjb/.icewm
2600        /home/laytonjb/.kde
19260       /home/laytonjb/.komodoedit
1852        /home/laytonjb/.libreoffice
1044        /home/laytonjb/.local
104         /home/laytonjb/.lyx
...
15305060    /home/laytonjb/src
622704576   /home/laytonjb

The output shows the space usage summary for each subdirectory below the main directory that has data older than six months. This capability can be extremely useful when searching for directories that have very old data. From a system administrator’s perspective, a prime example would be to use agedu to scan user directories for really old data after examining all home directories for the oldest data. This can also be run as part of a script that is run either daily, weekly, or monthly and creates a report of the directories with the oldest data, from which you can decide what to do (e.g., archiving).

The savvy administrators reading this article will be quick to realize that users could simply use the touch command to update the atime and mtime of their data, obscuring the real access and modify times of the data. However, one could use agedu to run reports fairly often to catch users doing this. It doesn't stop them, but at least you have a record of the users employing this method, and if they become abusers of space, you can at least talk to them and show them the reports. Despite the tone of this article, users are not evil in any sense, but having data to explain why they should compress or delete data is much more effective than simply demanding that they delete data. These reports can also help you identify users that need more space and then work with them to understand how they are using space, and they can help you when requesting more space because you can explain how space is being used, who is using it, and the how fast data is growing.

Summary – Check Your Space Usage Today!

As pointed out in the introduction, studies have pointed out that most stored data has not been accessed in a very long time or is not accessed often. Although there are legitimate reasons for keeping data online, at least knowing how much data has not been accessed in quite some time provides evidence for either adding storage or adding archiving capabilities. Knowing how data is used can also help identify the users that are using the most space, so they can be asked to delete or compress data that they are not using or have not touched in some time. (It's fairly convincing to ask the user to compress files they have not used in some time when you can show them how much disk space they are using and how long it's been since they last accessed the files.) This information can also be used to provide justification for more space, or perhaps even more importantly, it can be used to track trends in data usage.

Agedu is a tool that can give you a quick overview of your disk usage as a function of time. The tool is remarkably easy to build and use and has a great deal of flexibility, including the ability to be scripted. As pointed out, the scripting capability can be used in a variety of ways to help administrators. Even for the casual home user, this information can be very useful in understanding why your disks are getting so full can be even more useful when asking the household finance committee to flip for a new 10TB SATA drive (or two).