Lead Image © Lucy Baldwin, 123RF.com

Lead Image © Lucy Baldwin, 123RF.com

Filesystem Murder Mystery

Hunting down a performance problem

Article from ADMIN 50/2019
By
Slow file rename prompts a cloud exploration.

As I was generating results for one of last Fall's columns, James Saryerwinnie of the Amazon AWS team was chasing an interesting performance problem and relaying his progress live on Twitter (Figure 1) [1], providing me yet another unnecessary opportunity to procrastinate. What followed was one of the most interesting Twitter threads in performance-tuning in recent memory and a great practical demonstration of the investigative nature of performance problems.

Figure 1: The original description of the problem.

Saryerwinnie was observing occasional unexplained delays of a few seconds while renaming some gigabyte-sized files [2] on a fast cloud instance [3]. I was able to reproduce his result in an AWS EC2 m5d.large instance running Ubuntu Server 18.04 (Figure 2). I used an EBS class gp2 volume of 8GB showing remarkable performance in this particular test (including a throughput of 175MBps), but similar results are achievable at other scales with some lucky timing.

Figure 2: Testing against the baseline: running 4707 times slower!

My Figure 2 result shows a file rename taking 4,707 times longer than it did originally on the same system when it happens to rename over an existing file that was recently manipulated; the slowdown otherwise disappeared if I waited between operations. Just like Saryerwinnie's original result, the slowdown is measured in whole seconds, spanning four orders of magnitude from the ideal performance level, which appears to indicate that the rename system call [4] is for some reason waiting for recent I/O to be written to disk before completing.

This observation seemed to point the finger at some unexpected queuing of operations at the filesystem level, as if some part of the previous operation, still latent somewhere in the kernel, had to complete before the rename could take place. Given the symbolic nature of file entries in a directory structure, this appears architecturally unnecessary: The completion of the file operation should not interfere with a directory entry's renaming, even when the entry is its own.

Or should it? Perhaps an unexpected guarantee of in-order operation is in play, which is why I, and presumably Saryerwinnie himself before me, found this observation interesting. A similar slowdown is observed when the move command is targeting an existing empty file (Figure 3), seemingly indicating that the rename syscall itself and not actual data movement through caches [5] is the culprit.

Figure 3: Using a zero-length target file does not affect the results.

AWS Storage Basics

AWS block storage comes in two types: ephemeral , which is locally attached and gets its name from being lost when an instance is de-provisioned, and network-based , as delivered by the Elastic Block Storage service (EBS). EBS volumes are highly available and come with a multitude of options too large to detail here [6]. In broad strokes, EBS volumes come in four types: General Purpose SSD (gp2), Provisioned IOPs SSD (io1), Throughput Optimized HDD (st1), and Cold HDD (sc1).

EBS volumes can be used to boot EC2 instances and can be as large as 16TiB. They support encryption, snapshots, and most other accoutrements generally associated with enterprise data storage but stand out for their built-in monitoring delivered via AWS CloudWatch. Figure 4 shows the spike in write bandwidth caused by my earlier tests.

Figure 4: CloudWatch monitoring the original write test.

Strace to the Rescue

Unrivaled as a debugging tool, strace [7] is also a great way to unpack an executable's performance when used with the -t option to include the time of each call. As shown in Listing 1, nearly the entirety of the mv execution time was spent in an unusually long (nine seconds in this case!) rename system call, confirming the suspicion that it is the key to this enigma.

Listing 1

Timing the Execution of Calls in mv

ubuntu@aws:~/slow-mv$ strace -t mv 3GB.copy 3GB
19:00:09 execve("/bin/mv", ["mv", "3GB.copy", "3GB"], 0x7ffd0e7dddf8 /* 21 vars */) = 0
19:00:09 brk(NULL)                      = 0x55cd7d1ce000
19:00:09 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
19:00:09 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
...
19:00:09 getgid()                       = 1000
19:00:09 access("3GB", W_OK)            = 0
19:00:09 rename("3GB.copy", "3GB")      = 0
19:00:18 lseek(0, 0, SEEK_CUR)          = -1 ESPIPE (Illegal seek)
19:00:18 close(0)                       = 0
19:00:18 close(1)                       = 0
19:00:18 close(2)                       = 0
19:00:18 exit_group(0)                  = ?
19:00:18 +++ exited with 0 +++
ubuntu@aws:~/slow-mv$

The next step of analysis would take me into the kernel, were it not for Matt Wilson delivering a timely resolution [8]: The combination of the data=ordered and auto_da_alloc ext4 options results in a behavior akin to invoking sync [9] before renaming over an existing file. I can validate this result by running a few tests in the test instance, after remounting the root partition with modified options, as shown in Listing 2. I could not observe a delay longer than a third of a second with this configuration, as opposed to a maximum of nine seconds with the original (Figure 5). Of course, there is a reason for this setting's trade-off.

Listing 2

Disabling auto_da_alloc

ubuntu@aws grep ext4 < /proc/mounts
/dev/nvme1n1p1 / ext4 rw,relatime,discard,data=ordered 0 0
ubuntu@aws sudo mount -o remount,rw,relatime,discard,noauto_da_alloc,data=ordered /dev/nvme1n1p1 /
ubuntu@aws  grep ext4 < /proc/mounts
/dev/nvme1n1p1 / ext4 rw,relatime,discard,noauto_da_alloc,data=ordered 0 0
Figure 5: Disabling auto_da_alloc resulted in smaller delays.

Puzzling the Riddle

As is explained in the kernel docs [10], setting auto_da_alloc in the default data=ordered mode will result in the replace-via-rename and replace-via-truncate file access patterns having their data blocks flushed to disk before a rename is persisted to disk, consistent with what was observed. The solution to the puzzle lies in filesystem designers attempting to prevent data loss during system crashes for those applications designed with the assumptions of the older ext3 filesystem environment, as detailed by Ted Ts'o in a vintage 2009 blog [11]. As the distribution was turning from ext3 to ext4, some Ubuntu users encountered unexpected data loss because of buggy applications in the Jaunty Jackalope (v9.04) alpha installs. In Ted's own words, "Application writers had gotten lazy, because ext3 by default has a commit interval of 5 seconds".

Infos

  1. mv performance problem: https://twitter.com/jsaryer/status/1069775822880333824
  2. Unexplained rename() delay: https://twitter.com/jsaryer/status/1069831718389960704
  3. GitHub, James Saryerwinnie: https://gist.github.com/jamesls/9d52c7b3a19491f4d14c2d487515115f
  4. rename(2) man page: https://linux.die.net/man/2/rename
  5. "Assess USB performance while exploring storage caching" by Federico Lucifredi, ADMIN , issue 48, 2018, pg. 94, http://www.admin-magazine.com/Archive/2018/48/Assess-USB-performance-while-exploring-storage-caching
  6. Amazon EBS user guide: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html
  7. strace(1) man page: https://linux.die.net/man/1/strace
  8. ext4 options problem: https://twitter.com/_msw_/status/1069853156534669312
  9. sync(8) man page: https://linux.die.net/man/8/sync
  10. auto_da_alloc in kernel ext4 documentation: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt
  11. "Delayed allocation and the zero-length file problem" by Theodore Ts'o, March 12, 2009: http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/

The Author

Federico Lucifredi http://@0xf2 is the Product Management Director for Ceph Storage at Red Hat, formerly the Ubuntu Server PM at Canonical and the Linux "Systems Management Czar" at SUSE. He enjoys arcane hardware issues and shell scripting mysteries and takes his McFlurry shaken, not stirred. You can read more from him in the new O'Reilly title AWS System Administration .

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

comments powered by Disqus