The Lustre open source distributed, parallel filesystem scales to high-performance computing environments.

Working with the Lustre Filesystem

What do you do when you need to deploy a large filesystem that is scalable to the exabyte level and supports a large-client, simultaneous-access workload? You find a parallel distributed filesystem such as Lustre. In this article, I build the high-performance Lustre filesystem from source, install it on multiple machines, mount it from clients, and access them in parallel.

Lustre Filesystems

A distributed filesystem allows access to files from multiple hosts sharing the files within a computer network, which makes it possible for multiple users on multiple client machines to share files and storage resources. The client machines do not have direct access to the underlying block storage sharing those files; instead, they communicate with a set or cluster of server machines hosting those files and the filesystem to which they are written.

Lustre (or Linux Cluster) [1]-[3] is one such distributed filesystem, usually deployed for large-scale cluster high performance computing (HPC). Licensed under the GNU General Public License (GPL), Lustre provides a solution in which high performance and scalability to tens of thousands of nodes (including the clients) and exabytes of storage becomes a reality and is relatively simple to deploy and configure. As of this writing, the Lustre project is at version 2.14, nearing the official release of 2.15 (currently under development), which will be the next long-term support (LTS) release.

Lustre contains somewhat of a unique architecture, with four major functional units: (1) a single Management Service (MGS), which can be hosted on its own machine or on one of the metadata machines; (2) the Metadata Service (MDS), which contains Metadata Targets (MDTs); (3) Object Storage Services (OSS), which store file data on one or more Object Storage Targets (OSTs); and (4) the clients that access and use the file data.

For each Lustre filesystem, MDTs store namespace metadata, which include file names, directories, access permissions, and file layouts. The MDT data is stored in a single-disk dedicated filesystem that maps locally to the serving node, controls file access, and informs the client nodes which objects make up a file. One or more MDS nodes can exist on a single Lustre filesystem with one or more MDTs each.

An OST is a dedicated object-base filesystem exported for read and write operations. The capacity of a Lustre filesystem is determined by the sum of the total capacities of the OSTs.

Lustre presents all clients with a unified namespace for all of the files and data in the filesystem, which allows concurrent and coherent read and write access to the files in the filesystem. When a client accesses a file, it completes a file name lookup on the MDS, and either a new file is created or the layout of an existing file is returned to the client.

Locking the file on the OST, the client then runs one or more read or write operations to the file but does not directly modify the objects on the OST. Instead, it delegates tasks to the OSS. This approach ensures scalability and improved security and reliability, because it does not allow direct access to the underlying storage, which would increase the risk of filesystem corruption from misbehaving or defective clients.

Although all four components (MGS, MDT, OST, and client) can run on the same node, they are typically configured on separate nodes communicating over a network.


In this article, I use eight nodes, four of which will be configured as client machines and the rest as the servers hosting the Lustre filesystem. Although not required, all eight systems will run CentOS 8.5.2111. As the names imply, the servers will host the target Lustre filesystem; the clients will not only mount it, but also write to it.

For the configuration, you need to build the filesystem packages for both the clients and the servers, which means you will need to install package dependencies from the package repositories:

$ sudo dnf install wget git make gcc kernel-devel \
  epel-release automake binutils libtool bison byacc \
  kernel-headers elfutils-libelf-devel elfutils-libelf \
  kernel-rpm-macros kernel-abi-whitelists keyutils-libs \
  keyutils-libs-devel libnl3 libnl3-devel rpm-build \

Next, enable the powertools repository and install the following packages:

$ sudo dnf config-manager --set-enabled powertools
$ sudo dnf install dkms libyaml-devel

To build Lustre from source, you need to grab the updated e2fsprogs packages for your respective distribution and version hosted on the WhamCloud project website. In this case, I downloaded and installed the necessary packages for my system:


An RPM build environment needs to be created next, which will only be used once to grab, install, and extract the source kernel packages:

$ echo '%_topdir %(echo $HOME)/rpmbuild' > ~/.rpmmacros

The Lustre filesystem relies on a local filesystem to store local objects. The project supports ZFS and a patched version of ext4 called LDISKFS, which I use for the build with the ext4 source from a running kernel. To grab the correct kernel source, you need to make a note of your distribution and its version,

$ cat /etc/redhat-release
CentOS Linux release 8.5.2111

as well as the kernel version:

$ uname -r

This location differs depending on the information output above. Listing 1 shows the commands for my setup to grab your kernel's source, install the source RPM, change into the directory containing the source objects, and extract the kernel tarball. The final three lines change to the kernel/fs source directory (which should mostly be empty) of the currently installed kernel source, rename the existing ext4 directory, and copy the extracted ext4 source in the current directory.

Listing 1: Kernel Source

$ wget
$ sudo rpm -ivh kernel-4.18.0-348.7.1.el8_5.src.rpm
$ cd ~/rpmbuild/SOURCES
$ tar xJf linux-4.18.0-348.7.1.el8_5.tar.xz
$ cd /usr/src/kernels/4.18.0-305.10.2.el8_4.x86_64/fs/
$ sudo mv ext4/ ext4.orig
$ sudo cp -r /home/pkoutoupis/rpmbuild/SOURCES/linux-4.18.0-305.10.2.el8_4/fs/ext4

Building Lustre from Source

The next steps check out the Lustre source code in your home directory, change into the source directory, check out the desired branch, and set the version string:

$ cd ~
$ git clone git://
$ cd lustre-release
$ git branch
$ git checkout master

To build the client packages, type:

$ sh && ./configure --disable-server && make rpms

When the build completes without error, the RPMs shown in Listing 2 will be listed in the root of the source directory.

Listing 2: RPMs After the Build

$ ls *.rpm

Now you need to install the client packages on the client nodes and verify that the packages and the version have been installed:

$ sudo dnf install {kmod-,}lustre-client-2.14.56_111_gf8747a8-1.el8.x86_64.rpm
$ rpm -qa|grep lustre

To build the server packages, type:

$ sh && ./configure && make rpms

When the build completes, you will find the RPMs shown in Listing 3 in the root of the source directory:

Listing 3: Source Root RPMs

[centos@ip-172-31-54-176 lustre-release]$ ls *.rpm

To install the packages on the nodes designated as servers, type:

$ sudo dnf install *.rpm

Then, verify that the packages and the version have been installed. I installed the packages shown in Listing 4. Before proceeding, please read the “Configuring the Servers” box.

Listing 4: Packages on Nodes

[centos@ip-172-31-54-176 RPMS]$ rpm -qa|grep lustre

Configuring the Servers

For the sole purpose of convenience, I have deployed virtual machines to host this entire tutorial. I will also be limited to a 1 Gigabit Ethernet (GigE) network. On each of the virtual machines designated to host the Lustre filesystem, a secondary, approximately 50GB drive is attached.

Preparing The Metadata Servers

You now have Lustre builds for both the client and server setups. I will now switch the focus to use those builds to configure both. Although a separate node could have been used to host the management service (i.e., the MGS), I instead opted to use the first MDS hosting the first MDT as the management service. To do this, add the --mgs option when formatting the device for Lustre. A Lustre deployment can host one, 64, or more MDT devices. However, in this example, I will format just one (Listing 5). If you do choose to format additional MDTs, be sure to increment the value of the index parameter by one each time and specify the node ID (NID) for the MGS node with --mgsnode=<NID> (shown in the “Preparing The Object Storage Servers” section).

Listing 5: Formatting the MDT

$ sudo mkfs.lustre --fsname=testfs --index=0 --mgs --mdt /dev/sdb
   Permanent disk data:
Target:     testfs:MDT0000
Index:      0
Lustre FS:  testfs
Mount type: ldiskfs
Flags:      0x65
              (MDT MGS first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
checking for existing Lustre data: not found
device size = 48128MB
formatting backing filesystem ldiskfs on /dev/sdb
        target name   testfs:MDT0000
        kilobytes     49283072
        options       -I 512 -i 1024 -J size=1925 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,project,huge_file,ea_inode,large_dir,flex_bg -E lazy_journal_init="0",lazy_itable_init="0" -F
mkfs_cmd = mke2fs -j -b 4096 -L testfs:MDT0000 -I 512 -i 1024 -J size=1925 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,project,huge_file,ea_inode,large_dir,flex_bg -E lazy_journal_init="0",lazy_itable_init="0" -F /dev/sdb 49283072k
Writing CONFIGS/mountdata

Now create a mountpoint to host the MDT and then mount it:

$ sudo mkdir /mnt/mdt
$ sudo mount -t lustre /dev/sdb /mnt/mdt/

Because I am not using LDAP and just trusting my clients (and its users) for this example, I need to execute the following on the same MGS node:

$ lctl set_param mdt.*.identity_upcall=NONE

Note that the above command should NOT be deployed in production because it could potentially lead to security concerns and issues.

Make note of the management server's IP address (Listing 6). This output will be the Lustre Networking (LNET) NID, which can be verified by:

$ sudo lctl list_nids10.0.0.2@tcp

Listing 6: Management Server

$ sudo ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1460
        inet  netmask  broadcast
        inet6 fe80::bfd3:1a4b:f76b:872a  prefixlen 64  scopeid 0x20<link>
        ether 42:01:0a:80:00:02  txqueuelen 1000  (Ethernet)
        RX packets 11919  bytes 61663030 (58.8 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 10455  bytes 973590 (950.7 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

LNET is Lustre's network communication protocol, which is designed to be lightweight and efficient. It supports message passing for remote procedure call (RPC) request processes and remote direct memory access (RDMA) for bulk data movement. All metadata and file data I/O are managed through LNET.

Preparing the Object Storage Servers

On the next server, I format the secondary storage volume to be the first OST with an index of 0, while pointing to the MGS node with --mgsnode= (Listing 7). Then, I create a mountpoint to host the OST and mount it:

$ sudo mkdir /mnt/ost
$ sudo mount -t lustre /dev/sdb /mnt/ost/

Listing 7: Format the OST

$ sudo mkfs.lustre --reformat --index=0 --fsname=testfs --ost --mgsnode= /dev/sdb
   Permanent disk data:
Target:     testfs:OST0000
Index:      0
Lustre FS:  testfs
Mount type: ldiskfs
Flags:      0x62
              (OST first_time update )
Persistent mount opts: ,errors=remount-ro
Parameters: mgsnode=
device size = 48128MB
formatting backing filesystem ldiskfs on /dev/sdb
        target name   testfs:OST0000
        kilobytes     49283072
        options       -I 512 -i 1024 -J size=1024 -q -O extents,uninit_bg,dir_nlink,quota,project,huge_file,flex_bg -G 256 -E resize="4290772992",lazy_journal_init="0",lazy_itable_init="0" -F
mkfs_cmd = mke2fs -j -b 4096 -L testfs:OST0000 -I 512 -i 1024 -J size=1024 -q -O extents,uninit_bg,dir_nlink,quota,project,huge_file,flex_bg -G 256 -E resize="4290772992",lazy_journal_init="0",lazy_itable_init="0" -F /dev/sdb 49283072k
Writing CONFIGS/mountdata

On the rest of the nodes I follow the same procedure, again, by incrementing the index parameter value by one each time (Listing 8). Be sure to create the local mountpoint to host the OST and then mount it.

Listing 8: The Rest of the Nodes

$ sudo mkfs.lustre --reformat --index=1 --fsname=testfs --ost --mgsnode= /dev/sdb
   Permanent disk data:
Target:     testfs:OST0001
Index:      1
Lustre FS:  testfs
Mount type: ldiskfs
Flags:      0x62
[ ... ]

Using the Clients

To mount the filesystem on a client, you need to specify the filesystem type, the NID of the MGS, the filesystem’s name, and the mountpoint on which to mount it. The template for the command, and the command I used are:

mount -t lustre <MGS NID>:/<fsname> <mountpoint>
mount -t lustre /lustre

In the examples below, I will be relying on pdsh to run commands on multiple remote hosts simultaneously. All four clients will need a local directory to mount the remote filesystem,

$ sudo pdsh -w 10.0.0.[3-6] mkdir -pv /lustre

after which, you can mount the remote filesystem on all clients:

$ sudo pdsh -w 10.0.0.[3-6] mount -t lustre /lustre

Each client now has access to the remote Lustre filesystem. The filesystem is currently empty:

$ sudo ls /lustre/

As a quick test, create an empty file and verify that it has been created:

$ sudo touch /lustre/test.txt
$ sudo ls /lustre/

All four clients should be able to see the same file:

$ sudo pdsh -w 10.0.0.[3-6] ls /lustre test.txt test.txt test.txt test.txt

You can clean up the output so that you do not see the same instance repeated over and over again:

$ sudo pdsh -w 10.0.0.[3-6] ls /lustre | dshbak -c

I/O and Performance Benchmarking

MDTest is an MPI-based metadata performance testing application designed to test parallel filesystems, and IOR is a benchmarking utility also designed to test the performance of distributed filesystems. To put it more simply: With MDTest, you would typically test the metadata operations involved in creating, removing, and reading objects such as directories, files, and so on, whereas IOR is more straightforward and just focuses on benchmarking buffered or direct sequential or random write-read throughput to the filesystem. Both are maintained and distributed together under the IOR GitHub project. To build the latest IOR package from source, you need to install a Message Passing Interface (MPI) framework, then clone, build, and install the test utilities:

$ sudo dnf install mpich mpich-devel
$ git clone <a href="">
$ cd ior
$ MPICC=/usr/lib64/mpich/bin/mpicc ./configure
$ cd src/
$ sudo make && make install

You are now ready to run a simple benchmark of your filesystem.


The benchmark will give you a general idea of how it performs in its current environment. I rely on mpirun to dispatch the I/O generated by IOR in parallel across the clients; in the end, I get an aggregated result of the entire job execution.

The filesystem is currently empty, with the exception of the file created earlier to test the filesystem. Both the MDT and OSTs are empty with no real file data (Listing 9, executed from the client).

Listing 9: Current Environment

$ sudo lfs df
UUID                    1K-blocks        Used   Available Use% Mounted on
testfs-MDT0000_UUID      22419556       10784    19944620   1% /lustre[MDT:0]
testfs-OST0000_UUID      23335208        1764    20852908   1% /lustre[OST:0]
testfs-OST0001_UUID      23335208        1768    20852904   1% /lustre[OST:1]
testfs-OST0002_UUID      23335208        1768    20852904   1% /lustre[OST:2]
filesystem_summary:      70005624        5300    62558716   1% /lustre

Now, run a write-only instance of IOR from the four clients simultaneously to benchmark the performance of the HPC setup. Each client will initiate a single process to write 64MB transfers to a 5GB file (Listing 10).

Listing 10: IOR Write-Only

$ sudo /usr/lib64/mpich/bin/mpirun --host,,, /usr/local/bin/ior -F -w -t 64m -k --posix.odirect -D 60 -u -b 5g -o /lustre/test.01
IOR-3.4.0+dev: MPI Coordinated Test of Parallel I/O
Began               : Tue Jan 25 20:02:21 2022
Command line        : /usr/local/bin/ior -F -w -t 64m -k --posix.odirect -D 60 -u -b 5g -o /lustre/test.01
Machine             : Linux lustre-client1
TestID              : 0
StartTime           : Tue Jan 25 20:02:21 2022
Path                : /lustre/0/test.01.00000000
FS                  : 66.8 GiB   Used FS: 35.9%   Inodes: 47.0 Mi   Used Inodes: 0.0%
api                 : POSIX
apiVersion          :
test filename       : /lustre/test.01
access              : file-per-process
type                : independent
segments            : 1
ordering in a file  : sequential
ordering inter file : no tasks offsets
nodes               : 4
tasks               : 4
clients per node    : 1
repetitions         : 1
xfersize            : 64 MiB
blocksize           : 5 GiB
aggregate filesize  : 20 GiB
stonewallingTime    : 60
stoneWallingWearOut : 0
access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     1835.22    28.68      0.120209    5242880    65536      0.000934   11.16      2.50       11.16      0
Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write        1835.22    1835.22    1835.22       0.00      28.68      28.68      28.68       0.00   11.15941         NA            NA     0      4   1    1   1     0        1         0    0      1 5368709120 67108864   20480.0 POSIX      0
Finished            : Tue Jan 25 20:02:32 2022

Notice a little more than 1.8GiBps throughput writes to the filesystem. Considering that each client is writing to the target filesystem in a single process and that you probably did not hit the limit of the GigE backend, this isn't a bad result. You will start to see the OST targets fill up with data (Listing 11).

Listing 11: Writing to OST Targets

$ lfs df
UUID                    1K-blocks        Used   Available Use% Mounted on
testfs-MDT0000_UUID      22419556       10800    19944604   1% /lustre[MDT:0]
testfs-OST0000_UUID      23335208     5244648    15577064  26% /lustre[OST:0]
testfs-OST0001_UUID      23335208     5244652    15577060  26% /lustre[OST:1]
testfs-OST0002_UUID      23335208    10487544    10301208  51% /lustre[OST:2]
filesystem_summary:      70005624    20976844    41455332  34% /lustre

This time, rerun IOR, but in read-only mode. The command will use the same number of clients, threads, and transfer size, but read 1GB (Listing 12).

Listing 12: IOR Read-Only

$ sudo /usr/lib64/mpich/bin/mpirun --host,,, /usr/local/bin/ior -F -r -t 64m -k --posix.odirect -D 15 -u -b 1g -o /lustre/test.01
IOR-3.4.0+dev: MPI Coordinated Test of Parallel I/O
Began               : Tue Jan 25 20:04:11 2022
Command line        : /usr/local/bin/ior -F -r -t 64m -k --posix.odirect -D 15 -u -b 1g -o /lustre/test.01
Machine             : Linux lustre-client1
TestID              : 0
StartTime           : Tue Jan 25 20:04:11 2022
Path                : /lustre/0/test.01.00000000
FS                  : 66.8 GiB   Used FS: 30.0%   Inodes: 47.0 Mi   Used Inodes: 0.0%
api                 : POSIX
apiVersion          :
test filename       : /lustre/test.01
access              : file-per-process
type                : independent
segments            : 1
ordering in a file  : sequential
ordering inter file : no tasks offsets
nodes               : 4
tasks               : 4
clients per node    : 1
repetitions         : 1
xfersize            : 64 MiB
blocksize           : 1 GiB
aggregate filesize  : 4 GiB
stonewallingTime    : 15
stoneWallingWearOut : 0
access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
WARNING: Expected aggregate file size       = 4294967296
WARNING: Stat() of aggregate file size      = 21474836480
WARNING: Using actual aggregate bytes moved = 4294967296
WARNING: Maybe caused by deadlineForStonewalling
read      2199.66    34.40      0.108532    1048576    65536      0.002245   1.86       0.278201   1.86       0
Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
read         2199.66    2199.66    2199.66       0.00      34.37      34.37      34.37       0.00    1.86211         NA            NA     0      4   1    1   1     0        1         0    0      1 1073741824 67108864    4096.0 POSIX      0
Finished            : Tue Jan 25 20:04:13 2022

For a virtual machine deployment on a 1GigE network, I get roughly 2.2GiBps reads, which again, if you think about it, is not bad at all. Imagine this on a much larger configuration with better compute, storage, and network capabilities; more processes per client; and more clients. This cluster would scream with speed.


That is the Lustre high-performance filesystem in a nutshell. To unmount the filesystem from the client, use the umountcommand, just like you would unmount any other device from a system:

$ sudo pdsh -w 10.0.0.[3-6] umount /lustre

Much like any other technology, Lustre is not the only distributed filesystem of its kind, including IBM's GPFS, BeeGFS, and plenty more. Either way, and despite the competition, Lustre is both stable and reliable and has cemented itself in the HPC space for nearly two decades; it is not going anywhere.

For Further Reading

[1] The Lustre Project:
[2] The Lustre Project Wiki:
[3] The Lustre Documentation:
[4] The IOR (and MDtest) GitHub Project:

About the Author

Petros Koutoupis is currently a senior performance software engineer at Cray (now HPE) for its Lustre High Performance File System division. He is also the creator and maintainer of the RapidDisk Project ( Petros has worked in the data storage industry for well over a decade and has helped to pioneer the many technologies unleashed in the wild today.