The Quest For The Fastest Linux Filesystem

What’s this thing about?

This post has a few main points:

1. Speeding up a filesystem’s performance by setting it up on a tuned RAID0/5 array.

2. Picking the fastest filesystem.

3. The fastest format options for Ext3/4 or XFS filesystems.

4. Tuning an Ext3/4 filesystem’s journal and directory index for speed.

5. Filesystem mount options that increase performance, such as noatime and barrier=0.

6. Setting up LILO to boot from a RAID1 /boot partition.

The title is a bit of an oversimplification 😉 The article is intended to keep being work in progress as “we” learn, and as new faster tools become available. This article is not intended to cover the fastest hardware (yet). The goal is the “fastest” filesystem possible on whatever device you have. Basically, “we” want to setup and tweak whatever is possible to get our IO writes and reads to happen quicker. Which IO reads? random or sequential? long or short? The primary goal is a quick Linux root filesystem, which is slightly different than, lets say, a database-only filesystem, or a /home partition for user files. Oh, and by the way, do not use this on your production machines, people. Seriously.

RAID

WTF is RAID?!

The first question is, how many devices would you like your filesystem to span? The simple and correct answer is – the more the faster. To use one filesystem across multiple devices, a single “virtual” device can be created from multiple partitions with RAID. (Recently developed filesystems, like BTRFS and ZFS, are capable of splitting themselves intelligently across partitions to optimize performance on their own, without RAID) Linux uses a software RAID tool which comes free with every major distribution – mdadm. Read about mdadm here, and read about using it here. There’s also a quick 10 step guide I wrote here which will give you an idea about the general procedure of setting up a RAID mdadm array.

Plan your array, and then think about it for a while before you execute – you can’t change the array’s geometry (which is the performance sensitive part) after it’s created, and it’s a real pain to migrate a filesystem between arrays. Not to mention a Linux root filesystems.

Deciding on a performance oriented type of RAID ( RAID0 vs. RAID5 )

The rule of thumb is to use 3 or more drives in a RAID5 array to gain redundancy at the cost of a slight performance loss over a RAID0 array (10% CPU load at peak times on my 2.8 GHz AthlonX2 with a 3 disk RAID5 array). If you only have 2 drives, you cannot use RAID5. Whatever your situation is, RAID0 will always be the fastest, but less responsible, choice.

RAID0 provides no redundancy and will fail irrecoverably when one of the drives in the array fails. Some would say you should avoid putting your root filesystem on an un-redundant array, but we’ll do it anyways! RAID0 is, well, the *fastest* (I threw that caution to the wind and I’m typing this from a RAID0 root partition, for what it’s worth). If you are going to be or have been using a RAID0 array, please comment about your experiences. Oh, and do backup often. At least weekly. To an *external* drive. If you only have one drive you can skip to the filesystem tuning part. If you do are going to use RAID0/5, remember to leave room for a RAID1 array, or a regular partition, for /boot. Today, LILO cannot yet boot a RAID0/5 array.

Deciding on a RAID stripe size ( 4 / 8 / 16 / 32 / 64 / 128 / 256 … )

You will need to decide, for both RAID0 and RAID5, about the size of the stripe you will use. See how such decisions affect performance here. I find the best results for my personal desktop to be 32kb chunks. 64 does not feel much different. I would not recommend going below 32 or above 128 for a general desktops root partition. I surf, play games, stream UPnP, run virtual machines, and use a small MySQL database. If I would be doing video editing, for example, a significantly bigger stripe size would be faster. Such specific usage filesystem should be setup for their own need and not on the root filesystem, if possible. Comments?

RAID 5 – deciding on a parity algorithm ( Symmetric vs. Asymmetric )

For RAID5, the parity algorithm can be set to 4 different types. Symmetric-Left, Symmetric-Right, Asymmetric-Left, and Asymmetric-Right. They are explained here, but they appear to only affect performance to a small degree for desktop usage, as one thread summarized.

Creating a RAID0 array

Using the suggestions above, the command to create a 2-disk RAID0 array for a root partition on /dev/md0 using the partitions /dev/sda1 and /dev/sdb1 should look like this:

# mdadm --create /dev/md0 --metadata=0.90 --level=0 --chunk=32 --raid-devices=2 /dev/sd[ab]1

Note the –metadata option, which with 0.90 specifies the older mdadm metadata format. If you will use anything other than 0.90, you will find Lilo failing to boot.

The Fastest Filesystem – Setup and Tuning

Deciding on a Filesystem ( Ext3 vs. Ext4 vs. XFS vs. BTRFS )

The Ext4 filesystem does seem to outperform Ext3, XFS and BTRFS, and it can be optimized for striping on RAID arrays. I recommend Ext4 until BTRFS catches up in performance, becomes compatible with LILO/GRUB, and gets an FSCK tool.

Deciding on a Filesystem Block Size ( 1 vs. 2 vs. 4 )

It is impossible to stress how important this part is. Luckily, if you don’t know what this is and just don’t touch it, most mkfs tools default to the fastest choice – 4kb. Why you would not want to use 1 or 2 is neatly shown in the benchmarking results of RAID performance on those block sizes. Even if you are not using RAID, you will find 4kb blocks to perform faster. Much like the RAID geometry, this is permanent and cannot be changed.

Creating an optimized for RAID Ext4 ( stride and stripe-width )

Use those guidelines to calculate these values:

stride = filesystem block-size / RAID chunk.
stripe-width = stride * number of drives in RAID array ( - for RAID0, and that minus one for RAID5 )

pass the stride and the stripe-width to mkfs.ext4, along with the block size in bytes, like this:

# mkfs.ext4 -b 4096 -E stride=8,stripe-width=16 /dev/md0

A handy tool to calculate those things for you can be found here.

Creating an optimized XFS filesystem ( sunit and swidth )

The XFS options for RAID optimization are sunit and swidth. A good explanation about those two options can be found in this post. A quick and dirty formula to calculate those parameters was taken from here:

sunit = RAID chunk in bytes / 512
swidth = sunit * number of drives in RAID array ( - for RAID0, and that minus one for RAID5 )

The sunit for a 32kb (or 32768 byte) array would be 32768 / 512 = 64

The command to create such a filesystem for a 32kb chunk size RAID0 array with 2 drives and a 4096 (4kb) block size will look something like this:

# mkfs.xfs -b size=4096 -d sunit=64,swidth=128 /dev/md0

Tuning the Ext3 / Ext4 Filesystem ( Journal )

There’s a good explanation about the 3 modes in which a filesystem’s journal can be used on the OpenSUSE Wiki. That same guide will rightly recommend avoiding writing actual data to the journal to improve performance. On a newly created but unmounted filesystem, disable the writing of actual data to the journal:

# tune2fs -O has_journal -o journal_data_writeback /dev/md0

Turning on Ext3 / Ext4 Directory Indexing:

Your filesystem will perform faster if the directories are indexed:

# tune2fs -O dir_index /dev/md0
# e2fsck -D /dev/md0

Filesystem Mounting Options ( noatime, nodiratime, barrier, data and errors options ):

Some options should be passed to the filesystem on mount to increase its performance:

noatime, nodiratime – Do not log access of files and directories.

barrier=0 – Disable barrier sync (Only safe if you can assure uninterrupted power to the drives, such as a UPS battery)

errors=remount-ro – When we have filesystem errors, we should remount our root filesystem readonly (and generally panic).

data=writeback – For Ext3 / Ext4. If your journal is in writeback mode (as we previously advised), set this option.

My fstab looks like this:

/dev/md0         /                ext4        noatime,nodiratime,data=writeback,stripe=16,barrier=0,errors=remount-ro      1   1

And my manual mount command will look like this:

# mount /dev/md0 /mnt -o noatime,nodiratime,data=writeback,stripe=16,barrier=0,errors=remount-ro

Did I mention to NEVER do this on a production machine?

Installing your Linux

Install as usual, but do not format the root partition you’ve setup! If you are using RAID0/5, you have to setup a separate, RAID1 or primary /boot partition. In my experience, the leaving the boot partition unoptimized does not affect regular performance, but if you are keen on shaving a few milliseconds off your boot-time you can go ahead and tune that filesystem yourself as well.

Making sure LILO boots

If you are using RAID0/5 for your root partition, you must setup a separate non-RAID or RAID1 partition as /boot. If you do setup your /boot partition to be on a RAID1 array, you have to make sure to point lilo to the right drive but editing /etc/lilo.conf :

boot = /dev/md1

and make sure LILO knows about the mirroring of the /boot partitions by adding the line:

raid-extra-boot = mbr-only

Then, LILO must be reinstalled to the Master Boot Record while the /boot partition is mounted on the root partition. From a system rescue CD, with a properly edited lilo.conf file this will look something like this:

# mount /dev/md0 /mnt
# mount /dev/md1 /mnt/boot
# /mnt/sbin/lilo -C /mnt/etc/lilo.conf

… and reboot.

Experience and thoughts:

I’ve been following my own advice for the last couple of weeks. The system is stable and best of all, *fast*. May those not be “famous last words”, but I’ll update this post as I go. The only thing we all really need is comments and input. If you use something else that works faster for you – let us know. If something downgraded your stability to the level of Win98, please let us know. More importantly – if you see any errors, you got it – let us know.

TO DO:

Test this interesting post about Aligning Partitions

Test BTRFS on 2 drives without RAID/LVM

Advertisements

22 Comments

Filed under #!

22 responses to “The Quest For The Fastest Linux Filesystem

  1. Hi thank you for an perceptive post, I actually found your blog by mistake while searching on Goole for something else nearly related, in any case before i ramble on too much i would just like to state how much I enjoyed your post, I have bookmarked your site and also taken your RSS feed, Once Again thank you very much for the blog post keep up the good work.

  2. John Everett

    Thanks for the nice info. A couple questions, however:

    1) You create a RAID with a 32k chunk:
    mdadm –create /dev/md0 –metadata=0.90 –level=0 –chunk=32 –raid-devices=2 /dev/sd[ab]1

    Then define stride as:
    stride = filesystem block-size / RAID chunk

    But, then make your 4k block-size file system with a stride of 8:
    mkfs.ext4 -b 4096 -E stride=8,stripe-width=16 /dev/md0

    Is stride then actually:
    stride = RAID chunk / filesystem block-size = 32/4 = 8

    2) You write:
    “…if the [mdadm metadata format] is anything other than 0.90, you will find Lilo failing to boot.”
    and follow that advice creating your RAID 0 root filesystem device:
    mdadm –create /dev/md0 –metadata=0.90 –level=0 –chunk=32 –raid-devices=2 /dev/sd[ab]1

    However, you also say:
    “If you are using RAID0/5 for your root partition, you must setup a separate non-RAID or RAID1 partition as /boot.”

    Is the 0.90 metadata format necessary on the (non-booting) root partition device (/dev/md0) or is it only necessary on a (booting) RAID 1 /boot partition device? Or on Both?

    Thanks again.

    • Ernest Kugel

      Yes you are correct, typo fixed!

      I believe you have to specify *both* ROOT and BOOT filesystems as 0.90 because Lilo needs to use the Root filesystem to load the kernel, but cannot recognize it before the kernel is loaded if it is not “0.90”

      Hope this helps… Let us know how it’s going!

      • John Everett

        My understanding is that the kernel and the initial ram disk (with extra driver support) is in the /boot directory, so the boot loader should only need to be able to read the /boot partition. Once in the boot partition, the kernel, etc. should be able to access more complex configurations (RAID 0 or 5, VLM partitions, etc.) for the root partition. That’s why I figured only the /boot partition should need any special metadata concession.

        I indeed tested this out, making a .90 format RAID 1 /boot and a (default) 1.2 format RAID 0 / — and booting the system worked fine. However, when I experimented making the RAID 1 /boot version 1.2, it didn’t boot.

        I also tried using the CentOS 6.0 installer to create a similar structure. It made the /boot partition 1.0 metadata format and the / partition version 1.1, and all booted fine. All of these tests were using grub 0.97.

      • Ernest Kugel

        Grub support for 1.2 metadata is great news, last time when I tried it with Lilo on Slackware 13.1 it seemed to pose a problem…

      • Ernest Kugel

        I think LILO as a boot loader has a hard time with metadata > 0.90. It is not the kernel that is limited, but only LILO’s mdadm support, AFAIK. GRUB might have no such limitations, as it looks from your findings!

  3. alfonso

    Hi,
    great post!
    I noticed the same thing as John:
    shouldn’t stride be:
    stride = filesystem block-size / RAID chunk
    thanks,

  4. alfonso

    Hi,
    also the formula for stripe-width = stride * number of drives in RAID array is not in sync with the tool you link:
    http://busybox.net/~aldot/mkfs_stride.html
    The tool, with input
    RAID level 6
    Number of physical disks 4
    RAID chunk size (in KiB) 64
    number of filesystem blocks (in KiB) 4
    suggests
    mkfs.ext3 -b 4096 -E stride=16,stripe-width=32
    but with your formula stripe-width should be 64.

    any idea what’s the correct value?
    thanks!

  5. alfonso

    Ok,
    so I looked at the HTML code of the page. These are the formulas:
    stride = chunk size / fs blocksize
    stripe width = stride * number of EFFECTIVE drives in RAID array (so total – 2 for RAID 6, total – 1 for RAID 5 and so on)

  6. masuch

    Hi,

    I have a problem with speed of read/write of RAID 0.
    to create RAID 0 by:
    mdadm –create /dev/md0 –chunk=64 –level=0 –raid-devices=4 /dev/sda2 /dev/sdb2 /dev/sdd2 /dev/sde7
    mkfs.ext4 -b 4096 -E stride=16,stripe-width=64 /dev/md0
    tune2fs -O has_journal -o journal_data_writeback /dev/md0
    tune2fs -O dir_index /dev/md0
    e2fsck -D /dev/md0
    sudo nano /etc/fstab
    /dev/md0 /mnt/md0 ext4 noatime,nodiratime,data=writeback,stripe=16,barrier=0,errors=remount-ro 1 1
    /etc/mdadm/mdadm.conf:
    ARRAY /dev/md0 metadata=1.2 UUID=f5671042:3c222058:dfff829f:7016466f

    measured speed:
    sudo hdparm -tT /dev/sda
    sudo hdparm -tT /dev/sdb
    sudo hdparm -tT /dev/sdc
    sudo hdparm -tT /dev/sdd

    it is always aproximately like following:
    Timing cached reads: 35184 MB in 2.00 seconds = 17613.44 MB/sec
    Timing buffered disk reads: 404 MB in 3.01 seconds = 134.06 MB/sec

    but for RAID 0 I have got:
    sudo hdparm -tT /dev/md0

    /dev/md0:
    Timing cached reads: 33210 MB in 2.00 seconds = 16623.39 MB/sec
    Timing buffered disk reads: 786 MB in 3.00 seconds = 261.94 MB/sec

    which is only 2 times faster … .

    —————–
    dd if=/dev/zero of=~/test.img bs=8k count=256k
    262144+0 records in
    262144+0 records out
    2147483648 bytes (2.1 GB) copied, 21.0776 s, 102 MB/s

    dd if=/dev/zero of=/mnt/md0/test.img bs=8k count=256k
    262144+0 records in
    262144+0 records out
    2147483648 bytes (2.1 GB) copied, 7.61225 s, 282 MB/s

    I have only 2 times faster read/write operations instead of 4 times ???

    Could anybody please help me what I messed up ?
    Thank you for any clue.
    M.

    • Ernest Kugel

      Hi, this is an interesting problem, I’m glad to see you are getting double the performance, but you are right to expect a quadrupedal performance with four drives in RAID0. Could you kindly post your motherboards make and model, and if you have an easy opportunity, to setup a RAID0 array with 3 drives to see if you get triple performance or not?

      • John Everett

        I don’t get remotely close to X times performance from my RAID arrays either. I have six 1TB drives, partitioned, and with the partitions assembled into a RAID 10 for my system volume and a RAID 5 for a storage volume. hdparm benchmarks are as follows:

        /dev/sda: (single drive)
        Timing buffered disk reads: 304 MB in 3.00 seconds = 101.17 MB/sec
        /dev/md1 (RAID 10 built from 6 32GB partitions):
        Timing buffered disk reads: 586 MB in 3.01 seconds = 194.76 MB/sec
        /dev/md2 (RAID 5 built from 6 ~950GB partitions):
        Timing buffered disk reads: 670 MB in 3.01 seconds = 222.84 MB/sec

        This is on an old Pentium 4 Intel server motherboard.

      • Ernest Kugel

        These benchmarks are not great, I wonder if mdadm is the problem, or a lack of other resources? Have you been testing this from a LiveCD? Any other processes accessing the drives would slow it down if it is being used by an OS, so LiveCDs are the closest you can get to testing the drives for speed.

      • masuch

        OK,
        I have created RAID 0 from 3 partitions located on three different hard disks (All running on SATA II ) with small optimalizations.

        sudo mount /dev/md0 /mnt/md0 -o noatime,nodiratime,data=writeback,stripe=16,barrier=0,errors=remount-ro

        sudo mount /dev/md0 /mnt/md0 -o noatime,nodiratime,data=writeback,stripe=16,barrier=0,errors=remount-ro

        dd if=/dev/zero of=/mnt/md0/test.img bs=8k count=1024k
        1048576+0 records in
        1048576+0 records out
        8589934592 bytes (8.6 GB) copied, 40.1627 s, 214 MB/s

        not even 2 times faster.

        motherboard:
        maximus iv extreme, Revision 3, sandy bridge, BIOS version 2105

  7. John Everett

    Also, I notice this guide involves building file systems directly on top of the software RAIDs. Are they any additional concerns that would need to be addressed if on top of the software RAID devices, one is first going to create LVM abstractions (Logical Volume Manager Physical Volumes, Volume Groups, and Logical Volumes), and then create the file systems on top of those Logical Volumes?

    • Ernest Kugel

      Regarding LVMs, many distributions do create LVMs on top of mdadm (as well as hardware RAID arrays). This is useful for breaking up partitions over raid arrays. However, since I haven’t found anything about LVM increasing performance, it was left out.

  8. Hi, I noticed I had to use tune4fs and e4fsck (instead of tune2fs and e2fsck) while using ext4 on my centos 5.5 server. Just thought others might appreciate that as well.

    • Ernest Kugel

      Thanks for reporting from CentOS land. tune4fs and friends much be a Red-Hat/Fedora/CentOS thing, because my Slackware distribution has no such think. If I had to bet I’d put my money on CentOS making wrappers for these tools, much like “# mount.cifs” is really just “# mount -t cifs”. Thanks for the heads up re CentOS!

  9. TOM

    Awesome tutorial

    • Here’s what I did that worked well:

      mdadm –create –level=0 –name=royal –raid-devices=2 –metadata=1.2 –chunk=128 /dev/md/royal /dev/sdb1 /dev/sdc1

      mkfs.ext4 -m .1 -L ROYAL -b 4096 -E stride=32,stripe-width=64 /dev/md/royal

      /etc/fstab options:
      defaults,noatime,data=writeback,errors=remount-ro 0 2

      Works great! Careful with the data=writeback!!! If speed can be 90% of that use data=ordered

  10. Pingback: Tune ext4 for the best performance | Arun's blog

  11. Pingback: Easy RAIDer « Coding Fit

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s