Tag Archives: performance

The Quest For The Fastest Linux Filesystem

What’s this thing about?

This post has a few main points:

1. Speeding up a filesystem’s performance by setting it up on a tuned RAID0/5 array.

2. Picking the fastest filesystem.

3. The fastest format options for Ext3/4 or XFS filesystems.

4. Tuning an Ext3/4 filesystem’s journal and directory index for speed.

5. Filesystem mount options that increase performance, such as noatime and barrier=0.

6. Setting up LILO to boot from a RAID1 /boot partition.

The title is a bit of an oversimplification 😉 The article is intended to keep being work in progress as “we” learn, and as new faster tools become available. This article is not intended to cover the fastest hardware (yet). The goal is the “fastest” filesystem possible on whatever device you have. Basically, “we” want to setup and tweak whatever is possible to get our IO writes and reads to happen quicker. Which IO reads? random or sequential? long or short? The primary goal is a quick Linux root filesystem, which is slightly different than, lets say, a database-only filesystem, or a /home partition for user files. Oh, and by the way, do not use this on your production machines, people. Seriously.

RAID

WTF is RAID?!

The first question is, how many devices would you like your filesystem to span? The simple and correct answer is – the more the faster. To use one filesystem across multiple devices, a single “virtual” device can be created from multiple partitions with RAID. (Recently developed filesystems, like BTRFS and ZFS, are capable of splitting themselves intelligently across partitions to optimize performance on their own, without RAID) Linux uses a software RAID tool which comes free with every major distribution – mdadm. Read about mdadm here, and read about using it here. There’s also a quick 10 step guide I wrote here which will give you an idea about the general procedure of setting up a RAID mdadm array.

Plan your array, and then think about it for a while before you execute – you can’t change the array’s geometry (which is the performance sensitive part) after it’s created, and it’s a real pain to migrate a filesystem between arrays. Not to mention a Linux root filesystems.

Deciding on a performance oriented type of RAID ( RAID0 vs. RAID5 )

The rule of thumb is to use 3 or more drives in a RAID5 array to gain redundancy at the cost of a slight performance loss over a RAID0 array (10% CPU load at peak times on my 2.8 GHz AthlonX2 with a 3 disk RAID5 array). If you only have 2 drives, you cannot use RAID5. Whatever your situation is, RAID0 will always be the fastest, but less responsible, choice.

RAID0 provides no redundancy and will fail irrecoverably when one of the drives in the array fails. Some would say you should avoid putting your root filesystem on an un-redundant array, but we’ll do it anyways! RAID0 is, well, the *fastest* (I threw that caution to the wind and I’m typing this from a RAID0 root partition, for what it’s worth). If you are going to be or have been using a RAID0 array, please comment about your experiences. Oh, and do backup often. At least weekly. To an *external* drive. If you only have one drive you can skip to the filesystem tuning part. If you do are going to use RAID0/5, remember to leave room for a RAID1 array, or a regular partition, for /boot. Today, LILO cannot yet boot a RAID0/5 array.

Deciding on a RAID stripe size ( 4 / 8 / 16 / 32 / 64 / 128 / 256 … )

You will need to decide, for both RAID0 and RAID5, about the size of the stripe you will use. See how such decisions affect performance here. I find the best results for my personal desktop to be 32kb chunks. 64 does not feel much different. I would not recommend going below 32 or above 128 for a general desktops root partition. I surf, play games, stream UPnP, run virtual machines, and use a small MySQL database. If I would be doing video editing, for example, a significantly bigger stripe size would be faster. Such specific usage filesystem should be setup for their own need and not on the root filesystem, if possible. Comments?

RAID 5 – deciding on a parity algorithm ( Symmetric vs. Asymmetric )

For RAID5, the parity algorithm can be set to 4 different types. Symmetric-Left, Symmetric-Right, Asymmetric-Left, and Asymmetric-Right. They are explained here, but they appear to only affect performance to a small degree for desktop usage, as one thread summarized.

Creating a RAID0 array

Using the suggestions above, the command to create a 2-disk RAID0 array for a root partition on /dev/md0 using the partitions /dev/sda1 and /dev/sdb1 should look like this:

# mdadm --create /dev/md0 --metadata=0.90 --level=0 --chunk=32 --raid-devices=2 /dev/sd[ab]1

Note the –metadata option, which with 0.90 specifies the older mdadm metadata format. If you will use anything other than 0.90, you will find Lilo failing to boot.

The Fastest Filesystem – Setup and Tuning

Deciding on a Filesystem ( Ext3 vs. Ext4 vs. XFS vs. BTRFS )

The Ext4 filesystem does seem to outperform Ext3, XFS and BTRFS, and it can be optimized for striping on RAID arrays. I recommend Ext4 until BTRFS catches up in performance, becomes compatible with LILO/GRUB, and gets an FSCK tool.

Deciding on a Filesystem Block Size ( 1 vs. 2 vs. 4 )

It is impossible to stress how important this part is. Luckily, if you don’t know what this is and just don’t touch it, most mkfs tools default to the fastest choice – 4kb. Why you would not want to use 1 or 2 is neatly shown in the benchmarking results of RAID performance on those block sizes. Even if you are not using RAID, you will find 4kb blocks to perform faster. Much like the RAID geometry, this is permanent and cannot be changed.

Creating an optimized for RAID Ext4 ( stride and stripe-width )

Use those guidelines to calculate these values:

stride = filesystem block-size / RAID chunk.
stripe-width = stride * number of drives in RAID array ( - for RAID0, and that minus one for RAID5 )

pass the stride and the stripe-width to mkfs.ext4, along with the block size in bytes, like this:

# mkfs.ext4 -b 4096 -E stride=8,stripe-width=16 /dev/md0

A handy tool to calculate those things for you can be found here.

Creating an optimized XFS filesystem ( sunit and swidth )

The XFS options for RAID optimization are sunit and swidth. A good explanation about those two options can be found in this post. A quick and dirty formula to calculate those parameters was taken from here:

sunit = RAID chunk in bytes / 512
swidth = sunit * number of drives in RAID array ( - for RAID0, and that minus one for RAID5 )

The sunit for a 32kb (or 32768 byte) array would be 32768 / 512 = 64

The command to create such a filesystem for a 32kb chunk size RAID0 array with 2 drives and a 4096 (4kb) block size will look something like this:

# mkfs.xfs -b size=4096 -d sunit=64,swidth=128 /dev/md0

Tuning the Ext3 / Ext4 Filesystem ( Journal )

There’s a good explanation about the 3 modes in which a filesystem’s journal can be used on the OpenSUSE Wiki. That same guide will rightly recommend avoiding writing actual data to the journal to improve performance. On a newly created but unmounted filesystem, disable the writing of actual data to the journal:

# tune2fs -O has_journal -o journal_data_writeback /dev/md0

Turning on Ext3 / Ext4 Directory Indexing:

Your filesystem will perform faster if the directories are indexed:

# tune2fs -O dir_index /dev/md0
# e2fsck -D /dev/md0

Filesystem Mounting Options ( noatime, nodiratime, barrier, data and errors options ):

Some options should be passed to the filesystem on mount to increase its performance:

noatime, nodiratime – Do not log access of files and directories.

barrier=0 – Disable barrier sync (Only safe if you can assure uninterrupted power to the drives, such as a UPS battery)

errors=remount-ro – When we have filesystem errors, we should remount our root filesystem readonly (and generally panic).

data=writeback – For Ext3 / Ext4. If your journal is in writeback mode (as we previously advised), set this option.

My fstab looks like this:

/dev/md0         /                ext4        noatime,nodiratime,data=writeback,stripe=16,barrier=0,errors=remount-ro      1   1

And my manual mount command will look like this:

# mount /dev/md0 /mnt -o noatime,nodiratime,data=writeback,stripe=16,barrier=0,errors=remount-ro

Did I mention to NEVER do this on a production machine?

Installing your Linux

Install as usual, but do not format the root partition you’ve setup! If you are using RAID0/5, you have to setup a separate, RAID1 or primary /boot partition. In my experience, the leaving the boot partition unoptimized does not affect regular performance, but if you are keen on shaving a few milliseconds off your boot-time you can go ahead and tune that filesystem yourself as well.

Making sure LILO boots

If you are using RAID0/5 for your root partition, you must setup a separate non-RAID or RAID1 partition as /boot. If you do setup your /boot partition to be on a RAID1 array, you have to make sure to point lilo to the right drive but editing /etc/lilo.conf :

boot = /dev/md1

and make sure LILO knows about the mirroring of the /boot partitions by adding the line:

raid-extra-boot = mbr-only

Then, LILO must be reinstalled to the Master Boot Record while the /boot partition is mounted on the root partition. From a system rescue CD, with a properly edited lilo.conf file this will look something like this:

# mount /dev/md0 /mnt
# mount /dev/md1 /mnt/boot
# /mnt/sbin/lilo -C /mnt/etc/lilo.conf

… and reboot.

Experience and thoughts:

I’ve been following my own advice for the last couple of weeks. The system is stable and best of all, *fast*. May those not be “famous last words”, but I’ll update this post as I go. The only thing we all really need is comments and input. If you use something else that works faster for you – let us know. If something downgraded your stability to the level of Win98, please let us know. More importantly – if you see any errors, you got it – let us know.

TO DO:

Test this interesting post about Aligning Partitions

Test BTRFS on 2 drives without RAID/LVM

Advertisements

22 Comments

Filed under #!

RAID for Everyone – Faster Linux with mdadm and XFS

I crashed my Linux. It took a lot of skill and root access, but I’ve accidentally hosed my desktop and backtracking will be more time consuming than running through a quick Slackware install. If you find yourself in this situation, and have more than one drive on your machine, it makes sense to RAID the drives. RAID will either greatly increase performance for the drives, which are the bottleneck of any desktop, or mirror the drives for disk failure protection. To read more about RAID, which becomes more and more popular, try The Linux Software RAID How To.

This quick how-to will try to cover the basics, but all the basics, needed to install any Linux Desktop distribution on any machine with 2 or more drives. It begins with installing a Linux system on a RAID1 partition, and continues with adding a RAID0 home partition after the install. For the home partition, XFS will be used as a file-system, and tweaked to illustrate some of its strengths with RAID. Finally, It’ll cover replacing a failed drive in an array. Any bits of it will try to be relevant to other scenarios. Mostly, it will attempt to demonstrate how simple it is to administer RAID arrays with mdadm.

Why software RAID (mdadm)? Chances are, your motherboard already comes with an on-board RAID controller. Those are present on motherboards as cheap as 60$. I won’t be using mine, however, and this tutorial will not cover that part. I had the most miserable experience with my ATI on-board RAID, which is a propriety chipset, working out of the box only with SuSE, and failing drives left, right and center. Even if your lucky to have a decent Linux supported controller, you will still have a hard time finding a decent interface for the firmware, the options will be lacking at best, and you will not find RAID 5 and 6 options on motherboards or low end cards. You will also have no cheap way to recover data from failed controllers, spare buying the same hardware again. Propriety software cards are not even worth mentioning. Since the CPU penalty for software RAID is fairly low on modern chips, and all Linux distributions supports mdadm out of the box, that’s what I recommend.

Why Slackware? I’ll be using Slackware 13 because that’s what I like, and because Slackware install CD gives the most partitioning freedom (read: a console with all the console-tools) before the install. But this will work on anything Linux. Here it goes:

1.  What you need:

Get a Slackware CD/DVD ready, or any Linux installation CD from which you can access a console before the installation starts. Live CDs are great! Backup your data. No, really, do it now. And we assume you have at least 2 drives installed. The size does not have to be identical, but the partition layout will be limited to that possible on the smallest of the drives.

2. BIOS:

Reboot with the Linux installation CD in the drive. In your BIOS, make sure the RAID controller is off (and that you can boot from CD/DVD).

3. Partition Drives:

Boot the CD/DVD and get to a console. Slackware just takes you to one on its own, from a Debian DVD one can access an alternative console with Alt+F2, etc, and in a LiveCD there’s probably a terminal program. Pull one up. Log in as root, or do everything with $ sudo. Here goes the destructive part. Identify the drives you’ll use with

# fdisk -l

This examples uses /dev/sda and /dev/sdb. We also assume you have no active RAID arrays on the drives you manipulate. If you do, stop them:

# mdadm --manage --stop /dev/md0

Create parition tables on drives:

# cfdisk /dev/sda

delete all partitions. Create a Swap type partition (one of two) and a Linux Raid Autodetect type partition. Write and exit. repeat on other drive(s) with the same sizes for both partitions. Note that you do not have to use the entire space right away. In this case, we will setup a root file system on a mirrored partition, for redundancy and ease with bootloaders as most recommend. Only later we’ll attach a striped (for size and speed) home partition.

4. RAID the drives:

mdadm is our weapon of choice. It’s mighty but simple. Here’s a RAID1 (mirrored) device /dev/md0, using /dev/sda2 and /dev/sdb2 (assuming /dev/sda1 and sdb1 were used for swap):

# mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda2 /dev/sdb2

Now, RAID1 drives will take some time to rebuild (sync the mirror), depending on your partition size. I’ve seen 20GB partitions rebuilt in 15 minutes and 500GB partitions go for almost 2 hours over 2 7,200 RPM SATA drives. You can tell what the status is by glancing at /proc/mdstat:

# cat /proc/mdstat

Wait for the array to rebuild before proceeding. Many corners of the internet, including this one in the past, suggested you had to wait for an mdadm array to rebuild before using it. However logical this sounds, in reality, “the reconstruction process is transparent, so you can actually use the device even though the mirror is currently under reconstruction”, and we can move right along regardless of the type of RAID we chose:

5. Start Linux install and choose /dev/md0 as your root partition. Install the OS.

6. Setting up a RAID home partition, or any other partition, is not much more complicated. We’ll use RAID 0 for home, because of the volume it provides, as well as the speed performance. We’ll be using /dev/sda3 and /dev/sdb3 for a mirrored RAID. So head over to the terminal:

# cfdisk /dev/sda3

partition as you wish, type Linux RAID Autodetect, and repeat with /dev/sdb3. Make sure the partition sizes match.

7. Set up RAID:

This is a simple striped setup across 2 partitions

# mdadm --create /dev/md1 --level=0 --chunk=256 --raid-devices=2 /dev/sda3 /dev/sdb3

here, we had to specify the RAID 0 chunk, which is the stripe size, in addition to the options we used with RAID 1. The optimal chunk size depends entirely on your primary usage.

8. Setup a file-system:

You can use anything and just skip this section, but XFS has special tweaking for RAID, and its worth taking advantage of them for performance. XFS allows specifying the partition RAID dimensions to the file-system, and takes them into consideration with file reads/writes, to match the operations. 2 parametrs are used with XFS upon creation and mounting: sunit, which is the size of each chunk in 512byte blocks, and swidth, which is sunit * amount-of-drives (…for RAID 0 and 1; that minus 1 for RAID 5; or that minus 2 for RAID 6). More about RAID and XFS can be found here. To create a matching XFS file-system:

# mkfs.xfs -d sunit=512, swidth=1024 /dev/md1

9. Move home to its new home.

To quickly move the contents of the old /home directory to the new RAID partition, simply rename the old home, create a new home, mount, and copy stuff over. We’ll put an entry in fstab to mount the file-system properly, and with no access time logging, to get the performance boost. All of this must to be done as root with all other users logged out. (If home was on a separate partition already, you must unmount it and remount it as something else rather than moving it):

# mv /home /home.old
# mkdir /home
# echo "/dev/md0     /home     xfs     noatime,sunit=512,swidth=1024    0    2" >> /etc/fstab
# mount /home
# cp -aR --preserve=all /home.old/* /home
# rm -rf /home.old

10. Thats it for the setup. Now, lets give our new RAID array a real test drive.

We can check the status of all our arrays with:

# cat /proc/mdstat

We can monitor RAID1 arrays (but not RAID0) with:

# mdadm --monitor --oneshot /dev/md0

But the most rewarding bit will be performing some speed tests with hdparm. Lets check on a speed performance of a single drive:

# hdparm -t /dev/sda
/dev/sda:
 Timing buffered disk reads:  366 MB in  3.01 seconds = 121.45 MB/sec

Compare this to the speed of our RAID 0 array:

# hdparm -t /dev/md1
/dev/md1:
 Timing buffered disk reads:  622 MB in  3.01 seconds = 206.71 MB/sec

Yup, that’s right folks — the speed on a 2 drive RAID 0 array is twice as high. That being expected, it is by no means less satisfying 😉

Bonus: Recovering a failed drive from a RAID1 array. This will be handy for your root partition. Needless to say, you will not be able to recover anything from your RAID0, because it has 0 redundancy. With RAID1, however, the machine just keeps humming along after a drive gave up. How will you know you have a failed drive then? If the drive failed partially (repeatedly failing on some seeks but not on all), you will notice your performance degrade. You can test for performance degradation even before it becomes severe, with hdparm as explained above. If the drive failed totally, you might not notice.  It’s good to occasionally peak at /proc/mdstat to see that the array is up. In this case, the fix is easy – just pop in a new drive when the system is off. However, if you have a partially failed drive in a RAID1 array and you do not wish to wait for a reboot (very reasonable on a server that keeps working, if you could just avoid the horrible seek delays for the failing drive), you could drop it yourself in 2 commands.

My /dev/md0 is a RAID1 array which is made of /dev/sda2 and /dev/sdb2.  In my case, it was easy to see the drive  access light throwing fits, and the desktop freezing occasionally, indicating a problem with the drives. A quick run of hdparm revealed that /dev/sdb was the failing drive, as it displayed much slower reads. It caused the file-system on /dev/sdb2 to be barley accessible, which slowed my RAID1 array during writes (reads were still fast because they could be completed from the good drive alone, but writes needed to happen on both drives). So as soon as I got my desktop back from an occasional freeze, I fired up a terminal, marked the drive in the array as failed, and removed it from the array:

# mdadm --manage /dev/md0 --fail /dev/sdb2
# mdadm --manage /dev/md0 --remove /dev/sdb2

Past that point, its just a matter of powering off and replacing the drive at your earliest convenience. Once you got a new drive, pop it in, boot the system up, clone the partition table, and add the new partition to the array:

# sfdisk -d /dev/sda | sfdisk /dev/sdb
# mdadm --manage /dev/md0 --add /dev/sdb2

… watch the array rebuild itself by looking at /proc/mdstat , and you’re done. Phew. 🙂

I hope the minimal amount of code and steps will demonstrate how easy it is for any person with 2 hard drives to enjoy the benefits of RAID, which will make your Linux Desktop even faster/safer, without investing any significant amount of time or money.

7 Comments

Filed under #!