Tuning Ubuntu mdadm RAID5/6

If you are using mdadm RAID 5 or 6 with Ubuntu, you might notice that the performance is not all uber all the time. Reason for this is that the default tuning settings for Ubuntu is set to rather motdest values. These can lucikly easily be tuned. I will in this article increase some settings until my read and write performance against my RAID 6 has been improved a lot.

My setup:
CPU: Intel(R) Core(TM)2 Quad CPU Q9300
RAM: 16G
Drives: 11 drives in one RAID6 with drives split over two cheap PCI-E x4 controllers and the motherboard`s internal controller.

I will test my system between each tuning by using dd for read and write testing. Since i have a nice amount of RAM available, i will use a test file of 36G. (bs=16k) Between each test (both read and write), i clear the OS disk cache with the command:

sync;echo 3 > /proc/sys/vm/drop_caches

Tuning stripe_cache_size

stripe_cache_size affects RAM used by mdadm to writing of data. Ubuntu`s default value is 256, you can verify your value by doing:

cat /sys/block/md0/md/stripe_cache_size

And changing it with:

echo *number* > /sys/block/md0/md/stripe_cache_size

Test results with stripe_cache_size=256
– Write performance: 174 MB/s

Not to good, i therefore increased it some levels, each level with result is described below:

Test results with stripe_cache_size=512
– Write performance: 212 MB/s

Test results with stripe_cache_size=1024
– Write performance: 237 MB/s

Test results with stripe_cache_size=2048
– Write performance: 254 MB/s

Test results with stripe_cache_size=4096
– Write performance: 295 MB/s

Test results with stripe_cache_size=8192
– Write performance: 362 MB/s

Test results with stripe_cache_size=16384
– Write performance: 293 MB/s

Test results with stripe_cache_size=32768
– Write performance: 326 MB/s

So, going from 256 to 32K ~doubled my write performance, not bad! 🙂

Tuning Read Ahead

Time to change a bit on read ahead, which should impact read performance. Default read ahead value is “1536”, and you can change it with the command:

blockdev --setra *number* /dev/md0

Test results with Read Ahead @ 1536
– Read performance: 717 MB/s

Test results with Read Ahead @ 4096
– Read performance: 746 MB/s

Test results with Read Ahead @ 32768
– Read performance: 731 MB/s

Test results with Read Ahead @ 262144
– Read performance: 697 MB/s

Test results with Read Ahead @ 524288
– Read performance: 630 MB/s

So oposite of the write performance tuning, this actually became worse for most of the settings. So 4096 is the best for my system.

In conclution

This is just an example on how different settings can have rather large impact on a system, both for the better and for the worse. If you are going to tune your system you have to test different setting for yourself and see what works best for your setup.  Higher values does not automaticly mean better results. I ended up with “stripe_cache_size=8192” and “Read Ahead @ 4096” for my system.

If you want to make sure that your changes is saved when rebooting the system, remember to add these commands (with your values) in /etc/rc.local.

howto: Create a RAID6 with mdadm

Note: This guide discusses RAID6 but the same will work for a RAID5,  the only difference is that RAID6 uses two disks for parity data while RAID5 only uses one disk.

1) Make sure that all the hard drives you are going to use is connected and available.

I will use the drives /dev/sdb to/dev/sdi in this example.

root@ubuntu:/home/thu# ls -l /dev/sd*
brw-rw—- 1 root disk 8,   0 2011-01-05 21:47 /dev/sda
brw-rw—- 1 root disk 8,   1 2011-01-05 21:47 /dev/sda1
brw-rw—- 1 root disk 8,   2 2011-01-05 21:47 /dev/sda2
brw-rw—- 1 root disk 8,   5 2011-01-05 21:47 /dev/sda5
brw-rw—- 1 root disk 8,  16 2011-01-05 21:47 /dev/sdb
brw-rw—- 1 root disk 8,  32 2011-01-05 21:47 /dev/sdc
brw-rw—- 1 root disk 8,  48 2011-01-05 21:47 /dev/sdd
brw-rw—- 1 root disk 8,  64 2011-01-05 21:47 /dev/sde
brw-rw—- 1 root disk 8,  80 2011-01-05 21:47 /dev/sdf
brw-rw—- 1 root disk 8,  96 2011-01-05 21:47 /dev/sdg
brw-rw—- 1 root disk 8, 112 2011-01-05 21:47 /dev/sdh
brw-rw—- 1 root disk 8, 128 2011-01-05 21:47 /dev/sdi

2) Ask mdadm to create the RAID

root@ubuntu:/home/thu# mdadm ––create /dev/md0 ––level=6 ––raid-devices=8 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi

mdadm: array /dev/md0 started.

Note –level and –raid-devices followed by a list of all the drives to use in the RAID.

3) Create a file system for the RAID device

root@ubuntu:/home/thu# mkfs.ext3 /dev/md0
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=16 blocks, Stripe width=96 blocks
1966080 inodes, 7864224 blocks
393211 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
240 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 31 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

And you will now be able to mount the device and start using it 🙂

4) Check that the RAID is being built in the background.

Note that the construction will restart every time you restart your server unless it has completed, so please do not reboot unless necessary 🙂

You can check the status of the RAID through two ways:

root@ubuntu:/home/thu# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdi[7] sdh[6] sdg[5] sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
31456896 blocks level 6, 64k chunk, algorithm 2 [8/8] [UUUUUUUU]
[====>…………….]  resync = 20.0% (1050880/5242816) finish=4.5min speed=15392K/sec
root@ubuntu:/home/thu# mdadm ––detail /dev/md0
/dev/md0:
Version : 00.90
Creation Time : Wed Jan  5 22:03:27 2011
Raid Level : raid6
Array Size : 31456896 (30.00 GiB 32.21 GB)
Used Dev Size : 5242816 (5.00 GiB 5.37 GB)
Raid Devices : 8
Total Devices : 8
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Wed Jan  5 22:06:15 2011
State : active, resyncing
Active Devices : 8
Working Devices : 8
Failed Devices : 0
Spare Devices : 0
Chunk Size : 64K
Rebuild Status : 54% complete
UUID : d09dd686:b5f57b0b:e368bf24:bd0fce41 (local to host ubuntu)
Events : 0.10
Number   Major   Minor   RaidDevice State
0       8       16        0      active sync   /dev/sdb
1       8       32        1      active sync   /dev/sdc
2       8       48        2      active sync   /dev/sdd
3       8       64        3      active sync   /dev/sde
4       8       80        4      active sync   /dev/sdf
5       8       96        5      active sync   /dev/sdg
6       8      112        6      active sync   /dev/sdh
7       8      128        7      active sync   /dev/sdi

Howto: Increase disk space in a mdadm raid

I currently have a Ubuntu Linux server running two mdadm RAID`s. One of the RAID sets is set up using 6 x 500 GB SATA drives. Now i have purchased 6 x 1500 GB SATA drives that will replace the old disks, but the challenge is to increase the RAID and filesystem without loosing any data or having downtime. (Note: not having downtime is possible since i use a system that supports hot swapping of drives)

In summary, this can be achieved by doing the following:
1) Replace all disks in the RAID (one by one)
2) Grow the RAID
3) Expand the filesystem

In this guide i will be working on /dev/md1

Now, let`s get to work!

Part one: Replace the disks

PS: If your system does not support hot swap, you have to turn of/restart your machine for each disk you are replacing.

Remove a disk in the RAID, then insert a new (bigger) drive.
Check dmesg (or similiar) to get the name of the last drive.

[14522870.380610] scsi 15:0:0:0: Direct-Access ATA WDC WD15EARS-00Z 80.0 PQ: 0 ANSI: 5
[14522870.381589] sd 15:0:0:0: [sdm] 2930277168 512-byte hardware sectors: (1.50 TB/1.36 TiB)
[14522870.381622] sd 15:0:0:0: [sdm] Write Protect is off
[14522870.381626] sd 15:0:0:0: [sdm] Mode Sense: 00 3a 00 00
[14522870.381673] sd 15:0:0:0: [sdm] Write cache: enabled, read cache: enabled, doesn’t support DPO or FUA
[14522870.381845] sd 15:0:0:0: [sdm] 2930277168 512-byte hardware sectors: (1.50 TB/1.36 TiB)
[14522870.381870] sd 15:0:0:0: [sdm] Write Protect is off
[14522870.381875] sd 15:0:0:0: [sdm] Mode Sense: 00 3a 00 00
[14522870.381918] sd 15:0:0:0: [sdm] Write cache: enabled, read cache: enabled, doesn’t support DPO or FUA
[14522870.381926] sdm: unknown partition table
[14522870.397752] sd 15:0:0:0: [sdm] Attached SCSI disk
[14522870.397878] sd 15:0:0:0: Attached scsi generic sg9 type 0

Now, tell mdadm to add your new drive to the RAID you removed a drive from by doing:

mdadm –manage /dev/md1 –add /dev/sdm

Mdadm will then start syncing data to your new drive, to get a ETA of when it`s done (and when you can replace the next drive) check the mdadm status.

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdl[2] sdi[4] sdf[3] sde[1] sdd[0]
5860553728 blocks level 5, 128k chunk, algorithm 2 [5/5] [UUUUU]

md1 : active raid5 sdm[6] sdg[1] sdk[5] sdj[7](F) sdh[2] sdc[3] sda[0]
2441932480 blocks level 5, 64k chunk, algorithm 2 [6/5] [UUUU_U]
[==>………………] recovery = 14.2% (69439012/488386496) finish=155.8min speed=44805K/sec

unused devices:

So after around 155 minutes  the drive is active. (And the next one can be replaced)

Repeat this process for each disk in the RAID.

When you have changed all disks, run the command “mdadm –manage /dev/mdX –remove failed” to remove any devices listes as failed for the given RAID.

Part two: Increase the space available for the RAID

This is done by simply issuing the command:

mdadm –grow /dev/md1 –size=max

And the RAID size is increased, note that this has caused the RAID to start a resync (again):

~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdl[2] sdi[4] sdf[3] sde[1] sdd[0]
5860553728 blocks level 5, 128k chunk, algorithm 2 [5/5] [UUUUU]

md1 : active raid5 sdc[0] sdj[3] sdh[5] sdg[2] sdn[1] sdm[4]
7325692480 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
[======>…………..]  resync = 34.6% (508002752/1465138496) finish=247.0min speed=64561K/sec

PS: note that the resync speed has increased by around 20MB/s after all the drives was replaced 🙂

You will now also notice that the RAID reports it`s new size:

~# mdadm –detail /dev/md1
/dev/md1:
Version : 00.90
Creation Time : Sat Jun 13 01:55:27 2009
Raid Level : raid5
Array Size : 7325692480 (6986.32 GiB 7501.51 GB)
Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
Raid Devices : 6
Total Devices : 6
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Fri Mar  5 08:03:47 2010
State : active, resyncing
Active Devices : 6
Working Devices : 6
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Rebuild Status : 35% complete

UUID : ed415534:2925f54a:352a6ad4:582f9bd3 (local to host)
Events : 0.247

Number   Major   Minor   RaidDevice State
0       8       32        0      active sync   /dev/sdc
1       8      208        1      active sync   /dev/sdn
2       8       96        2      active sync   /dev/sdg
3       8      144        3      active sync   /dev/sdj
4       8      192        4      active sync   /dev/sdm
5       8      112        5      active sync   /dev/sdh

Part three: resize file system

Start off by unounting the file system in question and perform a file system check to make sure everything is a-ok

# umount /home/samba/raid1
# fsck /dev/md1
fsck 1.41.4 (27-Jan-2009)
e2fsck 1.41.4 (27-Jan-2009)
/dev/md1 has gone 188 days without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes

Be warned: The fsck CAN take quite a time to finish.

When it`s complete, you are ready for the last step, which is to resize the filesystem:

# resize2fs /dev/md1 6986G
resize2fs 1.41.4 (27-Jan-2009)
Resizing the filesystem on /dev/md1 to 1831337984 (4k) blocks.

And voila! Mount up the filesystem again and you are finished! 🙂