« On Debian discussions | Home | LDAP performance is poor.. »

September 17, 2011

Struggling with Advanced Format during a LVM to RAID migration

Recently I decided to invest in another harddisk for my atom system. That system, I built up almost two years ago, has become the central system in my home network, serving as a fileserver to host my personal data, some git repositories etc., streaming server and since I switched to a cable internet connection it also serves as a router/firewall.Originally, I bought that disk to backup some data, of the systems in the network, but I realized that all data on this system were hosted on a single 320GB 2,5" disk and it became clear to me, that, in absense of a proper backup strategy, I at least should provide some redundancy.

So I decided, once the disk was in place, that the whole system should move to a RAID1 over the two disks. Basically this is not that hard as it may seem at a first glance, but I had some problems due to a new sector size in some recent harddisks, which is called Advanced Format.

But lets begin from the start. The basic idea of such a migration is:

  1. Install mdadm with apt-get. Make sure to answer 'all' to the question which devices need to be activated in order to boot the system.

  2. Partition the new disk (almost) identical.Because the new drive is somewhat bigger that wouldn't make sense, but at least the two partitions which should be mirrored on the second disk, need to be identical.Usually this is achieved easily by using
    sfdisk -d /dev/sda | sfdisk /dev/second/sdb
    In this case, it wasn't that easy. But I will come to that in a minute.

  3. Change the type of the partitions to 'FD' (Linux RAID autodetect) with fdisk

  4. Erase evidence of an eventual old RAID from the partitions, which is probably pointless on a brand-new disk, but we want to be sure:
    mdadm --zero-superblock /dev/sdb1mdadm --zero-superblock /dev/sdb2
  5. Create two DEGRADED raid1 arrays from the partitions:
    mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdb1 missingmdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sdb2 missing
  6. Create filesystem on the first raid device, which will become /boot.

  7. Mount that filesystem somewhere temporary and move the contents of /boot to it:
    mount /dev/md0 /mnt/somewhere
  8. Unmount /boot, edit fstab to mount /boot from /dev/md0 and re-mount /boot (from md0)

  9. Create mdadm configuration with mdadm and append it to /etc/mdadm/mdadm.conf:
    mdadm --examine --scan >> /etc/mdadm/mdadm.conf
  10. Update the initramfs and grub (no manual modification needed with grub2 on my system)and install grub into the MBR of the second disk.
    update-initramfs -uupdate-grubgrub-install /dev/sdb
  11. The first point to pray: Reboot the system to verify it can boot from the new /boot.

  12. Create a physical volume on /dev/md1:
    pvcreate /dev/md1
  13. Extend the volume group to contain that device:
    vgextend /dev/md1
  14. Move the whole volume group physically from the first disk to the degraded RAID:
    vgmove /dev/md1
    (Wait for it to complete... takes some time ;)

  15. Reduce first disk from the VG:
    vgreduce /dev/sda2
  16. Prepare it for addition to the RAID (see step 3 and 4) and add it:
    mdadm --add /dev/md0 /dev/sda1mdadm --add /dev/md1 /dev/sda2
  17. Hooray! Watch into /proc/mdstat. You should see that the RAID is recovering.

  18. When recovery is finished pray another time and hope that system is still booting with it running from the RAID entirely. If it does: Finished :-)

Now to the problem with the advanced format:There is some action taking place with the hardware vendors to move to a new sector size. Physically my new device has a size of 4096 bytes per sector. Somewhat different to the 512 bytes disks used to have the last decade.

Logically it still has 512 bytes per sector. As far as I understand this is achieved by placing 8 logical sectors into one physical sector, so when partitioning a new disk the alignment of the disk has to be so that partitions start in a sector which is a multiple of 8.

That, obviously, wasn't the case with the old partitioning on my first disk. So I had to manually create partitions by specifying start points manually and making sure they are dividable by 8.Otherwise fdisk would complain about the layout on the disk.This does not work with cfdisk, because it does not accept manual alignment parameters and unfortunately the partitions it creates do have a wrong alignment. So good old fdisk and some calculations how many sectors are needed and where to start, to the rescue.

So the layout is now:

Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048      291154      144553+  fd  Linux raid autodetect
/dev/sdb2          291160   625139334   312424087+  fd  Linux raid autodetect


Hello Patrick,

thank you for the artcile. The Advanced Format of newer drives is indeed causing huge problems with performance, and I appreciate your efforts to shed light onto the issue.

Have you done performance testing since you aligned the partitions?

I am particularly wondering what happens when you stack layers on top of the drives, e.g. md, then LVM. Each of those needs to be aligned properly too. Have you looked into this?

Hi madduck,

no I haven't done performance tests. Mostly because I don't really have a scope on performance on this system and secondly, because I wouldn't have comparison values for that system anyway.

With respect to alignment of md and LVM: I must commit I'm not a crack in that area and didn't even know that these need "proper" alignment. Usually using defaults used to work enough for me. Can you follow-up on this some more detailed?

Any idea why my (long) comment is not showing up?

Well, the idea is really the same on all layers. The reason why you want your partitions aligned is so that when you write a block at a higher level, at most one (group of) block(s) gets written on a lower level.

Advanced format means that the drive exposes 512b blocks, but whenever you write one, it actually updates a 4k block, rewriting the other 7 512b blocks unchanged. This is nothing new, memory has done this forever (e.g. writing a boolean (1 bit) has pretty much always meant writing a whole word (16, 32, or nowadays often 64 bits).
Compilers have been grown really good at aligning memory writes, and one has to go through lengths to trick them into doing something else (__unaligned etc.).

With disks, we are only now really starting to get into this issue, and it will get worse. The layer on top of the device usually works with larger blocks than 512b. For instance, a filesystem might have 4k blocks, while LVM might even use 4M blocks ("PE size").

Now imagine you have a file of size 33Kb, sitting on a standard ext4 filesystem. This file will occupy ceil(33/4)=9 filesystem blocks. If you write this file, the filesystem will successively write these blocks to the underlying device. With 512b blocks, there's hardly a problem, but if the disk works with 4k blocks, then imagine what happens if the first filesystem block starts at sector 7: to write the first filesystem block, the disk needs to write sectors 7–11, which means writing sector groups 0–7 and 8–15. To write the second filesystem block, the disk needs to write sector groups 8–15 and 16–23, and so on. The problem gets worse when the filesystem cannot allocate sequential blocks, because then, no driver or device can be smart enough to consolidate the writes. In short: twice the amount of work is needed. And if this is not enough, consider LVM writing a 4M extent (for every filesystem block updated) to an unaligned RAID6, meaning that all four disks have to be written with data and parity data, …

So therein lies the importance to ensure that the 4k filesystem block is exactly aligned with the 4k device block. Aligning partitions is the first step, but now you need to ensure that every layer plays nice with the one underneath. The problem here are the metadata. As far as I can tell, filesystems squash their metadata into blocks as well, so there should not be a problem once you've aligned the filesystems like you did, using the partition table.

However, layers like MD and LVM are more obscure. For instance, MD uses 32k superblocks, but then adds an additional 2 bytes for each device. In addition, the different metadata versions put the superblocks at different position on the device (see md_superblock_formats.txt in the mdadm doc directory). And then there is LVM…

I have to find a solution to all of this (WD drives + GPT + MD + LVM + dmcrypt + ext4) within the next 10 days, so watch http://madduck.net/blog/…

Leave a comment