So you lost a disk on a MD RAID5 array, now what?

It happens, you just lost a disk on your RAID5 MD array, or things are not how it should look like… How do we troubleshoot this?

First things first, what’s the name of your MD device. You can easily learn that by issuing:

cat /proc/mdstat

This should output something similar to:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdd[1] sda[3] sdb[2]
2929890816 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/4] [.AAA]
bitmap: 0/8 pages [0KB], 65536KB chunk

Here we have a MD device /dev/md0. (If you don’t see a response to this, you might have lost your MD device, which could be a bigger issue!)

Another thing that we see (Or we don’t see) here is that sda/sdb/sdd are here in the raid but sdc is nowhere to be found! This is our problem.

For some reason /dev/sdc is not in the RAID group anymore. Let’s see what’s going on with /dev/sdc?

mdadm --examine /dev/sdc

In my example this was hanging for a long time. When I issue dmesg on another console, I was getting a lot of I/O errors about this disk. This is telling me that the disk is malfunctioning.

I shutdown the server and wiggled the disk. Rebooted and it was back online. My array has now four disks however only 3 of them are “functioning” since after the reboot MD kicked /dev/sdc out of the array.

We need to reassemble the array and let RAID5 do its magic. First stop the MD device /dev/md0.

mdadm --stop /dev/md0

Then we need to add /dev/sdc back into the array:

mdadm /dev/md0 -a /dev/sdc

Then depending on the situation we might need to reassemble the array:

mdadm --assemble /dev/md0 /dev/sd[abcd] --verbose --force

Hopefully /dev/sdc is now back in your array now. This should start a long(er) process to sync up the array state to all disks and hopefully you now have your array back!

After the sync completes, I would still do a fsck on the /dev/md0 filesystem.

fsck.ext4 /dev/md0

e2fsck 1.45.5 (07-Jan-2020)
data: recovering journal
JBD2: Invalid checksum recovering block 185073680 in log
JBD2: Invalid checksum recovering block 89 in log
Journal checksum error found in data
data was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (704006059, counted=696320594).
Fix? yes
Free inodes count wrong (182701042, counted=182694547).
Fix? yes

data: FILE SYSTEM WAS MODIFIED
data: 429421/183123968 files (0.2% non-contiguous), 36152110/732472704 blocks

You can use this same steps (or similar) to remove /dev/sdc and replace with a brand new hard drive. In my case wiggling solved the problem for now. (I probably will need a drive in the near future)

I hope this helped someone. It surely will help me when I will have to do this again 😛