It happens, you just lost a disk on your RAID5 MD array, or things are not how it should look like… How do we troubleshoot this?
First things first, what’s the name of your MD device. You can easily learn that by issuing:
cat /proc/mdstat
This should output something similar to:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdd[1] sda[3] sdb[2]
2929890816 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/4] [.AAA]
bitmap: 0/8 pages [0KB], 65536KB chunk
Here we have a MD device /dev/md0
. (If you don’t see a response to this, you might have lost your MD device, which could be a bigger issue!)
Another thing that we see (Or we don’t see) here is that sda/sdb/sdd are here in the raid but sdc
is nowhere to be found! This is our problem.
For some reason /dev/sdc
is not in the RAID group anymore. Let’s see what’s going on with /dev/sdc
?
mdadm --examine /dev/sdc
In my example this was hanging for a long time. When I issue dmesg
on another console, I was getting a lot of I/O errors about this disk. This is telling me that the disk is malfunctioning.
I shutdown the server and wiggled the disk. Rebooted and it was back online. My array has now four disks however only 3 of them are “functioning” since after the reboot MD kicked /dev/sdc
out of the array.
We need to reassemble the array and let RAID5 do its magic. First stop the MD device /dev/md0
.
mdadm --stop /dev/md0
Then we need to add /dev/sdc
back into the array:
mdadm /dev/md0 -a /dev/sdc
Then depending on the situation we might need to reassemble the array:
mdadm --assemble /dev/md0 /dev/sd[abcd] --verbose --force
Hopefully /dev/sdc
is now back in your array now. This should start a long(er) process to sync up the array state to all disks and hopefully you now have your array back!
After the sync completes, I would still do a fsck on the /dev/md0
filesystem.
fsck.ext4 /dev/md0
e2fsck 1.45.5 (07-Jan-2020)
data: recovering journal
JBD2: Invalid checksum recovering block 185073680 in log
JBD2: Invalid checksum recovering block 89 in log
Journal checksum error found in data
data was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (704006059, counted=696320594).
Fix? yes
Free inodes count wrong (182701042, counted=182694547).
Fix? yes
data: FILE SYSTEM WAS MODIFIED
data: 429421/183123968 files (0.2% non-contiguous), 36152110/732472704 blocks
You can use this same steps (or similar) to remove /dev/sdc and replace with a brand new hard drive. In my case wiggling solved the problem for now. (I probably will need a drive in the near future)
I hope this helped someone. It surely will help me when I will have to do this again 😛