For the sake of this post, lets assume that you have a RAID 1 device setup on Linux.
First identify whether your RAID is healthy. If it is, this is what it looks like:
[root@node ~]# cat /proc/mdstat
Personalities : [raid1]
md126 : active raid1 sda2[0] sdb2[1]
3650885632 blocks super 1.2 [2/2] [UU]
bitmap: 0/28 pages [0KB], 65536KB chunk
md127 : active raid1 sda1[0] sdb1[1]
255868928 blocks super 1.2 [2/2] [UU]
bitmap: 1/2 pages [4KB], 65536KB chunk
unused devices:
If the RAID device is NOT healthy, the output looks like this:
[root@node ~]# cat /proc/mdstat
Personalities : [raid1]
md126 : active raid1 sda2[1]
3650885632 blocks super 1.2 [2/1] [_U]
bitmap: 4/28 pages [16KB], 65536KB chunk
md127 : active raid1 sda1[1]
255868928 blocks super 1.2 [2/1] [_U]
bitmap: 2/2 pages [8KB], 65536KB chunk
unused devices:
We can see that we only see block device sda active on both RAID devices. We can see that sdb has disappeared. Lets check whether sdb is indeed dead.
[root@node ~]# smartctl -i /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-229.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: HP
Product: SOMEPRODUCTID
Revision: HPD5
Logical block provisioning type unreported, LBPME=-1, LBPRZ=0
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c500846bacf7
Serial number: SOMESERIAL
Device type: disk
Transport protocol: SAS
Local Time is: Wed Feb 10 10:25:11 2016 CST
device is NOT READY (e.g. spun down, busy)
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
Lets proceed to removing this device from the RAID, if it has already not removed automatically.
[root@node ~]# mdadm --manage /dev/md126 --fail /dev/sdb2
[root@node ~]# mdadm --manage /dev/md126 --remove /dev/sdb2
[root@node ~]# mdadm --manage /dev/md127 --fail /dev/sdb1
[root@node ~]# mdadm --manage /dev/md127 --remove /dev/sdb1
Now, go to the server and physically remove the old, dead drive and replace it with a healthy, new one. You can physically identify the dead drive (and its serial number) from the output from smartctl -i output.
Once the server is booted up, set up the new drive. To do so, we need a package called gdisk. Install gdisk on the server and clone the GPT partition table.
[root@c125 ~]# sgdisk -R /dev/sdb /dev/sda
The operation has completed successfully.
[root@c125 ~]# sgdisk -G /dev/sdb
The operation has completed successfully.
The first command replicates the partition table from the good RAID drive into the newly added drive. The second command randomizes the GPT UIDs on the second drive, just so that the drive is not an exact clone.
Add the newly readied drive into the RAID.
[root@c125 ~]# mdadm --manage /dev/md126 --add /dev/sdb2
mdadm: added /dev/sdb2
[root@c125 ~]# mdadm --manage /dev/md127 --add /dev/sdb1
mdadm: added /dev/sdb1
Now wait until the drives are synced and release the server into the wild. Well, not literally.