software raid

最後更新: 2019-10-03




Linux 上的 software raid 名叫 md.

它是透過 linux driver "md_mod" 去做 block device 的管理的.


Linux 的 software RAID 支援 RAID Level 如下 :

RAID configurations

  • RAID 1 – Mirror
  • RAID 5 - parity distributed across all devices
  • RAID 10 – Take a number of RAID 1 mirrorsets and stripe across them RAID 0 style.


  • Linear
  • Multipath
  • Faulty
  • Container


DM 與 Multiath Device 的分別


  • RAID: /dev/md0                   <-- Process: md0_raid1 會管理它
  • MULTIPATH: /dev/dm-0





apt-get install mdadm


update-initramfs: Generating /boot/initrd.img-2.6.32-5-686


  • /etc/cron.d/mdadm
  • /sbin/mdadm
  • /sbin/mdmon


  • /etc/mdadm/mdadm.conf
  • /etc/default/mdadm


mdadm -V

mdadm - v3.1.4 - 31st August 2010


Config File




  • /etc/mdadm.conf               # Centos 7 (如不存在就人手建立它)
  • /etc/mdadm/mdadm.conf   # Debian


# 從那裡尋找 RAID 的成員
# DEVICE partitions               從 /proc/partitions 找出要 scan 的 device
# DEVICE /dev/hda* /dev/hdc*       人手設定
DEVICE partitions

# 自動建立 /dev/md*, 並且設定它的 permission
# "auto=yes" to indicate how missing device entries should be created.
CREATE owner=root group=disk mode=0660 auto=yes

# 用到 --homehost 參數的時的 default 值.
# auto-assemble 及 create (metadata)會用到它.
# <system> 相當於 gethostname
HOMEHOST <system>

#### Mail Setting ####
MAILFROM root@server
# array 出事時會比 mail 誰

metadata type:

0.90           # common format (superblock: 4K, 64K aligned block)

1.x             # superblock that is normally 1K long, but can be longer (Default: 1.2)

superblock on different locations on the device

  • At the end (for 1.0),
  • At the start (for 1.1)
  • 4K from the start (for 1.2)

ddf              # Use the "Industry  Standard" DDF (Disk  Data Format) format defined by SNIA.

 # When creating a DDF array a CONTAINER will be created, and normal arrays can be created in that container.

imsm         # Use  the Intel(R) Matrix Storage Manager metadata format.


a program  to  be  run  when  mdadm --monitor detects potentially interesting events on any of the arrays that it is monitoring.




The ARRAY lines identify actual arrays.


ls -l /dev/md/500g

lrwxrwxrwx 1 root root 8 Jul 30 23:59 /dev/md/500g -> ../md127

設定 Array

只有 match ALL identities 才會建立 Array

  • uuid=           # 128 bit uuid in hexadecimal
  • name=         # name stored in the superblock
  • level=          # The value is a raid level
  • devices=      # listed there must also be listed on a DEVICE line

i.e. 用 device 及 raid level

ARRAY /dev/md1 devices=/dev/sda1,/dev/sdb1 level=1

i.e. 用 meta version 及 UUID

If the name does not  start with a slash ('/'), it is treated as being in /dev/md/

ARRAY 500g metadata=1.2 UUID=938cab94:2fd92126:e1d6bfdf:cf91677e

# 修改完設定檔後, 需要 reload daemon

service mdadm reload


AUTO 是用控制 auto-assemble 是否進行的設定, 它是一個 list 來


  1. first match wins
  2. plus sign: the auto assembly is  allowed
  3. minus sign: the auto assembly is disallowed
  4. no match:  the auto assembly is allowed

metadata types: 0.90, 1.x, ddf, imsm

# all is usually last
AUTO +1.x homehost -all

 * AUTO 是由 "mdadm -As" 引發

Disable mdadm automatic setup ARRAY


AUTO -all


# <ignore>: matches the rest of the line will never be automatically  assembled.
ARRAY <ignore> UUID=?:?:?:?


# 防止 "touch /dev/sd{e,f}" 引發組裝 RAID

# Backup

mv /lib/udev/rules.d/64-md-raid-assembly.rules /root

udevadm control --reload-rules


Command mdadm 的用法



mdadm [mode] <raiddevice> [options] <component-devices>


  • (-A) Assemble: Assemble the components of a previously created array into an active array.
  • (-B) Build: Build an array that doesn't have per-device metadata(similar to --create)
  • (-C) Create: A 'resync' process is started to make sure that the array is consistent


  • adding new spares and removing faulty devices.(default)
  • (-F) Monitor:
  • (--auto-detect) Auto-detect, requests the Linux Kernel to activate any auto-detected arrays


#  scan /proc/mdstat 去獲得所需設定, 比如要填上的 md? 及 sd? device

-s, --scan


入門: 建立 RAID1


準備 Harddisk:


fdisk /dev/sde

設定 partition's system id

按 t

之後輸入 fd    <-- Linux raid auto



parted /dev/sde

(parted) mklabel msdo

(parted) mkpart pri 0% 100%

(parted) set 1 raid

(parted) print

Number  Start   End    Size   Type     File system  Flags
 1      33.6MB  500GB  500GB  primary               raid


# 安全成見

blkid /dev/{sde,sdf}1

# 建立

mdadm -C /dev/md/500g -l 1 -n 2 -N 500g /dev/sde1 /dev/sdf1

# Opts

-C, --create                  # Create a new array.

-n, --raid-devices         # number of active devices in the array

                                  #  Setting a value of 1 is probably  a mistake and so requires that --force be specified first.

-l, --level                     #

-N, --name=                #

-n, --raid-devices=2:   # 2 active devices in the array


  • -x, --spare-devices=


cat /proc/mdstat

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : active raid1 sde1[0] sdf1[1]
      488202944 blocks super 1.2 [2/2] [UU]
      [=>...................]  resync =  5.0% (24566016/488202944) finish=416.1min speed=18566K/sec
      bitmap: 4/4 pages [16KB], 65536KB chunk


# keep mon raid rebuild status -

while true
    cat /proc/mdstat | awk '/resync/{print $0}'
    sleep 1


mdadm --detail --scan >> /etc/mdadm/mdadm.conf


md 的 start 與 stop


Start md:

方法1: auto

mdadm -As

mdadm: /dev/md0 has been started with 2 drives.

-A, --assemble       # Assemble a pre-existing array.

-s, --scan              # Scan config file or /proc/mdstat for missing information.

方法2: manual

# -A, --assemble              # Assemble a pre-existing array.

mdadm -A /dev/md0 /dev/usbdisk/WD4T-K3G5AD6B /dev/usbdisk/WD4T-NHGNXBVY

mdadm: /dev/md0 has been started with 2 drives.

Use UUID For Assemble:


-u, --uuid=      # uuid of array to assemble.  Devices which don't have this uuid are excluded

當 mount 時與指定的 uuid 不相附時, 會出以下 Error

mdadm: /dev/usbdisk/WD4T-K3G5AD6B has wrong uuid.
mdadm: /dev/usbdisk/WD4T-NHGNXBVY has wrong uuid.


blkid /dev/sdd1

/dev/sdd1: UUID="dffe9762-x-x-x-x" UUID_SUB="569c0941-x-x-x-x" LABEL="home:V4T" TYPE="linux_raid_member"

mdadm -E /dev/sdd1

Array UUID = UUID = dffe9762...

Device UUID = UUID_SUB = 569c0941...

Use Name For Assemble

-N, --name=

This must be the name that was specified when  creating the  array.  

It must either match the name stored in the superblock exactly,

or it must match with the current homehost prefixed to the start of the given name.

Stop md:

-S, --stop              # deactivate array, releasing all resources


mdadm -S /dev/md0

mdadm: stopped /dev/md0

 * stop 後 "/proc/mdstat" 內的資料 及 /dev/md0 都會不見了




cat /proc/mdstat

Personalities : [raid1]
md0 : active raid1 sdc1[1] sdb1[0]
      2095415 blocks super 1.2 [2/2] [UU]

unused devices: <none>
# 沒有合成 array 的 device 就會在這裡


查看 RAID 的情況


Misc mode:

  • -Q, --query
  • -D, --detail                 # Print details of one or more md devices.
  • -E, --examine             # Print contents of the metadata stored on the named device


mdadm -Q /dev/sda1

/dev/sda1: is not an md array

mdadm -Q /dev/sdb1

/dev/sdb1: is not an md array
/dev/sdb1: device 0 in 2 device unknown raid1 array.  Use mdadm --examine for more detail.

mdadm -Q /dev/md0

/dev/md0: 2046.30MiB raid1 2 devices, 0 spares. Use mdadm --detail for more detail.

Detail & Examine(-D):

mdadm -D /dev/md0

        Version : 1.2
  Creation Time : Thu Feb 14 17:38:31 2013
     Raid Level : raid1
     Array Size : 2095415 (2046.65 MiB 2145.70 MB)
  Used Dev Size : 2095415 (2046.65 MiB 2145.70 MB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Thu Feb 14 17:52:03 2013
          State : active, resyncing
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

 Rebuild Status : 50% complete                       # 當 device 被 mount 後, rebuild 會很慢

           Name : debian3:0  (local to host debian3)
           UUID : 938cab94:2fd92126:e1d6bfdf:cf91677e
         Events : 17

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1


# Print contents of the metadata stored on the named device(s).

mdadm -E /dev/sdb1

          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 938cab94:2fd92126:e1d6bfdf:cf91677e
           Name : debian3:0  (local to host debian3)
  Creation Time : Thu Feb 14 17:38:31 2013
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 4190854 (2046.66 MiB 2145.72 MB)
     Array Size : 4190830 (2046.65 MiB 2145.70 MB)
  Used Dev Size : 4190830 (2046.65 MiB 2145.70 MB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 833f1527:7789fd72:1c8feb2f:2f175385

    Update Time : Thu Feb 14 17:53:47 2013
       Checksum : 57c1b402 - correct
         Events : 19

   Device Role : Active device 0                         # 另有 Device Role : spare
   Array State : AA ('A' == active, '.' == missing)

 --examine applies to devices which are components of an array, while

--detail applies to a whole array  which  is  currently active.




-o, --readonly

It works  with Create, Assemble, Manage and Misc mode.

-w, --readwrite



清除 metadata


mdadm --zero-superblock /dev/sdd1


Rebuild turning


Tip #1: speed_limit_min 與 speed_limit_max


# reflects the current “goal” rebuild speed . The default is 1000. Unit: Kibibytes
# USB 的速度:
#   USB2.0 ~ 30 Mbyte
#   USB3.0 ~ 107 Mbyte

echo 70000 > /proc/sys/dev/raid/speed_limit_min


# Default: 200000





查看當前 rebuild 的速度

iostat sda sdb 1

      sda             sdb             cpu
  kps tps svc_t   kps tps svc_t  us  sy  wt  id
16498  71  11.5 16493  71  11.5  18  65   3  13
13117  53   9.8 13307  54  10.2   3  20   0  77
26451  80  11.5 26515  80  11.6   7  73  11   9

Tip #2: Bitmap Option

Bitmaps optimize rebuild time after a crash, or after removing and re-adding a device.

# Turn it on by typing the following command:

# internal: the bitmap is stored with the metadata on the array

# When creating  an array on devices which are 100G or larger,

# mdadm automatically adds an internal bitmap as it will usually be beneficial.

mdadm --grow --bitmap=internal /dev/md0

# Once array rebuild or fully synced, disable bitmaps:

mdadm --grow --bitmap=none /dev/md0

When an array has a write-intent bitmap, a spindle (a device, often a hard drive) can be removed and re-added,

then only blocks changes since the removal (as recorded in the bitmap) will be resynced.

Therefore a write-intent bitmap reduces rebuild/recovery (md sync) time if:

  • the machine crashes (unclean shutdown)
  • one spindle is disconnected, then reconnected

A write-intent bitmap:

    * does not improve performance
    * can be removed/added at any time
    * may cause a degradation in write performance, it varies upon:

        the size of the chunk data (on the RAID device) mapped to each bit in the bitmap, as expressed by cat /proc/mdstat.

        The ratio (bitmap size / RAID device size ) workload profile
        (long sequences of writes are more impacted, as spindle heads go back and forth between the data zone and the bitmap zone)

Set the chunksize of the bitmap.  Each bit corresponds to that many Kilobytes of storage.

When using an internal bitmap, the  chunksize  defaults  to  64Meg

# rebuild 中不能加 bitmap

mdadm: Cannot add bitmap while array is resyncing or reshaping etc.
mdadm: failed to set internal bitmap.

when you first create a raid1 (mirrored) array from two drives,

does mdadm insist on mirroring the contents of the first drive to the second even though the drives are entirely blank


md0 : active raid1 sdd1[0] sde1[1]
      3906885632 blocks super 1.2 [2/2] [UU]
      [=====>...............]  resync = 27.6% (1080482688/3906885632) finish=492.6min speed=95609K/sec
      bitmap: 22/30 pages [88KB], 65536KB chunk

If it's 22/30 that means there are 22 of 30 pages allocated in the in-memory bitmap.

The pages are allocated on demand, and get freed when they're empty (all zeroes).

in-memory bitmap allows bitmap operations to be more efficient

The in-memory bitmap uses 16 bits for each bitmap chunk to count all ongoing writes to the chunk,

so it's actually up to 16 times larger than the on-disk bitmap.

Tip #3: Set read-ahead option

# Set readahead (in 512-byte sectors) to 32 MiB per raid device.

blockdev --setra 65536 /dev/mdX




mdadm --monitor options... devices..

Detault 有一個 daemon 有背後行

/sbin/mdadm --monitor --pid-file /var/run/mdadm/ --daemonise --scan --syslog

start / stop:

/etc/init.d/mdadm start | stop

  • --syslog               Cause  all  events  to be reported through 'syslog'
  • -f, --daemonise

只有以下 Event 才會 sent Email:

  • Fail
  • FailSpare,
  • DegradedArray
  • SparesMissing
  • TestMessage


  • -t, --test             Send 一測試 E-mail


mdadm -F -1 -t -s

  • -F => --monitor
  • -1 => --oneshot     # Check arrays only once.
  • -t  => --test           # Generate a TestMessage alert for every array found at startup.(mail)
  • -s  => --scan          #



#   should mdadm run periodic redundancy checks over your arrays? See
#   /etc/cron.d/mdadm.

#   should mdadm start the MD monitoring daemon during boot?


Manage mode


  • -t, --test
  • -a, --add
  • -r, --remove
  • -f, --fail
  • --re-add        just updates the blocks that  have  changed  since  the  device  was removed.




specify what automatic behavior is allowed on devices newly appearing in the system and

provides a  way  of  marking  spares that  can  be  moved to other arrays as well as the migration domains.


Domain can be defined through policy line by specifying a domain  name for a number of paths from /dev/disk/by-path/. 

A device may belong to several domains.


日常保養(Manage mode)


加 spare HardDisk

-a, --add              # (情況1) If a device appears to have recently been part of the array (possibly it failed or was  removed) 

                           # the device  is  re-added  as  described in the next point.

                           # (情況2) If that fails or the device was never part of the array, the device is added as a hot-spare.

                           # If the array is degraded, it will immediately start to rebuild data onto that spare.

--add-spare          # Add a device as a spare.  This is similar to --add except that it does not attempt --re-add first.


mdadm /dev/md0 -a /dev/sdd1

mdadm: added /dev/sdd1

cat /proc/mdstat

md0 : active raid1 sdd1[2](S) sdb1[0] sdc1[1]
      2095415 blocks super 1.2 [2/2] [UU]

unused devices: <none>


更換 HardDisk

# -f, --fail             # Mark listed devices as faulty

# -r, --remove      # They must not be active. (failed or spare)

# --re-add             # re-add a device that was previously removed from an array.

# --replace            # Mark listed devices as requiring replacement.  

                            # As soon as a spare is available,  it will be rebuilt and will replace the  marked device.

i.e. sdd1 取替 sdc1

# 當新加的 sdd1 沒有 metadata 時, 那 mdadm 會 "re-add" 到 array

mdadm /dev/md0 -f /dev/sdc1 -r /dev/sdc1 -a /dev/sdd1

救 Array

-R, --run
              start a partially assembled array




RAID1 speeds

Linux implementation of RAID1 speeds up disk read operations twice

as long as two separate disk read operations are performed at a time.

A single stream of sequential input will not be accelerated (e.g. a single dd),

but multiple sequential streams or a random workload will use more than one spindle.

In theory, having an N-disk RAID1 will allow N sequential threads to read from all disks.


#1: A single stream

sync && echo 3 > /proc/sys/vm/drop_caches


dd if=/dev/md127 of=/dev/null bs=10M count=$COUNT &

#2: multiple sequential streams

# bonnie++ benchmarking tool which doesn't perform two separate reads at one time.

# 所以用兩組 dd 去測試

sync && echo 3 > /proc/sys/vm/drop_caches


dd if=/dev/md127 of=/dev/null bs=10M count=$COUNT &;

dd if=/dev/md127 of=/dev/null bs=10M count=$COUNT skip=$COUNT &;


man 4 md

RAID 10 instead of RAID 1

mdadm --create /dev/md64 --level=10 --metadata=1.2  --raid-devices=4 \
  /dev/sda4 /dev/sdb4 \
  /dev/sda5 /dev/sdb5

test 1 的情況依然有較 RAID1 好 read performance, 因為在 RAID0 在 RAID10 內


Heath check



md: Autodetecting RAID arrays.
md: created md1
md: bind<sda3>
md: bind<sdb3>
md: running: <sdb3><sda3>
md: kicking non-fresh sdb3 from array!
md: unbind<sdb3>
md: export_rdev(sdb3)

This can happen after an unclean shutdown (like a power fail).

Usually removing and re-adding the problem devices will correct the situation:


mdadm /dev/md0 -a /dev/sdb3


/sbin/mdadm /dev/md0 --fail /dev/sda5 --remove /dev/sdb3

/sbin/mdadm /dev/md0 --add /dev/sdb3

Trigger a full check of md0 with

echo check > /sys/block/md0/md/sync_action


md: syncing RAID array md0
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
md: using 128k window, over a total of 4192896 blocks.
... 一段時間後 ...
md: md0: sync done.
RAID1 conf printout:
 --- wd:2 rd:2
 disk 0, wo:0, o:1, dev:sda2
 disk 1, wo:0, o:1, dev:sdb2

cat /proc/mdstat

md0 : active raid1 sdb2[1] sda2[0]
      4192896 blocks [2/2] [UU]
      [=========>...........]  resync = 48.4% (2034624/4192896) finish=0.5min speed=61875K/sec
... 一段時間後 ...
md0 : active raid1 sdb2[1] sda2[0]
      4192896 blocks [2/2] [UU]


If a read error is encountered, the block in error is calculated and written back.

If the array is a mirror, as it can't calculate the correct data,

it will take the data from the first (available) drive and write it back to the dodgy drive.

# This will stop any check or repair that is currently in progress.

echo idle > /sys/block/mdX/md/sync_action

# Only valid for parity raids - this will also check the integrity of the data as it reads it,

# and rewrite a corrupt stripe.

# It will terminate immediately without doing anything if the array is degraded, as it cannot recalculate the faulty data.

# DO NOT run this on raid-6 without making sure that it is the correct thing to do.

# There is a utility "raid6check" that you should use if "check" flags data errors on a raid-6.

echo repair > /sys/block/mdX/md/sync_action





一般而言, 不會人手用到它


mdmon polling the sysfs namespace looking for changes in array_state,


mdmon [--all] [--takeover] CONTAINER

CONTAINER: The container device to monitor.(/dev/md/container)


This tells mdmon to find any active containers and start  monitoring  each  of them


instructs mdmon to replace any active mdmon which is currently monitoring the array.


Disable Auto Assemble MD


boot 時


GRUB_CMDLINE_LINUX_DEFAULT=" ... raid=noautodetect"


grep noautodetect /boot/grub/grub.cfg

hotplug 時

mkdir /lib/udev/rules.d_disable

mv /lib/udev/rules.d/64-md-*.rules /lib/udev/rules.d_disable

udevadm control --reload




情況: 將 RAID Group 1 的 Disk 移到 Group 2

# --zero-superblock: You can make drives forget they were in a RAID by zeroing out their md superblocks.

mdadm --zero-superblock /dev/sdd1


Scrubbing the drives



For RAID1 and RAID10

It compares the corresponding blocks of each disk in the array.


this means checking that the parity block is (or blocks are) correct.

If a read error

If a read error is detected during this process,

the normal read-error handling causes correct data to be found from other devices

and to be written back to the faulty device.

mismatch (not read-error)

If all blocks read successfully but are found to not be consistent, then this is regarded as a mismatch.


This is set to zero when a scrub starts and is incremented whenever a sector is found that is a mismatch.

A value of 128 could simply mean that a single 64KB check found an error (128 x 512bytes = 64KB).

(128 => it does not determine exactly how many actual sectors were affected)

(md normally works in units(128) much larger than a single sector (512byte))

If check was used,

then no action is taken to handle the mismatch, it is simply recorded.

If repair was used,

then a mismatch will be repaired in the same way that resync repairs arrays.

For RAID5/RAID6 new parity blocks are written.

On a truly clean RAID5 or RAID6 array, any mismatches should indicate a hardware problem at some level

For RAID1/RAID10, all but one block are overwritten with the content of that one block.

On RAID1 and RAID10 it is possible for software issues to cause a mismatch to be reported.

1. if there's a power outage or

2. if you have memory-mapped files like swap files

3. If an array is created with "--assume-clean" then a subsequent check could be expected to find some mismatches. (unused space)

Check vs. Repair

As opposed to check a repair also includes a resync.

The difference from Resync is, that no bitmap is used to optimize the process.

Track mismatch on RAID1

Fill up the entire disk (cat /dev/zero > bigfile)

Free the space again (rm bigfile)

Re-run a data check (echo check > /sys/block/mdX/md/sync_action)

# Checking

cat /sys/block/md0/md/sync_action

idle / check

cat /proc/mdstat


當 RAID 1 得單碟時



md126 : inactive sdd5[0](S)
      2925435456 blocks super 1.2

mdadm -D /dev/md126

           Version : 1.2
        Raid Level : raid0
     Total Devices : 1
       Persistence : Superblock is persistent

             State : inactive
   Working Devices : 1

              Name : DiskStation:2
              UUID : ?:?:?:?
            Events : 15689956

    Number   Major   Minor   RaidDevice

       -       8       53        -        /dev/sdd5

mdadm -E /dev/sdd5

          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : ?:?:?:?
           Name : DiskStation:2
  Creation Time : Mon Dec 17 18:38:53 2018
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 5850870912 (2789.91 GiB 2995.65 GB)
     Array Size : 2925435456 (2789.91 GiB 2995.65 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1968 sectors, after=0 sectors
          State : clean
    Device UUID : ?:?:?:?

    Update Time : Tue Apr  7 10:59:16 2020
       Checksum : 42566548 - correct
         Events : 15689956

   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing, 'R' == replacing)


mdadm -o -R /dev/md125

mdadm: started array /dev/md/DiskStation:2


[491604.441644] md/raid1:md125: active with 1 out of 2 mirrors
[491604.441695] md125: detected capacity change from 0 to 2995645906944

mdadm -S md125