software raid

最後更新: 2021-03-23

 

介紹

Linux 上的 software raid 名叫 md.

它是透過 linux kernel driver "md_mod" 去做 block device 的管理的.

功能:

  • combine several physical disks into one larger virtual device (RAID0, RAID5)
  • performance improvements (RAID1, RAID10)
  • redundancy (RAID1, RAID5 ...)

Linux 的 software RAID 支援 RAID Level 如下 :

RAID configurations

  • RAID 1 – Mirror
  • RAID 5 - Parity distributed across all devices
  • RAID 10 – Take a number of RAID 1 mirrorsets and stripe across them RAID 0 style.

Non-RAID

  • Linear
  • Multipath
  • Faulty
  • Container

目錄

 


DM 與 Multiath Device 的分別

 

  • RAID: /dev/md0                   <-- Process: md0_raid1 會管理它
  • MULTIPATH: /dev/dm-0

 


安裝

 

Debian 6:

apt-get install mdadm

以下是其中一個重要步驟: 執行 update-initramfs

update-initramfs: Generating /boot/initrd.img-2.6.32-5-686

獲得

  • /etc/cron.d/mdadm
  • /sbin/mdadm
  • /sbin/mdmon

設定

  • /etc/mdadm/mdadm.conf
  • /etc/default/mdadm

Version

mdadm -V

mdadm - v3.1.4 - 31st August 2010

 


Config File

 

Location

  • /etc/mdadm.conf               # Centos 7 (如不存在就人手建立它)
  • /etc/mdadm/mdadm.conf   # Debian

mdadm.conf

[1] DEVICE

設定從那裡尋找 RAID 的成員(i.e. /dev/sd*1) 去嘗試組裝

DEVICE partitions

Default 設定. 從 /proc/partitions 找出 device 去 scan

那些有 MD superblocks 的 Device 就會嘗試組裝

DEVICE /dev/sda* /dev/sdb1

指定 scan 某 Device 的所有 partition

DEVICE /dev/disk/by-path/pci-*

這方式最好, 因為可以指定 port

ll /dev/disk/by-path

total 0
lrwxrwxrwx 1 root root  9 Dec 28 15:41 pci-0000:00:1f.2-ata-1 -> ../../sda
lrwxrwxrwx 1 root root  9 Dec 28 15:41 pci-0000:00:1f.2-ata-1.0 -> ../../sdd
lrwxrwxrwx 1 root root 10 Dec 28 15:41 pci-0000:00:1f.2-ata-1.0-part1 -> ../../sdd1
...

[2]

# 自動建立 /dev/md*, 並且設定它的 permission
# "auto=yes" 相當於 mdadm 的 "--auto". 
CREATE owner=root group=disk mode=0660 auto=yes

如果不此設定, 就要用 "ARRAY /dev/mdN metadata=1.2 name=HOST:NAME UUID=..."

[3]

# 執行 CLI "mdadm" 時, 參數 "--homehost" 用到的的預設值
# (considered the home for any arrays)
# auto-assemble 及 create (metadata)時都會用到它
# <system> 相當於 gethostname
HOMEHOST <system>

[4]

#### Mail Setting ####
MAILFROM root@server
# 當 array 出事時會比 mail 誰
MAILADDR [email protected]

PROGRAM

a program  to  be  run  when  mdadm --monitor detects potentially interesting events on any of the arrays that it is monitoring.

 


metadata(superblock) 的分別

 

 * Default: v1.2

0.90            # common format (superblock: 4K, 64K aligned block)

superblock location: At the end of the device

 * Putting the superblock at the end of the device is dangerous

    if you have any kind of auto-mounting/auto-detection/auto-activation of the raid contents;

1.x              # superblock that is normally 1K long, but can be longer

# 1.x superblock on different locations on the device

  • 1.0: near the end (at least 8K, and less than 12K, from the end)
  • 1.1: At the start
  • 1.2: 4K after the start (for 1.2) # 優點: Device 可以有 Partition

hexdump -C /dev/sdb3 | less

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000  fc 4e 2b a9 01 00 00 00  00 00 00 00 00 00 00 00  |.N+.............|
...

ddf              # Use the "Industry  Standard" DDF (Disk  Data Format) format defined by SNIA.

# When creating a DDF array a CONTAINER will be created, and normal arrays can be created in that container.

imsm         # Use  the Intel(R) Matrix Storage Manager metadata format.

 


Array

 

組裝了的 Array

ls -l /dev/md/500g

lrwxrwxrwx 1 root root 8 Jul 30 23:59 /dev/md/500g -> ../md127

The ARRAY lines identify actual arrays.

/etc/mdadm.conf

ARRAY /dev/md/root level=raid1 num-devices=2 UUID=...

設定 ARRAY

只有 match ALL identities 才會建立 Array

  • uuid=           # 128 bit uuid in hexadecimal
  • name=         # name stored in the superblock
  • level=          # The value is a raid level
  • devices=      # listed there must also be listed on a DEVICE line

i.e.

# 用 device(sda1, sdb1) 及 raid level(1)

ARRAY /dev/md1 devices=/dev/sda1,/dev/sdb1 level=1

# 用 meta version 及 UUID

# If the name does not start with a slash ('/'), it is treated as being in /dev/md/

# UUID 不是用 blkid 獲得, 而係用 "mdadm -E /dev/sd?" 獲得

ARRAY 500g metadata=1.2 UUID=938cab94:2fd92126:e1d6bfdf:cf91677e

# 修改完設定檔後, 需要 reload daemon

service mdadm reload

自動組合Array(AUTO):

AUTO 是用控制 auto-assemble 是否進行, 它是一個 list 來

規則如下:

  1. first match wins
  2. plus sign: the auto assembly is allowed
  3. minus sign: the auto assembly is disallowed
  4. no match: the auto assembly is allowed

metadata types: 0.90, 1.x, ddf, imsm

i.e.

# all is usually last
AUTO +1.x homehost -all

 * AUTO 是由 "mdadm -As" 引發的. -A=--assemble; -s=--scan

Disable mdadm automatic setup ARRAY

/etc/mdadm.conf

AUTO -all

OR

# <ignore>: matches the rest of the line will never be automatically  assembled.
ARRAY <ignore> UUID=?:?:?:?

* if the superblock is tagged as belonging to the given home host,

   it will automatically choose a device name and try to assemble the array.

udev

# 防止 "touch /dev/sd{e,f}" 引發組裝 RAID

# Backup

mv /lib/udev/rules.d/64-md-raid-assembly.rules /root

udevadm control --reload-rules

Boot 機時自動組合Array

When md is compiled into the kernel (not as module),

partitions of type 0xfd are scanned and automatically assembled into RAID arrays.

(uppressed with the kernel parameter "raid=noautodetect")

Checking: kernel & module

grep CONFIG_MD /boot/config-*

CONFIG_MD=y
CONFIG_MD_AUTODETECT=y
...

 * CONFIG_MD_AUTODETECT only works for version 0.90 superblocks

 * init scripts that run any arrays that aren't started by auto-detect

Partitionable

The kernel parameter raid=partitionable (or raid=part) means that

  all auto-detected arrays are assembled as partitionable.

The standard names for non-partitioned arrays

/dev/mdN
/dev/md/N

The standard names for partitionable arrays

/dev/md/dNpM
/dev/md_dNpM

 


Command: mdadm 的用法

 

usage:

mdadm [mode] <raiddevice> [options] <component-devices>

mode:

  • Assemble(-A): Assemble the components of a previously created array into an active array.
  • Build(-B): Build an array that doesn't have per-device metadata(similar to --create)
  • Create(-C): A 'resync' process is started to make sure that the array is consistent
  • Monitor mode(--monitor, -F): 當 RAID 情況有變時發出通知
  • Manage: adding new spares and removing faulty devices
  • Grow (or shrink): 
    1) active size of component devices
    2) number of  active devices
    3) RAID level
  • Auto-detect(--auto-detect): requests the Linux Kernel to activate any auto-detected arrays
  • Misc mode (-Q, --query, -D, --detail, -E, --examine, -R, --run, -S, --stop)

常用指令:

-A

# 將已經是RAID 的 sdi1 及 sdj1 重新組將成 md0 裡

mdadm -A /dev/md0 /dev/sdi1 /dev/sdj1

mdadm: /dev/md0 has been started with 2 drives.

-s, --scan                       # not mode-specific

-F, --follow, --monitor     # Select Monitor mode.

# scan /proc/mdstat 獲得所需設定, 比如要填上的 md? 及 sd? device

mdadm -F --scan

其他參數

-o, --readonly        #  Create, Assemble, Manage and Misc mode

Start the array read only rather than read-write as normal.
(no resync, recovery, or reshape)

-w, --readwrite

--zero-superblock

應用: 清除 metadata

mdadm --zero-superblock /dev/sdd1

 


應用: 建立雙碟的 RAID1

 

準備 Harddisk:

[方法 A]

fdisk /dev/sde

設定 partition's system id

按 t

之後輸入 fd    <-- Linux raid auto

[方法 B]

parted /dev/sde

(parted) mklabel msdos

(parted) mkpart pri 0% 100%

(parted) set 1 raid                                # parted /dev/sde set 1 raid

(parted) print                                       # parted /dev/sde print

Number  Start   End    Size   Type     File system  Flags
 1      33.6MB  500GB  500GB  primary               raid

建立 RAID1 ARRAY:

# 安全成見, double check hdd

blkid /dev/{sde,sdf}1

# 建立 RAID1

mdadm -C /dev/md0 -l 1 -n 2 -N /dev/sde1 /dev/sdf1

 * 建立後兩隻碟就會立即開始 sync

Opts

-C, --create                  # Create a new array

-n N, --raid-devices=N  # number of active devices in the array
                                   # Setting a value of 1 is probably a mistake and so requires that --force be specified first.
                                   # 不設定會有 error "mdadm: no raid-devices specified."

-l, --level                     # 1 = RAID1

-N, --name=                #

其他設定:

  • -x, --spare-devices=

查看

cat /proc/mdstat

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : active raid1 sde1[0] sdf1[1]
      488202944 blocks super 1.2 [2/2] [UU]
      [=>...................]  resync =  5.0% (24566016/488202944) finish=416.1min speed=18566K/sec
      bitmap: 4/4 pages [16KB], 65536KB chunk

P.S.

# keep mon raid rebuild status - monrebuild.sh

while true
do
    cat /proc/mdstat | awk '/resync/{print $0}'
    sleep 1
done

保存 RAID 設定到設定檔(mdadm.conf):

mdadm --detail --scan >> /etc/mdadm/mdadm.conf

加入了內容

ARRAY /dev/md0 metadata=1.2 name=home:0 UUID=55b256aa:ebec745e:0c0a2ab7:c92d4fb1

Notes

用以下 cmd 有同樣效果

mdadm --examine --scan >> /etc/mdadm/mdadm.conf

 


查看 RAID 狀態

 

cat /proc/mdstat

Personalities : [raid1]
md0 : active raid1 sdc1[1] sdb1[0]
      2095415 blocks super 1.2 [2/2] [UU]

unused devices: <none>
#/\ 沒有合成 array 的 device 就會在這裡

 


Create RAID1 With "missing"

 

# keyword "missing" is specified for the first device: this will be added later.

mdadm --create /dev/md0 --name softraid2t --level=1 --raid-devices=2 missing /dev/sdg1

# Verify that the RAID array

cat /proc/mdstat

Personalities : [raid1]
md0 : active raid1 sdg1[1]
      7813893120 blocks super 1.2 [2/1] [_U]
      bitmap: 0/59 pages [0KB], 65536KB chunk

# Add disk partition to array

mdadm /dev/md0 --add /dev/sdf1

mdadm: added /dev/sdf1

 * If  a device is given before any options, or
    if the first option is one of --add, --re-add, --add-spare, --fail, --remove, or --replace,
    then the MANAGE mode is assumed.

# Verify that the RAID array is being rebuilt

cat /proc/mdstat

Personalities : [raid1]
md0 : active raid1 sdf1[2] sdg1[1]
      7813893120 blocks super 1.2 [2/1] [_U]
      [>....................]  recovery =  0.0% (4175872/7813893120) finish=623.3min speed=208793K/sec
      bitmap: 28/59 pages [112KB], 65536KB chunk

unused devices: <none>

 


md 的 start 與 stop

 

Start md:

方法1: auto 啟動 RAID

mdadm -As

mdadm: /dev/md0 has been started with 2 drives.

-A, --assemble       # Assemble a pre-existing array.

-s, --scan              # Scan config file or /proc/mdstat for missing information.

方法2: manual

# -A, --assemble              # Assemble a pre-existing array.

mdadm -A /dev/md0 /dev/usbdisk/WD4T-K3G5AD6B /dev/usbdisk/WD4T-NHGNXBVY

mdadm: /dev/md0 has been started with 2 drives.

Use UUID For Assemble:

Assemble

-u, --uuid=      # uuid of array to assemble.  Devices which don't have this uuid are excluded

當 mount 時與指定的 uuid 不相附時, 會出以下 Error

mdadm: /dev/usbdisk/WD4T-K3G5AD6B has wrong uuid.
mdadm: /dev/usbdisk/WD4T-NHGNXBVY has wrong uuid.

Checking

blkid /dev/sda2

/dev/sda2: 
 UUID="2c58db70-dca0-d769-7127-8d9cdcfe0d9b" 
 UUID_SUB="ace8c810-31ee-ce5f-ec4b-4ce874f091de" 
 LABEL="localhost:boot" TYPE="linux_raid_member" PARTUUID="4f2c6ff1-02"

mdadm -E /dev/sda2

Array UUID : 2c58db70:dca0d769:71278d9c:dcfe0d9b
Name : localhost:boot
...
Device UUID : ace8c810:31eece5f:ec4b4ce8:74f091de

UUID = Array UUID

UUID_SUB = Device UUID

blkid /dev/md125

/dev/md125: UUID="f721d037-2c40-4ce6-91a2-4eb39c00cd8b" BLOCK_SIZE="1024" TYPE="ext4"

Use Name For Assemble

-N, --name=

This must be the name that was specified when creating the array.  

It must either match the name stored in the superblock exactly,

or it must match with the current homehost prefixed to the start of the given name.

Stop md:

-S, --stop              # deactivate array, releasing all resources

i.e.

mdadm -S /dev/md0

mdadm: stopped /dev/md0

 * stop 後 "/proc/mdstat" 內的資料 及 /dev/md0 都會不見了

 


查看 RAID 的資訊

 

Misc mode:

  • -Q, --query           # Examine a device to see if it is an md device and if it is a component of an md array.
  • -D, --detail           # Print details of one or more md devices.
  • -E, --examine       # Print contents of the metadata stored on the named device

Query(-Q):

mdadm -Q /dev/sda1

/dev/sda1: is not an md array

mdadm -Q /dev/sdb1

/dev/sdb1: is not an md array
/dev/sdb1: device 0 in 2 device unknown raid1 array.  Use mdadm --examine for more detail.

mdadm -Q /dev/md0

/dev/md0: 2046.30MiB raid1 2 devices, 0 spares. Use mdadm --detail for more detail.

Detail(-D):

mdadm -D /dev/md0

/dev/md0:
        Version : 1.2
  Creation Time : Thu Feb 14 17:38:31 2013
     Raid Level : raid1
     Array Size : 2095415 (2046.65 MiB 2145.70 MB)
  Used Dev Size : 2095415 (2046.65 MiB 2145.70 MB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Thu Feb 14 17:52:03 2013
          State : active, resyncing
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

 Rebuild Status : 50% complete                       # 當 device 被 mount 後, rebuild 會很慢

           Name : debian3:0  (local to host debian3)
           UUID : 938cab94:2fd92126:e1d6bfdf:cf91677e
         Events : 17

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1

Examine(-E):

# Print contents of the metadata stored on the named device(s).

mdadm -E /dev/sdb1

/dev/sdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 938cab94:2fd92126:e1d6bfdf:cf91677e
           Name : debian3:0  (local to host debian3)
  Creation Time : Thu Feb 14 17:38:31 2013
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 4190854 (2046.66 MiB 2145.72 MB)
     Array Size : 4190830 (2046.65 MiB 2145.70 MB)
  Used Dev Size : 4190830 (2046.65 MiB 2145.70 MB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 833f1527:7789fd72:1c8feb2f:2f175385

    Update Time : Thu Feb 14 17:53:47 2013
       Checksum : 57c1b402 - correct
         Events : 19


   Device Role : Active device 0                         # 另有 Device Role : spare
   Array State : AA ('A' == active, '.' == missing)

Detail 與 Examine 分別

--examine applies to devices which are components of an array

--detail applies to a whole array which is currently active.

 


Rebuild Speed Turning

 

Tip #1: speed_limit_min 與 speed_limit_max

Unit: Kibibytes

speed_limit_min

# reflects the current “goal” rebuild speed . The default is 1000.
# USB 的速度:
#   USB2.0 ~ 30 Mbyte
#   USB3.0 ~ 107 Mbyte

echo 70000 > /proc/sys/dev/raid/speed_limit_min

speed_limit_max

# Default: 200000

/proc/sys/dev/raid/speed_limit_max

寫入設定檔

sysctl dev.raid.speed_limit_min

sysctl dev.raid.speed_limit_max

查看當前 rebuild 的速度

iostat sda sdb 1

      sda             sdb             cpu
  kps tps svc_t   kps tps svc_t  us  sy  wt  id
16498  71  11.5 16493  71  11.5  18  65   3  13
13117  53   9.8 13307  54  10.2   3  20   0  77
26451  80  11.5 26515  80  11.6   7  73  11   9
....

Tip #2: Bitmap Option

Bitmaps optimize rebuild time after a crash, or after removing and re-adding a device.

# Turn it on by typing the following command: "--bitmap=?"

# Type=internal: the bitmap is stored with the metadata on the array

# When creating an array on devices which are 100G or larger,
#     mdadm automatically adds an internal bitmap as it will usually be beneficial.

mdadm --grow --bitmap=internal /dev/md0

# Once array rebuild or fully synced, disable bitmaps:

mdadm --grow --bitmap=none /dev/md0

write-intent bitmap

When an array has a write-intent bitmap, a spindle (a device, often a hard drive) can be removed and re-added,

    then only blocks changes since the removal (as recorded in the bitmap) will be resynced.

Therefore a write-intent bitmap reduces rebuild/recovery (md sync) time if:

  • the machine crashes (unclean shutdown)
  • one spindle is disconnected, then reconnected

A write-intent bitmap:

    * does not improve performance

    * can be removed/added at any time

    * may cause a degradation in write performance, it varies upon:

       - the size of the chunk data (on the RAID device) mapped to each bit in the bitmap, as expressed by cat /proc/mdstat.

       - The ratio (bitmap size / RAID device size ) workload profile
        (long sequences of writes are more impacted, as spindle heads go back and forth between the data zone and the bitmap zone)
    
--bitmap-chunk=bit

Set the chunksize of the bitmap.  Each bit corresponds to that many Kilobytes of storage.

When using an internal bitmap, the chunksize defaults to 64M

在 rebuild 中不能加 bitmap

mdadm: Cannot add bitmap while array is resyncing or reshaping etc.
mdadm: failed to set internal bitmap.

when you first create a raid1 (mirrored) array from two drives,

does mdadm insist on mirroring the contents of the first drive to the second even though the drives are entirely blank

i.e.

md0 : active raid1 sdd1[0] sde1[1]
      3906885632 blocks super 1.2 [2/2] [UU]
      [=====>...............]  resync = 27.6% (1080482688/3906885632) finish=492.6min speed=95609K/sec
      bitmap: 22/30 pages [88KB], 65536KB chunk

If it's 22/30 that means there are 22 of 30 pages allocated in the in-memory bitmap.

The pages are allocated on demand, and get freed when they're empty (all zeroes).

in-memory bitmap allows bitmap operations to be more efficient

The in-memory bitmap uses 16 bits for each bitmap chunk to count all ongoing writes to the chunk,

so it's actually up to 16 times larger than the on-disk bitmap.

Tip #3: Set read-ahead option

測試結果沒有加快

Unit: N sectors (512-byte per sectors)

blockdev --getra /dev/md127

256                             # 128 KB

blockdev --setra 65536 /dev/md127  # 32 MB

 



Monitor Mode

 

mdadm --monitor options... devices..

Detault 有一個 daemon 有背後行

/sbin/mdadm --monitor --pid-file /var/run/mdadm/monitor.pid --daemonise --scan --syslog

start / stop:

/etc/init.d/mdadm start | stop

  • --syslog               Cause  all  events  to be reported through 'syslog'
  • -f, --daemonise

只有以下 Event 才會 sent Email:

  • Fail
  • FailSpare,
  • DegradedArray
  • SparesMissing
  • TestMessage

其他參數:

  • -t, --test             Send 一測試 E-mail

Example:

mdadm -F -1 -t -s

  • -F => --monitor
  • -1 => --oneshot     # Check arrays only once.
  • -t  => --test           # Generate a TestMessage alert for every array found at startup.(mail)
  • -s  => --scan          #

Config

/etc/default/mdadm

#   should mdadm run periodic redundancy checks over your arrays? See
#   /etc/cron.d/mdadm.
AUTOCHECK=false

# START_DAEMON:
#   should mdadm start the MD monitoring daemon during boot?
START_DAEMON=false

 


Manage Mode

 

  • -t, --test
  • -a, --add
  • -r, --remove
  • -f, --fail
  • --re-add        just updates the blocks that  have  changed  since  the  device  was removed.

 


Policy

 

specify what automatic behavior is allowed on devices newly appearing in the system and

provides a  way  of  marking  spares that  can  be  moved to other arrays as well as the migration domains.

Domain can be defined through policy line by specifying a domain  name for a number of paths from /dev/disk/by-path/. 

A device may belong to several domains.

 



日常保養(Manage Mode)

 

  • 加 Spare Disk
  • 救 Array

加 Spare Disk

-a, --add              # (情況1) If a device appears to have recently been part of the array (possibly it failed or was removed) 

                           # --add 相當於 --re-add

                           # (情況2) If that fails or the device was never part of the array, the device is added as a hot-spare

                           # If the array is degraded, it will immediately start to rebuild data onto that spare.

--add-spare          # Add a device as a spare. This is similar to --add except that it does not attempt --re-add first.

i.e.

mdadm /dev/md0 -a /dev/sdd1

mdadm: added /dev/sdd1

cat /proc/mdstat

md0 : active raid1 sdd1[2](S) sdb1[0] sdc1[1]
      2095415 blocks super 1.2 [2/2] [UU]

unused devices: <none>

更換 Disk

-f, --fail             # Mark listed devices as faulty

-r, --remove      # They must not be active. (failed or spare)

--re-add            # re-add a device that was previously removed from an array.

                        # The recovery may only require sections that are flagged a write-intent bitmap to be recovered

--replace           # Mark listed devices as requiring replacement.  

                        # As soon as a spare is available,  it will be rebuilt and will replace the marked device.

i.e. 把 offline/fail 了的 sda1 放回 RAID

cat /proc/mdstat

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdc1[2] sda1[1](F)
      1953381376 blocks super 1.2 [2/1] [U_]
      bitmap: 5/15 pages [20KB], 65536KB chunk

mdadm /dev/md0 --re-add /dev/sda1

mdadm: re-add /dev/sda1 to md0 succeed

cat /proc/mdstat              # "--re-add" 比 "--replace" 怏很多

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdc1[2] sda1[1]
      1953381376 blocks super 1.2 [2/1] [U_]
      [>....................]  recovery =  3.2% (63370624/1953381376) finish=275.4min speed=114352K/sec
      bitmap: 5/15 pages [20KB], 65536KB chunk

i.e. sdd1 取替 sdc1

# 當新加的 sdd1 沒有 metadata 時, 那 mdadm 會 "re-add" 到 array

mdadm /dev/md0 -f /dev/sdc1 -r /dev/sdc1 -a /dev/sdd1

救 Array

-R, --run              start a partially assembled array

ie.

mdadm -A /dev/md0 /dev/loop0p3 --run

mdadm: /dev/md0 has been started with 1 drive (out of 2).

 


Heath Check

 

在開機時

md: Autodetecting RAID arrays.
....
md: created md1
md: bind<sda3>
md: bind<sdb3>
md: running: <sdb3><sda3>
md: kicking non-fresh sdb3 from array!
md: unbind<sdb3>
md: export_rdev(sdb3)

This can happen after an unclean shutdown (like a power fail).

Usually removing and re-adding the problem devices will correct the situation:

Fix

mdadm /dev/md0 -a /dev/sdb3

Trigger a full check of md0 with

echo check > /sys/block/md0/md/sync_action

dmesg

md: syncing RAID array md0
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
md: using 128k window, over a total of 4192896 blocks.
... 一段時間後 ...
md: md0: sync done.
RAID1 conf printout:
 --- wd:2 rd:2
 disk 0, wo:0, o:1, dev:sda2
 disk 1, wo:0, o:1, dev:sdb2

cat /proc/mdstat

md0 : active raid1 sdb2[1] sda2[0]
      4192896 blocks [2/2] [UU]
      [=========>...........]  resync = 48.4% (2034624/4192896) finish=0.5min speed=61875K/sec
... 一段時間後 ...
md0 : active raid1 sdb2[1] sda2[0]
      4192896 blocks [2/2] [UU]

P.S.

If a read error is encountered, the block in error is calculated and written back.

If the array is a mirror, as it can't calculate the correct data,

it will take the data from the first (available) drive and write it back to the dodgy drive.

# This will stop any check or repair that is currently in progress.

echo idle > /sys/block/mdX/md/sync_action

# Only valid for parity raids - this will also check the integrity of the data as it reads it,

# and rewrite a corrupt stripe.

# It will terminate immediately without doing anything if the array is degraded, as it cannot recalculate the faulty data.

# DO NOT run this on raid-6 without making sure that it is the correct thing to do.

# There is a utility "raid6check" that you should use if "check" flags data errors on a raid-6.

echo repair > /sys/block/mdX/md/sync_action

 



Performance

 

single & multiple speed

Linux implementation of RAID1 speeds up disk read operations twice

as long as two separate disk read operations are performed at a time.

A single stream of sequential input will not be accelerated (e.g. a single dd),

but multiple sequential streams or a random workload will use more than one spindle.

In theory, having an N-disk RAID1 will allow N sequential threads to read from all disks.

Test:

#1: A single stream

sync && echo 3 > /proc/sys/vm/drop_caches

COUNT=1000;

dd if=/dev/md127 of=/dev/null bs=10M count=$COUNT &

#2: multiple sequential streams

# bonnie++ benchmarking tool which doesn't perform two separate reads at one time.

# 所以用兩組 dd 去測試

sync && echo 3 > /proc/sys/vm/drop_caches

COUNT=1000;

dd if=/dev/md127 of=/dev/null bs=10M count=$COUNT &;

dd if=/dev/md127 of=/dev/null bs=10M count=$COUNT skip=$COUNT &;

Doc

man 4 md

RAID 10 instead of RAID 1

mdadm --create /dev/md64 --level=10 --metadata=1.2  --raid-devices=4 \
  /dev/sda4 /dev/sdb4 \
  /dev/sda5 /dev/sdb5

test 1 的情況依然有較 RAID1 好 read performance, 因為有兩個 R1 組成 R0,

而 R0 有提速功效

RAID 1 Speed Test

# T1: mon

dstat -d -D sda,sdb

# T2: write speed: 20?M

echo 1 > /proc/sys/vm/drop_caches

pv /dev/zero > test.bin    # 兩隻碟同時寫入 data

# T2: read speed: 20?M

echo 1 > /proc/sys/vm/drop_caches

pv test.bin > /dev/null    # 在其中一隻碟讀出 data

# T2: read & write at same time by cp, rsync: 5?M

echo 1 > /proc/sys/vm/drop_caches

cp test.bin test2.bin

rsync test.bin test2.bin       # 沒分別

Summary: 經測試, 一般 HDD 有 30~40% performance

# T2: read & write at same time by dd

# 準備 file

pv /dev/zero > test.bin

# 測試

echo 1 > /proc/sys/vm/drop_caches

pv /dev/zero > test1.bin

pv test.bin > /dev/null

嘗試改善 copy 的速度

Readahead

# 8192 x 512 / 1024 = 4k

blockdev --getra /dev/md127

# 8388608 = 4m

blockdev --setra 8388608 /dev/md127

結論: 沒有改善

deadline scheduler

# sda1 及 sdb1 是 RAID 1 的 成完

echo deadline > /sys/block/sda1/queue/scheduler

echo deadline > /sys/block/sdb1/queue/scheduler

結論: 沒有改善

總結

HDD 只有在 sequential R/W 才有最高速度 "hdparm -t /dev/sda"

 



mdmon

 

介紹:

一般而言, 不會人手用到它

功能:

mdmon polling the sysfs namespace looking for changes in array_state,

用法:

mdmon [--all] [--takeover] CONTAINER

CONTAINER: The container device to monitor.(/dev/md/container)

--all 

This tells mdmon to find any active containers and start  monitoring  each  of them

--takeover

instructs mdmon to replace any active mdmon which is currently monitoring the array.

 


Disable Auto Assemble MD

 

Boot 時

/etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT=" ... raid=noautodetect"

update-grub

grep noautodetect /boot/grub/grub.cfg

Hotplug 時

mkdir /lib/udev/rules.d_disable

mv /lib/udev/rules.d/64-md-*.rules /lib/udev/rules.d_disable

udevadm control --reload

P.S.

HDD 加減 partition 時會 trigger 它

 


重設 RAID

 

情況: 將 RAID Group 1 的 Disk 移到 Group 2

# --zero-superblock: You can make drives forget they were in a RAID by zeroing out their md superblocks.

mdadm --zero-superblock /dev/sdd1

 


Scrubbing the drives

 

Checks

For RAID1 and RAID10

It compares the corresponding blocks of each disk in the array.

For RAID4, RAID5, RAID6

this means checking that the parity block is (or blocks are) correct.

If a read error

If a read error is detected during this process,

the normal read-error handling causes correct data to be found from other devices

and to be written back to the faulty device.

mismatch (not read-error)

If all blocks read successfully but are found to not be consistent, then this is regarded as a mismatch.

/sys/block/mdX/md/mismatch_cnt

This is set to zero when a scrub starts and is incremented whenever a sector is found that is a mismatch.

A value of 128 could simply mean that a single 64KB check found an error (128 x 512bytes = 64KB).

(128 => it does not determine exactly how many actual sectors were affected)

(md normally works in units(128) much larger than a single sector (512byte))

If check was used,

then no action is taken to handle the mismatch, it is simply recorded.

If repair was used,

then a mismatch will be repaired in the same way that resync repairs arrays.

For RAID5/RAID6 new parity blocks are written.

On a truly clean RAID5 or RAID6 array, any mismatches should indicate a hardware problem at some level

For RAID1/RAID10, all but one block are overwritten with the content of that one block.

On RAID1 and RAID10 it is possible for software issues to cause a mismatch to be reported.

1. if there's a power outage or

2. if you have memory-mapped files like swap files

3. If an array is created with "--assume-clean" (avoid  the initial resync) # not recommended

    then a subsequent check could be expected to find some mismatches. (unused space)

    (--assume-clean => Tell mdadm that the array pre-existed and is known to be clean.)

Check vs. Repair

As opposed to check a repair also includes a resync.

The difference from Resync is, that no bitmap is used to optimize the process.

Track mismatch on RAID1

Fill up the entire disk (cat /dev/zero > bigfile)

Free the space again (rm bigfile)

Re-run a data check

echo check > /sys/block/mdX/md/sync_action

Checking

cat /sys/block/mdX/md/sync_action

cat /proc/mdstat

 


當 RAID 1 單碟時(有 1 隻 HDD 死左)

 

某天 reboot 機後 RAID1 Partition 不見了.

cat /proc/mdstat

...
md127 : inactive sde4[1](S)
      1919397888 blocks super 1.2

mdadm -D /dev/md127

mdadm -D /dev/md127
/dev/md125:
           Version : 1.2
        Raid Level : raid0
     Total Devices : 1
       Persistence : Superblock is persistent

             State : inactive
   Working Devices : 1

              Name : kvm2:home
              UUID : 158378ca:3565a69f:431c5909:06143914
            Events : 234045

    Number   Major   Minor   RaidDevice

       -       8       68        -        /dev/sde4

mdadm -E /dev/sde4

/dev/sde4:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 158378ca:3565a69f:431c5909:06143914
           Name : kvm2:home
  Creation Time : Tue Sep  8 15:55:48 2020
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 3838795776 (1830.48 GiB 1965.46 GB)
     Array Size : 1919397888 (1830.48 GiB 1965.46 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=0 sectors
          State : clean
    Device UUID : fce7e1ec:153294b2:e1d93634:3f5b58e3

Internal Bitmap : 8 sectors from superblock
    Update Time : Fri May 14 10:06:57 2021
  Bad Block Log : 512 entries available at offset 16 sectors
       Checksum : 116304fa - correct
         Events : 234045


   Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing, 'R' == replacing)

Notes: 上面的情況如同行了以下 CMD

mdadm -A /dev/md127 /dev/sdc1

 

====================================================

解決方式 1: Readonly 啟動 Array

# Stop 個 Arrary(mdX), 放出 Device(sdX) 先

mdadm -S /dev/md127

mdadm -A /dev/md127 /dev/sdc1

cat /proc/mdstat

md127 : inactive sdc1[2](S)
      1953381376 blocks super 1.2

# -o --readonly ; -R --run

mdadm --readonly --run /dev/md127

cat /proc/mdstat

md127 : active (auto-read-only) raid1 sdc1[2]
      1953381376 blocks super 1.2 [2/1] [U_]
      bitmap: 0/15 pages [0KB], 65536KB chunk

mdadm -D /dev/md127

/dev/md124:
           Version : 1.2
     Creation Time : Tue Sep  8 12:07:23 2020
        Raid Level : raid1
        Array Size : 31456256 (30.00 GiB 32.21 GB)
     Used Dev Size : 31456256 (30.00 GiB 32.21 GB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri May 14 10:06:58 2021
             State : clean, degraded
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : localhost:root
              UUID : b72004f7:f466e30b:0d134b02:eb29ec27
            Events : 561024

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       67        1      active sync   /dev/sde3

由於 RAID 處於 only, 所以使用 ro 去 mount

mount -o ro /dev/md127 /mnt/raid

 

====================================================

解決方式2: 加新碟到 Array

 

mdadm -A /dev/md127 /dev/sdc1

mdadm -R /dev/md127

mdadm /dev/md127 --add /dev/sda1

 


Hybrid HDD + SSD RAID1

 

Example: Create new Hybrid RAID

  • Internal SATA SSD drive: /dev/sda
  • External USB-connected HDD drive: /dev/sdb

mdadm --create --assume-clean /dev/md0 --level=1 --raid-devices=2 /dev/sda1 --write-mostly /dev/sdb1

# --assume-clean

由於是用 SSD, 所以用 --assume-clean 防止 Full Write 一次

# -W, --write-mostly

This is valid for RAID1 only and means that the 'md' driver will avoid reading from these devices(sdb) if at all possible.

This can be useful if mirroring over a slow link. (Hybrid HDD + SSD)

# --write-behind=N

valid for RAID1 only. write-behind is only attempted on drives marked as 'write-mostly'

set the maximum number of outstanding writes allowed. The default value is 256.

套用 "--write-mostly" 在建立好的 RAID 上

ls -1d /sys/block/md124/md/dev-*

/sys/block/md124/md/dev-sda4
/sys/block/md124/md/dev-sdc4

cat /sys/block/md124/md/dev-*/state

in_sync
in_sync

# set a device to be write mostly with

echo writemostly > /sys/block/md124/md/dev-sda4/state

cat /proc/mdstat

should show a "(W)" after the HDD components.

# clear the write-mostly status with

echo -writemostly > /sys/block/md124/md/dev-sda4/state

 


Rename Array

 

-U, --update=

Update the superblock on each device while assembling the array.

可以 update 的 attr. name, uuid, ...

做法:

[Step 1] 查看 metadata version

mdadm -D /dev/md124

Version : 1.2
...
              Name : kvm2.local:2T

[Step 2] Rename

mdadm --stop /dev/md124

# For metadata version is 1.0 or higher

mdadm -A --name=kvm2:home -U name /dev/md124

 


合併缺了碟的 RAID5

 

# 3 缺 1 的 RAID5

mdadm --create --assume-clean \
  --level=5 --raid-devices=3 --verbose \
  --metadata=1.0 --chunk=512K --layout=left-symmetric \
  /dev/md0 /dev/loop0 /dev/loop1 missing

 * 必須注意 loop0, loop1, missing 的次序

 

 


Help

 

  • man 4 md

 


Other

 

https://datahunter.org/synology_md