最後更新: 2021-03-23
介紹
Linux 上的 software raid 名叫 md.
它是透過 linux kernel driver "md_mod" 去做 block device 的管理的.
功能:
- combine several physical disks into one larger virtual device (RAID0, RAID5)
- performance improvements (RAID1, RAID10)
- redundancy (RAID1, RAID5 ...)
Linux 的 software RAID 支援的 RAID Level 如下 :
RAID configurations
- RAID 1 – Mirror
- RAID 5 - Parity distributed across all devices
- RAID 10 – Take a number of RAID 1 mirrorsets and stripe across them RAID 0 style.
Non-RAID
- Linear
- Multipath
- Faulty
- Container
目錄
- DM 與 Multiath Device 的分別
- 安裝
- Config File
- Array
- Command: mdadm 的用法
- 應用: 建立雙碟的 RAID1
- 查看 RAID 狀態
- Create RAID1 With "missing"
- md 的 start 與 stop
- 查看 RAID 的資訊
- Rebuild Speed Turning
- Monitor Mode
- Manage Mode
-
日常保養(Manage Mode)
- 加 Spare Disk
- 更換 Disk - Heath Check
- Performance
- mdmon
- Disable Auto Assemble MD
- 重設 RAID
- Scrubbing the drives
- 當 RAID 1 單碟時
- Hybrid HDD + SSD RAID1
- Rename Array
- Other
DM 與 Multiath Device 的分別
- RAID: /dev/md0 <-- Process: md0_raid1 會管理它
- MULTIPATH: /dev/dm-0
安裝
Debian 6:
apt-get install mdadm
以下是其中一個重要步驟: 執行 update-initramfs
update-initramfs: Generating /boot/initrd.img-2.6.32-5-686
獲得
- /etc/cron.d/mdadm
- /sbin/mdadm
- /sbin/mdmon
設定
- /etc/mdadm/mdadm.conf
- /etc/default/mdadm
Version
mdadm -V
mdadm - v3.1.4 - 31st August 2010
Config File
Location
- /etc/mdadm.conf # Centos 7 (如不存在就人手建立它)
- /etc/mdadm/mdadm.conf # Debian
mdadm.conf
[1] DEVICE
設定從那裡尋找 RAID 的成員(i.e. /dev/sd*1) 去嘗試組裝
DEVICE partitions
Default 設定. 從 /proc/partitions 找出 device 去 scan
那些有 MD superblocks 的 Device 就會嘗試組裝
DEVICE /dev/sda* /dev/sdb1
指定 scan 某 Device 的所有 partition
DEVICE /dev/disk/by-path/pci-*
這方式最好, 因為可以指定 port
ll /dev/disk/by-path
total 0 lrwxrwxrwx 1 root root 9 Dec 28 15:41 pci-0000:00:1f.2-ata-1 -> ../../sda lrwxrwxrwx 1 root root 9 Dec 28 15:41 pci-0000:00:1f.2-ata-1.0 -> ../../sdd lrwxrwxrwx 1 root root 10 Dec 28 15:41 pci-0000:00:1f.2-ata-1.0-part1 -> ../../sdd1 ...
[2]
# 自動建立 /dev/md*, 並且設定它的 permission
# "auto=yes" 相當於 mdadm 的 "--auto".
CREATE owner=root group=disk mode=0660 auto=yes
如果不此設定, 就要用 "ARRAY /dev/mdN metadata=1.2 name=HOST:NAME UUID=..."
[3]
# 執行 CLI "mdadm" 時, 參數 "--homehost" 用到的的預設值
# (considered the home for any arrays)
# auto-assemble 及 create (metadata)時都會用到它
# <system> 相當於 gethostname
HOMEHOST <system>
[4]
#### Mail Setting #### MAILFROM root@server # 當 array 出事時會比 mail 誰 MAILADDR [email protected]
PROGRAM
a program to be run when mdadm --monitor detects potentially interesting events on any of the arrays that it is monitoring.
metadata(superblock) 的分別
* Default: v1.2
0.90 # common format (superblock: 4K, 64K aligned block)
superblock location: At the end of the device
* Putting the superblock at the end of the device is dangerous
if you have any kind of auto-mounting/auto-detection/auto-activation of the raid contents;
1.x # superblock that is normally 1K long, but can be longer
# 1.x superblock on different locations on the device
- 1.0: near the end (at least 8K, and less than 12K, from the end)
- 1.1: At the start
- 1.2: 4K after the start (for 1.2) # 優點: Device 可以有 Partition
hexdump -C /dev/sdb3 | less
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00001000 fc 4e 2b a9 01 00 00 00 00 00 00 00 00 00 00 00 |.N+.............| ...
ddf # Use the "Industry Standard" DDF (Disk Data Format) format defined by SNIA.
# When creating a DDF array a CONTAINER will be created, and normal arrays can be created in that container.
imsm # Use the Intel(R) Matrix Storage Manager metadata format.
Array
組裝了的 Array
ls -l /dev/md/500g
lrwxrwxrwx 1 root root 8 Jul 30 23:59 /dev/md/500g -> ../md127
The ARRAY lines identify actual arrays.
/etc/mdadm.conf
ARRAY /dev/md/root level=raid1 num-devices=2 UUID=...
設定 ARRAY
只有 match ALL identities 才會建立 Array
- uuid= # 128 bit uuid in hexadecimal
- name= # name stored in the superblock
- level= # The value is a raid level
- devices= # listed there must also be listed on a DEVICE line
i.e.
# 用 device(sda1, sdb1) 及 raid level(1)
ARRAY /dev/md1 devices=/dev/sda1,/dev/sdb1 level=1
# 用 meta version 及 UUID
# If the name does not start with a slash ('/'), it is treated as being in /dev/md/
# UUID 不是用 blkid 獲得, 而係用 "mdadm -E /dev/sd?" 獲得
ARRAY 500g metadata=1.2 UUID=938cab94:2fd92126:e1d6bfdf:cf91677e
# 修改完設定檔後, 需要 reload daemon
service mdadm reload
自動組合Array(AUTO):
AUTO 是用控制 auto-assemble 是否進行, 它是一個 list 來
規則如下:
- first match wins
- plus sign: the auto assembly is allowed
- minus sign: the auto assembly is disallowed
- no match: the auto assembly is allowed
metadata types: 0.90, 1.x, ddf, imsm
i.e.
# all is usually last AUTO +1.x homehost -all
* AUTO 是由 "mdadm -As" 引發的. -A=--assemble; -s=--scan
Disable mdadm automatic setup ARRAY
/etc/mdadm.conf
AUTO -all
OR
# <ignore>: matches the rest of the line will never be automatically assembled. ARRAY <ignore> UUID=?:?:?:?
* if the superblock is tagged as belonging to the given home host,
it will automatically choose a device name and try to assemble the array.
udev
# 防止 "touch /dev/sd{e,f}" 引發組裝 RAID
# Backup
mv /lib/udev/rules.d/64-md-raid-assembly.rules /root
udevadm control --reload-rules
Boot 機時自動組合Array
When md is compiled into the kernel (not as module),
partitions of type 0xfd are scanned and automatically assembled into RAID arrays.
(uppressed with the kernel parameter "raid=noautodetect")
Checking: kernel & module
grep CONFIG_MD /boot/config-*
CONFIG_MD=y CONFIG_MD_AUTODETECT=y ...
* CONFIG_MD_AUTODETECT only works for version 0.90 superblocks
* init scripts that run any arrays that aren't started by auto-detect
Partitionable
The kernel parameter raid=partitionable (or raid=part) means that
all auto-detected arrays are assembled as partitionable.
The standard names for non-partitioned arrays
/dev/mdN /dev/md/N
The standard names for partitionable arrays
/dev/md/dNpM /dev/md_dNpM
Command: mdadm 的用法
usage:
mdadm [mode] <raiddevice> [options] <component-devices>
mode:
- Assemble(-A): Assemble the components of a previously created array into an active array.
- Build(-B): Build an array that doesn't have per-device metadata(similar to --create)
- Create(-C): A 'resync' process is started to make sure that the array is consistent
- Monitor mode(--monitor, -F): 當 RAID 情況有變時發出通知
- Manage: adding new spares and removing faulty devices
-
Grow (or shrink):
1) active size of component devices
2) number of active devices
3) RAID level - Auto-detect(--auto-detect): requests the Linux Kernel to activate any auto-detected arrays
- Misc mode (-Q, --query, -D, --detail, -E, --examine, -R, --run, -S, --stop)
常用指令:
-A
# 將已經是RAID 的 sdi1 及 sdj1 重新組將成 md0 裡
mdadm -A /dev/md0 /dev/sdi1 /dev/sdj1
mdadm: /dev/md0 has been started with 2 drives.
-s, --scan # not mode-specific
-F, --follow, --monitor # Select Monitor mode.
# scan /proc/mdstat 獲得所需設定, 比如要填上的 md? 及 sd? device
mdadm -F --scan
其他參數
-o, --readonly # Create, Assemble, Manage and Misc mode
Start the array read only rather than read-write as normal.
(no resync, recovery, or reshape)
-w, --readwrite
--zero-superblock
應用: 清除 metadata
mdadm --zero-superblock /dev/sdd1
應用: 建立雙碟的 RAID1
準備 Harddisk:
[方法 A]
fdisk /dev/sde
設定 partition's system id
按 t
之後輸入 fd <-- Linux raid auto
[方法 B]
parted /dev/sde
(parted) mklabel msdos
(parted) mkpart pri 0% 100%
(parted) set 1 raid # parted /dev/sde set 1 raid
(parted) print # parted /dev/sde print
Number Start End Size Type File system Flags 1 33.6MB 500GB 500GB primary raid
建立 RAID1 ARRAY:
# 安全成見, double check hdd
blkid /dev/{sde,sdf}1
# 建立 RAID1
mdadm -C /dev/md0 -l 1 -n 2 -N /dev/sde1 /dev/sdf1
* 建立後兩隻碟就會立即開始 sync
Opts
-C, --create # Create a new array
-n N, --raid-devices=N # number of active devices in the array
# Setting a value of 1 is probably a mistake and so requires that --force be specified first.
# 不設定會有 error "mdadm: no raid-devices specified."
-l, --level # 1 = RAID1
-N, --name= #
其他設定:
- -x, --spare-devices=
查看
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md127 : active raid1 sde1[0] sdf1[1] 488202944 blocks super 1.2 [2/2] [UU] [=>...................] resync = 5.0% (24566016/488202944) finish=416.1min speed=18566K/sec bitmap: 4/4 pages [16KB], 65536KB chunk
P.S.
# keep mon raid rebuild status - monrebuild.sh
while true do cat /proc/mdstat | awk '/resync/{print $0}' sleep 1 done
保存 RAID 設定到設定檔(mdadm.conf):
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
加入了內容
ARRAY /dev/md0 metadata=1.2 name=home:0 UUID=55b256aa:ebec745e:0c0a2ab7:c92d4fb1
Notes
用以下 cmd 有同樣效果
mdadm --examine --scan >> /etc/mdadm/mdadm.conf
查看 RAID 狀態
cat /proc/mdstat
Personalities : [raid1] md0 : active raid1 sdc1[1] sdb1[0] 2095415 blocks super 1.2 [2/2] [UU] unused devices: <none> #/\ 沒有合成 array 的 device 就會在這裡
Create RAID1 With "missing"
# keyword "missing" is specified for the first device: this will be added later.
mdadm --create /dev/md0 --name softraid2t --level=1 --raid-devices=2 missing /dev/sdg1
# Verify that the RAID array
cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdg1[1]
7813893120 blocks super 1.2 [2/1] [_U]
bitmap: 0/59 pages [0KB], 65536KB chunk
# Add disk partition to array
mdadm /dev/md0 --add /dev/sdf1
mdadm: added /dev/sdf1
* If a device is given before any options, or
if the first option is one of --add, --re-add, --add-spare, --fail, --remove, or --replace,
then the MANAGE mode is assumed.
# Verify that the RAID array is being rebuilt
cat /proc/mdstat
Personalities : [raid1] md0 : active raid1 sdf1[2] sdg1[1] 7813893120 blocks super 1.2 [2/1] [_U] [>....................] recovery = 0.0% (4175872/7813893120) finish=623.3min speed=208793K/sec bitmap: 28/59 pages [112KB], 65536KB chunk unused devices: <none>
md 的 start 與 stop
Start md:
方法1: auto 啟動 RAID
mdadm -As
mdadm: /dev/md0 has been started with 2 drives.
-A, --assemble # Assemble a pre-existing array.
-s, --scan # Scan config file or /proc/mdstat for missing information.
方法2: manual
# -A, --assemble # Assemble a pre-existing array.
mdadm -A /dev/md0 /dev/usbdisk/WD4T-K3G5AD6B /dev/usbdisk/WD4T-NHGNXBVY
mdadm: /dev/md0 has been started with 2 drives.
Use UUID For Assemble:
Assemble
-u, --uuid= # uuid of array to assemble. Devices which don't have this uuid are excluded
當 mount 時與指定的 uuid 不相附時, 會出以下 Error
mdadm: /dev/usbdisk/WD4T-K3G5AD6B has wrong uuid. mdadm: /dev/usbdisk/WD4T-NHGNXBVY has wrong uuid.
Checking
blkid /dev/sda2
/dev/sda2: UUID="2c58db70-dca0-d769-7127-8d9cdcfe0d9b" UUID_SUB="ace8c810-31ee-ce5f-ec4b-4ce874f091de" LABEL="localhost:boot" TYPE="linux_raid_member" PARTUUID="4f2c6ff1-02"
mdadm -E /dev/sda2
Array UUID : 2c58db70:dca0d769:71278d9c:dcfe0d9b Name : localhost:boot ... Device UUID : ace8c810:31eece5f:ec4b4ce8:74f091de
UUID = Array UUID
UUID_SUB = Device UUID
blkid /dev/md125
/dev/md125: UUID="f721d037-2c40-4ce6-91a2-4eb39c00cd8b" BLOCK_SIZE="1024" TYPE="ext4"
Use Name For Assemble
-N, --name=
This must be the name that was specified when creating the array.
It must either match the name stored in the superblock exactly,
or it must match with the current homehost prefixed to the start of the given name.
-S, --stop # deactivate array, releasing all resources
i.e.
mdadm -S /dev/md0
mdadm: stopped /dev/md0
* stop 後 "/proc/mdstat" 內的資料 及 /dev/md0 都會不見了
查看 RAID 的資訊
Misc mode:
- -Q, --query # Examine a device to see if it is an md device and if it is a component of an md array.
- -D, --detail # Print details of one or more md devices.
- -E, --examine # Print contents of the metadata stored on the named device
Query(-Q):
mdadm -Q /dev/sda1
/dev/sda1: is not an md array
mdadm -Q /dev/sdb1
/dev/sdb1: is not an md array
/dev/sdb1: device 0 in 2 device unknown raid1 array. Use mdadm --examine for more detail.
mdadm -Q /dev/md0
/dev/md0: 2046.30MiB raid1 2 devices, 0 spares. Use mdadm --detail for more detail.
Detail(-D):
mdadm -D /dev/md0
/dev/md0: Version : 1.2 Creation Time : Thu Feb 14 17:38:31 2013 Raid Level : raid1 Array Size : 2095415 (2046.65 MiB 2145.70 MB) Used Dev Size : 2095415 (2046.65 MiB 2145.70 MB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Thu Feb 14 17:52:03 2013 State : active, resyncing Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Rebuild Status : 50% complete # 當 device 被 mount 後, rebuild 會很慢 Name : debian3:0 (local to host debian3) UUID : 938cab94:2fd92126:e1d6bfdf:cf91677e Events : 17 Number Major Minor RaidDevice State 0 8 17 0 active sync /dev/sdb1 1 8 33 1 active sync /dev/sdc1
Examine(-E):
# Print contents of the metadata stored on the named device(s).
mdadm -E /dev/sdb1
/dev/sdb1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 938cab94:2fd92126:e1d6bfdf:cf91677e Name : debian3:0 (local to host debian3) Creation Time : Thu Feb 14 17:38:31 2013 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 4190854 (2046.66 MiB 2145.72 MB) Array Size : 4190830 (2046.65 MiB 2145.70 MB) Used Dev Size : 4190830 (2046.65 MiB 2145.70 MB) Data Offset : 2048 sectors Super Offset : 8 sectors State : active Device UUID : 833f1527:7789fd72:1c8feb2f:2f175385 Update Time : Thu Feb 14 17:53:47 2013 Checksum : 57c1b402 - correct Events : 19 Device Role : Active device 0 # 另有 Device Role : spare Array State : AA ('A' == active, '.' == missing)
Detail 與 Examine 分別
--examine applies to devices which are components of an array
--detail applies to a whole array which is currently active.
Rebuild Speed Turning
Tip #1: speed_limit_min 與 speed_limit_max
Unit: Kibibytes
speed_limit_min
# reflects the current “goal” rebuild speed . The default is 1000.
# USB 的速度:
# USB2.0 ~ 30 Mbyte
# USB3.0 ~ 107 Mbyte
echo 70000 > /proc/sys/dev/raid/speed_limit_min
speed_limit_max
# Default: 200000
/proc/sys/dev/raid/speed_limit_max
寫入設定檔
sysctl dev.raid.speed_limit_min
sysctl dev.raid.speed_limit_max
查看當前 rebuild 的速度
iostat sda sdb 1
sda sdb cpu kps tps svc_t kps tps svc_t us sy wt id 16498 71 11.5 16493 71 11.5 18 65 3 13 13117 53 9.8 13307 54 10.2 3 20 0 77 26451 80 11.5 26515 80 11.6 7 73 11 9 ....
Tip #2: Bitmap Option
Bitmaps optimize rebuild time after a crash, or after removing and re-adding a device.
# Turn it on by typing the following command: "--bitmap=?"
# Type=internal: the bitmap is stored with the metadata on the array
# When creating an array on devices which are 100G or larger,
# mdadm automatically adds an internal bitmap as it will usually be beneficial.
mdadm --grow --bitmap=internal /dev/md0
# Once array rebuild or fully synced, disable bitmaps:
mdadm --grow --bitmap=none /dev/md0
write-intent bitmap
When an array has a write-intent bitmap, a spindle (a device, often a hard drive) can be removed and re-added,
then only blocks changes since the removal (as recorded in the bitmap) will be resynced.
Therefore a write-intent bitmap reduces rebuild/recovery (md sync) time if:
- the machine crashes (unclean shutdown)
- one spindle is disconnected, then reconnected
A write-intent bitmap:
* does not improve performance
* can be removed/added at any time
* may cause a degradation in write performance, it varies upon:
- the size of the chunk data (on the RAID device) mapped to each bit in the bitmap, as expressed by cat /proc/mdstat.
- The ratio (bitmap size / RAID device size ) workload profile
(long sequences of writes are more impacted, as spindle heads go back and forth between the data zone and the bitmap zone)
--bitmap-chunk=bit
Set the chunksize of the bitmap. Each bit corresponds to that many Kilobytes of storage.
When using an internal bitmap, the chunksize defaults to 64M
在 rebuild 中不能加 bitmap
mdadm: Cannot add bitmap while array is resyncing or reshaping etc. mdadm: failed to set internal bitmap.
when you first create a raid1 (mirrored) array from two drives,
does mdadm insist on mirroring the contents of the first drive to the second even though the drives are entirely blank
i.e.
md0 : active raid1 sdd1[0] sde1[1]
3906885632 blocks super 1.2 [2/2] [UU]
[=====>...............] resync = 27.6% (1080482688/3906885632) finish=492.6min speed=95609K/sec
bitmap: 22/30 pages [88KB], 65536KB chunk
If it's 22/30 that means there are 22 of 30 pages allocated in the in-memory bitmap.
The pages are allocated on demand, and get freed when they're empty (all zeroes).
in-memory bitmap allows bitmap operations to be more efficient
The in-memory bitmap uses 16 bits for each bitmap chunk to count all ongoing writes to the chunk,
so it's actually up to 16 times larger than the on-disk bitmap.
Tip #3: Set read-ahead option
測試結果沒有加快
Unit: N sectors (512-byte per sectors)
blockdev --getra /dev/md127
256 # 128 KB
blockdev --setra 65536 /dev/md127 # 32 MB
Monitor Mode
mdadm --monitor options... devices..
Detault 有一個 daemon 有背後行
/sbin/mdadm --monitor --pid-file /var/run/mdadm/monitor.pid --daemonise --scan --syslog
start / stop:
/etc/init.d/mdadm start | stop
- --syslog Cause all events to be reported through 'syslog'
- -f, --daemonise
只有以下 Event 才會 sent Email:
- Fail
- FailSpare,
- DegradedArray
- SparesMissing
- TestMessage
其他參數:
- -t, --test Send 一測試 E-mail
Example:
mdadm -F -1 -t -s
- -F => --monitor
- -1 => --oneshot # Check arrays only once.
- -t => --test # Generate a TestMessage alert for every array found at startup.(mail)
- -s => --scan #
Config
/etc/default/mdadm
# should mdadm run periodic redundancy checks over your arrays? See # /etc/cron.d/mdadm. AUTOCHECK=false # START_DAEMON: # should mdadm start the MD monitoring daemon during boot? START_DAEMON=false
Manage Mode
- -t, --test
- -a, --add
- -r, --remove
- -f, --fail
- --re-add just updates the blocks that have changed since the device was removed.
Policy
specify what automatic behavior is allowed on devices newly appearing in the system and
provides a way of marking spares that can be moved to other arrays as well as the migration domains.
Domain can be defined through policy line by specifying a domain name for a number of paths from /dev/disk/by-path/.
A device may belong to several domains.
日常保養(Manage Mode)
- 加 Spare Disk
- 救 Array
加 Spare Disk
-a, --add # (情況1) If a device appears to have recently been part of the array (possibly it failed or was removed)
# --add 相當於 --re-add
# (情況2) If that fails or the device was never part of the array, the device is added as a hot-spare
# If the array is degraded, it will immediately start to rebuild data onto that spare.
--add-spare # Add a device as a spare. This is similar to --add except that it does not attempt --re-add first.
i.e.
mdadm /dev/md0 -a /dev/sdd1
mdadm: added /dev/sdd1
cat /proc/mdstat
md0 : active raid1 sdd1[2](S) sdb1[0] sdc1[1]
2095415 blocks super 1.2 [2/2] [UU]
unused devices: <none>
更換 Disk
-f, --fail # Mark listed devices as faulty
-r, --remove # They must not be active. (failed or spare)
--re-add # re-add a device that was previously removed from an array.
# The recovery may only require sections that are flagged a write-intent bitmap to be recovered
--replace # Mark listed devices as requiring replacement.
# As soon as a spare is available, it will be rebuilt and will replace the marked device.
i.e. 把 offline/fail 了的 sda1 放回 RAID
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdc1[2] sda1[1](F)
1953381376 blocks super 1.2 [2/1] [U_]
bitmap: 5/15 pages [20KB], 65536KB chunk
mdadm /dev/md0 --re-add /dev/sda1
mdadm: re-add /dev/sda1 to md0 succeed
cat /proc/mdstat # "--re-add" 比 "--replace" 怏很多
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdc1[2] sda1[1] 1953381376 blocks super 1.2 [2/1] [U_] [>....................] recovery = 3.2% (63370624/1953381376) finish=275.4min speed=114352K/sec bitmap: 5/15 pages [20KB], 65536KB chunk
i.e. sdd1 取替 sdc1
# 當新加的 sdd1 沒有 metadata 時, 那 mdadm 會 "re-add" 到 array
mdadm /dev/md0 -f /dev/sdc1 -r /dev/sdc1 -a /dev/sdd1
救 Array
-R, --run start a partially assembled array
ie.
mdadm -A /dev/md0 /dev/loop0p3 --run
mdadm: /dev/md0 has been started with 1 drive (out of 2).
Heath Check
在開機時
md: Autodetecting RAID arrays. .... md: created md1 md: bind<sda3> md: bind<sdb3> md: running: <sdb3><sda3> md: kicking non-fresh sdb3 from array! md: unbind<sdb3> md: export_rdev(sdb3)
This can happen after an unclean shutdown (like a power fail).
Usually removing and re-adding the problem devices will correct the situation:
Fix
mdadm /dev/md0 -a /dev/sdb3
Trigger a full check of md0 with
echo check > /sys/block/md0/md/sync_action
dmesg
md: syncing RAID array md0 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. md: using 128k window, over a total of 4192896 blocks. ... 一段時間後 ... md: md0: sync done. RAID1 conf printout: --- wd:2 rd:2 disk 0, wo:0, o:1, dev:sda2 disk 1, wo:0, o:1, dev:sdb2
cat /proc/mdstat
md0 : active raid1 sdb2[1] sda2[0] 4192896 blocks [2/2] [UU] [=========>...........] resync = 48.4% (2034624/4192896) finish=0.5min speed=61875K/sec ... 一段時間後 ... md0 : active raid1 sdb2[1] sda2[0] 4192896 blocks [2/2] [UU]
P.S.
If a read error is encountered, the block in error is calculated and written back.
If the array is a mirror, as it can't calculate the correct data,
it will take the data from the first (available) drive and write it back to the dodgy drive.
# This will stop any check or repair that is currently in progress.
echo idle > /sys/block/mdX/md/sync_action
# Only valid for parity raids - this will also check the integrity of the data as it reads it,
# and rewrite a corrupt stripe.
# It will terminate immediately without doing anything if the array is degraded, as it cannot recalculate the faulty data.
# DO NOT run this on raid-6 without making sure that it is the correct thing to do.
# There is a utility "raid6check" that you should use if "check" flags data errors on a raid-6.
echo repair > /sys/block/mdX/md/sync_action
Performance
single & multiple speed
Linux implementation of RAID1 speeds up disk read operations twice
as long as two separate disk read operations are performed at a time.
A single stream of sequential input will not be accelerated (e.g. a single dd),
but multiple sequential streams or a random workload will use more than one spindle.
In theory, having an N-disk RAID1 will allow N sequential threads to read from all disks.
Test:
#1: A single stream
sync && echo 3 > /proc/sys/vm/drop_caches
COUNT=1000;
dd if=/dev/md127 of=/dev/null bs=10M count=$COUNT &
#2: multiple sequential streams
# bonnie++ benchmarking tool which doesn't perform two separate reads at one time.
# 所以用兩組 dd 去測試
sync && echo 3 > /proc/sys/vm/drop_caches
COUNT=1000;
dd if=/dev/md127 of=/dev/null bs=10M count=$COUNT &;
dd if=/dev/md127 of=/dev/null bs=10M count=$COUNT skip=$COUNT &;
Doc
man 4 md
RAID 10 instead of RAID 1
mdadm --create /dev/md64 --level=10 --metadata=1.2 --raid-devices=4 \ /dev/sda4 /dev/sdb4 \ /dev/sda5 /dev/sdb5
在 test 1 的情況依然有較 RAID1 好 read performance, 因為有兩個 R1 組成 R0,
而 R0 有提速功效
RAID 1 Speed Test
# T1: mon
dstat -d -D sda,sdb
# T2: write speed: 20?M
echo 1 > /proc/sys/vm/drop_caches
pv /dev/zero > test.bin # 兩隻碟同時寫入 data
# T2: read speed: 20?M
echo 1 > /proc/sys/vm/drop_caches
pv test.bin > /dev/null # 在其中一隻碟讀出 data
# T2: read & write at same time by cp, rsync: 5?M
echo 1 > /proc/sys/vm/drop_caches
cp test.bin test2.bin
rsync test.bin test2.bin # 沒分別
Summary: 經測試, 一般 HDD 有 30~40% performance
# T2: read & write at same time by dd
# 準備 file
pv /dev/zero > test.bin
# 測試
echo 1 > /proc/sys/vm/drop_caches
pv /dev/zero > test1.bin
pv test.bin > /dev/null
嘗試改善 copy 的速度
Readahead
# 8192 x 512 / 1024 = 4k
blockdev --getra /dev/md127
# 8388608 = 4m
blockdev --setra 8388608 /dev/md127
結論: 沒有改善
deadline scheduler
# sda1 及 sdb1 是 RAID 1 的 成完
echo deadline > /sys/block/sda1/queue/scheduler
echo deadline > /sys/block/sdb1/queue/scheduler
結論: 沒有改善
總結
HDD 只有在 sequential R/W 才有最高速度 "hdparm -t /dev/sda"
mdmon
介紹:
一般而言, 不會人手用到它
功能:
mdmon polling the sysfs namespace looking for changes in array_state,
用法:
mdmon [--all] [--takeover] CONTAINER
CONTAINER: The container device to monitor.(/dev/md/container)
--all
This tells mdmon to find any active containers and start monitoring each of them
--takeover
instructs mdmon to replace any active mdmon which is currently monitoring the array.
Disable Auto Assemble MD
Boot 時
/etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT=" ... raid=noautodetect"
update-grub
grep noautodetect /boot/grub/grub.cfg
Hotplug 時
mkdir /lib/udev/rules.d_disable
mv /lib/udev/rules.d/64-md-*.rules /lib/udev/rules.d_disable
udevadm control --reload
P.S.
HDD 加減 partition 時會 trigger 它
重設 RAID
情況: 將 RAID Group 1 的 Disk 移到 Group 2
# --zero-superblock: You can make drives forget they were in a RAID by zeroing out their md superblocks.
mdadm --zero-superblock /dev/sdd1
Scrubbing the drives
Checks
For RAID1 and RAID10
It compares the corresponding blocks of each disk in the array.
For RAID4, RAID5, RAID6
this means checking that the parity block is (or blocks are) correct.
If a read error
If a read error is detected during this process,
the normal read-error handling causes correct data to be found from other devices
and to be written back to the faulty device.
mismatch (not read-error)
If all blocks read successfully but are found to not be consistent, then this is regarded as a mismatch.
/sys/block/mdX/md/mismatch_cnt
This is set to zero when a scrub starts and is incremented whenever a sector is found that is a mismatch.
A value of 128 could simply mean that a single 64KB check found an error (128 x 512bytes = 64KB).
(128 => it does not determine exactly how many actual sectors were affected)
(md normally works in units(128) much larger than a single sector (512byte))
If check was used,
then no action is taken to handle the mismatch, it is simply recorded.
If repair was used,
then a mismatch will be repaired in the same way that resync repairs arrays.
For RAID5/RAID6 new parity blocks are written.
On a truly clean RAID5 or RAID6 array, any mismatches should indicate a hardware problem at some level
For RAID1/RAID10, all but one block are overwritten with the content of that one block.
On RAID1 and RAID10 it is possible for software issues to cause a mismatch to be reported.
1. if there's a power outage or
2. if you have memory-mapped files like swap files
3. If an array is created with "--assume-clean" (avoid the initial resync) # not recommended
then a subsequent check could be expected to find some mismatches. (unused space)
(--assume-clean => Tell mdadm that the array pre-existed and is known to be clean.)
Check vs. Repair
As opposed to check a repair also includes a resync.
The difference from Resync is, that no bitmap is used to optimize the process.
Track mismatch on RAID1
Fill up the entire disk (cat /dev/zero > bigfile)
Free the space again (rm bigfile)
Re-run a data check
echo check > /sys/block/mdX/md/sync_action
Checking
cat /sys/block/mdX/md/sync_action
cat /proc/mdstat
當 RAID 1 單碟時(有 1 隻 HDD 死左)
某天 reboot 機後 RAID1 Partition 不見了.
cat /proc/mdstat
...
md127 : inactive sde4[1](S)
1919397888 blocks super 1.2
mdadm -D /dev/md127
mdadm -D /dev/md127 /dev/md125: Version : 1.2 Raid Level : raid0 Total Devices : 1 Persistence : Superblock is persistent State : inactive Working Devices : 1 Name : kvm2:home UUID : 158378ca:3565a69f:431c5909:06143914 Events : 234045 Number Major Minor RaidDevice - 8 68 - /dev/sde4
mdadm -E /dev/sde4
/dev/sde4: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : 158378ca:3565a69f:431c5909:06143914 Name : kvm2:home Creation Time : Tue Sep 8 15:55:48 2020 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 3838795776 (1830.48 GiB 1965.46 GB) Array Size : 1919397888 (1830.48 GiB 1965.46 GB) Data Offset : 264192 sectors Super Offset : 8 sectors Unused Space : before=264112 sectors, after=0 sectors State : clean Device UUID : fce7e1ec:153294b2:e1d93634:3f5b58e3 Internal Bitmap : 8 sectors from superblock Update Time : Fri May 14 10:06:57 2021 Bad Block Log : 512 entries available at offset 16 sectors Checksum : 116304fa - correct Events : 234045 Device Role : Active device 1 Array State : AA ('A' == active, '.' == missing, 'R' == replacing)
Notes: 上面的情況如同行了以下 CMD
mdadm -A /dev/md127 /dev/sdc1
====================================================
解決方式 1: Readonly 啟動 Array
# Stop 個 Arrary(mdX), 放出 Device(sdX) 先
mdadm -S /dev/md127
mdadm -A /dev/md127 /dev/sdc1
cat /proc/mdstat
md127 : inactive sdc1[2](S) 1953381376 blocks super 1.2
# -o --readonly ; -R --run
mdadm --readonly --run /dev/md127
cat /proc/mdstat
md127 : active (auto-read-only) raid1 sdc1[2] 1953381376 blocks super 1.2 [2/1] [U_] bitmap: 0/15 pages [0KB], 65536KB chunk
mdadm -D /dev/md127
/dev/md124: Version : 1.2 Creation Time : Tue Sep 8 12:07:23 2020 Raid Level : raid1 Array Size : 31456256 (30.00 GiB 32.21 GB) Used Dev Size : 31456256 (30.00 GiB 32.21 GB) Raid Devices : 2 Total Devices : 1 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Fri May 14 10:06:58 2021 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Consistency Policy : bitmap Name : localhost:root UUID : b72004f7:f466e30b:0d134b02:eb29ec27 Events : 561024 Number Major Minor RaidDevice State - 0 0 0 removed 1 8 67 1 active sync /dev/sde3
由於 RAID 處於 only, 所以使用 ro 去 mount
mount -o ro /dev/md127 /mnt/raid
====================================================
解決方式2: 加新碟到 Array
mdadm -A /dev/md127 /dev/sdc1
mdadm -R /dev/md127
mdadm /dev/md127 --add /dev/sda1
Hybrid HDD + SSD RAID1
Example: Create new Hybrid RAID
- Internal SATA SSD drive: /dev/sda
- External USB-connected HDD drive: /dev/sdb
mdadm --create --assume-clean /dev/md0 --level=1 --raid-devices=2 /dev/sda1 --write-mostly /dev/sdb1
# --assume-clean
由於是用 SSD, 所以用 --assume-clean 防止 Full Write 一次
# -W, --write-mostly
This is valid for RAID1 only and means that the 'md' driver will avoid reading from these devices(sdb) if at all possible.
This can be useful if mirroring over a slow link. (Hybrid HDD + SSD)
# --write-behind=N
valid for RAID1 only. write-behind is only attempted on drives marked as 'write-mostly'
set the maximum number of outstanding writes allowed. The default value is 256.
套用 "--write-mostly" 在建立好的 RAID 上
ls -1d /sys/block/md124/md/dev-*
/sys/block/md124/md/dev-sda4 /sys/block/md124/md/dev-sdc4
cat /sys/block/md124/md/dev-*/state
in_sync in_sync
# set a device to be write mostly with
echo writemostly > /sys/block/md124/md/dev-sda4/state
cat /proc/mdstat
should show a "(W)" after the HDD components.
# clear the write-mostly status with
echo -writemostly > /sys/block/md124/md/dev-sda4/state
Rename Array
-U, --update=
Update the superblock on each device while assembling the array.
可以 update 的 attr. name, uuid, ...
做法:
[Step 1] 查看 metadata version
mdadm -D /dev/md124
Version : 1.2 ... Name : kvm2.local:2T
[Step 2] Rename
mdadm --stop /dev/md124
# For metadata version is 1.0 or higher
mdadm -A --name=kvm2:home -U name /dev/md124
合併缺了碟的 RAID5
# 3 缺 1 的 RAID5
mdadm --create --assume-clean \
--level=5 --raid-devices=3 --verbose \
--metadata=1.0 --chunk=512K --layout=left-symmetric \
/dev/md0 /dev/loop0 /dev/loop1 missing
* 必須注意 loop0, loop1, missing 的次序
Help
- man 4 md
Other
https://datahunter.org/synology_md