最後更新: 2021-02-10
目錄
- Network Setting (Firewall)
- Internal meta data
- 簡易設定
- 建立 Disk
- 查看狀態
- Enabling and Disabling Resource
- Change the Primary node(Manual Failover)
- Reconfiguring Setting / Resource
- Online Device Verification
- Sync data 的速度
- Checksum-based Synchronization
- Replication traffic integrity checking (For debug)
- I/O error handling strategies
- Growing on-line
- Disabling backing device flushes
- Dealing with hard drive failure
- Using Truck based replication(Disk shipping)
Network Setting (Firewall)
* 每一個 resource 都要 1 個獨立的 TCP Port (7788, 7789, 7790 ...)
* Firewall 單邊開了也 Connect 到 (ie. drbd-a 有開 port, drbd-b 無開 port), 但最好都是兩邊開 !
My Setup
drbd-a(eth0)\ (eth1) \ | \ | Switch-Router | / (eth1) / drbd-b(eth0)/
drdb-a
- eth0: 192.168.88.31
- eth1: 10.0.0.1/30
drdb-b
- eth0: 192.168.88.32
- eth1: 10.0.0.2/30
On both node
firewall-cmd --permanent --add-rich-rule='rule family="ipv4"
source address="10.0.0.0/30" port port="7789" protocol="tcp" accept'
firewall-cmd --reload
Remark
status "peer connection:Connecting" 時才見到 LISTEN, 當 Connect 後是見唔到的.
netstat -ntlp | grep 7789
tcp 0 0 10.0.0.1:7789 0.0.0.0:* LISTEN -
replication:Established 後
watch -n1 'netstat -ntp'
tcp 0 0 10.0.0.2:53550 10.0.0.1:7789 ESTABLISHED - tcp 0 0 10.0.0.2:36675 10.0.0.1:7789 ESTABLISHED -
OS 準備
/etc/sysconfig/network-scripts/ifcfg-eth1
...
IPADDR=10.0.0.1
NETMASK=255.255.255.252
MTU=9000
/etc/hostname
drbd-a
/etc/hosts
# drbd host 192.168.88.31 drbd-a 192.168.88.32 drbd-b
/etc/sysconfig/grub
GRUB_CMDLINE_LINUX=" ... ipv6.disable=1 net.ifnames=0 biosdevname=0"
grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
reboot
簡易設定(Single-primary mode)
The file format was designed as to allow to have a verbatim copy of the file on both nodes of the cluster.
建立設定檔:
/etc/drbd.conf
include "drbd.d/global_common.conf"; include "drbd.d/*.res";
global_common.conf
# V8
global {
# 為 drbd project 作統計之用
usage-count no;
}
common {
net {
protocol B;
}
}
MyResource.res
# resource "r0" 有一個 vol 0 的設定
resource r0 {
on drbd-a {
# 名稱亦可以用另外的定義 drbd_X
device /dev/drbd1;
disk /dev/sdb1;
address 10.0.0.1:7789;
meta-disk internal;
}
on drbd-b {
device /dev/drbd1;
disk /dev/sdb1;
address 10.0.0.2:7789;
meta-disk internal;
}
}
說明
Resource
Any resource is a replication group consisting of one of more volumes that share a common replication stream.
In DRBD, every resource has a role, which may be Primary or Secondary.
Hostname
兩架的 hostname 必須是, 由 uname -n 獲得
address
which is used to wait for incoming connections from the partner respectively to reach the partner
Internal meta data
meta data at the end of the device
缺點:
write 的效能會差了 ( two additional movements of the write/read head of the hard disk)
Protocol
Protocol = replication modes
Protocol A - Asynchronous replication protocol
Local write operations on the primary node are considered completed as soon as the local disk write has finished
long distance replication scenarios
Protocol B - Memory synchronous replication protocol(semi-synchronous)
The primary node are considered completed as soon as the local disk write has occurred,
and the replication packet has reached the peer node.
simultaneous power failure on both nodes and concurrent
Protocol C
only after both the local and the remote disk write(s) have been confirmed.
Remark
1. 設定個別 resource 的 protocol
resource another-resource-name { net { protocol C; ... } ... }
2. 單 Resource 多 Volume 設定
MyResource.res
resource r0 { volume 0 { device /dev/drbd1; disk /dev/sdb1; meta-disk internal; } volume 1 { device /dev/drbd2; disk /dev/sdc1; meta-disk internal; } on drbd-a { address 10.0.0.1:7789; } on drbd-b { address 10.0.0.2:7789; } }
建立 Resource
步驟:
- Create device(sdb1) metadata (On Both Node)
- Attach to backing device (On Both Node)
- Start the initial full synchronization (On Primary Node)
1. 建立 device(sdb1) 的 metadata # On both nodes
# CLI: drbdadm create-md <resource_name> # md = metadata
i.e.
drbdadm create-md r0
initializing activity log initializing bitmap (320 KB) to all zero Writing meta data... New drbd meta data block successfully created.
blkid /dev/sdb1 # 沒有 output 的 !!
drbdadm status
# No currently configured DRBD found.
drbdadm create-md r0
You want me to create a v09 style flexible-size internal meta data block. There appears to be a v09 flexible-size internal meta data block already in place on /dev/vdb1 at byte offset 10736365568 Do you really want to overwrite the existing meta-data? [need to type 'yes' to confirm]
2. 啟用 resource: # On both nodes
drbdadm up <resource>
# 類似執行了以下 2 個步驟,
- drbdadm attach <resource> # 新的 Resource 不可以直接行此 cmd
- drbdadm connect <resource>
詳見: drbd_cli
i.e. run on drbd-a
drbdadm up r0
lsblk
vdb 252:16 0 10G 0 disk └─vdb1 252:17 0 10G 0 part └─drbd1 147:1 0 10G 0 disk
drbdadm status
r0 role:Secondary
disk:Inconsistent
drbd-b connection:Connecting
connection: StandAlone -> Unconnected -> Connecting
當連上對方而因 setting 不匹配而失 -> StandAlone
修改設定後, 如果想再嘗試, 要在兩 node 行 down, up 一次
成功連線後:
r0 role:Secondary disk:Inconsistent cs2 role:Secondary peer-disk:Inconsistent
3. Start the Initial full synchronization # On ONE node only
* On initial resource configuration
* On the node you selected as the synchronization source
drbdadm primary --force <resource>
i.e.
# On primary node (drbd-a)
drbdadm primary --force r0
drbdadm status
r0 role:Primary
disk:UpToDate
drbd-b role:Secondary
replication:SyncSource peer-disk:Inconsistent done:2.39
一段時間後
drbdadm status
r0 role:Primary disk:UpToDate drbd-b role:Secondary peer-disk:UpToDate
停用 resource:
drbdadm down resource # 它有 resource demotion 的效果
# 相當於
- drbdadm disconnect resource
- drbdadm detach resource
ie.
drbdadm down r0
lsblk
vdb 252:16 0 10G 0 disk └─vdb1 252:17 0 10G 0 part
查看狀態
drbdadm status [resource]
Or
cat /proc/drbd
# V8 才有以下資訊
version: 8.4.11-1 (api:1/proto:86-101) GIT-hash: 66145a308421e9c124ec391a7848ac20203bb03c build by mockbuild@, 2020-04-05 02:58:18 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate B r----- ns:55908 nr:0 dw:0 dr:31456284 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
- ap (application pending). Number of block I/O requests forwarded to DRBD, but not yet answered by DRBD.
-
ep (epochs). Number of epoch objects. Usually 1.
Might increase under I/O load when using either the barrier or the none write ordering method.
---- - cs (connection state) # drbdadm cstate r0
- ro (roles) # drbdadm role r0
-
ds (disk states). # drbdadm dstate r0
----- - ns (network send). # Volume of net data sent to the partner via the network connection; in Kibyte.
-
nr (network receive). # Volume of net data received by the partner via the network connection; in Kibyte.
---- - dw (disk write). # Net data written on local hard disk; in Kibyte.
-
dr (disk read). # Net data read from local hard disk; in Kibyte.
---- - al (activity log). # Number of updates of the activity log area of the meta data.
- bm (bit map). # Number of updates of the bitmap area of the meta data.
-
lo (local count). # Number of open requests to the local I/O sub-system issued by DRBD.
---- - pe (pending). # Number of requests sent to the partner, but that have not yet been answered by the latter.
- ua (unacknowledged). # Number of requests received by the partner via the network connection, but that have not yet been answered.
- wo (write order). # Currently used write ordering method: b (barrier), f (flush), d (drain) or n (none).
- oos (out of sync). # Amount of storage currently out of sync; in Kibibytes.
State
Connection state
drbdadm cstate <resource>
- StandAlone # No network configuration available. (drbdadm disconnect, failed authentication, split brain)
- Connected
- Timeout
- Unconnected
- WFConnection # This node is waiting until the peer node becomes visible on the network.
- WFReportParams # TCP connection has been established, this node waits for the first network packet from the peer.
- StartingSyncS # Full synchronization, initiated by the administrator, is just starting
-
WFBitMapS # Partial synchronization is just starting
------------ - SyncSource # the local node being the source of synchronization.
-
SyncTarget # the local node being the target of synchronization
replication:SyncSource peer-disk:Inconsistent done:7.02 - PausedSyncS # drbdadm pause-sync
- VerifyS # On-line device verification is currently running (source of verification)
- VerifyT # On-line device verification is currently running (target of verification)
Disk states
drbdadm dstate <resource>
- UpToDate
- Negotiating
- DUnknown
- Diskless (detached using drbdadm detach)
- Attaching
- Failed
- Inconsistent
- Outdated
- Consistent: Consistent data of a node without connection.
- DUnknown: This state is used for the peer disk if no network connection is available.
Enabling and Disabling Resource
up
drbdadm up <resource>
down
drbdadm down <resource>
Remark
可以用 all 去表達所有 resource
i.e.
drbdadm down all
Change the Primary node(Manual Failover)
[Lab 1] Change Over (V8)
# On node 1 (orig: primary node)
drbdadm role r0 # Shows the current roles of the devices (e.g. "Primary/Secondary")
umount /dev/drbd0
drbdadm secondary r0
drbdadm status
r0 role:Secondary
disk:UpToDate
drbd-b role:Secondary
peer-disk:UpToDate
# On node 2 (secondary)
drbdadm primary r0
mount /dev/drbd0 /home/drbd
...
P.S.
* 當 "connection state" 是 "Connected" 時, 在 node 2 上行 "drbdadm primary <resource>" 會有 error
因為在 node 1 沒有執行 "drbdadm secondary r0"
[Lab 2] cs2 failover to cs1 (V8)
# On cs2 (Orig: primary node)
reboot
# On cs1 (Orig: secondary)
drbdadm status
r0 role:Secondary disk:UpToDate peer connection:Connecting
drbdadm primary r0 # 非必要, 因為 mount 會令它直接變成 Primary
Remark
# 在沒有行 "drbdadm primary r0" 情況下
mount /dev/drbd1 /var/lib/lxc
mount: /dev/drbd1 is write-protected, mounting read-only mount: mount /dev/drbd1 on /var/lib/lxc failed: Wrong medium type
# On cs2 (Secondary)
drbdadm up r0
Marked additional 12 MB as out-of-sync based on AL.
drbdadm status
r0 role:Secondary disk:UpToDate peer connection:Connecting
...
r0 role:Secondary disk:UpToDate
dmesg
... block drbd1: Split-Brain detected but unresolved, dropping connection!
Reconfiguring Setting / Resource
On both nodes:
- synchronize your /etc/drbd.conf
- drbdadm adjust <resource>
P.S.
# 必須 dry-run (-d) 看有沒有錯先
drbdadm -d adjust <resource>
i.e.
drbdadm adjust all
Remark
更新 /etc/drbd.conf 的 common section 時, CLI 要行 "drbdadm adjust all"
Online device verification
Diagram:
VerifyS -sha1-> VerifyT
設定:
global_common.conf
common { net { # 設定檢查的算法. Default off. sha1 | md5 verify-alg <algorithm>; } }
OR
# MyResource.res
resource <resource> net { verify-alg <algorithm>; } ... }
CPU 硬解
...
執行檢查:
# 當 blocks out of sync 時, 會有 kernel message log
drbdadm verify <resource | all>
ie.
drbdadm verify all
# Local 沒有 設定 "verify-alg" 時
r0: State change failed: (-14) Need a verify algorithm to start online verify Command 'drbdsetup verify r0 0 0' terminated with exit code 11
# Remote 沒有 set "verify-alg"
r0: State change failed: (-10) State change was refused by peer node Command 'drbdsetup verify r0 0 0' terminated with exit code 11
dmesg
... drbd r0: Preparing cluster-wide state change 999491511 (1->0 496/288) ... drbd r0: State change 999491511: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFFC ... drbd r0: Committing cluster-wide state change 999491511 (1ms) ... drbd r0/0 drbd1 drbd-a: repl( Established -> VerifyS ) ... drbd r0/0 drbd1 drbd-a: Starting Online Verify from sector 0 ... drbd r0/0 drbd1 drbd-a: Starting Online Verify from sector 48 ... 一段時間後 ... drbd r0/0 drbd1 drbd-a: Online verify done (total 140 sec; paused 0 sec; 74888 K/sec) ... drbd r0/0 drbd1 drbd-a: repl( VerifyS -> Established )
Remark:
不論 node 是 Pri. 或 Sec. 行 CLI 的一方會成為 VerifyT, 另一邊會成為 VerifyS
dstat -c -n -N eth1 -d -D vdb
----total-cpu-usage---- --net/eth1- --dsk/vdb-- usr sys idl wai hiq siq| recv send| read writ 1 15 85 0 0 0|1696k 3063k| 87M 0 0 9 91 0 0 1|1161k 2052k| 61M 0 0 9 91 0 0 0|1486k 2547k| 72M 0 ...
測試2
在 Secondary 上
dd if=/dev/zero of=/dev/vdb1 bs=1M count=1
drbdadm status
r0 role:Secondary disk:UpToDate drbd-b role:Primary peer-disk:UpToDate
drbdadm verify all
dmesg
... drbd r0/0 drbd1 drbd-b: Out of sync: start=0, size=6240 (sectors) ... drbd r0/0 drbd1 drbd-b: Out of sync: start=7248, size=184 (sectors) ... ... drbd r0/0 drbd1 drbd-b: Online verify done (total 149 sec; paused 0 sec; 76912 K/sec) ... drbd r0/0 drbd1 drbd-b: Online verify found 244000 4k blocks out of sync!
drbdadm status
r0 role:Secondary
disk:UpToDate <- ?!
drbd-b role:Primary
peer-disk:UpToDate
Resynchronize mark blocks:
drbdadm disconnect r0
drbdadm connect r0
drbdadm status
r0 role:Secondary disk:Outdated drbd-b connection:Connecting
Sync data 的速度
Synchronization is distinct from device replication
Synchronization is necessary if the replication link has been interrupted for any reason
for controlling the used bandwidth during resynchronization (not normal replication)
(ie. in the SyncTarget state, on the (inconsistent) receiver side)
V8.4 預設是用 Variable Rate 的
Setting
- Fix Rate
- 臨時提速
- Variable sync rate
Fix Rate
# 100MByte/second of IO bandwidth
common { disk { c-plan-ahead 0; resync-rate 100M; } }
P.S.
要有 "c-plan-ahead 0;" setting "resync-rate 100M;" 才生效
1G network IO ~ 110 Mibyte
臨時提速:
drbdadm disk-options --c-plan-ahead=0 --resync-rate=110M <resource>
還原本設定:
drbdadm adjust resource
Variable Rate:
dynamic resync speed controller setting: c-plan-ahead, c-fill-target, c-delay-target, c-max-rate, c-min-rate
resource <resource> {
disk {
# agility of the controller (Higher values => slower responses => 誤差大)
# Default: 0. Unit: 0.1 seconds. It should be at least 5 times RTT.
c-plan-ahead 20;
# The most bandwidth that can be used by a resync.
c-max-rate 100M;
# c-fill-target, c-delay-target 是二選一的設定
# Only when fill_target is set to 0 the controller will use delay_target
# c-fill-target = a constant amount of data fill_target
# c-delay-target = constant delay time of delay_target along the path, unit: 0.1s
c-fill-target 50M;
# tells DRBD use only up to min_rate for resync IO and
# to dedicate all other available IO bandwidth to application requests.
# 0 = disables the limitation
c-min-rate 10M;
}
}
Checksum-based Synchronization
作用
減少 re-writes a sector with identical contents while DRBD is in disconnected mode
運作
When using checksum-based synchronization, then rather than performing a brute-force overwrite of blocks marked out of sync,
DRBD reads blocks before synchronizing them and computes a hash of the contents currently found on disk.
Settings
# Default off, algorithm: sha1, md5, and crc32c
resource <resource>
net {
csums-alg <algorithm>;
}
...
}
Replication traffic integrity checking Setting(For debug)
* This feature is not intended for production use. Enable only if you need to diagnose data corruption problems.
# default off resource <resource> net { data-integrity-alg <algorithm>; } ... }
I/O Errors (Disk error handling strategies)
Config
resource <resource> { disk { on-io-error <strategy>; ... } ... }
處理方式
- Passing on I/O errors (it is left to upper layers to deal with such errors)
(may result in a file system being remounted read-only)
- Masking I/O errors (V8.4 Default)
DRBD transparently fetches the affected block from the peer node, over the network.
(carries out all subsequent I/O operations, read and write, on the peer node.)
=> service continues without interruption
strategy
detach(Default)
the node drops its backing device, and continues in diskless mode.
pass_on
This causes DRBD to report the I/O error to the upper layers.
On the primary node, it is reported to the mounted file system.
On the secondary node, it is ignored (because the secondary has no upper layer to report to).
call-local-io-error
Invokes the command defined as the local I/O error handler.
(defined in the resource’s handlers section)
Growing on-line
backing block devices can be grown while in operation (online)
node is in primary state:
drbdadm resize resource
* This triggers a synchronization of the new section.
Growing off-line:
"internal meta data" must be moved to the end of the grown device
on both nodes:
drbdadm dump-md resource > /tmp/metadata
# Adjust the size information (la-size-sect[sectors]) in the file /tmp/metadata
# Re-initialize the metadata area
drbdadm create-md resource
# Re-import the corrected meta data
drbdmeta_cmd=$(drbdadm -d dump-md test-disk)
${drbdmeta_cmd/dump-md/restore-md} /tmp/metadata
# Re-enable
drbdadm up resource
Disabling backing device flushes
只有 device 有 battery-backed write cache (BBWC) 才好 enable 它們
(當 controller 有電時, 會寫 cache 的資料入碟)
因為寫入到 cache 後, 那 cache 內的 data 亦即是有效的 Data , 所以不怕斷電
如果那 device 是 write-through mode 的, 亦可 enable 此功能
resource <resource> disk { # replicated data set disk-flushes no; # meta data md-flushes no; ... } ... }
Dealing with hard drive failure
# Replacing a failed disk when using internal meta data
# Full synchronization of the new hard disk starts instantaneously and automatically.
drbdadm create-md <resource>
Using Truck based replication (Disk shipping)
用處:
一次過 sync 大量 Data 去遠地方
做法:
By removing a hot-swappable drive from a RAID-1 mirror
Local node:
# 此 cmd 後, 就可拿出 harddisk
# Create a consistent, verbatim copy of the resource’s data and its metadata.
drbdadm new-current-uuid --clear-bitmap <resource>
# 拿出 harddisk 後
drbdadm new-current-uuid resource
Disk to remote node:
# 將運過來的 Hard Disk 插入後, 就可以行以下 cmd
drbdadm up <resource>