最後更新: 2021-02-10
目錄
- More Options
- invalidate
- Login
- DRBD meta data
- Generation Identifier (GI)
- Activity Log
- Measuring
-
Cluster File System: GFS2 or OCFS2
OCFS2
GFS2 - 解決 Split brain 問題
- Using LVM with DRBD
- Tuning Recommendations
- Keep Alive Checking
More Options
Global Section
udev-always-use-vnr
drbdadm will always add the .../VNR part,
and will not care for whether the volume definition was implicit or explicit.
# explicit volume definition: volume VNR { }
DEVICE=drbd<minor> SYMLINK_BY_RES=drbd/by-res/<resource-name>/VNR
minor-count
sizing hint for DRBD. It helps to right-size various memory pools.
It should be set in the in the same order of magnitude than the actual number of minors you use.
dialog-refresh
The default value is 1.
The user dialog redraws the second count every time seconds
disable-ip-verification
for some obscure reasons, drbdadm can/might not use ip or ifconfig to do a sanity check for the IP address.
cmd-timeout-short / cmd-timeout-medium / cmd-timeout-long
Startup Section
wfc-timeout
The init script drbd(8) blocks the boot process until the DRBD resources are connected.
Default is 0, which means unlimited. The unit is seconds.
When the cluster manager starts later, it does not see a resource with internal split-brain.
In case you want to limit the wait time, do it here.
degr-wfc-timeout
Wait for connection timeout, if this node was a degraded cluster.
this timeout value is used instead of wfc-timeout, because the peer is less likely to show up in time
outdated-wfc-timeout
Wait for connection timeout, if the peer was outdated.
wait-after-sb
make the init script to continue to wait even if the device pair had a split brain situation and
therefore refuses to connect.
Options Section
cpu-mask
Sets the cpu-affinity-mask for DRBD's kernel threads of this device.
Default: 0 => spread over all CPUs of the machine
on-no-data-accessible ond-policy
This setting controls what happens to IO requests on a degraded, disk less node.
The available policies are io-error(default) and suspend-io.
invalidate
invalidate
This forces the local device of a pair of connected DRBD devices into SyncTarget state,
which means that all data blocks of the device are copied over from the peer.
On Pirmary
# 自己會變成 cs:SyncSource, 另一邊會變成 cs:SyncTarget
drbdadm invalidate-remote r0
On Secondary
drbdadm invalidate r0
注意 !! Secondary 也可以 invalidate-remote Primary !!
Login
net { # The HMAC algorithm will be used for the challenge response authentication of the peer. # ALG: cat /proc/crypto # peer authentication is disabled as long as no cram-hmac-alg is specified. cram-hmac-alg ALG; shared-secret STRING; }
DRBD meta data
- the size of the DRBD device
- Generation Identifier (GI)
- Activity Log (AL)
- uick-sync bitmap
Generation Identifier (GI)
* 8-byte, 4 set
Current UUID (same => Connected and fully synchronized)
用圖:
determine the direction of synchronization.
用下情況會影響 GI:
- The initial device full sync,
- a disconnected resource switching to the primary role,
- a resource in the primary role disconnecting.
- Current UUID's empty on both nodes
- Current UUID's empty on one node.
- Equal current UUID's.
- Bitmap UUID matches peer's current UUID (normal background re-synchronization)
- Current UUID matches peer's historical UUID (full background re-synchronization)
- Bitmap UUID's match, current UUID's do not (split brain: auto-recovery strategies)
- Neither current nor bitmap UUID's match (split brain: waits for manual )
- No UUID's match
Activity Log
The activity log (AL), stored in the meta data area
keeps track of those blocks that have “recently” been written to.
If a temporarily failed node that was in active mode at the time of failure is synchronized,
only those hot extents highlighted in the AL need to be synchronized, rather than the full device.
configurable parameter
# Many active extents => improves write throughput
# Few active extents => reduces synchronization time after active node failure and subsequent recovery
suitable Activity Log size
E=(R x T)/4
R = 30 MiByte/s
T = 240 sec
之後選它的下一個 prime number
quick-sync bitmap:
one bit represents a 4-KiB chunk of on-disk data
if the bit is set => needs to be re-synchronized
Measuring
Measuring throughput
for i in $(seq 5); do dd if=/dev/zero of=$TEST_LL_DEVICE bs=512M count=1 oflag=direct done
Measuring latency
dd if=/dev/zero of=$TEST_LL_DEVICE bs=512 count=1000 oflag=direct
# Gigabit Ethernet links 100us-200us
Cluster File System: GFS2 or OCFS2
OCFS2
DLM(Distributed Lock Manager): ocfs2_dlmfs (separate from the actual OCFS2 file systems)
Infrastructure: Pacemaker
* Both resources are always primary, any interruption in the network between nodes will result in a split-brain.
設定: Dual-primary mode
resource r0 { startup { # 在啟動時, 兩個 node 都會自動成為 Primary become-primary-on both; } net { # Dual-primary mode requires "protocol C" protocol C; # 防止 split brain 的方法 fencing resource-and-stonith; # You should do so after the initial resource synchronization has completed. allow-two-primaries; # after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; ... } }
fencing
dont-care (最差的處理方案)
This is the default policy. No fencing actions are taken.
resource-only
If a node becomes a disconnected primary, it tries to fence the peer's disk.
This is done by calling the fence-peer handler.
The handler is supposed to reach the other node over alternative communication paths and
call 'drbdadm outdate res' there.
resource-and-stonith
If a node becomes a disconnected primary, it freezes all its IO operations and calls its fence-peer handler.
In case it cannot reach the peer it should stonith the peer. IO is resumed as soon as the situation is resolved.
In case your handler fails, you can resume IO with the "resume-io" command.
On both nodes:
- drbdadm disconnect resource
- drbdadm connect resource
- drbdadm primary resource
建立 FS:
# filesystem label = ocfs2_drbd0
mkfs -t ocfs2 -N 2 -L ocfs2_drbd0 /dev/drbd0
P.S.
* All cluster file systems require fencing
disk { fencing resource-and-stonith; } handlers { outdate-peer "/sbin/make-sure-the-other-node-is-confirmed-dead.sh" }
* DRBD does not promote the surviving node to the primary role;
( 這是 cluster manager 負責的 )
解決 Split Brain 問題
設定通知:
common { handlers { # send e-mail notification split-brain "/usr/lib/drbd/notify-split-brain.sh root"; ... } ... }
自動解決:
common { net { # Split brain has just been detected, # but at this time the resource is not in the Primary role on any host # 當 A 有變而 B 無變時, 沒變的一方更新 after-sb-0pri discard-zero-changes; # The resource is in the Primary role on one host(1pri) after-sb-1pri discard-secondary; # The resource is in the Primary role on both hosts # drop the connection and continue in disconnected after-sb-2pri disconnect; ... } ... }
人手解決 Split brain:
[Lab 1]
有 cs1(Orig: Pri) 及 cs2(Orig: Sec), 當 cs1 reboot 後
drbdadm status
r0 role:Secondary
disk:UpToDate
drbd-b connection:StandAlone
dmesg
... drbd r0/0 drbd1: Split-Brain detected but unresolved, dropping connection!
原因: cs2 成了 Primary
解決
在 cs2(Pri.) 上執行
drbdadm connect r0
r0 role:Primary disk:UpToDate
變成
r0 role:Primary disk:UpToDate peer connection:Connecting
在 cs1(Sec) 上執行
drbdadm connect --discard-my-data r0
之後就會 cs2 read -> cs1 writ
[Lab 2] link down
# On Hypervisor
domif-setlink cs2 --interface cs2-eth1 --state down
# On cs1
drbdadm primary r0 # 沒有 error
=> 在 network link disconnect 情況下 Secondary 才可以變成 Primary
=> 此時是 Split brain 開端
[Lab 3] dual primary
# On cs1
mount /dev/drbd0 /home/cluster
echo test2 > /home/cluster/test.txt
# On Hypervisor
domif-setlink cs2 --interface cs2-eth1 --state up
由於 cs1 及 cs2 都是 Primary, 所以它們會各自各運作
[Lab 4] 決定視 cs2 上 data 為正統
# On cs1
umount /home/cluster
drbdadm secondary r0
Using LVM with DRBD
- using LVM Logical Volumes as backing devices for DRBD;
- using DRBD devices as Physical Volumes for LVM;
drbd over lvm 的好處
automated LVM snapshots during DRBD synchronization
resource r0 { handlers { before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh"; after-resync-target "/usr/lib/drbd/unsnapshot-resync-target-lvm.sh"; } }
lvm over drbd
Disable the LVM cache and config filter setting:
# make sure that LVM detects PV signatures on stacked resources filter = [ "a|drbd1[0-9]|", "r|.*|" ] # disable writing out of the persistent filter cache file write_cache_state = 0
rm /etc/lvm/cache/.cache
update-initramfs
P.S.
write_cache_state:
LVM labels and volume group (VG) metadata stored inside physical volume.
On Linux system startup "vgscan" scan block devices of system and looking for LVM lables.
Name of PV are stored in "/etc/lvm/.cache"
improve performance as re-scanning of block devices not require
Tuning Recommendations
Disk Buffer (max-buffers & max-epoch-size)
# Affect write performance on the secondary node
# The default for both is 2048
-
max-buffers
Limits the memory usage per DRBD minor device on the receiving side,
or for internal buffers during resync or online-verify. Unit is PAGE_SIZE; 8k = 8000 - max-epoch-size # number of write requests permitted between two write barriers
resource <resource> { net { # allocates for writing data to disk max-buffers 8000; # requests permitted between two write barriers max-epoch-size 8000; ... } ... }
TCP send buffer auto-tuning (sndbuf-size)
Default: 128 KiB; 0 = auto-tuning
Gigabit Ethernet > 2M
resource <resource> { net { sndbuf-size 4M; rcvbuf-size 4M; } ... }
Fairly large activity log
for better performance by reducing the amount of metadata disk write operations.
The associated extended resynchronization time after a primary node crash
resource <resource> { disk { al-extents 3389; ... } ... }
Disabling barriers and disk flushes
For systems equipped with battery backed
resource <resource> { disk { disk-barrier no; disk-flushes no; ... } ... }
CPU mask
CPU mask of 1 (00000001) means DRBD may use the first CPU only.
A mask of 12 (00001100) implies DRBD may use the third and fourth CPU.
resource <resource> { options { # 12 => 00001100 (use third and fourth CPU) cpu-mask 12; ... } ... }
P.S.
Application to use only those CPUs which DRBD does not use.
("taskset" command in an application init script)
Enabling Jumbo frames
ifconfig <interface> mtu <size>
Or
ip link set <interface> mtu <size>
Enabling the deadline scheduler
Guarantee a start service time for a request.
Deadline queues are basically sorted by their deadline
Read queues are given a higher priority
echo deadline > /sys/block/<device>/queue/scheduler echo 0 > /sys/block/<device>/queue/iosched/front_merges echo 150 > /sys/block/<device>/queue/iosched/read_expire echo 1500 > /sys/block/<device>/queue/iosched/write_expire
Keep Alive Checking
timeout N
當對方在 N/10 秒內沒有回應要求時就當對方死了.
預設 60 (6 seconds). The unit 0.1 seconds.
This must be lower than connect-int and ping-int.
connect-int N
當連線中斷時, DRBD 會不斷以每 N 秒嘗試重新連線
The default value is 10 seconds.
ping-int N
If the TCP/IP connection linking a DRBD device pair is idle for more than N seconds
The default is 10 seconds.
DRBD will generate a keep-alive packet to check if its partner is still alive.
ping-timeout N
對方要在 N/10 秒內回應 keep-alive packet, 否則當對方死亡
預設 5 (500ms). The unit 0.1 seconds.