drbd - 進階

最後更新: 2021-02-10

 

目錄

  • More Options
  • invalidate
  • Login
  • DRBD meta data
  • Generation Identifier (GI)
  • Activity Log
  • Measuring
  • Cluster File System: GFS2 or OCFS2
    OCFS2
    GFS2
  • 解決 Split brain 問題
  • Using LVM with DRBD
  • Tuning Recommendations
  • Keep Alive Checking

 


More Options

 

Global Section

udev-always-use-vnr

drbdadm will always add the .../VNR part,

and will not care for whether the volume definition was implicit or explicit.

# explicit volume definition: volume VNR { }

DEVICE=drbd<minor>
SYMLINK_BY_RES=drbd/by-res/<resource-name>/VNR

minor-count

sizing hint for DRBD. It helps to right-size various memory pools.

It should be set in the in the same order of magnitude than the actual number of minors you use.

dialog-refresh

The default value is 1.

The user dialog redraws the second count every time seconds

disable-ip-verification

for some obscure reasons, drbdadm can/might not use ip or ifconfig to do a sanity check for the IP address.

cmd-timeout-short / cmd-timeout-medium / cmd-timeout-long

Startup Section

wfc-timeout

The init script drbd(8) blocks the boot process until the DRBD resources are connected.

Default is 0, which means unlimited. The unit is seconds.

When the cluster manager starts later, it does not see a resource with internal split-brain.

In case you want to limit the wait time, do it here.

degr-wfc-timeout

Wait for connection timeout, if this node was a degraded cluster.

this timeout value is used instead of wfc-timeout, because the peer is less likely to show up in time

outdated-wfc-timeout

Wait for connection timeout, if the peer was outdated.

wait-after-sb

make the init script to continue to wait even if the device pair had a split brain situation and

therefore refuses to connect.

Options Section

cpu-mask

Sets the cpu-affinity-mask for DRBD's kernel threads of this device.

Default: 0 => spread over all CPUs of the machine

on-no-data-accessible ond-policy

This setting controls what happens to IO requests on a degraded, disk less node.

The available policies are io-error(default) and suspend-io.

 


invalidate

 

invalidate

This forces the local device of a pair of connected DRBD devices into SyncTarget state,

which means that all data blocks of the device are copied over from the peer.

On Pirmary

# 自己會變成 cs:SyncSource, 另一邊會變成 cs:SyncTarget

drbdadm invalidate-remote r0

On Secondary

drbdadm invalidate r0

注意 !! Secondary 也可以 invalidate-remote Primary !!

 


Login

 

net {
    # The HMAC algorithm will be used for the challenge response authentication of the peer.
    # ALG: cat /proc/crypto
    # peer authentication is disabled as long as no cram-hmac-alg is specified.
    cram-hmac-alg ALG;

    shared-secret STRING;
}

 


DRBD meta data

 

  • the size of the DRBD device
  • Generation Identifier (GI)
  • Activity Log (AL)
  • uick-sync bitmap

 


Generation Identifier (GI)

 

* 8-byte, 4 set

Current UUID (same => Connected and fully synchronized)

用圖:

determine the direction of synchronization.

用下情況會影響 GI:

  • The initial device full sync,
  • a disconnected resource switching to the primary role,
  • a resource in the primary role disconnecting.
  1. Current UUID's empty on both nodes
  2. Current UUID's empty on one node.
  3. Equal current UUID's.
  4. Bitmap UUID matches peer's current UUID (normal background re-synchronization)
  5. Current UUID matches peer's historical UUID (full background re-synchronization)
  6. Bitmap UUID's match, current UUID's do not (split brain: auto-recovery strategies)
  7. Neither current nor bitmap UUID's match (split brain: waits for manual )
  8. No UUID's match

 


Activity Log

 

The activity log (AL), stored in the meta data area

keeps track of those blocks that have “recently” been written to.

If a temporarily failed node that was in active mode at the time of failure is synchronized,

only those hot extents highlighted in the AL need to be synchronized, rather than the full device.

configurable parameter

# Many active extents => improves write throughput
# Few active extents => reduces synchronization time after active node failure and subsequent recovery

suitable Activity Log size

E=(R x T)/4

R = 30 MiByte/s

T = 240 sec

之後選它的下一個 prime number

quick-sync bitmap:

one bit represents a 4-KiB chunk of on-disk data
if the bit is set => needs to be re-synchronized

 


Measuring

 

Measuring throughput

for i in $(seq 5); do
  dd if=/dev/zero of=$TEST_LL_DEVICE bs=512M count=1 oflag=direct
done

Measuring latency

dd if=/dev/zero of=$TEST_LL_DEVICE bs=512 count=1000 oflag=direct

# Gigabit Ethernet links 100us-200us

 


Cluster File System: GFS2 or OCFS2

 

OCFS2

 

DLM(Distributed Lock Manager): ocfs2_dlmfs (separate from the actual OCFS2 file systems)

Infrastructure: Pacemaker

 * Both resources are always primary, any interruption in the network between nodes will result in a split-brain.

設定: Dual-primary mode

resource r0 {
  startup {
    # 在啟動時, 兩個 node 都會自動成為 Primary
    become-primary-on both;
  }
  net {

    # Dual-primary mode requires "protocol C"
    protocol C;

    # 防止 split brain 的方法
    fencing resource-and-stonith;

    # You should do so after the initial resource synchronization has completed.
    allow-two-primaries;

    # 
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
    ...
  }
}

fencing

dont-care (最差的處理方案)

This is the default policy. No fencing actions are taken.

resource-only

If a node becomes a disconnected primary, it tries to fence the peer's disk.

This is done by calling the fence-peer handler.

The handler is supposed to reach the other node over alternative communication paths and

call 'drbdadm outdate res' there.

resource-and-stonith

If a node becomes a disconnected primary, it freezes all its IO operations and calls its fence-peer handler.

In case it cannot reach the peer it should stonith the peer. IO is resumed as soon as the situation is resolved.

In case your handler fails, you can resume IO with the "resume-io" command.

On both nodes:

  1. drbdadm disconnect resource
  2. drbdadm connect resource
  3. drbdadm primary resource

建立 FS:

# filesystem label = ocfs2_drbd0

mkfs -t ocfs2 -N 2 -L ocfs2_drbd0 /dev/drbd0

P.S.

 * All cluster file systems require fencing

disk {
        fencing resource-and-stonith;
}
handlers {
        outdate-peer "/sbin/make-sure-the-other-node-is-confirmed-dead.sh"
}

* DRBD does not promote the surviving node to the primary role;

  ( 這是 cluster manager 負責的 )

 


解決 Split Brain 問題

 

設定通知:

common {
  handlers {
    # send e-mail notification
    split-brain "/usr/lib/drbd/notify-split-brain.sh root";
    ...
  }
  ...
}

自動解決:

common {
  net {

    # Split brain has just been detected, 
    # but at this time the resource is not in the Primary role on any host
    # 當 A 有變而 B 無變時, 沒變的一方更新
    after-sb-0pri      discard-zero-changes;
    
    # The resource is in the Primary role on one host(1pri)
    after-sb-1pri     discard-secondary;
    
    # The resource is in the Primary role on both hosts
    # drop the connection and continue in disconnected
    after-sb-2pri     disconnect;
    
    ...
  }
  ...
}

人手解決 Split brain:

[Lab 1]

有 cs1(Orig: Pri) 及 cs2(Orig: Sec), 當 cs1 reboot 後

drbdadm status

r0 role:Secondary
  disk:UpToDate
  drbd-b connection:StandAlone

dmesg

... drbd r0/0 drbd1: Split-Brain detected but unresolved, dropping connection!

原因: cs2 成了 Primary

解決

在 cs2(Pri.) 上執行

drbdadm connect r0

r0 role:Primary
  disk:UpToDate

變成

r0 role:Primary
  disk:UpToDate
  peer connection:Connecting

在 cs1(Sec) 上執行

drbdadm connect --discard-my-data r0

之後就會 cs2 read -> cs1 writ

[Lab 2] link down

# On Hypervisor

domif-setlink cs2 --interface cs2-eth1 --state down

# On cs1

drbdadm primary r0                   # 沒有 error

=> 在 network link disconnect 情況下 Secondary 才可以變成 Primary

=> 此時是 Split brain 開端

[Lab 3] dual primary

# On cs1

mount /dev/drbd0 /home/cluster

echo test2 > /home/cluster/test.txt

# On Hypervisor

domif-setlink cs2 --interface cs2-eth1 --state up

由於 cs1 及 cs2 都是 Primary, 所以它們會各自各運作

[Lab 4] 決定視 cs2 上 data 為正統

# On cs1

umount /home/cluster

drbdadm secondary r0

 


Using LVM with DRBD

 

  • using LVM Logical Volumes as backing devices for DRBD;
  • using DRBD devices as Physical Volumes for LVM;

drbd over lvm 的好處

automated LVM snapshots during DRBD synchronization

resource r0 {
  handlers {
    before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh";
    after-resync-target "/usr/lib/drbd/unsnapshot-resync-target-lvm.sh";
  }
}

lvm over drbd

Disable the LVM cache and config filter setting:

# make sure that LVM detects PV signatures on stacked resources
filter = [ "a|drbd1[0-9]|", "r|.*|" ]

# disable writing out of the persistent filter cache file
write_cache_state = 0

rm /etc/lvm/cache/.cache

update-initramfs

P.S.

write_cache_state:

LVM labels and volume group (VG) metadata stored inside physical volume.
On Linux system startup "vgscan" scan block devices of system and looking for LVM lables.

Name of PV are stored in "/etc/lvm/.cache"

improve performance as re-scanning of block devices not require

 


Tuning Recommendations

 

Disk Buffer (max-buffers & max-epoch-size)

# Affect write performance on the secondary node

# The default for both is 2048

  • max-buffers
    Limits the memory usage per DRBD minor device on the receiving side,
    or for internal buffers during resync or online-verify. Unit is PAGE_SIZE; 8k = 8000
  • max-epoch-size        # number of write requests permitted between two write barriers
resource <resource> {
  net {
    # allocates for writing data to disk
    max-buffers 8000;

    # requests permitted between two write barriers
    max-epoch-size 8000;
    ...
  }
  ...
}

TCP send buffer auto-tuning (sndbuf-size)

Default: 128 KiB; 0 = auto-tuning

Gigabit Ethernet > 2M

resource <resource> {
  net {
    sndbuf-size 4M;
    rcvbuf-size 4M;
  }
  ...
}

Fairly large activity log

for better performance by reducing the amount of metadata disk write operations.

The associated extended resynchronization time after a primary node crash

resource <resource> {
  disk {
    al-extents 3389;
    ...
  }
  ...
}

Disabling barriers and disk flushes

For systems equipped with battery backed

resource <resource> {
  disk {
    disk-barrier no;
    disk-flushes no;
    ...
  }
  ...
}

CPU mask

CPU mask of 1 (00000001) means DRBD may use the first CPU only.

A mask of 12 (00001100) implies DRBD may use the third and fourth CPU.

resource <resource> {
  options {
    # 12 => 00001100 (use third and fourth CPU)
    cpu-mask 12;
    ...
  }
  ...
}

P.S.

Application to use only those CPUs which DRBD does not use.

("taskset" command in an application init script)

Enabling Jumbo frames

ifconfig <interface> mtu <size>

Or

ip link set <interface> mtu <size>

Enabling the deadline scheduler

Guarantee a start service time for a request.

Deadline queues are basically sorted by their deadline

Read queues are given a higher priority

echo deadline > /sys/block/<device>/queue/scheduler
echo 0 > /sys/block/<device>/queue/iosched/front_merges
echo 150 > /sys/block/<device>/queue/iosched/read_expire
echo 1500 > /sys/block/<device>/queue/iosched/write_expire

 


Keep Alive Checking

 

timeout N

當對方在 N/10 秒內沒有回應要求時就當對方死了.

預設 60 (6 seconds). The unit 0.1 seconds.

This must be lower than connect-int and ping-int.

connect-int N

當連線中斷時, DRBD 會不斷以每 N 秒嘗試重新連線

The default value is 10 seconds.

ping-int N

If the TCP/IP connection linking a DRBD device pair is idle for more than N seconds

The default is 10 seconds.

DRBD will generate a keep-alive packet to check if its partner is still alive.

ping-timeout N

對方要在 N/10 秒內回應 keep-alive packet, 否則當對方死亡

預設 5 (500ms). The unit 0.1 seconds.

 


 

 

Creative Commons license icon Creative Commons license icon