drbd - 進階

最後更新: 2021-02-10

 

目錄

  • DRBD meta data
  • Generation Identifier (GI)
  • Activity Log
  • Measuring
  • Cluster File System: GFS2 or OCFS2
    OCFS2
    GFS2
  • 解決 Split brain 問題
  • Using LVM with DRBD
  • Tuning Recommendations
  • "Net" Setting

 


DRBD meta data

 

  • the size of the DRBD device
  • Generation Identifier (GI)
  • Activity Log (AL)
  • uick-sync bitmap

 


Generation Identifier (GI)

 

* 8-byte, 4 set

Current UUID (same => Connected and fully synchronized)

用圖:

determine the direction of synchronization.

用下情況會影響 GI:

  • The initial device full sync,
  • a disconnected resource switching to the primary role,
  • a resource in the primary role disconnecting.
  1. Current UUID's empty on both nodes
  2. Current UUID's empty on one node.
  3. Equal current UUID's.
  4. Bitmap UUID matches peer's current UUID (normal background re-synchronization)
  5. Current UUID matches peer's historical UUID (full background re-synchronization)
  6. Bitmap UUID's match, current UUID's do not (split brain: auto-recovery strategies)
  7. Neither current nor bitmap UUID's match (split brain: waits for manual )
  8. No UUID's match

 


Activity Log

 

The activity log (AL), stored in the meta data area

keeps track of those blocks that have “recently” been written to.

If a temporarily failed node that was in active mode at the time of failure is synchronized,

only those hot extents highlighted in the AL need to be synchronized, rather than the full device.

configurable parameter

# Many active extents => improves write throughput
# Few active extents => reduces synchronization time after active node failure and subsequent recovery

suitable Activity Log size

E=(R x T)/4

R = 30 MiByte/s

T = 240 sec

之後選它的下一個 prime number

quick-sync bitmap:

one bit represents a 4-KiB chunk of on-disk data
if the bit is set => needs to be re-synchronized

 


Measuring

 

Measuring throughput

for i in $(seq 5); do
  dd if=/dev/zero of=$TEST_LL_DEVICE bs=512M count=1 oflag=direct
done

Measuring latency

dd if=/dev/zero of=$TEST_LL_DEVICE bs=512 count=1000 oflag=direct

# Gigabit Ethernet links 100us-200us

 


Cluster File System: GFS2 or OCFS2

 

OCFS2

 

DLM(Distributed Lock Manager): ocfs2_dlmfs (separate from the actual OCFS2 file systems)

Infrastructure: Pacemaker

 * Both resources are always primary, any interruption in the network between nodes will result in a split-brain.

設定: Dual-primary mode

resource r0 {
  startup {
    # 在啟動時, 兩個 node 都會自動成為 Primary
    become-primary-on both;
  }
  net {

    # Dual-primary mode requires "protocol C"
    protocol C;

    # 防止 split brain 的方法
    fencing resource-and-stonith;

    # You should do so after the initial resource synchronization has completed.
    allow-two-primaries;

    # 
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
    ...
  }
}

fencing

dont-care (最差的處理方案)

This is the default policy. No fencing actions are taken.

resource-only

If a node becomes a disconnected primary, it tries to fence the peer's disk.

This is done by calling the fence-peer handler.

The handler is supposed to reach the other node over alternative communication paths and

call 'drbdadm outdate res' there.

resource-and-stonith

If a node becomes a disconnected primary, it freezes all its IO operations and calls its fence-peer handler.

In case it cannot reach the peer it should stonith the peer. IO is resumed as soon as the situation is resolved.

In case your handler fails, you can resume IO with the "resume-io" command.

On both nodes:

  1. drbdadm disconnect resource
  2. drbdadm connect resource
  3. drbdadm primary resource

建立 FS:

# filesystem label = ocfs2_drbd0

mkfs -t ocfs2 -N 2 -L ocfs2_drbd0 /dev/drbd0

P.S.

 * All cluster file systems require fencing

disk {
        fencing resource-and-stonith;
}
handlers {
        outdate-peer "/sbin/make-sure-the-other-node-is-confirmed-dead.sh"
}

* DRBD does not promote the surviving node to the primary role;

  ( 這是 cluster manager 負責的 )

 


解決 Split brain 問題

 

設定通知:

  handlers {
    # send e-mail notification
    split-brain "/usr/lib/drbd/notify-split-brain.sh root";
    ...
  }

自動解決:

  net {
    # Automatic split brain recovery policies
    
    # Split brain has just been detected, 
    # but at this time the resource is not in the Primary role on any host
    # 當 A 有變而 B 無變時, 沒變的一方更新
    after-sb-0pri      discard-zero-changes;
    
    # The resource is in the Primary role on one host(1pri)
    after-sb-1pri     discard-secondary;
    
    # The resource is in the Primary role on both hosts
    # drop the connection and continue in disconnected
    after-sb-2pri     disconnect;
    
    ...
  }

人手解決:

# on victim node

drbdadm secondary resource
drbdadm -- --discard-my-data connect resource

# on survivor node

# 當cs 是  WFConnection 就不用理它, 如果是 StandAlone 時執行以下指令
drbdadm connect resource

 


Using LVM with DRBD

 

  • using LVM Logical Volumes as backing devices for DRBD;
  • using DRBD devices as Physical Volumes for LVM;

drbd over lvm 的好處

automated LVM snapshots during DRBD synchronization

resource r0 {
  handlers {
    before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh";
    after-resync-target "/usr/lib/drbd/unsnapshot-resync-target-lvm.sh";
  }
}

lvm over drbd

Disable the LVM cache and config filter setting:

# make sure that LVM detects PV signatures on stacked resources
filter = [ "a|drbd1[0-9]|", "r|.*|" ]

# disable writing out of the persistent filter cache file
write_cache_state = 0

rm /etc/lvm/cache/.cache

update-initramfs

P.S.

write_cache_state:

LVM labels and volume group (VG) metadata stored inside physical volume.
On Linux system startup "vgscan" scan block devices of system and looking for LVM lables.

Name of PV are stored in "/etc/lvm/.cache"

improve performance as re-scanning of block devices not require

 


Tuning Recommendations

 

Disk Buffer (max-buffers & max-epoch-size)

# Affect write performance on the secondary node

# The default for both is 2048

  • max-buffers             # The buffers DRBD allocates for writing data to disk (Unit is PAGE_SIZE)
  • max-epoch-size        # number of write requests permitted between two write barriers
resource <resource> {
  net {
    # allocates for writing data to disk
    max-buffers 8000;

    # requests permitted between two write barriers
    max-epoch-size 8000;
    ...
  }
  ...
}

TCP send buffer auto-tuning (sndbuf-size)

Default: 128 KiB; 0 = auto-tuning

Gigabit Ethernet -> 2M

resource <resource> {
  net {
    sndbuf-size 2M;
    ...
  }
  ...
}

Fairly large activity log

for better performance by reducing the amount of metadata disk write operations.

The associated extended resynchronization time after a primary node crash

resource <resource> {
  disk {
    al-extents 3389;
    ...
  }
  ...
}

Disabling barriers and disk flushes

For systems equipped with battery backed

resource <resource> {
  disk {
    disk-barrier no;
    disk-flushes no;
    ...
  }
  ...
}

CPU mask

CPU mask of 1 (00000001) means DRBD may use the first CPU only.

A mask of 12 (00001100) implies DRBD may use the third and fourth CPU.

resource <resource> {
  options {
    # 12 => 00001100 (use third and fourth CPU)
    cpu-mask 12;
    ...
  }
  ...
}

P.S.

Application to use only those CPUs which DRBD does not use.

("taskset" command in an application init script)

Enabling Jumbo frames

ifconfig <interface> mtu <size>

Or

ip link set <interface> mtu <size>

Enabling the deadline scheduler

Guarantee a start service time for a request.

Deadline queues are basically sorted by their deadline

Read queues are given a higher priority

echo deadline > /sys/block/<device>/queue/scheduler
echo 0 > /sys/block/<device>/queue/iosched/front_merges
echo 150 > /sys/block/<device>/queue/iosched/read_expire
echo 1500 > /sys/block/<device>/queue/iosched/write_expire

 


"net" Setting

 

# The default value is 60 (6 seconds). The unit 0.1 seconds.

# the partner node is considered dead and therefore the TCP/IP connection is abandoned.

# This must be lower than connect-int and ping-int.

timeout 0.1-sec

# The default value is 10 seconds.

# Set the time between two retries.

connect-int second

# The default is 10 seconds.

# DRBD will generate a keep-alive packet to check if its partner is still alive.

ping-int second

# The default value is 5 (500ms). The unit 0.1 seconds.

# The time the peer has time to answer to a keep-alive packet.

ping-timeout 0.1-sec