drbd - 設定

最後更新: 2021-02-10

 

目錄

 


Network Setting (Firewall)

 

* 每一個 resource 都要 1 個獨立的 TCP Port (7788, 7789, 7790 ...)

* Firewall 單邊開了也 Connect 到 (ie. drbd-a 有開 port, drbd-b 無開 port), 但最好都是兩邊開 !

My Setup

drbd-a(eth0)\
 (eth1)      \
   |          \
   |        Switch-Router
   |          /
 (eth1)      /
drbd-b(eth0)/

drdb-a

  • eth0: 192.168.88.31
  • eth1: 10.0.0.1/30

drdb-b

  • eth0: 192.168.88.32
  • eth1: 10.0.0.2/30

On both node

firewall-cmd --permanent --add-rich-rule='rule family="ipv4"
  source address="10.0.0.0/30" port port="7789" protocol="tcp" accept'

firewall-cmd --reload

Remark

status "peer connection:Connecting" 時才見到 LISTEN, 當 Connect 後是見唔到的.

netstat -ntlp | grep 7789

tcp        0      0 10.0.0.1:7789           0.0.0.0:*               LISTEN      -

replication:Established 後

watch -n1 'netstat -ntp'

tcp        0      0 10.0.0.2:53550          10.0.0.1:7789           ESTABLISHED -
tcp        0      0 10.0.0.2:36675          10.0.0.1:7789           ESTABLISHED -

OS 準備

 

/etc/sysconfig/network-scripts/ifcfg-eth1

...
IPADDR=10.0.0.1
NETMASK=255.255.255.252
MTU=9000

/etc/hostname

drbd-a

/etc/hosts

# drbd host
192.168.88.31    drbd-a
192.168.88.32    drbd-b

/etc/sysconfig/grub

GRUB_CMDLINE_LINUX=" ... ipv6.disable=1 net.ifnames=0 biosdevname=0"

grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg

reboot

 


簡易設定(Single-primary mode)

 

The file format was designed as to allow to have a verbatim copy of the file on both nodes of the cluster.

建立設定檔:

/etc/drbd.conf

include "drbd.d/global_common.conf";
include "drbd.d/*.res";

global_common.conf

# V8
global {
 # 為 drbd project 作統計之用
  usage-count no;
}
common {
  net {
    protocol B;
  }
}

MyResource.res

# resource "r0" 有一個 vol 0 的設定

resource r0 {
  on drbd-a {
    # 名稱亦可以用另外的定義 drbd_X
    device    /dev/drbd1;
    disk      /dev/sdb1;
    address   10.0.0.1:7789;
    meta-disk internal;
  }
  on drbd-b {
    device    /dev/drbd1;
    disk      /dev/sdb1;
    address   10.0.0.2:7789;
    meta-disk internal;
  }
}

說明

Resource

Any resource is a replication group consisting of one of more volumes that share a common replication stream.

In DRBD, every resource has a role, which may be Primary or Secondary.

Hostname

兩架的 hostname 必須是, 由 uname -n 獲得

address

which is used to wait for incoming connections from the partner respectively to reach the partner

Internal meta data

meta data at the end of the device

缺點:

write 的效能會差了 ( two additional movements of the write/read head of the hard disk)

Protocol

Protocol = replication modes

Protocol A - Asynchronous replication protocol

Local write operations on the primary node are considered completed as soon as the local disk write has finished

long distance replication scenarios

Protocol B - Memory synchronous replication protocol(semi-synchronous)

The primary node are considered completed as soon as the local disk write has occurred,

and the replication packet has reached the peer node.

simultaneous power failure on both nodes and concurrent

Protocol C

only after both the local and the remote disk write(s) have been confirmed.

Remark

1. 設定個別 resource 的 protocol

resource another-resource-name {
  net {
    protocol C;
    ...
  }
  ...
}

2. 單 Resource 多 Volume 設定

MyResource.res

resource r0 {
  volume 0 {
    device    /dev/drbd1;
    disk      /dev/sdb1;
    meta-disk internal;
  }
  volume 1 {
    device    /dev/drbd2;
    disk      /dev/sdc1;
    meta-disk internal;
  }
  on drbd-a {
    address   10.0.0.1:7789;
  }
  on drbd-b {
    address   10.0.0.2:7789;
  }
}

 


建立 Resource

 

步驟:

  1. Create device(sdb1) metadata (On Both Node)
  2. Attach to backing device (On Both Node)
  3. Start the initial full synchronization (On Primary Node)

1. 建立 device(sdb1) 的 metadata  # On both nodes

# CLI: drbdadm create-md <resource_name>   # md = metadata

i.e.

drbdadm create-md r0

initializing activity log
initializing bitmap (320 KB) to all zero
Writing meta data...
New drbd meta data block successfully created.

blkid /dev/sdb1        # 沒有 output 的 !!

drbdadm status

# No currently configured DRBD found.

drbdadm create-md r0

You want me to create a v09 style flexible-size internal meta data block.
There appears to be a v09 flexible-size internal meta data block
already in place on /dev/vdb1 at byte offset 10736365568

Do you really want to overwrite the existing meta-data?
[need to type 'yes' to confirm]

2. 啟用 resource:                        # On both nodes

drbdadm up <resource>

# 類似執行了以下 2 個步驟,

  1. drbdadm attach <resource>      # 新的 Resource 不可以直接行此 cmd
  2. drbdadm connect <resource>

詳見: drbd_cli

i.e. run on drbd-a

drbdadm up r0

lsblk

vdb             252:16   0   10G  0 disk
└─vdb1          252:17   0   10G  0 part
  └─drbd1       147:1    0   10G  0 disk

drbdadm status

r0 role:Secondary
  disk:Inconsistent
  drbd-b connection:Connecting

connection: StandAlone -> Unconnected -> Connecting

當連上對方而因 setting 不匹配而失 -> StandAlone

修改設定後, 如果想再嘗試, 要在兩 node 行 down, up 一次

成功連線後:

r0 role:Secondary
  disk:Inconsistent
  cs2 role:Secondary
    peer-disk:Inconsistent

3. Start the Initial full synchronization       # On ONE node only

 * On initial resource configuration

 * On the node you selected as the synchronization source

drbdadm primary --force <resource>

i.e.

# On primary node (drbd-a)

drbdadm primary --force r0

drbdadm status

r0 role:Primary
  disk:UpToDate
  drbd-b role:Secondary
    replication:SyncSource peer-disk:Inconsistent done:2.39

一段時間後

drbdadm status

r0 role:Primary
  disk:UpToDate
  drbd-b role:Secondary
    peer-disk:UpToDate

停用 resource:

drbdadm down resource   # 它有 resource demotion 的效果

# 相當於

  1. drbdadm disconnect resource
  2. drbdadm detach resource

ie.

drbdadm down r0

lsblk

vdb             252:16   0   10G  0 disk
└─vdb1          252:17   0   10G  0 part

 


查看狀態

 

drbdadm status [resource]

Or

cat /proc/drbd

# V8 才有以下資訊

version: 8.4.11-1 (api:1/proto:86-101)
GIT-hash: 66145a308421e9c124ec391a7848ac20203bb03c build by mockbuild@, 2020-04-05 02:58:18

 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate B r-----
    ns:55908 nr:0 dw:0 dr:31456284 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
  • ap (application pending).     Number of block I/O requests forwarded to DRBD, but not yet answered by DRBD.
  • ep (epochs).                       Number of epoch objects. Usually 1.
                                             Might increase under I/O load when using either the barrier or the none write ordering method.
    ----
  • cs (connection state)          # drbdadm cstate r0
  • ro (roles)                           # drbdadm role r0
  • ds (disk states).                 # drbdadm dstate r0
    -----
  • ns (network send).      # Volume of net data sent to the partner via the network connection; in Kibyte.
  • nr (network receive).   # Volume of net data received by the partner via the network connection; in Kibyte.
    ----
  • dw (disk write).      # Net data written on local hard disk; in Kibyte.
  • dr (disk read).       # Net data read from local hard disk; in Kibyte.
    ----
  • al (activity log).         # Number of updates of the activity log area of the meta data.
  • bm (bit map).           # Number of updates of the bitmap area of the meta data.
  • lo (local count).        # Number of open requests to the local I/O sub-system issued by DRBD.
    ----
  • pe (pending).               # Number of requests sent to the partner, but that have not yet been answered by the latter.
  • ua (unacknowledged).  # Number of requests received by the partner via the network connection, but that have not yet been answered.
  • wo (write order).         # Currently used write ordering method: b (barrier), f (flush), d (drain) or n (none).
  • oos (out of sync).        # Amount of storage currently out of sync; in Kibibytes.

State

Connection state

drbdadm cstate <resource>

  • StandAlone          # No network configuration available.  (drbdadm disconnect, failed authentication, split brain)
  • Connected
  • Timeout
  • Unconnected
  • WFConnection      # This node is waiting until the peer node becomes visible on the network.
  • WFReportParams  # TCP connection has been established, this node waits for the first network packet from the peer.
  • StartingSyncS      # Full synchronization, initiated by the administrator, is just starting
  • WFBitMapS          # Partial synchronization is just starting
    ------------
  • SyncSource        # the local node being the source of synchronization.
  • SyncTarget         # the local node being the target of synchronization
    replication:SyncSource peer-disk:Inconsistent done:7.02
  • PausedSyncS      # drbdadm pause-sync
  • VerifyS               # On-line device verification is currently running (source of verification)
  • VerifyT               # On-line device verification is currently running (target of verification)

Disk states

drbdadm dstate <resource>

  • UpToDate
  • Negotiating
  • DUnknown
  • Diskless (detached using drbdadm detach)
  • Attaching
  • Failed
  • Inconsistent
  • Outdated
  • Consistent: Consistent data of a node without connection.
  • DUnknown: This state is used for the peer disk if no network connection is available.

 


Enabling and Disabling Resource

 

up

drbdadm up <resource>

down

drbdadm down <resource>

Remark

可以用 all 去表達所有 resource

i.e.

drbdadm down all

 


Change the Primary node(Manual Failover)

 

[Lab 1] Change Over (V8)

# On node 1 (orig: primary node)

drbdadm role r0           # Shows the current roles of the devices (e.g. "Primary/Secondary")

umount /dev/drbd0

drbdadm secondary r0

drbdadm status

r0 role:Secondary
  disk:UpToDate
  drbd-b role:Secondary
    peer-disk:UpToDate

# On node 2 (secondary)

drbdadm primary r0

mount /dev/drbd0 /home/drbd

...

P.S.

* 當 "connection state" 是 "Connected" 時, 在 node 2 上行 "drbdadm primary <resource>" 會有 error

   因為在 node 1 沒有執行 "drbdadm secondary r0"

[Lab 2] cs2 failover to cs1 (V8)

# On cs2 (Orig: primary node)

reboot

# On cs1 (Orig: secondary)

drbdadm status

r0 role:Secondary
  disk:UpToDate
  peer connection:Connecting

drbdadm primary r0                 # 非必要, 因為 mount 會令它直接變成 Primary

Remark

# 在沒有行 "drbdadm primary r0" 情況下

mount /dev/drbd1 /var/lib/lxc

mount: /dev/drbd1 is write-protected, mounting read-only
mount: mount /dev/drbd1 on /var/lib/lxc failed: Wrong medium type

# On cs2 (Secondary)

drbdadm up r0

Marked additional 12 MB as out-of-sync based on AL.

drbdadm status

r0 role:Secondary
  disk:UpToDate
  peer connection:Connecting

...

r0 role:Secondary
  disk:UpToDate

dmesg

... block drbd1: Split-Brain detected but unresolved, dropping connection!

 


Reconfiguring Setting / Resource

 

On both nodes:

  1. synchronize your /etc/drbd.conf
  2. drbdadm adjust <resource>

P.S.

# 必須 dry-run (-d) 看有沒有錯先

drbdadm -d adjust <resource>

i.e.

drbdadm adjust all

Remark

更新 /etc/drbd.conf 的 common section 時, CLI 要行 "drbdadm adjust all"

 


Online device verification

 

Diagram:

VerifyS -sha1-> VerifyT

設定:

global_common.conf

common {
    net {
        # 設定檢查的算法. Default off. sha1 | md5
        verify-alg <algorithm>;
    }
}

OR

# MyResource.res

resource <resource>
  net {
    verify-alg <algorithm>;
  }
  ...
}

CPU 硬解

...

執行檢查:

# 當 blocks out of sync 時, 會有 kernel message log

drbdadm verify <resource | all>

ie.

drbdadm verify all

# Local 沒有 設定 "verify-alg" 時

r0: State change failed: (-14) Need a verify algorithm to start online verify
Command 'drbdsetup verify r0 0 0' terminated with exit code 11

# Remote 沒有 set "verify-alg"

r0: State change failed: (-10) State change was refused by peer node
Command 'drbdsetup verify r0 0 0' terminated with exit code 11

dmesg

... drbd r0: Preparing cluster-wide state change 999491511 (1->0 496/288)
... drbd r0: State change 999491511: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFFC
... drbd r0: Committing cluster-wide state change 999491511 (1ms)
... drbd r0/0 drbd1 drbd-a: repl( Established -> VerifyS )
... drbd r0/0 drbd1 drbd-a: Starting Online Verify from sector 0
... drbd r0/0 drbd1 drbd-a: Starting Online Verify from sector 48
... 一段時間後
... drbd r0/0 drbd1 drbd-a: Online verify done (total 140 sec; paused 0 sec; 74888 K/sec)
... drbd r0/0 drbd1 drbd-a: repl( VerifyS -> Established )

Remark:

不論 node 是 Pri. 或 Sec. 行 CLI 的一方會成為 VerifyT, 另一邊會成為 VerifyS

dstat -c -n -N eth1 -d -D vdb

----total-cpu-usage---- --net/eth1- --dsk/vdb--
usr sys idl wai hiq siq| recv  send| read  writ
  1  15  85   0   0   0|1696k 3063k|  87M    0
  0   9  91   0   0   1|1161k 2052k|  61M    0
  0   9  91   0   0   0|1486k 2547k|  72M    0
...

測試2

在 Secondary 上

dd if=/dev/zero of=/dev/vdb1 bs=1M count=1

drbdadm status

r0 role:Secondary
  disk:UpToDate
  drbd-b role:Primary
    peer-disk:UpToDate

drbdadm verify all

dmesg

... drbd r0/0 drbd1 drbd-b: Out of sync: start=0, size=6240 (sectors)
... drbd r0/0 drbd1 drbd-b: Out of sync: start=7248, size=184 (sectors)
...
... drbd r0/0 drbd1 drbd-b: Online verify done (total 149 sec; paused 0 sec; 76912 K/sec)
... drbd r0/0 drbd1 drbd-b: Online verify found 244000 4k blocks out of sync!

drbdadm status

r0 role:Secondary
  disk:UpToDate                 <- ?!
  drbd-b role:Primary
    peer-disk:UpToDate

Resynchronize mark blocks:

drbdadm disconnect r0

drbdadm connect r0

drbdadm status

r0 role:Secondary
  disk:Outdated
  drbd-b connection:Connecting

 


Sync data 的速度

 

Synchronization is distinct from device replication

Synchronization is necessary if the replication link has been interrupted for any reason

for controlling the used bandwidth during resynchronization (not normal replication)
(ie. in the SyncTarget state, on the (inconsistent) receiver side)

V8.4 預設是用 Variable Rate

Setting

  • Fix Rate
  • 臨時提速
  • Variable sync rate

Fix Rate

# 100MByte/second of IO bandwidth

common {
  disk {
    c-plan-ahead 0;
    resync-rate 100M;
  }
}

P.S.

要有 "c-plan-ahead 0;" setting "resync-rate 100M;" 才生效

1G network IO ~ 110 Mibyte

臨時提速:

drbdadm disk-options --c-plan-ahead=0 --resync-rate=110M <resource>

還原本設定:

drbdadm adjust resource

Variable Rate:

dynamic resync speed controller setting: c-plan-ahead, c-fill-target, c-delay-target, c-max-rate, c-min-rate

resource <resource> {
  disk {
    # agility of the controller (Higher values => slower responses => 誤差大)
    # Default: 0. Unit: 0.1 seconds. It should be at least 5 times RTT.
    c-plan-ahead 20;

    # The most bandwidth that can be used by a resync.
    c-max-rate 100M;
    
    # c-fill-target, c-delay-target 是二選一的設定
    # Only when fill_target is set to 0 the controller will use delay_target
    # c-fill-target = a constant amount of data fill_target
    # c-delay-target = constant delay time of delay_target along the path, unit: 0.1s
    c-fill-target 50M;

    # tells DRBD use only up to min_rate for resync IO and 
    # to dedicate all other available IO bandwidth to application requests.
    # 0 = disables the limitation
    c-min-rate 10M;
  }
}

 


Checksum-based Synchronization

 

作用

減少 re-writes a sector with identical contents while DRBD is in disconnected mode

運作

When using checksum-based synchronization, then rather than performing a brute-force overwrite of blocks marked out of sync,

DRBD reads blocks before synchronizing them and computes a hash of the contents currently found on disk.

Settings

# Default off, algorithm: sha1, md5, and crc32c
resource <resource>
  net {
    csums-alg <algorithm>;
  }
  ...
}

 


Replication traffic integrity checking Setting(For debug)

 

 * This feature is not intended for production use. Enable only if you need to diagnose data corruption problems.

# default off
resource <resource>
  net {
    data-integrity-alg <algorithm>;
  }
  ...
}

 


I/O Errors (Disk error handling strategies)

 

Config

resource <resource> {
  disk {
    on-io-error <strategy>;
    ...
  }
  ...
}

處理方式

  • Passing on I/O errors (it is left to upper layers to deal with such errors)

(may result in a file system being remounted read-only)

 

  • Masking I/O errors (V8.4 Default)

DRBD transparently fetches the affected block from the peer node, over the network.
(carries out all subsequent I/O operations, read and write, on the peer node.)
 => service continues without interruption

strategy

detach(Default)

the node drops its backing device, and continues in diskless mode.

pass_on

This causes DRBD to report the I/O error to the upper layers.

On the primary node, it is reported to the mounted file system.

On the secondary node, it is ignored (because the secondary has no upper layer to report to).

call-local-io-error

Invokes the command defined as the local I/O error handler.

(defined in the resource’s handlers section)

 


Growing on-line

 

backing block devices can be grown while in operation (online)

node is in primary state:

drbdadm resize resource

* This triggers a synchronization of the new section.

Growing off-line:

 "internal meta data" must be moved to the end of the grown device

on both nodes:

drbdadm dump-md resource > /tmp/metadata

# Adjust the size information (la-size-sect[sectors]) in the file /tmp/metadata

# Re-initialize the metadata area

drbdadm create-md resource

# Re-import the corrected meta data

drbdmeta_cmd=$(drbdadm -d dump-md test-disk)
${drbdmeta_cmd/dump-md/restore-md} /tmp/metadata

# Re-enable

drbdadm up resource

 


Disabling backing device flushes

 

只有 device 有 battery-backed write cache (BBWC) 才好 enable 它們

(當 controller 有電時, 會寫 cache 的資料入碟)

因為寫入到 cache 後, 那 cache 內的 data 亦即是有效的 Data , 所以不怕斷電

如果那 device 是 write-through mode 的, 亦可 enable 此功能

resource <resource>
  disk {
 
    # replicated data set
    disk-flushes no;
    
    # meta data
    md-flushes no;
    ...
  }
  ...
}

 


Dealing with hard drive failure

 

# Replacing a failed disk when using internal meta data

# Full synchronization of the new hard disk starts instantaneously and automatically.

drbdadm create-md <resource>

 

 


Using Truck based replication (Disk shipping)

 

用處:

一次過 sync 大量 Data 去遠地方

做法:

By removing a hot-swappable drive from a RAID-1 mirror

Local node:

# 此 cmd 後, 就可拿出 harddisk

# Create a consistent, verbatim copy of the resource’s data and its metadata.

drbdadm new-current-uuid --clear-bitmap <resource>

# 拿出 harddisk 後

drbdadm  new-current-uuid  resource

Disk to remote node:

# 將運過來的 Hard Disk 插入後, 就可以行以下 cmd

drbdadm up <resource>

 

 

Creative Commons license icon Creative Commons license icon