最強的 FileSystem - BTRFS

最後更新: 2018-04-09

 

介紹

 

btrfs 是一個多才多藝的 FileSystem 來, 它根本架構是 CoW, 支援以下功能:

  • Extent based file storage (2^64 max file size)
  • Dynamic inode allocation
  • Writable snapshots
  • Subvolumes
  • Object level mirroring and striping
  • Compression (gzip and LZO)
  • Online filesystem check
  • Online filesystem defragmentation
  • Integrated multiple device support (RAID-0, RAID-1 and RAID-10 )
  • Checksums on data and metadata
  • Space-efficient packing of small files
  • Space-efficient indexed directories
  • Seed devices
  • Background scrub process for finding and fixing errors on files with redundant copies

所以說, 唔用就走寶 ~

不過, 使用前一定要少心, 因為在 Linux2.6 上它仍未有 fsck !! (Linux 3.2 上的 btrfsck 仍未用得..)

此外, 如果 CPU 支援 hardware 的 CRC32 那效能會更好

以下是 Btrfs v0.19 的使用

 

目錄

  • 注意事項
  • 新式指令
  • Kernel threads
  • Recovery
  • Build btrfs-progs
  • extref(extended inode refs)
  • More about btrfs
  • 建立多 Device 的 btrfs (mkfs.btrfs)
  • Conversion (single -> raid1)
  • Replacing failed devices
  • Play btrfs with image file
  • btrfs-send / btrfs-receive
  • GlobalReserve
  • XOR module
  • 常遇到的應用
     

DOC

https://btrfs.wiki.kernel.org/index.php/Main_Page

 


btrfs 指令

 

btrfs 現在總分兩類指令, 分別是舊風格的及新風格的

新式:

  • btrfs [option]                     # 是的, 新式的只有一個指令, 伴隨著一堆 option

舊式:

  • btrfsctl
  • btrfs-show
  • btrfstune

 


checking & mount & fstab

 

scan

This is required after loading the btrfs module if you're running with more than one device in a filesystem.

btrfs device scan /dev/sdd1 /dev/sde1

Scanning for Btrfs filesystems in '/dev/sdd1'
Scanning for Btrfs filesystems in '/dev/sde1'

dmesg

[454919.695156] btrfs: device label raid4t devid 1 transid 4 /dev/sdd1
[454919.705341] btrfs: device label raid4t devid 2 transid 4 /dev/sde1

# Show the structure of a filesystem

btrfs filesystem show /dev/sdd1

Label: raid4t  uuid: da8c8ae3-163b-47ba-aafc-6ec0199df818
        Total devices 2 FS bytes used 640.00KiB
        devid    1 size 3.64TiB used 2.03GiB path /dev/sdd1
        devid    2 size 3.64TiB used 2.01GiB path /dev/sde1

# Show space usage information for a mount point

btrfs filesystem df /data/raid4t

Data, RAID1: total=1.00GiB, used=512.00KiB
Data, single: total=8.00MiB, used=0.00
System, RAID1: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, RAID1: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00

df -h

/dev/sdd1             7.3T  4.0G  7.3T   1% /data/raid4t

 * BTRFS isn't actually mirroring disks like "raid 1", it's storing at least two copes of data.

 

fstab

/etc/fstab

doesn't perform a "btrfs device scan [/dev/sd?]",

you can still mount a multi-volume btrfs filesystem by passing all the devices in the filesystem explicitly to the mount command

/dev/sdb     /mnt    btrfs    device=/dev/sdb1,device=/dev/sdc2,device=/dev/sdd1,device=/dev/sde1    0 0

 

設定 Default 的 mount volume

查看有什麼 volume:

btrfs sub list /mnt/sda8

ID 256 top level 5 path vps

設置:

btrfs sub set 256 /mnt/sda8

fstab:

# myserver
/dev/sda8       /var/lib/lxc/myserver/rootfs    btrfs   noatime,subvol=vps      0       0
/dev/sda8       /mnt/sda8                       btrfs   noatime,subvolid=0      0       0

指定 mount /dev/sda5 時, 根目錄是用那一個 sub-volume

mount:

mount -t btrfs -o subvol=vm.13-9-11 /dev/sda7 /mnt/tmp

 

Mount Options

noatime

device

# mount a multi-volume btrfs filesystem by passing all the devices in the filesystem

device=/dev/sdd1,device=/dev/sde1

subvol

label

subvol=home

id

subvolid=0    # 0 = top level

autodefrag (since 3.0)

Will detect random writes into existing files and kick off background defragging.

It is well suited to bdb or sqlite databases, but not virtualization images or big databases (yet).

commit=number (since 3.12)

Set the interval of periodic commit, 30 seconds by default.

check_int (since 3.3)

Switch on integrity checker for metadata

check_int_data (since 3.3)

Switch on integrity checker for data and metadata

compress

compress=zlib - Better compression ratio. It's the default and safe for olders kernels.

compress=lzo - Faster compression.

compress=no - Disables compression (starting with kernel 3.6).

space_cache (since 2.6.37)

Btrfs stores the free space data on-disk to make the caching of a block group much quicker.

if enabled, Kernel will have available FS free space block addresses in memory,

thus when you create a new file it will immediately start writing data to disk.

Without this, Btrfs has to scan the entire tree every time looking for the free space that can be allocated.

clear_cache (since 2.6.37)

Clear all the free space caches during mount.
(used one time and only after you notice some problems with free space. )

degraded

when raid levels require certain number of devices for successful mount.

discard

Enables discard/TRIM on freed blocks.

inode_cache (since 3.0)

Enable free inode number caching.
(This option may slow down your system at first run. )

recovery (since 3.2)

Enable autorecovery upon mount; currently it scans list of several previous tree roots and tries to use the first readable. The information about the tree root backups is stored by kernels starting with 3.2, older kernels do not and thus no recovery can be done.

nodatacow

Do not copy-on-write data for newly created files, existing files are unaffected. This also turns off checksumming! IOW, nodatacow implies nodatasum. datacow is used to ensure the user either has access to the old version of a file, or to the newer version of the file. datacow makes sure we never have partially updated files written to disk. nodatacow gives slight performance boost by directly overwriting data (like ext[234]), at the expense of potentially getting partially updated files on system failures. Performance gain is usually < 5% unless the workload is random writes to large database files, where the difference can become very large. NOTE: switches off compression !

DOC

https://btrfs.wiki.kernel.org/index.php/Mount_options

 


注意事項

 

Block-level copies of devices

Do NOT

    make a block-level copy of a Btrfs filesystem to another block device...
    use LVM snapshots, or any other kind of block level snapshots...
    turn a copy of a filesystem that is stored in a file into a block device with the loopback driver...

... and then try to mount either the original or the snapshot while both are visible to the same kernel.

Why?

If there are multiple block devices visible at the same time, and those block devices have the same filesystem UUID,

then they're treated as part of the same filesystem.

If they are actually copies of each other (copied by dd or LVM snapshot, or any other method),

then mounting either one of them could cause data corruption in one or both of them.

If you for example make an LVM snapshot of a btrfs filesystem, you can't mount either the LVM snapshot or the original,

because the kernel will get confused, because it thinks it's mounting a Btrfs filesystem that consists of two disks,

after which it runs into two devices which have the same device number.

Fragmentation

Files with a lot of random writes can become heavily fragmented (10000+ extents) causing thrashing on HDDs and excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM.

On servers and workstations this affects databases and virtual machine images.

On desktops this primarily affects application databases

You can use filefrag to locate heavily fragmented files (may not work correctly with compression).

Having many subvolumes can be very slow

The cost of several operations, including currently balance and device delete, is proportional to the number of subvolumes,

including snapshots, and (slightly super-linearly) the number of extents in the subvolumes.

This is "obvious" for "pure" subvolumes, as each is an independent file tree and has independent extents anyhow (except for ref-linked ones).

But in the case of snapshots metadata and extents are (usually) largely ref-linked with the ancestor subvolume,

so the full scan of the snapshot need not happen, but currently this happens.
 

 



新式指令

 

btrfs - control a btrfs filesystem

Version

btrfs version

btrfs-progs v4.4

help

btrfs help [--full]

btrfs subvolume --help

subvolume

# 以下例子會建立一個叫 data 的 subvolume, 並把它設定成 mount 時的 root

#1 準備

mount /dev/sdg1 /mnt

cd /mnt

btrfs subvolume show .

/mnt/btrfs is toplevel subvolume

#2 建立一個叫 data 的 subvolume

btrfs subvolume create data

Remark: 刪除

btrfs subvolume delete <subvolume>

#3 查看所有 subvolume 的 ID (新建立的 subvolume ID 是 257)

btrfs subvolume list .

ID 257 gen 11 top level 5 path data

#4 查看當前的 root ID

btrfs subvolume get-default .

ID 5 (FS_TREE)

#5 將 ID 是 257 的 subvolume 設定成 root

btrfs subvolume set-default 257 .

#6 測試

cd ..

umount /mnt

mount /dev/sdg1 /mnt

cd /mnt

btrfs subvolume show .

/mnt
        Name:                   data
        UUID:                   6bb748d2-21a2-ec4a-84b8-18d441bb6bd0
        Parent UUID:            -
        Received UUID:          -
        Creation time:          2019-01-03 23:13:14 +0800
        Subvolume ID:           257
        Generation:             11
        Gen at creation:        8
        Parent ID:              5
        Top level ID:           5
        Flags:                  -
        Snapshot(s):

#7 在另外地方 mount 回 top level (為了 take snapshot)

mount -o subvolid=5 /dev/sdg1 /path/to/mountpoint

Remark

toplevel subvolume: ID=5

#8 take snapshot

btrfs subvolume snapshot <source> [<dest>/]<name>

 

filesystem - Manage a btrfs filesystem

show [device]

i.e.

btrfs filesystem show /dev/sdd1

Label: 'raid4t'  uuid: 0e100ef3-00c3-4761-90a4-965420a5c5fd
        Total devices 2 FS bytes used 112.00KiB
        devid    1 size 3.64TiB used 2.01GiB path /dev/sdd1
        devid    2 size 3.64TiB used 2.01GiB path /dev/sde1

-m|--mbytes
               show sizes in MiB

df <mount_point>

summary information about allocation of block group types of a given mount point

i.e.

btrfs filesystem df /data/raid4t

Data, RAID1: total=1.00GiB, used=512.00KiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=1.00GiB, used=112.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

GlobalReserve

It is an artificial and internal emergency space. It is used eg. when the filesystem is full.

Its total size is dynamic based on the filesystem size, usually not larger than 512MiB, used may fluctuate.

usage <path>

Show detailed information about internal filesystem usage.

Overall:
    Device size:                   7.28TiB
    Device allocated:            128.02GiB
    Device unallocated:            7.15TiB
    Device missing:                  0.00B
    Used:                        122.70GiB
    Free (estimated):              3.58TiB      (min: 3.58TiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:               65.00MiB      (used: 0.00B)
....

defragment [options] <file>|<dir> [<file>|<dir>...]

-r              # files in dir will be defragmented recursively

-f               # flush data for each file before going to the next file.

i.e.

btrfs filesystem defragment pc_data

resize [+/-]<size>[gkm]|max <filesystem>

i.e.

cd /data

btrfs fi show .

Label: none  uuid: 1711bb02-3009-42f2-952e-ea34ec1f218a
        Total devices 1 FS bytes used 1014.43GiB
        devid    1 size 1.10TiB used 1024.00GiB path /dev/mapper/vg3t-data_disk

btrfs fi resize max .

sync <path>

# This is done via a special ioctl and will also trigger cleaning of deleted subvolumes.

i.e.

btrfs fi sync /data/raid4t

FSSync '/data/raid4t'

Change Label

set

btrfs filesystem label <mountpoint> <newlabel>

i.e.

btrfs filesystem label /mnt/tmp mytestlw

get

btrfs filesystem label /mnt/tmp

mytestlw

 

device - Manage devices

       btrfs device scan [<device> [<device>..]]                   # Scan devices for a btrfs filesystem.

# If no devices are passed, btrfs uses block devices containing btrfs filesystem as listed by blkid.

       btrfs device add <dev> [<dev>..]

       btrfs device delete <dev> [<dev>..] <path> ]

       btrfs device usage <path>                                          # Show detailed information about internal allocations in devices

/dev/sdd1, ID: 1
   Device size:             3.64TiB
   Data,RAID1:              1.00GiB
   Metadata,RAID1:          1.00GiB
   System,RAID1:            8.00MiB
   Unallocated:             3.64TiB

/dev/sde1, ID: 2
   Device size:             3.64TiB
   Data,RAID1:              1.00GiB
   Metadata,RAID1:          1.00GiB
   System,RAID1:            8.00MiB
   Unallocated:             3.64TiB

btrfs device stats [-z] <path>|<device>                  # Read and print the device IO stats

-z               # Reset stats to zero after reading them.

[/dev/sdd1].write_io_errs   0
[/dev/sdd1].read_io_errs    0
[/dev/sdd1].flush_io_errs   0
[/dev/sdd1].corruption_errs 0
[/dev/sdd1].generation_errs 0
[/dev/sde1].write_io_errs   0
[/dev/sde1].read_io_errs    0
[/dev/sde1].flush_io_errs   0
[/dev/sde1].corruption_errs 0
[/dev/sde1].generation_errs 0

scrub

start

# The default IO priority of scrub is the idle class.

btrfs scrub start [-BdqrR] [-c ioprio_class -n ioprio_classdata] <path> | <device>

 * identified by <path> or on a single <device>

# 用最快速 scrub

-c <ioprio_class>               # set IO priority class (see ionice(1) manpage)

                                        # 0 for none, 1 for realtime, 2 for best-effort, 3 for idle

-n <ioprio_classdata>        # set IO priority classdata (see ionice(1) manpage)

                                        # For realtime and best-effort, 0-7 are valid data

btrfs scrub start -c 2 -n 0 /data/raid4t

status

btrfs scrub status [-dR] <path>|<device>

-d     stats per device

-R     print raw stats

i.e.

btrfs scrub status .

scrub status for 8ce021c5-6ea3-4694-93f9-d924e93d6eb4
        scrub started at Sun Jul 28 22:33:13 2019 and finished after 00:00:00
        total bytes scrubbed: 1.25MiB with 0 errors

btrfs scrub status -d .

scrub status for 8ce021c5-6ea3-4694-93f9-d924e93d6eb4
scrub device /dev/sde1 (id 1) history
        scrub started at Sun Jul 28 22:33:13 2019 and finished after 00:00:00
        total bytes scrubbed: 640.00KiB with 0 errors
scrub device /dev/sdf1 (id 2) history
        scrub started at Sun Jul 28 22:33:13 2019 and finished after 00:00:00
        total bytes scrubbed: 640.00KiB with 0 errors

btrfs scrub status -R .

scrub status for 8ce021c5-6ea3-4694-93f9-d924e93d6eb4
        scrub started at Sun Jul 28 22:33:13 2019 and finished after 00:00:00
        data_extents_scrubbed: 16
        tree_extents_scrubbed: 16
        data_bytes_scrubbed: 1048576
        tree_bytes_scrubbed: 262144
        read_errors: 0
        csum_errors: 0
        verify_errors: 0
        no_csum: 256
        csum_discards: 0
        super_errors: 0
        malloc_errors: 0
        uncorrectable_errors: 0
        unverified_errors: 0
        corrected_errors: 0
        last_physical: 4333764608

cancel

btrfs scrub cancel <path>|<device>

resume

Resume a canceled or interrupted scrub cycle

check

# Check an unmounted btrfs filesystem (Do off-line check on a btrfs filesystem)

btrfs check [options] <device>

--repair                                     # try to repair the filesystem.

-s|--support <superblock>         # use this superblock copy.

--init-csum-tree                         # create a new CRC tree.

--init-extent-tree                       # create a new extent tree.

i.e.

btrfs check --repair  /dev/sdd1

enabling repair mode
Checking filesystem on /dev/sdd1
UUID: 84b02a2b-f730-475d-af12-e05d03ba4391
checking extents
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
checking csums
checking root refs
found 181503785603 bytes used err is 0
total csum bytes: 421463376
total tree bytes: 515047424
total fs tree bytes: 35569664
total extent tree bytes: 9011200
btree space waste bytes: 48060766
file data blocks allocated: 431684403200
 referenced 431684403200
Btrfs v3.12

balance

# spread block groups accross all devices so they match constraints defined by the respective profiles.

btrfs [filesystem] balance start [options] <path>

btrfs [filesystem] balance cancel <path>

btrfs [filesystem] balance status [-v] <path>

quota

btrfs quota enable <path>

btrfs quota disable <path>

btrfs quota rescan [-sw] <path>

snapshot

i.e.

# 建立 snapshot

cd /mnt/sda6                                           # sda6 係 mount btrfs

mkdir snap                                               # 建立一個 Folder 去放 snap

btrfs sub snap -r ./vm snap/ vm-20170914  # vm 係一個 subvolume 來, vm-20170914 係 snapshot 名

# show snapshot in subvolume

btrfs sub show /backup/vm_admin_bak

....
        Snapshot(s):
                                .snapshots/1606/snapshot
                                .snapshots/2379/snapshot
....

# 刪除 snapshot "/backup/vm_admin_bak/.snapshots/2702/snapshot"

btrfs subvolume delete /backup/vm_admin_bak/.snapshots/2702/snapshot

Delete subvolume (no-commit): '/backup/vm_admin_bak/.snapshots/2702/snapshot'

property

# Lists available properties with their descriptions for the given object.

btrfs property list /data/raid4t

ro                  Set/get read-only flag of subvolume.
label               Set/get label of device.
compression         Set/get compression for a file or directory

 

readonly snapshot to rw snapshot

btrfs property list /path/to/snapshot

ro                  Set/get read-only flag of subvolume.
compression         Set/get compression for a file or directory

btrfs property get /path/to/snapshot ro

ro=true

btrfs property set /path/to/snapshot ro false

 


Kernel threads

 

btrfs-cleaner
btrfs-delalloc
btrfs-delayed-m
btrfs-endio-met
btrfs-endio-wri
btrfs-freespace
btrfs-readahead
btrfs-transacti
btrfs-cache-<n>
btrfs-endio-<n>
btrfs-fixup-<n>
btrfs-genwork-<n>
btrfs-submit-<n>
btrfs-worker-<n>
flush-btrfs-<n>

 


Recovery

 

當遇上問題

cp: reading `winxp-vio.qcow2': Input/output error
cp: failed to extend `/mnt/storage1/winxp-vio.qcow2': Input/output error

dmesg:

[4483186.892609] btrfs csum failed ino 100268 off 377020416 csum 1102225462 private 2315516263
[4483186.892795] btrfs csum failed ino 100268 off 377020416 csum 1102225462 private 2315516263

btrfsck 竟然說無事 ...

# btrfsck /dev/sda6

found 14301544448 bytes used err is 0
total csum bytes: 13947324
total tree bytes: 19484672
total fs tree bytes: 200704
btree space waste bytes: 4095840
file data blocks allocated: 14282059776
 referenced 14282059776
Btrfs Btrfs v0.19

找出有問題的檔案

# find . -inum 100268

./winxp-vio.qcow2

看來 winxp-vio.qcow2 這 file 沒救了 -__-

btrfs-zero-log

# clear out log tree

example:

server:btrfs-progs# ./btrfs-zero-log /dev/sdb1

recovery,nospace_cache,clear_cache

btrfs -t btrfs -o recovery,nospace_cache,clear_cache DEVICE MOUNTPOINT

 


Build btrfs-progs

 

Check Version:

btrfs version

btrfs-progs v4.4

準備:

apt-get install uuid-dev libattr1-dev zlib1g-dev libacl1-dev e2fslibs-dev libblkid-dev liblzo2-dev

Git repository:

# Official:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git

# Btrfs progs developments and patch integration

http://repo.or.cz/w/btrfs-progs-unstable/devel.git

cd btrfs-progs/

make

Usage

./btrfs fi show

 


Craete fs opts

 

# list all opts

mkfs.btrfs -O list-all

Filesystem features available:
mixed-bg            - mixed data and metadata block groups (0x4)
extref              - increased hardlink limit per file to 65536 (0x40, default)
raid56              - raid56 extended format (0x80)
skinny-metadata     - reduced-size metadata extent refs (0x100, default)
no-holes            - no explicit hole extents for files (0x200)

extref(extended inode refs)

the total number of links that can be stored for a given inode / parent dir pair is limited to under "4k"

 * 在 format 時決定

 

 


More about btrfs

 

============== Data allocation

it allocates chunks of this raw storage, typically in 1GiB lumps

Many files may be placed within a chunk, and files may span across more than one chunk.

 

============== RAID-1

btrfs replicates data on a per-chunk basis (chunks are allocated in pairs(different block device))

 

============== Balancing

A btrfs balance operation rewrites things at the level of chunks.

 

============== CoW

If you mount the filesystem with nodatacow, or use chattr +C on the file, then it only does the CoW operation for data if there’s more than one copy referenced.

 

============== Subvolumes

A subvolume in btrfs is not the same as an LVM logical volume or a ZFS subvolume.
(With LVM, a logical volume is a block device in its own right(filesystem))

* A btrfs filesystem has a default subvolume, which is initially set to be the top-level subvolume and

which is mounted if no subvol or subvolid option is specified. (subvolume can be mounted by subvol or subvolid )

Changing the default subvolume with btrfs subvolume default will make the top level of the filesystem inaccessible,

except by use of the subvol=/ or subvolid=5 mount options.

 * mounted top-level subvolume (subvol=/ or subvolid=5) => the full filesystem structure will be seen at the mount

 

Layout

 - Flat Subvolumes
 - Nested Subvolumes

何時用 Subvolumes

 * "Split" of areas which are "complete" and/or "consistent" in themselves. ("/var/lib/postgresql/")

 * Split of areas which need special properties. (VM images, "nodatacow")

 

====================== Snapshots

A snapshot is simply a subvolume that shares its data (and metadata) with some other subvolume, using btrfs's COW capabilities.

Once a [writable] snapshot is made, there is no difference in status between the original subvolume, and the new snapshot subvolume.

# Structure

toplevel
  +-- root                 (subvolume, to be mounted at /)
  +-- home                 (subvolume, to be mounted at /home)
  \-- snapshots            (directory)
      +-- root             (directory)
          +-- 2015-01-01   (subvolume, ro-snapshot of subvolume "root")
          +-- 2015-06-01   (subvolume, ro-snapshot of subvolume "root")
      \-- home             (directory)
          \-- 2015-01-01   (subvolume, ro-snapshot of subvolume "home")
          \-- 2015-12-01   (subvolume, rw-snapshot of subvolume "home")

# fstab

LABEL=the-btrfs-fs-device   /                    subvol=/root,defaults,noatime      0  0
LABEL=the-btrfs-fs-device   /home                subvol=/home,defaults,noatime      0  0
LABEL=the-btrfs-fs-device   /root/btrfs-top-lvl  subvol=/,defaults,noauto,noatime   0  0

# Creating a rw-snapshot

mount /root/btrfs-top-lvl
btrfs subvolume snapshot /root/btrfs-top-lvl/home /root/btrfs-top-lvl/snapshots/home/2015-12-01
umount /root/btrfs-top-lvl

# Restore

mount /root/btrfs-top-lvl
umount /home
mv /root/btrfs-top-lvl/home /root/btrfs-top-lvl/home.tmp     #or it could  have been deleted see below
mv /root/btrfs-top-lvl/snapshots/home/2015-12-01 /root/btrfs-top-lvl/home
mount /home

Remark - ro-snapshot

subvolume snapshot -r <source> <dest>|[<dest>/]<name>

 * Read-only subvolumes cannot be moved.

 


建立多 Device 的 btrfs (mkfs.btrfs)

 

多個與單一device

建立 btrfs 時, 如果有多個 device, 那它會有以下特性

多個 device:

metadata: mirrored across two devices
data: striped

單一 device:

metadata: duplicated on same disk

# Don't duplicate metadata on a single drive

mkfs.btrfs -m single /dev/sdz1

-m  <opt>  <--- metadata profile <-- 確保在 degraded mode 依然 mount 到

-d   <opt>  <--- data profile

 * more devices can be added after the FS has been created

 * convert between RAID levels after the FS has been created

建立 Filesystem

mkfs.btrfs /dev/sdz1

For HDD 相當於

mkfs.btrfs -m dup -d single /dev/sdz1

For SSD (or non-rotational device) 相當於

mkfs.btrfs -m single -d single /dev/sdz1

多 Device

mkfs.btrfs /dev/sdx1 /dev/sdy1

特性

  • Data: RAID0
  • System: RAID1
  • Metadata: RAID1

支援 RAID Level:

single, Raid0, Raid1, raid10(4 devices), Raid 5 and Raid6

single: full capacity of multiple drives with different sizes (metadata mirrored, data not mirrored and not striped)

Usage: RAID1

create raid1:

mkfs.btrfs -L raid4t -m raid1 -d raid1 /dev/sdd1 /dev/sde1

opt:

-m level                # metadata (raid0, raid1, raid10 or single)

-d  level                # data (raid0, raid1, raid10 or single)

-L  name               # Label 只可以在建立 btrfs 時設定

P.S.

它是可以多過2隻 HardDisk 去做 RAID 的 !!

Why do I have "single" chunks in my RAID filesystem?

Data, RAID1: total=3.08TiB, used=3.02TiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=3.88MiB, used=336.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=4.19GiB, used=3.56GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B

The single chunks are perfectly normal, and are a result of the way that mkfs works.

They are small, harmless, and will remain unused as the FS grows, so you won't risk any unreplicated data.

 


Conversion (single -> raid1)

single -> raid1

由無 raid 的 /dev/sda1 到有raid1 (sda1 及 sdb1)

mount /dev/sda1 /mnt
btrfs device add /dev/sdb1 /mnt
btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt

之後此 command 會一直在行, 直到整個 proccess 完成. 此時 mount point 仍可 R/W

If the metadata is not converted from the single-device default,

it remains as DUP, which does not guarantee that copies of block are on separate devices.

If data is not converted it does not have any redundant copies at all.

# 在另外一個 Shell 查看情況

btrfs balance status .

Balance on '.' is running
52 out of about 2583 chunks balanced (53 considered),  98% left

一會兒

Balance on '.' is running
55 out of about 2583 chunks balanced (56 considered),  98% left

btrfs filesystem df .

Data, RAID1: total=18.00GiB, used=16.31GiB             # 漸漸上升
Data, single: total=2.50TiB, used=2.49TiB              # 漸漸下降
System, DUP: total=8.00MiB, used=304.00KiB
Metadata, DUP: total=3.50GiB, used=2.70GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Profile names
  • raid0
  • raid1
  • raid10
  • raid5
  • raid6
  • dup (A form of "RAID" which stores two copies of each piece of data on the same device.)
  • single

Default:

# checking: btrfs fi df /data/pc_data

Data, single: total=696.01GiB, used=677.99GiB
System, DUP: total=8.00MiB, used=96.00KiB
Metadata, DUP: total=1.50GiB, used=903.45MiB
GlobalReserve, single: total=512.00MiB, used=0.00B

raid1 -> single

btrfs balance start -dconvert=single -mconvert=dup /data/1T

Done, had to relocate 6 out of 6 chunks

 


Replacing failed devices

 

發現:

mount 時失敗

mount /dev/sdg1 /data/1T

mount: wrong fs type, bad option, bad superblock on /dev/sdg1,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

dmesg 見到

[1519957.580510] BTRFS info (device sdg1): disk space caching is enabled
[1519957.580513] BTRFS info (device sdg1): has skinny extents
[1519957.591738] BTRFS warning (device sdg1): devid 2 uuid 289afc10-5bd5-47c4-b0f7-4c6e129554ae is missing
[1519957.591743] BTRFS error (device sdg1): failed to read chunk tree: -5
[1519957.628542] BTRFS error (device sdg1): open_ctree failed

fi sh 看詳細情況

# Show the structure of a filesystem

btrfs filesystem show /dev/sdg1

warning, device 2 is missing
warning devid 2 not found already
Label: '1T-RAID'  uuid: 394c3048-5fee-4bc9-b1b2-ec0084c0e2f0
        Total devices 2 FS bytes used 1.12MiB
        devid    1 size 931.51GiB used 4.04GiB path /dev/sdg1
        *** Some devices missing

解決過程:

  • mount in degraded mode
# /dev/sdg1 是 raid1 裡仍是正常的那隻 DISK
mount -o degraded /dev/sdg1 /data/1T
  • determine the raid level currently in use

btrfs fi df /data/1T

Data, RAID1: total=1.00GiB, used=0.00B
Data, single: total=1.00GiB, used=1.00MiB
System, RAID1: total=8.00MiB, used=16.00KiB
System, single: total=32.00MiB, used=0.00B
Metadata, RAID1: total=1.00GiB, used=32.00KiB
Metadata, single: total=1.00GiB, used=80.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B
  • add a new device
btrfs device add /dev/sdc1 /mnt
  • remove the missing device
# 'missing' is a special device name
btrfs device delete missing /data/1T

 * 行完 remove  cmd 後, 它將會 It redistributes any extents in use on the device being removed to the other devices in the filesystem.

成功:

 * 行完 CMD 後, sdb1 會不見了

失敗:

ERROR: error removing device 'missing': unable to go below two devices on raid1

預知 Device 有問題:

btrfs device stats /data/1T

[/dev/sdg1].write_io_errs   0
[/dev/sdg1].read_io_errs    0
[/dev/sdg1].flush_io_errs   0
[/dev/sdg1].corruption_errs 0
[/dev/sdg1].generation_errs 0
[/dev/sdg1].write_io_errs   0
[/dev/sdg1].read_io_errs    0
[/dev/sdg1].flush_io_errs   0
[/dev/sdg1].corruption_errs 0
[/dev/sdg1].generation_errs 0

 


Play btrfs with image file

 

  • Create Image File
  • Create RAID1
  • Expand from 2 Disk to 4 Disk (1g -> 2g)
  • Upgrade RAID1 to RAID10

# Create 1G image file X 4

truncate -s 1g vdisk1.img vdisk2.img vdisk3.img vdisk4.img

# map image file to block device

losetup -l

           # 沒有 Output

losetup /dev/loop1 vdisk1.img

losetup /dev/loop2 vdisk2.img

losetup /dev/loop3 vdisk3.img

losetup /dev/loop4 vdisk4.img

# Create RAID1 (sdb, sdc)

mkfs.btrfs -m raid1 -d raid1 /dev/loop1 /dev/loop2

# mount it

mkdir /mnt/btrfs

mount -t btrfs /dev/loop1 /mnt/btrfs

# check status

btrfs filesystem df /mnt/btrfs

Data, RAID1: total=102.38MiB, used=128.00KiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=102.38MiB, used=112.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

btrfs filesystem show /mnt/btrfs

Label: none  uuid: cdcded1d-edeb-4c73-948d-c3628703b47d
        Total devices 2 FS bytes used 256.00KiB
        devid    1 size 1.00GiB used 212.75MiB path /dev/loop1
        devid    2 size 1.00GiB used 212.75MiB path /dev/loop2

btrfs device usage /mnt/btrfs/

/dev/loop1, ID: 1
   Device size:             1.00GiB
   Device slack:              0.00B
   Data,RAID1:            102.38MiB
   Metadata,RAID1:        102.38MiB
   System,RAID1:            8.00MiB
   Unallocated:           811.25MiB

/dev/loop2, ID: 2
   ...

df -h | grep btrfs

/dev/loop1                1.0G   17M  913M   2% /mnt/btrfs

# Add device to existing btrfs system (sdd, sde)

btrfs device add /dev/loop3 /dev/loop4 /mnt/btrfs

btrfs filesystem df /mnt/btrfs

Data, RAID1: total=102.38MiB, used=128.00KiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=102.38MiB, used=112.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

btrfs filesystem show /mnt/btrfs

Label: none  uuid: bd7b494c-f242-4bcd-9160-443b5adf141c
        Total devices 4 FS bytes used 256.00KiB
        devid    1 size 1.00GiB used 212.75MiB path /dev/loop1
        devid    2 size 1.00GiB used 212.75MiB path /dev/loop2
        devid    3 size 1.00GiB used 0.00B path /dev/loop3
        devid    4 size 1.00GiB used 0.00B path /dev/loop4

df -h | grep btrfs

/dev/loop1                2.0G   17M  1.9G   1% /mnt/btrfs

blkid vdisk*.img

 * 所有 file 都有相同的 UUID, 但有不同的 UUID_SUB

vdisk1.img: UUID="bd7b494c-f242-4bcd-9160-443b5adf141c" UUID_SUB="e333a51b-9ef1-4369-b440-83d538e628d9" TYPE="btrfs"
vdisk2.img: UUID="bd7b494c-f242-4bcd-9160-443b5adf141c" UUID_SUB="85e90b67-9f60-4430-8c73-a1cf91e86bdd" TYPE="btrfs"
vdisk3.img: UUID="bd7b494c-f242-4bcd-9160-443b5adf141c" UUID_SUB="53469a58-803a-4847-a383-2f1470b8c126" TYPE="btrfs"
vdisk4.img: UUID="bd7b494c-f242-4bcd-9160-443b5adf141c" UUID_SUB="e19db13f-2995-4e9e-82af-3f5863cf61c5" TYPE="btrfs"

# raid1 -> raid10

btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt/btrfs

Done, had to relocate 3 out of 3 chunks

# check status

btrfs filesystem df /mnt/btrfs

Data, RAID10: total=416.00MiB, used=128.00KiB
System, RAID10: total=64.00MiB, used=16.00KiB
Metadata, RAID10: total=256.00MiB, used=112.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

Cleanup

umount /mnt/btrfs

losetup -l

NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE                   DIO
/dev/loop1         0      0         0  0 /root/btrfs-test/vdisk1.img   0
/dev/loop4         0      0         0  0 /root/btrfs-test/vdisk4.img   0
/dev/loop2         0      0         0  0 /root/btrfs-test/vdisk2.img   0
/dev/loop3         0      0         0  0 /root/btrfs-test/vdisk3.img   0

# -D, --detach-all              detach all used devices

losetup -D

 


btrfs-send / btrfs-receive

 

它比 rsync 優勝的地方

 * rsync does not track file renames

 * 支援臣大的檔案

btrfs-send

btrfs-send - generate a stream of changes between two subvolume snapshots

This command will generate a stream of instructions that describe changes between two subvolume snapshots.
The stream can be consumed by the btrfs receive command to replicate the sent snapshot on a different filesystem.
The command operates in two modes: full and incremental.

Only the send side is happening in-kernel. Receive is happening in user-space.

Based on the differences found by btrfs_compare_tree, we generate a stream of instructions.

"btrfs send" requires read-only subvolumes to operate on.

btrfs-receive - receive subvolumes from send stream

btrfs-receive

btrfs receive will fail in the following cases:

    - receiving subvolume already exists
    - previously received subvolume has been changed after it was received

A subvolume is made read-only after the receiving process finishes successfully

btrfs receive sets the subvolume read-only after it completes successfully. However, while the receive is in progress, users who have write access to files or directories in the receiving path can add, remove, or modify files, in which case the resulting read-only subvolume will not be an exact copy of the sent subvolume.

If the intention is to create an exact copy, the receiving path should be protected from access by users until the receive operation has completed and the subvolume is set to read-only.

Additionally, receive does not currently do a very good job of validating that an incremental send streams actually makes sense, and it is thus possible for a specially crafted send stream to create a subvolume with reflinks to arbitrary files in the same filesystem. Because of this, users are advised to not use btrfs receive on send streams from untrusted sources, and to protect trusted streams when sending them across untrusted networks.

應用

Initial Bootstrapping

btrfs subvolume snapshot -r /home /home/BACKUP

sync

btrfs send /home/BACKUP | btrfs receive /backup/home

Incremental Operation

btrfs subvolume snapshot -r /home /home/BACKUP-new

sync

# -p <parent>               <=   send an incremental stream from parent to subvol

# send: 前後兩個 snapshot 做對比

# receive: 在 "/backup/home" 內建立了 subversion "BACKUP-new"

btrfs send -p /home/BACKUP /home/BACKUP-new | btrfs receive /backup/home

# 題外話

btrfs sub sh /data/_snap/photo-new

UUID:                   5f081984-2843-a94b-aa53-3314422beb89
Flags:                  readonly

btrfs sub sh /data/1T/backup/photo-new

Received UUID:          5f081984-2843-a94b-aa53-3314422beb89
Flags:                  readonly

Cleanup

btrfs subvolume delete /home/BACKUP

mv /home/BACKUP-new /home/BACKUP

btrfs subvolume delete /backup/home/BACKUP

mv /backup/home/BACKUP-new /backup/home/BACKUP

# For keep backup copy

btrfs subvolume snapshot -r /backup/home/BACKUP /backup/home.$(date +%Y-%m-%d)

My Sctipt

#!/bin/bash

bk_path="/data/pc_data/photo"
snap_path="/data/_snap"
dest_path="/data/1T/backup"
keep=3

##############################################

lockfile=/tmp/btrfs-backup.lck

if [ -e $lockfile ]; then
        echo "Under locking" && exit
fi
touch $lockfile

folder=$(basename $bk_path)

org_snap=$snap_path/$folder
new_snap=$snap_path/${folder}-new

org_bak=$dest_path/$folder
new_bak=$dest_path/${folder}-new

echo " * create new snapshot"
btrfs sub snap -r $bk_path $new_snap
sync
echo " * send | receive"
btrfs send -p $org_snap $new_snap | btrfs receive $dest_path

echo " * send side cleanup"
btrfs sub del $org_snap
mv $new_snap $org_snap

echo "receive side cleanup"
btrfs subvolume delete $org_bak
mv $new_bak $org_bak
btrfs subvolume snapshot -r $org_bak $org_bak.$(date +%Y-%m-%d)

echo " * rotate backup to $keep"
ls -rd ${org_bak}.* | tail -n +$(( $keep + 1 ))| while read snap
do
        echo $snap
        btrfs subvolume delete "$snap"
done

rm -f $lockfile

ls $dest_path

 


GlobalReserve

 

What is the GlobalReserve and why does 'btrfs fi df' show it as single even on RAID filesystems?

The global block reserve is last-resort space for filesystem operations that may require allocating workspace even on a full filesystem.

An example is removing a file, subvolume or truncating a file.

This is mandated by the COW model, even removing data blocks requires to allocate some metadata blocks first (and free them once the change is persistent).

The block reserve is only virtual and is not stored on the devices.

It's an internal notion of Metadata but normally unreachable for the user actions (besides the ones mentioned above). For ease it's displayed as single.

The size of global reserve is determined dynamically according to the filesystem size but is capped at 512MiB. The value used greater than 0 means that it is in use.
 

 


XOR module

 

modprobe xor

[4945537.298825] xor: automatically using best checksumming function: generic_sse
[4945537.316018]    generic_sse:  3983.000 MB/sec
[4945537.316026] xor: using function: generic_sse (3983.000 MB/sec)

P.S.

btrfs depends: libcrc32c, zlib_deflate

 



常遇到的應用

 

# Convert an existing directory into a subvolume

cp -a --reflink=always Path2Folder Path2Subvolume

Remark

--reflink[=always]  is  specified, perform a lightweight copy,

where the data blocks are copied only when modified. 

If this is not possible the copy fails, or if --reflink=auto is specified,

fall back to a standard copy.

If you use --reflink=always on a non-COW capable filesyste, you will be given an error.

cp --reflink=always my_file.bin my_file_copy.bin