btrfs - 最強的 FileSystem

最後更新: 2023-03-10

 

目錄

  • 注意事項
  • 新式指令
  • Scrub
  • Kernel threads
  • Recovery
  • Build btrfs-progs
  • extref(extended inode refs)
  • More about btrfs
  • 建立多 Device 的 btrfs (mkfs.btrfs)
  • RAID1 Usage
  • Conversion (single -> raid1)
  • Replacing failed devices
  • Play btrfs with image file
  • btrfs-send / btrfs-receive
  • GlobalReserve
  • XOR module
  • 常遇到的應用
  • Data Integrity Test

 

介紹

btrfs 是一個多才多藝的 FileSystem 來, 它核心架構是 CoW, 支援以下功能:

  • Extent based file storage (2^64 max file size = 16EiB)
  • Dynamic inode allocation
  • Writable snapshots
  • Subvolumes (internal filesystem roots)
  • Subvolume-aware quota support
  • Object level mirroring and striping
  • Compression (ZLIB, LZO)
  • Online filesystem defragmentation
  • Integrated multiple device support (RAID0, RAID1, RAID10, RAID5 )
  • Checksums on metadata and data (crc32c, xxhash, sha256, blake2b)
  • Space-efficient packing of small files
  • Space-efficient indexed directories
  • Seed devices
    (Create a readonly filesystem that acts as a template to seed other Btrfs filesystems)
  • Background scrub process for finding and fixing errors on files with redundant copies
  • Efficient incremental backup(btrfs send)

所以說, 唔用就走寶 ~

不過, 使用前一定要少心, 因為在 Linux2.6 上它仍未有 fsck !! (Linux 3.2 上的 btrfsck 仍未用得..)

此外, 如果 CPU 支援 hardware 的 CRC32 那效能會更好
(grep -i SSE4.2 /proc/cpuinfo)

Version

每 version 的改善: btrfs_version

 


btrfs 指令

 

btrfs 現在總分兩類指令, 分別是舊風格的及新風格的

新式:

  • btrfs [option]                     # 是的, 新式的只有一個指令, 伴隨著一堆 option

舊式:

  • btrfsctl
  • btrfs-show
  • btrfstune

 


Checking & mount & fstab

 

scan

This is required after loading the btrfs module if you're running with more than one device in a filesystem.

btrfs device scan /dev/sdd1 /dev/sde1

Scanning for Btrfs filesystems in '/dev/sdd1'
Scanning for Btrfs filesystems in '/dev/sde1'

dmesg

[454919.695156] btrfs: device label raid4t devid 1 transid 4 /dev/sdd1
[454919.705341] btrfs: device label raid4t devid 2 transid 4 /dev/sde1

# Show the structure of a filesystem

btrfs filesystem show /dev/sdd1

Label: raid4t  uuid: da8c8ae3-163b-47ba-aafc-6ec0199df818
        Total devices 2 FS bytes used 640.00KiB
        devid    1 size 3.64TiB used 2.03GiB path /dev/sdd1
        devid    2 size 3.64TiB used 2.01GiB path /dev/sde1

# Show space usage information for a mount point

btrfs filesystem df /data/raid4t

Data, RAID1: total=1.00GiB, used=512.00KiB
Data, single: total=8.00MiB, used=0.00
System, RAID1: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, RAID1: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00

df -h

/dev/sdd1             7.3T  4.0G  7.3T   1% /data/raid4t

 * BTRFS isn't actually mirroring disks like "raid 1", it's storing at least two copes of data.

 

fstab

/etc/fstab

doesn't perform a "btrfs device scan [/dev/sd?]",

you can still mount a multi-volume btrfs filesystem by passing all the devices in the filesystem explicitly to the mount command

/dev/sdb     /mnt    btrfs    device=/dev/sdb1,device=/dev/sdc2,device=/dev/sdd1,device=/dev/sde1    0 0

 

設定 Default 的 mount volume

查看有什麼 volume:

btrfs sub list /mnt/sda8

ID 256 top level 5 path vps

設置:

btrfs sub set 256 /mnt/sda8

fstab:

# myserver
/dev/sda8       /var/lib/lxc/myserver/rootfs    btrfs   noatime,subvol=vps      0       0
/dev/sda8       /mnt/sda8                       btrfs   noatime,subvolid=0      0       0

指定 mount /dev/sda5 時, 根目錄是用那一個 sub-volume

mount:

mount -t btrfs -o subvol=vm.13-9-11 /dev/sda7 /mnt/tmp

 


Mount Options

 

noatime

device

# mount a multi-volume btrfs filesystem by passing all the devices in the filesystem

device=/dev/sdd1,device=/dev/sde1

subvol

By label

subvol=home

By id

subvolid=0    # 0 及 5 都是 top level

autodefrag (since 3.0)

Will detect random writes into existing files and kick off background defragging.

It is well suited to bdb or sqlite databases, but not virtualization images or big databases (yet).

commit=number (since 3.12)

Set the interval of periodic commit, 30 seconds by default.

checksum

* debugging options

check_int (since 3.3)

Switch on integrity checker for metadata

check_int_data (since 3.3)

Switch on integrity checker for data and metadata

compress

compress=zlib - Better compression ratio. It's the default and safe for olders kernels.

compress=lzo - Faster compression.

compress=no - Disables compression (starting with kernel 3.6).

space_cache (since 2.6.37)

space_cache (space_cache=v1 and space_cache=v2 since 4.5)
space_cache=version
nospace_cache

Btrfs stores the free space data on-disk to make the caching of a block group much quicker.

If enabled, Kernel will have available FS free space block addresses in memory,

thus when you create a new file it will immediately start writing data to disk.

Without this, Btrfs has to scan the entire tree every time looking for the free space that can be allocated.

clear_cache (since 2.6.37)

Clear all the free space caches during mount.
(used one time and only after you notice some problems with free space. )

degraded

在 RAID1 時只有一次機會 RW degraded 掛載, 必須小心使用.

"-o degraded"

discard

Enables discard/TRIM on freed blocks.

inode_cache (since 3.0)

Enable free inode number caching.
(This option may slow down your system at first run. )

recovery (since 3.2)

Enable autorecovery upon mount; currently it scans list of several previous tree roots and tries to use the first readable. The information about the tree root backups is stored by kernels starting with 3.2, older kernels do not and thus no recovery can be done.

nodatacow

Do not copy-on-write data for newly created files, existing files are unaffected. This also turns off checksumming! IOW, nodatacow implies nodatasum. datacow is used to ensure the user either has access to the old version of a file, or to the newer version of the file. datacow makes sure we never have partially updated files written to disk. nodatacow gives slight performance boost by directly overwriting data (like ext[234]), at the expense of potentially getting partially updated files on system failures. Performance gain is usually < 5% unless the workload is random writes to large database files, where the difference can become very large. NOTE: switches off compression !

DOC

https://btrfs.wiki.kernel.org/index.php/Mount_options

 


注意事項

 

Block-level copies of devices

Do NOT

    make a block-level copy of a Btrfs filesystem to another block device...
    use LVM snapshots, or any other kind of block level snapshots...
    turn a copy of a filesystem that is stored in a file into a block device with the loopback driver...

... and then try to mount either the original or the snapshot while both are visible to the same kernel.

Why?

If there are multiple block devices visible at the same time, and those block devices have the same filesystem UUID,

then they're treated as part of the same filesystem.

If they are actually copies of each other (copied by dd or LVM snapshot, or any other method),

then mounting either one of them could cause data corruption in one or both of them.

If you for example make an LVM snapshot of a btrfs filesystem, you can't mount either the LVM snapshot or the original,

because the kernel will get confused, because it thinks it's mounting a Btrfs filesystem that consists of two disks,

after which it runs into two devices which have the same device number.

Fragmentation

Files with a lot of random writes can become heavily fragmented (10000+ extents) causing thrashing on HDDs and excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM.

On servers and workstations this affects databases and virtual machine images.

On desktops this primarily affects application databases

You can use filefrag to locate heavily fragmented files (may not work correctly with compression).

Having many subvolumes can be very slow

The cost of several operations, including currently balance and device delete, is proportional to the number of subvolumes,

including snapshots, and (slightly super-linearly) the number of extents in the subvolumes.

This is "obvious" for "pure" subvolumes, as each is an independent file tree and has independent extents anyhow (except for ref-linked ones).

But in the case of snapshots metadata and extents are (usually) largely ref-linked with the ancestor subvolume,

so the full scan of the snapshot need not happen, but currently this happens.
 

 



新式指令

 

btrfs - tool to control a btrfs filesystem

Version

btrfs version

btrfs-progs v4.4

help

btrfs help [--full]

btrfs subvolume --help

subvolume

# 以下例子會建立一個叫 data 的 subvolume, 並把它設定成 mount 時的 root

#1 準備

mount /dev/sdg1 /mnt

cd /mnt

btrfs subvol show .

/mnt/btrfs is toplevel subvolume

#2 建立一個叫 data 的 subvolume

btrfs subvol create data

Remark: 刪除

btrfs subvolume delete <subvolume>

#3 查看所有 subvolume 的 ID (新建立的 subvolume ID 是 257)

btrfs subvol list .

ID 257 gen 11 top level 5 path data

#4 查看當前的 root ID

btrfs subvol get-default .

ID 5 (FS_TREE)

#5 將 ID 是 257 的 subvolume 設定成 root

btrfs subvol set-default 257 .

#6 測試

cd ..

umount /mnt

mount /dev/sdg1 /mnt

cd /mnt

btrfs subvolume show .

/mnt
        Name:                   data
        UUID:                   6bb748d2-21a2-ec4a-84b8-18d441bb6bd0
        Parent UUID:            -
        Received UUID:          -
        Creation time:          2019-01-03 23:13:14 +0800
        Subvolume ID:           257
        Generation:             11
        Gen at creation:        8
        Parent ID:              5
        Top level ID:           5
        Flags:                  -
        Snapshot(s):

#7 在另外地方 mount 回 top level (為了 take snapshot)

mount -o subvolid=5 /dev/sdg1 /path/to/mountpoint

Remark

toplevel subvolume: ID=5

#8 take snapshot

btrfs subvolume snapshot <source> [<dest>/]<name>

 

filesystem - Manage a btrfs filesystem

show [device]

e.g.

btrfs filesystem show /dev/sdd1

Label: 'raid4t'  uuid: 0e100ef3-00c3-4761-90a4-965420a5c5fd
        Total devices 2 FS bytes used 112.00KiB
        devid    1 size 3.64TiB used 2.01GiB path /dev/sdd1
        devid    2 size 3.64TiB used 2.01GiB path /dev/sde1

-m|--mbytes               # show sizes in MiB

df <mount_point>

summary information about allocation of block group types of a given mount point

e.g.

btrfs filesystem df /data/raid4t

Data, RAID1: total=1.00GiB, used=512.00KiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=1.00GiB, used=112.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

GlobalReserve

It is an artificial and internal emergency space. It is used eg. when the filesystem is full.

Its total size is dynamic based on the filesystem size, usually not larger than 512MiB, used may fluctuate.

usage <path>

Show detailed information about internal filesystem usage.

Overall:
    Device size:                   7.28TiB
    Device allocated:            128.02GiB
    Device unallocated:            7.15TiB
    Device missing:                  0.00B
    Used:                        122.70GiB
    Free (estimated):              3.58TiB      (min: 3.58TiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:               65.00MiB      (used: 0.00B)
....

defragment [options] <file>|<dir> [<file>|<dir>...]

-r              # files in dir will be defragmented recursively

-f               # flush data for each file before going to the next file.

e.g.

btrfs filesystem defragment pc_data

resize [+/-]<size>[gkm]|max <filesystem>

e.g.

cd /data

btrfs fi show .

Label: none  uuid: 1711bb02-3009-42f2-952e-ea34ec1f218a
        Total devices 1 FS bytes used 1014.43GiB
        devid    1 size 1.10TiB used 1024.00GiB path /dev/mapper/vg3t-data_disk

btrfs fi resize max .

sync <path>

# This is done via a special ioctl and will also trigger cleaning of deleted subvolumes.

e.g.

btrfs fi sync /data/raid4t

FSSync '/data/raid4t'

設定/查看 Label

Set Label

btrfs filesystem label <mountpoint> <newlabel>

e.g.

btrfs filesystem label /mnt/tmp mytestlw

Get Lablel

btrfs filesystem label /mnt/tmp

mytestlw

 

device - Manage devices

       btrfs device scan [<device> [<device>..]]           # Scan devices for a btrfs filesystem.

# If no devices are passed, btrfs uses block devices containing btrfs filesystem as listed by blkid.

       btrfs device add <dev> [<dev>..]

       btrfs device delete <dev> [<dev>..] <path> ]

       btrfs device usage <path>                                  # Show detailed information about internal allocations in devices

/dev/sdd1, ID: 1
   Device size:             3.64TiB
   Data,RAID1:              1.00GiB
   Metadata,RAID1:          1.00GiB
   System,RAID1:            8.00MiB
   Unallocated:             3.64TiB

/dev/sde1, ID: 2
   Device size:             3.64TiB
   Data,RAID1:              1.00GiB
   Metadata,RAID1:          1.00GiB
   System,RAID1:            8.00MiB
   Unallocated:             3.64TiB

btrfs device stats [-z] <path>|<device>           # Read and print the device IO stats

-z                                                         # Reset stats to zero after reading them.

[/dev/sdd1].write_io_errs   0
[/dev/sdd1].read_io_errs    0
[/dev/sdd1].flush_io_errs   0
[/dev/sdd1].corruption_errs 0
[/dev/sdd1].generation_errs 0
[/dev/sde1].write_io_errs   0
[/dev/sde1].read_io_errs    0
[/dev/sde1].flush_io_errs   0
[/dev/sde1].corruption_errs 0
[/dev/sde1].generation_errs 0

 


Scrub

 

start

# The default IO priority of scrub is the idle class.

btrfs scrub start [-BdqrR] [-c ioprio_class -n ioprio_classdata] <path> | <device>

 * identified by <path> or on a single <device>

 * 加快 scrub 的速度 (Doc: man ionice)

-c <ioprio_class>               # set IO priority class

                                        # 0 for none, 1 for realtime, 2 for best-effort, 3 for idle

-n <ioprio_classdata>        # Specify the scheduling class data. 

                                        # This only has an effect if the class accepts an argument.

                                        # 0(highest) ~ 7

btrfs scrub start -c 2 .

scrub started on ., fsid 1711bb02-3009-42f2-952e-ea34ec1f218a (pid=27455)

status

btrfs scrub status [-dR] <path>|<device>

-d     stats per device

-R     print raw stats

i.e.

btrfs scrub status .

scrub status for 8ce021c5-6ea3-4694-93f9-d924e93d6eb4
        scrub started at Sun Jul 28 22:33:13 2019 and finished after 00:00:00
        total bytes scrubbed: 1.25MiB with 0 errors

btrfs scrub status -d .

scrub status for 8ce021c5-6ea3-4694-93f9-d924e93d6eb4
scrub device /dev/sde1 (id 1) history
        scrub started at Sun Jul 28 22:33:13 2019 and finished after 00:00:00
        total bytes scrubbed: 640.00KiB with 0 errors
scrub device /dev/sdf1 (id 2) history
        scrub started at Sun Jul 28 22:33:13 2019 and finished after 00:00:00
        total bytes scrubbed: 640.00KiB with 0 errors

btrfs scrub status -R .

scrub status for 8ce021c5-6ea3-4694-93f9-d924e93d6eb4
        scrub started at Sun Jul 28 22:33:13 2019 and finished after 00:00:00
        data_extents_scrubbed: 16
        tree_extents_scrubbed: 16
        data_bytes_scrubbed: 1048576
        tree_bytes_scrubbed: 262144
        read_errors: 0
        csum_errors: 0
        verify_errors: 0
        no_csum: 256
        csum_discards: 0
        super_errors: 0
        malloc_errors: 0
        uncorrectable_errors: 0
        unverified_errors: 0
        corrected_errors: 0
        last_physical: 4333764608

cancel

btrfs scrub cancel <path>|<device>

Progress is saved in the scrub progress file and scrubbing can be resumed later using the scrub resume command.

i.e.

btrfs scrub cancel .

btrfs scrub status .

scrub status for 056bc837-b369-4ced-b47a-6ff9a0d1ad6c
        scrub started at Sat Jun 27 18:23:37 2020 and was aborted after 05:02:30
        total bytes scrubbed: 394.23GiB with 0 errors

resume

Resume a canceled or interrupted scrub cycle

i.e.

btrfs scrub resume .

 


Check (btrfsck)

 

# Check an unmounted btrfs filesystem (Do off-line check on a btrfs filesystem)

# By default, btrfs check will not modify the device but you can reaffirm that by the option --readonly.

 * The amount of memory required can be high

btrfs check [options] <device>

-b|--backup                             # use the first valid backup root copy

-s|--support <superblock>         # use 'superblock’th superblock copy

--check-data-csum                    # verify checksums of data blocks

                                               # (offline scrub but does not repair data from spare copies)

-p|--progress                            # indicate progress

--clear-space-cache v1|v2         # clear space cache for v1 or v2 (the clear_cache kernel mount option)

DANGEROUS OPTIONS

--repair                                     # try to repair the filesystem.

--init-csum-tree                         # create a new CRC tree.

--init-extent-tree                       # create a new extent tree.

e.g.

btrfs check /dev/sdd1

UUID: cd3ac961-71c8-4295-b86c-eb0edd697a47
checking extents [o]
checking free space cache [O]
checking fs roots [o]
checking csums
checking root refs
found 1683487281152 bytes used, no error found
total csum bytes: 1641723396
total tree bytes: 1917534208
total fs tree bytes: 132743168
total extent tree bytes: 30687232
btree space waste bytes: 121731080
file data blocks allocated: 1681569746944
 referenced 1681569746944

btrfs check --repair  /dev/sdd1

balance

# spread block groups accross all devices so they match constraints defined by the respective profiles.

btrfs [filesystem] balance start [options] <path>

btrfs [filesystem] balance cancel <path>

btrfs [filesystem] balance status [-v] <path>

quota

btrfs quota enable <path>

btrfs quota disable <path>

btrfs quota rescan [-sw] <path>

snapshot

i.e.

# 建立 snapshot

# sda6 係 btrfs, 它 mount 在 /mnt/sda6

cd /mnt/sda6

# 建立 Folder 去放 snap (方便管理)

mkdir snap

# ./vm 係一個 subvolume 來, vm-20170914 係 snapshot 名稱
# -r =  Make the new snapshot read only.

btrfs subvol snap -r ./vm snap/ vm-20170914

# List snapshot

btrfs subvol list .

Type filtering

  • -s                # show only snapshot subvolumes
  • -r                # show only readonly subvolumes
  • -d                # list deleted subvolumes that are not yet cleaned.

btrfs subvol list . -r

ID 2074 gen 2380 top level 5 path snapshot/2022-11-01
ID 2098 gen 2403 top level 5 path snapshot/2024-01-06

# Show snapshot in subvolume

btrfs subvol show /backup/vm_admin_bak

....
        Snapshot(s):
                                .snapshots/1606/snapshot
                                .snapshots/2379/snapshot
....

# 刪除 snapshot "/backup/vm_admin_bak/.snapshots/2702/snapshot"

btrfs subvol delete /backup/vm_admin_bak/.snapshots/2702/snapshot

Delete subvolume (no-commit): '/backup/vm_admin_bak/.snapshots/2702/snapshot'

property

# Lists available properties with their descriptions for the given object.

btrfs property list /data/raid4t

ro                  Set/get read-only flag of subvolume.
label               Set/get label of device.
compression         Set/get compression for a file or directory

readonly snapshot to rw snapshot

btrfs property list /path/to/snapshot

ro                  Set/get read-only flag of subvolume.
compression         Set/get compression for a file or directory

小了 label

btrfs property set /path/to/snapshot ro true

btrfs property get /path/to/snapshot ro

ro=true

Remark

readonly 是指 snapshot 的內容是 ro. snapshot 本身是可以直接刪除的

i.e.

btrfs subvolume delete 2020-07-03

 


Kernel threads

 

  • btrfs-cleaner
  • btrfs-delalloc
  • btrfs-delayed-m
  • btrfs-endio-met
  • btrfs-endio-wri
  • btrfs-freespace
  • btrfs-readahead
  • btrfs-transacti
  • btrfs-cache-<n>
  • btrfs-endio-<n>
  • btrfs-fixup-<n>
  • btrfs-genwork-<n>
  • btrfs-submit-<n>
  • btrfs-worker-<n>
  • flush-btrfs-<n>

 


Recovery

 

當遇上問題

cp: reading `winxp-vio.qcow2': Input/output error
cp: failed to extend `/mnt/storage1/winxp-vio.qcow2': Input/output error

dmesg:

[4483186.892609] btrfs csum failed ino 100268 off 377020416 csum 1102225462 private 2315516263
[4483186.892795] btrfs csum failed ino 100268 off 377020416 csum 1102225462 private 2315516263

# btrfsck 竟然說無事 ...

btrfsck /dev/sda6                # btrfs check [options] <device>

found 14301544448 bytes used err is 0
total csum bytes: 13947324
total tree bytes: 19484672
total fs tree bytes: 200704
btree space waste bytes: 4095840
file data blocks allocated: 14282059776
 referenced 14282059776
Btrfs Btrfs v0.19

解決方案

#由 dmesg 找出有問題的檔案

find . -inum 100268

./winxp-vio.qcow2

看來 winxp-vio.qcow2 這 file 沒救了 -__-

btrfs-zero-log

# clear out log tree

ie.

server:btrfs-progs# ./btrfs-zero-log /dev/sdb1

recovery,nospace_cache,clear_cache

btrfs -t btrfs -o recovery,nospace_cache,clear_cache DEVICE MOUNTPOINT

 


Build btrfs-progs

 

Check Version:

btrfs version

btrfs-progs v4.4

準備:

apt-get install uuid-dev libattr1-dev zlib1g-dev libacl1-dev e2fslibs-dev libblkid-dev liblzo2-dev

Git repository:

# Official:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git

# Btrfs progs developments and patch integration

http://repo.or.cz/w/btrfs-progs-unstable/devel.git

cd btrfs-progs/

make

Usage

./btrfs fi show

 


Craete fs opts

 

# list all opts

mkfs.btrfs -O list-all

Filesystem features available:
mixed-bg            - mixed data and metadata block groups (0x4)
extref              - increased hardlink limit per file to 65536 (0x40, default)
raid56              - raid56 extended format (0x80)
skinny-metadata     - reduced-size metadata extent refs (0x100, default)
no-holes            - no explicit hole extents for files (0x200)

extref(extended inode refs)

the total number of links that can be stored for a given inode / parent dir pair is limited to under "4k"

 * 在 format 時決定

 

 


More about btrfs

 

============== Data allocation

it allocates chunks of this raw storage, typically in 1GiB lumps

Many files may be placed within a chunk, and files may span across more than one chunk.

 

============== RAID-1

btrfs replicates data on a per-chunk basis (chunks are allocated in pairs(different block device))

 

============== Balancing

A btrfs balance operation rewrites things at the level of chunks.

 

============== CoW

If you mount the filesystem with nodatacow, or use chattr +C on the file, then it only does the CoW operation for data if there’s more than one copy referenced.

 

============== Subvolumes

A subvolume in btrfs is not the same as an LVM logical volume or a ZFS subvolume.
(With LVM, a logical volume is a block device in its own right(filesystem))

* A btrfs filesystem has a default subvolume, which is initially set to be the top-level subvolume and

which is mounted if no subvol or subvolid option is specified. (subvolume can be mounted by subvol or subvolid )

Changing the default subvolume with btrfs subvolume default will make the top level of the filesystem inaccessible,

except by use of the subvol=/ or subvolid=5 mount options.

 * mounted top-level subvolume (subvol=/ or subvolid=5) => the full filesystem structure will be seen at the mount

 

Layout

 - Flat Subvolumes
 - Nested Subvolumes

何時用 Subvolumes

 * "Split" of areas which are "complete" and/or "consistent" in themselves. ("/var/lib/postgresql/")

 * Split of areas which need special properties. (VM images, "nodatacow")

 

====================== Snapshots

A snapshot is simply a subvolume that shares its data (and metadata) with some other subvolume, using btrfs's COW capabilities.

Once a [writable] snapshot is made, there is no difference in status between the original subvolume, and the new snapshot subvolume.

# Structure

toplevel
  +-- root                 (subvolume, to be mounted at /)
  +-- home                 (subvolume, to be mounted at /home)
  \-- snapshots            (directory)
      +-- root             (directory)
          +-- 2015-01-01   (subvolume, ro-snapshot of subvolume "root")
          +-- 2015-06-01   (subvolume, ro-snapshot of subvolume "root")
      \-- home             (directory)
          \-- 2015-01-01   (subvolume, ro-snapshot of subvolume "home")
          \-- 2015-12-01   (subvolume, rw-snapshot of subvolume "home")

# fstab

LABEL=the-btrfs-fs-device   /                    subvol=/root,defaults,noatime      0  0
LABEL=the-btrfs-fs-device   /home                subvol=/home,defaults,noatime      0  0
LABEL=the-btrfs-fs-device   /root/btrfs-top-lvl  subvol=/,defaults,noauto,noatime   0  0

# Creating a rw-snapshot

mount /root/btrfs-top-lvl
btrfs subvolume snapshot /root/btrfs-top-lvl/home /root/btrfs-top-lvl/snapshots/home/2015-12-01
umount /root/btrfs-top-lvl

# Restore

mount /root/btrfs-top-lvl
umount /home
mv /root/btrfs-top-lvl/home /root/btrfs-top-lvl/home.tmp     #or it could  have been deleted see below
mv /root/btrfs-top-lvl/snapshots/home/2015-12-01 /root/btrfs-top-lvl/home
mount /home

Remark - ro-snapshot

subvolume snapshot -r <source> <dest>|[<dest>/]<name>

 * Read-only subvolumes cannot be moved.

 


建立多 Device 的 btrfs (mkfs.btrfs)

 

多個Device與單個Device

建立 btrfs 時, 如果有多個 device, 那它會有以下特性

多個 device:

  • metadata: mirrored across two devices
  • data: striped

單一 device:

  • metadata: duplicated on same disk

# Don't duplicate metadata on a single drive

mkfs.btrfs -m single /dev/sdz1

-m  <opt>  <--- metadata profile <-- 確保在 degraded mode 依然 mount 到

-d   <opt>  <--- data profile

Remark

 * more devices can be added after the FS has been created

 * convert between RAID levels after the FS has been created

建立 Filesystem 時分別

mkfs.btrfs /dev/sdz1

For HDD 相當於

mkfs.btrfs -m dup -d single /dev/sdz1

For SSD (or non-rotational device) 相當於

mkfs.btrfs -m single -d single /dev/sdz1

多 Device

mkfs.btrfs /dev/sdx1 /dev/sdy1

特性

  • Metadata: RAID1
  • System: RAID1
  • Data: RAID0

支援 RAID Level:

single, Raid0, Raid1, raid10(4 devices), Raid 5 and Raid6

single: full capacity of multiple drives with different sizes (metadata mirrored, data not mirrored and not striped)

 


RAID1 Usage

 

Create raid1:

mkfs.btrfs -m raid1 -d raid1 -L R1_Data /dev/sdd1 /dev/sde1

Opt:

  • -m level                # metadata (raid0, raid1, raid10 or single)
  • -d  level                # data (raid0, raid1, raid10 or single)
  • -L  name               # Label 只可以在建立 btrfs 時設定

 * raid1 volumes only mountable once RW if degraded (kernel versions: 4.9.x, 4.4.x)

    The read-only degraded mount policy is enforced by the kernel code, not btrfs user space tools.

 * When in RO mode, it seams I cannot do anything; cannot replace, nor add, nor delete a disk.

 * To get the filesystem mounted rw again, one needs to patch the kernel.

     => mount 完一次 RW 後, 之後 recover 並想 RW 就只有重新 Create & Copy

     => 所以, degraded mount 後應該立即 RAID1 -> Single

dmesg

[ 3961.007510] BTRFS info (device sdf1): allowing degraded mounts
[ 3961.007520] BTRFS info (device sdf1): disk space caching is enabled
[ 3961.007526] BTRFS info (device sdf1): has skinny extents
[ 3961.157803] BTRFS info (device sdf1): bdev (null) errs: wr 452660, rd 0, flush 0, corrupt 0, gen 0
[ 3976.264045] BTRFS warning (device sdf1): missing devices (1) exceeds the limit (0),
  writeable mount is not allowed
[ 3976.291632] BTRFS: open_ctree failed

P.S.

它是可以多過2隻 HardDisk 去做 RAID 的 !!

 

Why do I have "single" chunks in my RAID filesystem?

Data, RAID1: total=3.08TiB, used=3.02TiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=3.88MiB, used=336.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=4.19GiB, used=3.56GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B

The single chunks are perfectly normal, and are a result of the way that mkfs works.

They are small, harmless, and will remain unused as the FS grows, so you won't risk any unreplicated data.

 


Conversion (Single -> RAID1)

 

Single -> RAID1

由無 raid 的 /dev/sda1 到有raid1 (sda1 及 sdb1)

mount /dev/sda1 /mnt
btrfs device add /dev/sdb1 /mnt
btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt

之後此 command 會一直在行, 直到整個 proccess 完成. 此時 mount point 仍可 R/W

If the metadata is not converted from the single-device default,

it remains as DUP, which does not guarantee that copies of block are on separate devices.

If data is not converted it does not have any redundant copies at all.

# 在另外一個 Shell 查看情況

btrfs balance status .

Balance on '.' is running
52 out of about 2583 chunks balanced (53 considered),  98% left

一會兒

Balance on '.' is running
55 out of about 2583 chunks balanced (56 considered),  98% left

btrfs filesystem df .

Data, RAID1: total=18.00GiB, used=16.31GiB             # 漸漸上升
Data, single: total=2.50TiB, used=2.49TiB              # 漸漸下降
System, DUP: total=8.00MiB, used=304.00KiB
Metadata, DUP: total=3.50GiB, used=2.70GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Profile names
  • raid0
  • raid1
  • raid10
  • raid5
  • raid6
  • dup (A form of "RAID" which stores two copies of each piece of data on the same device.)
  • single

Default:

# checking: btrfs fi df /data/pc_data

Data, single: total=696.01GiB, used=677.99GiB
System, DUP: total=8.00MiB, used=96.00KiB
Metadata, DUP: total=1.50GiB, used=903.45MiB
GlobalReserve, single: total=512.00MiB, used=0.00B

RAID1 -> Single

cd /path/to/btrfs-mount-point

btrfs balance start -dconvert=single -mconvert=dup .

Done, had to relocate 6 out of 6 chunks

btrfs balance status .

Balance on '.' is running
5 out of about 2187 chunks balanced (6 considered), 100% left

 


Replacing failed devices

 

[Case 1] 發現有 disk 壞了:

mount 起時失敗

mount /dev/sdg1 /data/1T

mount: wrong fs type, bad option, bad superblock on /dev/sdg1,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

dmesg 見到

# 原本這 RAID1 是由 sdg1 及 sdh1 組成

[1519957.580510] BTRFS info (device sdg1): disk space caching is enabled
[1519957.580513] BTRFS info (device sdg1): has skinny extents
[1519957.591738] BTRFS warning (device sdg1): devid 2 uuid 289afc10-5bd5-47c4-b0f7-4c6e129554ae is missing
[1519957.591743] BTRFS error (device sdg1): failed to read chunk tree: -5
[1519957.628542] BTRFS error (device sdg1): open_ctree failed

fi sh 看詳細情況

# Show the structure of a filesystem

btrfs filesystem show /dev/sdg1

warning, device 2 is missing
warning devid 2 not found already
Label: '1T-RAID'  uuid: 394c3048-5fee-4bc9-b1b2-ec0084c0e2f0
        Total devices 2 FS bytes used 1.12MiB
        devid    1 size 931.51GiB used 4.04GiB path /dev/sdg1
        *** Some devices missing

解決過程:

  • mount in degraded mode
# /dev/sdg1 是 raid1 裡仍是正常的那隻 DISK
mount -o degraded /dev/sdg1 /data/1T
  • determine the raid level currently in use

btrfs fi df /data/1T

Data, RAID1: total=1.00GiB, used=0.00B
Data, single: total=1.00GiB, used=1.00MiB
System, RAID1: total=8.00MiB, used=16.00KiB
System, single: total=32.00MiB, used=0.00B
Metadata, RAID1: total=1.00GiB, used=32.00KiB
Metadata, single: total=1.00GiB, used=80.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B
  • add a new device
btrfs device add /dev/sdc1 /mnt
  • remove the missing device
# 'missing' is a special device name
btrfs device delete missing /data/1T

 * 行完 remove cmd 後, 它將會散 extents 去新碟(sdc1)

當沒有足球 device 時會出

ERROR: error removing device 'missing': unable to go below two devices on raid1

成功:

 * 行完 CMD 後, sdb1 會不見了

失敗:

ERROR: error removing device 'missing': unable to go below two devices on raid1

[Case 2] 預知 Device 有問題:

btrfs device stats /data/1T

[/dev/sdg1].write_io_errs   0
[/dev/sdg1].read_io_errs    0
[/dev/sdg1].flush_io_errs   0
[/dev/sdg1].corruption_errs 0
[/dev/sdg1].generation_errs 0
[/dev/sdh1].write_io_errs   100
[/dev/sdh1].read_io_errs    50
[/dev/sdh1].flush_io_errs   30
[/dev/sdh1].corruption_errs 30
[/dev/sdh1].generation_errs 30

=> 預知了 sdh 有問題

# 它會由嘗試由 sdh1 sdi1 讀取資料, 然後寫入 sdh1

# start [-Bfr] <srcdev>|<devid> <targetdev> <path>

# If the source device is disconnected, from the system, you have to use the devid parameter format.

# "-r" only read from <srcdev> if no other zero-defect mirror exists.

btrfs replace start /dev/shf1 /dev/sdi1 /data/1T

# Status

# print continuously until the replace operation finishes

btrfs replace status /data/1T

 


Play btrfs with image file

 

  • Create Image File
  • Create RAID1
  • Expand from 2 Disk to 4 Disk (1g -> 2g)
  • Upgrade RAID1 to RAID10

# Create 1G image file X 4

truncate -s 1g vdisk1.img vdisk2.img vdisk3.img vdisk4.img

# map image file to block device

losetup -l

           # 沒有 Output

losetup /dev/loop1 vdisk1.img

losetup /dev/loop2 vdisk2.img

losetup /dev/loop3 vdisk3.img

losetup /dev/loop4 vdisk4.img

# Create RAID1 (sdb, sdc)

mkfs.btrfs -m raid1 -d raid1 /dev/loop1 /dev/loop2

# mount it

mkdir /mnt/btrfs

mount -t btrfs /dev/loop1 /mnt/btrfs

# check status

btrfs filesystem df /mnt/btrfs

Data, RAID1: total=102.38MiB, used=128.00KiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=102.38MiB, used=112.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

btrfs filesystem show /mnt/btrfs

Label: none  uuid: cdcded1d-edeb-4c73-948d-c3628703b47d
        Total devices 2 FS bytes used 256.00KiB
        devid    1 size 1.00GiB used 212.75MiB path /dev/loop1
        devid    2 size 1.00GiB used 212.75MiB path /dev/loop2

btrfs device usage /mnt/btrfs/

/dev/loop1, ID: 1
   Device size:             1.00GiB
   Device slack:              0.00B
   Data,RAID1:            102.38MiB
   Metadata,RAID1:        102.38MiB
   System,RAID1:            8.00MiB
   Unallocated:           811.25MiB

/dev/loop2, ID: 2
   ...

df -h | grep btrfs

/dev/loop1                1.0G   17M  913M   2% /mnt/btrfs

# Add device to existing btrfs system (sdd, sde)

btrfs device add /dev/loop3 /dev/loop4 /mnt/btrfs

btrfs filesystem df /mnt/btrfs

Data, RAID1: total=102.38MiB, used=128.00KiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=102.38MiB, used=112.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

btrfs filesystem show /mnt/btrfs

Label: none  uuid: bd7b494c-f242-4bcd-9160-443b5adf141c
        Total devices 4 FS bytes used 256.00KiB
        devid    1 size 1.00GiB used 212.75MiB path /dev/loop1
        devid    2 size 1.00GiB used 212.75MiB path /dev/loop2
        devid    3 size 1.00GiB used 0.00B path /dev/loop3
        devid    4 size 1.00GiB used 0.00B path /dev/loop4

df -h | grep btrfs

/dev/loop1                2.0G   17M  1.9G   1% /mnt/btrfs

blkid vdisk*.img

 * 所有 file 都有相同的 UUID, 但有不同的 UUID_SUB

vdisk1.img: UUID="bd7b494c-f242-4bcd-9160-443b5adf141c" UUID_SUB="e333a51b-9ef1-4369-b440-83d538e628d9" TYPE="btrfs"
vdisk2.img: UUID="bd7b494c-f242-4bcd-9160-443b5adf141c" UUID_SUB="85e90b67-9f60-4430-8c73-a1cf91e86bdd" TYPE="btrfs"
vdisk3.img: UUID="bd7b494c-f242-4bcd-9160-443b5adf141c" UUID_SUB="53469a58-803a-4847-a383-2f1470b8c126" TYPE="btrfs"
vdisk4.img: UUID="bd7b494c-f242-4bcd-9160-443b5adf141c" UUID_SUB="e19db13f-2995-4e9e-82af-3f5863cf61c5" TYPE="btrfs"

# raid1 -> raid10

btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt/btrfs

Done, had to relocate 3 out of 3 chunks

# check status

btrfs filesystem df /mnt/btrfs

Data, RAID10: total=416.00MiB, used=128.00KiB
System, RAID10: total=64.00MiB, used=16.00KiB
Metadata, RAID10: total=256.00MiB, used=112.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

Cleanup

umount /mnt/btrfs

losetup -l

NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE                   DIO
/dev/loop1         0      0         0  0 /root/btrfs-test/vdisk1.img   0
/dev/loop4         0      0         0  0 /root/btrfs-test/vdisk4.img   0
/dev/loop2         0      0         0  0 /root/btrfs-test/vdisk2.img   0
/dev/loop3         0      0         0  0 /root/btrfs-test/vdisk3.img   0

# -D, --detach-all              detach all used devices

losetup -D

 


btrfs-send / btrfs-receive

 

它比 rsync 優勝的地方

 * rsync does not track file renames

 * 支援臣大的檔案

btrfs-send

btrfs-send - generate a stream of changes between two subvolume snapshots

This command will generate a stream of instructions that describe changes between two subvolume snapshots.
The stream can be consumed by the btrfs receive command to replicate the sent snapshot on a different filesystem.
The command operates in two modes: full and incremental.

Only the send side is happening in-kernel. Receive is happening in user-space.

Based on the differences found by btrfs_compare_tree, we generate a stream of instructions.

"btrfs send" requires read-only subvolumes to operate on.

btrfs-receive - receive subvolumes from send stream

btrfs-receive

btrfs receive will fail in the following cases:

    - receiving subvolume already exists
    - previously received subvolume has been changed after it was received

A subvolume is made read-only after the receiving process finishes successfully

btrfs receive sets the subvolume read-only after it completes successfully. However, while the receive is in progress, users who have write access to files or directories in the receiving path can add, remove, or modify files, in which case the resulting read-only subvolume will not be an exact copy of the sent subvolume.

If the intention is to create an exact copy, the receiving path should be protected from access by users until the receive operation has completed and the subvolume is set to read-only.

Additionally, receive does not currently do a very good job of validating that an incremental send streams actually makes sense, and it is thus possible for a specially crafted send stream to create a subvolume with reflinks to arbitrary files in the same filesystem. Because of this, users are advised to not use btrfs receive on send streams from untrusted sources, and to protect trusted streams when sending them across untrusted networks.

應用(測試)

準備

mount | grep btrfs

/dev/vdd on /backup type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
/dev/vdc on /home type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/)

btrfs subvol create /home/data              # 必須是 subvol, 因為它才能 take snapshot

mkdir /home/snap /backup/data            # 建立 Folder 存放 snapshot

Initial Bootstrapping

# 為 subvolume 建立一個 readonly snapshot

btrfs subvol snap -r /home/data /home/snap/data-snap1

# 將整個 snapshot 寫到 /backup, 建立了 data-snap1

btrfs send /home/snap/data-snap1 | btrfs receive /backup/data

題外話

btrfs subvol show /home/snap/data-snap1

snap/data-snap1
        Name:                   data-snap1
        UUID:                   79798a57-ad04-1a4f-8f6e-89b99687d8d2
        Parent UUID:            b1e3cb62-a34a-2541-a15a-ac6032564e6b
        Received UUID:          -
        ...

Notes: send side 的 "Send time" 它只會與 "Creation time" 一樣, 所以沒意思

btrfs subvol show /backup/data/data-snap1

data/data-snap1
        Name:                   data-snap1
        UUID:                   83bb6d33-b120-0a40-bf01-81b96ba5fb78
        Parent UUID:            -
        Received UUID:          79798a57-ad04-1a4f-8f6e-89b99687d8d2
        Send time:              2023-02-16 17:58:38 +0800
        ...
        Receive time:           2023-02-16 17:58:44 +0800

Notes: Receive time - Send time = 行左多耐

Incremental Backup

btrfs subvol snap -r /home/data /home/snap/data-snap2

# send side: 以 kernel mode 封行對比 snapshot 建立 "instruction stream"

# -p <parent> <=  send an incremental stream from parent to subvol

# receive side: 在 "/backup/home" 內建立了 subversion "BACKUP-new"

btrfs send -p /home/snap/data-snap1 /home/snap/data-snap2 | btrfs receive /backup/data

btrfs subvol show /home/snap/data-snap2

snap/data-snap2
        Name:                   data-snap2
        UUID:                   2a5054d1-5065-8c40-86bc-4c1599595b6f
        Parent UUID:            b1e3cb62-a34a-2541-a15a-ac6032564e6b
        Received UUID:          -
        ...

btrfs subvol show /backup/data/data-snap2

data/data-snap2
        Name:                   data-snap2
        UUID:                   54ee9c82-5020-0646-a22f-a924cda56001
        Parent UUID:            83bb6d33-b120-0a40-bf01-81b96ba5fb78
        Received UUID:          2a5054d1-5065-8c40-86bc-4c1599595b6f
        ...

Notes

Incremental 會同時有 "Parent UUID" 及 "Received UUID"

Cleanup

# 下次 send / receive 只依賴 snap2, 所以 snap1 可以刪除

btrfs subvol del /home/snap/data-snap1

btrfs subvol del /backup/data/data-snap1

 * receive 必須有 sent -p 指定的 snapshot

At subvol /home/snap/data-snap4
At snapshot data-snap4
ERROR: cannot find parent subvolume

MyBackupSctipt

#!/bin/bash

bk_path="/data/pc_data/photo"
snap_path="/data/_snap"
dest_path="/data/1T/backup"
keep=3

##############################################

lockfile=/tmp/btrfs-backup.lck

if [ -e $lockfile ]; then
        echo "Under locking" && exit
fi
touch $lockfile

folder=$(basename $bk_path)

org_snap=$snap_path/$folder
new_snap=$snap_path/${folder}-new

org_bak=$dest_path/$folder
new_bak=$dest_path/${folder}-new

echo " * create new snapshot"
btrfs sub snap -r $bk_path $new_snap
sync
echo " * send | receive"
btrfs send -p $org_snap $new_snap | btrfs receive $dest_path

echo " * send side cleanup"
btrfs sub del $org_snap
mv $new_snap $org_snap

echo "receive side cleanup"
btrfs subvolume delete $org_bak
mv $new_bak $org_bak
btrfs subvolume snapshot -r $org_bak $org_bak.$(date +%Y-%m-%d)

echo " * rotate backup to $keep"
ls -rd ${org_bak}.* | tail -n +$(( $keep + 1 ))| while read snap
do
        echo $snap
        btrfs subvolume delete "$snap"
done

rm -f $lockfile

ls $dest_path

 


GlobalReserve

 

What is the GlobalReserve and why does 'btrfs fi df' show it as single even on RAID filesystems?

The global block reserve is last-resort space for filesystem operations that may require allocating workspace even on a full filesystem.

An example is removing a file, subvolume or truncating a file.

This is mandated by the COW model, even removing data blocks requires to allocate some metadata blocks first (and free them once the change is persistent).

The block reserve is only virtual and is not stored on the devices.

It's an internal notion of Metadata but normally unreachable for the user actions (besides the ones mentioned above). For ease it's displayed as single.

The size of global reserve is determined dynamically according to the filesystem size but is capped at 512MiB. The value used greater than 0 means that it is in use.
 

 


XOR module

 

modprobe xor

[4945537.298825] xor: automatically using best checksumming function: generic_sse
[4945537.316018]    generic_sse:  3983.000 MB/sec
[4945537.316026] xor: using function: generic_sse (3983.000 MB/sec)

P.S.

btrfs depends: libcrc32c, zlib_deflate

 


常遇到的應用

 

# Convert an existing directory into a subvolume

cp -a --reflink=always Path2Folder Path2Subvolume

Remark

--reflink[=always]  is  specified, perform a lightweight copy,

where the data blocks are copied only when modified. 

If this is not possible the copy fails, or if --reflink=auto is specified,

fall back to a standard copy.

If you use --reflink=always on a non-COW capable filesyste, you will be given an error.

cp --reflink=always my_file.bin my_file_copy.bin
 


Data Integrity Test

 

Fault isolation

Btrfs generates checksums for data and metadata blocks.

Data blocks 與 Metadata blocks 係分別啟用 checksum

Default: Metadata: On; Data: Off

Corruption detection and correction

In Btrfs, checksums are verified each time a data block is read from disk.

If the file system detects a checksum mismatch while reading a block,

it first tries to obtain (or create) a good copy of this block from another device.

Test

sha256sum CentOS-7-x86_64-Minimal-1708.iso

bba314624956961a2ea31dd460cd860a77911c1e0a56e4820a12b9c5dad363f5

cp CentOS-7-x86_64-Minimal-1708.iso /mnt/btrfs

查看 HDD 某位置的 1 byte 內容

# 位置: 500MB = 524288000

xxd -ps -s 524288000 -l 1 vdisk1.img

84

xxd -ps -s 524288000 -l 1 vdisk2.img

d7

Remark

在 RAID1 結構下, vdisk1 與 vdisk 在同一位置的內容是不一樣的

# 破壞某位置的 1 byte 內容

# 破壞 vdisk1

printf '\x2a' | dd of=vdisk1.img bs=1 count=1 seek=524288000  conv=notrunc

sha256sum CentOS-7-x86_64-Minimal-1708.iso        # 正常沒 error, dmesg 亦沒報錯 !!

# 破壞 vdisk2

printf '\x2a' | dd of=vdisk2.img bs=1 count=1 seek=524288000  conv=notrunc

sha256sum CentOS-7-x86_64-Minimal-1708.iso        # 竟然仍是正常

令 Error 浮現

echo 1 > /proc/sys/vm/drop_caches

# read file 時是沒有 stderr 的, dmesg 才看到

sha256sum CentOS-7-x86_64-Minimal-1708.iso

# 查看 Error

dmesg

BTRFS warning (device loop1): csum failed root 5 ino 257 off 195297280 csum 0xe07abfd6 expected csum 0xdc613d1e mirror 2
BTRFS warning (device loop1): csum failed root 5 ino 257 off 195297280 csum 0xe07abfd6 expected csum 0xdc613d1e mirror 2
BTRFS info (device loop1): read error corrected: ino 257 off 195297280 (dev /dev/loop1 sector 1024000)

說明

root 5 ino 257                                      # 用 "stat filename" 可以看到 root 及 ino

find . -mount -inum 257

./CentOS-7-x86_64-Minimal-1708.iso

"(dev /dev/loop1 sector 1024000)"   # loop1 的 500 MB 位於壞左

# 查看 recovery 的內容 2a -> 84

xxd -ps -s 524288000 -l 1 vdisk1.img

84

 

Scrubbing

Scrub job that is performed in the background.
(only checks and repairs the portions of disks that are in use)

Start Scrub

cd /path/to/mountpoint

btrfs scrub start .

scrub started on ., fsid c5dfef39-c902-4d51-8217-17e268a77661 (pid=27078)
server:btrfs# WARNING: errors detected during scrubbing, corrected

dmesg

... BTRFS warning (device loop1): checksum error at logical 524288000 on dev /dev/loop1, 
                              physical 524288000, root 5, inode 257, offset 195297280, length 4096, 
                              links 1 (path: CentOS-7-x86_64-Minimal-1708.iso)
... BTRFS error (device loop1): bdev /dev/loop1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
... BTRFS error (device loop1): fixed up error at logical 524288000 on dev /dev/loop1

Health Check

# -z|--reset                # Print the stats and reset the values to zero afterwards

btrfs dev stats .          # Read and print the device IO error statistics

[/dev/mapper/vg3t-data_disk].write_io_errs    0
[/dev/mapper/vg3t-data_disk].read_io_errs     0
[/dev/mapper/vg3t-data_disk].flush_io_errs    0
[/dev/mapper/vg3t-data_disk].corruption_errs  0
[/dev/mapper/vg3t-data_disk].generation_errs  0

btrfs scrub status .

UUID:             ...
Scrub started:    Fri Dec 29 10:07:22 2023
Status:           running
Duration:         0:01:55
Time left:        1:34:27
ETA:              Fri Dec 29 11:43:45 2023
Total to scrub:   1.00TiB
Bytes scrubbed:   20.48GiB  (1.99%)
Rate:             182.39MiB/s
Error summary:    no errors found

 


btrfs: HUGE metadata allocated

 

First of all, BTRFS does allocate metadata (and data) one chunk at the time.

A typical size of metadata block group is 256MiB (filesystem smaller than 50GiB) and

1GiB (larger than 50GiB) for data. The system block group size is a few megabytes.

Keep in mind that BTRFS also stores smaller files in the metadata

(which may contribute to your "high" metadata usage)

By default BTRFS also duplicate metadata (recover in case of a corruption)

btrfs balance filters

Opts

  • -d[<filters>]       # act on data block groups
  • -m[<filters>]      # act on metadata chunks

filters

limit its action to a subset of the full filesystem

usage=<percent>

Balances only block groups with usage under the given percentage.

The value of 0 is allowed and will clean up completely unused block groups (ENOSPC)

i.e.

# The 60 indicates that you are allowing to have chunks with 40% wasted space(non used space)

# => if a chunk has less usage than that percentage it will be merged with others into new chunk

btrfs balance start -v -musage=60 /path

 


btrfs allocated 100%

 

原因

BTRFS starts every write in a freshly allocated chunk. (COW)

Data Usage: 分 block layer & file layer

Checking

df -h /var/lib/lxc/lamp/rootfs

Filesystem      Size  Used Avail Use% Mounted on
/dev/sdd6        38G   34G  1.9G  95% /var/lib/lxc/lamp/rootfs

btrfs fi show /var/lib/lxc/lamp/rootfs

Label: none  uuid: 8145ac80-1173-473f-994f-9080e7d03713
        Total devices 1 FS bytes used 33.31GiB
        devid    1 size 37.16GiB used 37.16GiB path /dev/sdd6

btrfs fi usage /var/lib/lxc/lamp/rootfs

Overall:
    Device size:                  37.16GiB
    Device allocated:             37.16GiB
    Device unallocated:              0.00B
    Device missing:                  0.00B
    Used:                         33.83GiB
    Free (estimated):              1.85GiB      (min: 1.85GiB)
    Data ratio:                       1.00
    Metadata ratio:                   1.99
    Global reserve:              137.54MiB      (used: 0.00B)
...

[Fix]

btrfs balance start /var/lib/lxc/lamp/rootfs &

btrfs balance status -v /var/lib/lxc/lamp/rootfs

Balance on '/var/lib/lxc/lamp/rootfs' is running
0 out of about 42 chunks balanced (21 considered), 100% left
Dumping filters: flags 0x7, state 0x1, force is off
  DATA (flags 0x0): balancing
  METADATA (flags 0x0): balancing
  SYSTEM (flags 0x0): balancing

 


 

 

Creative Commons license icon Creative Commons license icon