最後更新: 2023-03-10
目錄
- 注意事項
- 新式指令
- Scrub
- Kernel threads
- Recovery
- Build btrfs-progs
- extref(extended inode refs)
- More about btrfs
- 建立多 Device 的 btrfs (mkfs.btrfs)
- RAID1 Usage
- Conversion (single -> raid1)
- Replacing failed devices
- Play btrfs with image file
- btrfs-send / btrfs-receive
- GlobalReserve
- XOR module
- 常遇到的應用
- Data Integrity Test
介紹
btrfs 是一個多才多藝的 FileSystem 來, 它核心架構是 CoW, 支援以下功能:
- Extent based file storage (2^64 max file size = 16EiB)
- Dynamic inode allocation
- Writable snapshots
- Subvolumes (internal filesystem roots)
- Subvolume-aware quota support
- Object level mirroring and striping
- Compression (ZLIB, LZO)
- Online filesystem defragmentation
- Integrated multiple device support (RAID0, RAID1, RAID10, RAID5 )
- Checksums on metadata and data (crc32c, xxhash, sha256, blake2b)
- Space-efficient packing of small files
- Space-efficient indexed directories
-
Seed devices
(Create a readonly filesystem that acts as a template to seed other Btrfs filesystems) - Background scrub process for finding and fixing errors on files with redundant copies
- Efficient incremental backup(btrfs send)
所以說, 唔用就走寶 ~
不過, 使用前一定要少心, 因為在 Linux2.6 上它仍未有 fsck !! (Linux 3.2 上的 btrfsck 仍未用得..)
此外, 如果 CPU 支援 hardware 的 CRC32 那效能會更好
(grep -i SSE4.2 /proc/cpuinfo)
Version
每 version 的改善: btrfs_version
btrfs 指令
btrfs 現在總分兩類指令, 分別是舊風格的及新風格的
新式:
- btrfs [option] # 是的, 新式的只有一個指令, 伴隨著一堆 option
舊式:
- btrfsctl
- btrfs-show
- btrfstune
Checking & mount & fstab
scan
This is required after loading the btrfs module if you're running with more than one device in a filesystem.
btrfs device scan /dev/sdd1 /dev/sde1
Scanning for Btrfs filesystems in '/dev/sdd1' Scanning for Btrfs filesystems in '/dev/sde1'
dmesg
[454919.695156] btrfs: device label raid4t devid 1 transid 4 /dev/sdd1 [454919.705341] btrfs: device label raid4t devid 2 transid 4 /dev/sde1
# Show the structure of a filesystem
btrfs filesystem show /dev/sdd1
Label: raid4t uuid: da8c8ae3-163b-47ba-aafc-6ec0199df818 Total devices 2 FS bytes used 640.00KiB devid 1 size 3.64TiB used 2.03GiB path /dev/sdd1 devid 2 size 3.64TiB used 2.01GiB path /dev/sde1
# Show space usage information for a mount point
btrfs filesystem df /data/raid4t
Data, RAID1: total=1.00GiB, used=512.00KiB Data, single: total=8.00MiB, used=0.00 System, RAID1: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00 Metadata, RAID1: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00
df -h
/dev/sdd1 7.3T 4.0G 7.3T 1% /data/raid4t
* BTRFS isn't actually mirroring disks like "raid 1", it's storing at least two copes of data.
fstab
/etc/fstab
doesn't perform a "btrfs device scan [/dev/sd?]",
you can still mount a multi-volume btrfs filesystem by passing all the devices in the filesystem explicitly to the mount command
/dev/sdb /mnt btrfs device=/dev/sdb1,device=/dev/sdc2,device=/dev/sdd1,device=/dev/sde1 0 0
設定 Default 的 mount volume
查看有什麼 volume:
btrfs sub list /mnt/sda8
ID 256 top level 5 path vps
設置:
btrfs sub set 256 /mnt/sda8
fstab:
# myserver /dev/sda8 /var/lib/lxc/myserver/rootfs btrfs noatime,subvol=vps 0 0 /dev/sda8 /mnt/sda8 btrfs noatime,subvolid=0 0 0
指定 mount /dev/sda5 時, 根目錄是用那一個 sub-volume
mount:
mount -t btrfs -o subvol=vm.13-9-11 /dev/sda7 /mnt/tmp
Mount Options
noatime
device
# mount a multi-volume btrfs filesystem by passing all the devices in the filesystem
device=/dev/sdd1,device=/dev/sde1
subvol
By label
subvol=home
By id
subvolid=0 # 0 及 5 都是 top level
autodefrag (since 3.0)
Will detect random writes into existing files and kick off background defragging.
It is well suited to bdb or sqlite databases, but not virtualization images or big databases (yet).
commit=number (since 3.12)
Set the interval of periodic commit, 30 seconds by default.
checksum
* debugging options
check_int (since 3.3)
Switch on integrity checker for metadata
check_int_data (since 3.3)
Switch on integrity checker for data and metadata
compress
compress=zlib - Better compression ratio. It's the default and safe for olders kernels.
compress=lzo - Faster compression.
compress=no - Disables compression (starting with kernel 3.6).
space_cache (since 2.6.37)
space_cache (space_cache=v1 and space_cache=v2 since 4.5) space_cache=version nospace_cache
Btrfs stores the free space data on-disk to make the caching of a block group much quicker.
If enabled, Kernel will have available FS free space block addresses in memory,
thus when you create a new file it will immediately start writing data to disk.
Without this, Btrfs has to scan the entire tree every time looking for the free space that can be allocated.
clear_cache (since 2.6.37)
Clear all the free space caches during mount.
(used one time and only after you notice some problems with free space. )
degraded
在 RAID1 時只有一次機會 RW degraded 掛載, 必須小心使用.
"-o degraded"
discard
Enables discard/TRIM on freed blocks.
inode_cache (since 3.0)
Enable free inode number caching.
(This option may slow down your system at first run. )
recovery (since 3.2)
Enable autorecovery upon mount; currently it scans list of several previous tree roots and tries to use the first readable. The information about the tree root backups is stored by kernels starting with 3.2, older kernels do not and thus no recovery can be done.
nodatacow
Do not copy-on-write data for newly created files, existing files are unaffected. This also turns off checksumming! IOW, nodatacow implies nodatasum. datacow is used to ensure the user either has access to the old version of a file, or to the newer version of the file. datacow makes sure we never have partially updated files written to disk. nodatacow gives slight performance boost by directly overwriting data (like ext[234]), at the expense of potentially getting partially updated files on system failures. Performance gain is usually < 5% unless the workload is random writes to large database files, where the difference can become very large. NOTE: switches off compression !
DOC
https://btrfs.wiki.kernel.org/index.php/Mount_options
注意事項
Block-level copies of devices
Do NOT
make a block-level copy of a Btrfs filesystem to another block device...
use LVM snapshots, or any other kind of block level snapshots...
turn a copy of a filesystem that is stored in a file into a block device with the loopback driver...
... and then try to mount either the original or the snapshot while both are visible to the same kernel.
Why?
If there are multiple block devices visible at the same time, and those block devices have the same filesystem UUID,
then they're treated as part of the same filesystem.
If they are actually copies of each other (copied by dd or LVM snapshot, or any other method),
then mounting either one of them could cause data corruption in one or both of them.
If you for example make an LVM snapshot of a btrfs filesystem, you can't mount either the LVM snapshot or the original,
because the kernel will get confused, because it thinks it's mounting a Btrfs filesystem that consists of two disks,
after which it runs into two devices which have the same device number.
Fragmentation
Files with a lot of random writes can become heavily fragmented (10000+ extents) causing thrashing on HDDs and excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM.
On servers and workstations this affects databases and virtual machine images.
On desktops this primarily affects application databases
You can use filefrag to locate heavily fragmented files (may not work correctly with compression).
Having many subvolumes can be very slow
The cost of several operations, including currently balance and device delete, is proportional to the number of subvolumes,
including snapshots, and (slightly super-linearly) the number of extents in the subvolumes.
This is "obvious" for "pure" subvolumes, as each is an independent file tree and has independent extents anyhow (except for ref-linked ones).
But in the case of snapshots metadata and extents are (usually) largely ref-linked with the ancestor subvolume,
so the full scan of the snapshot need not happen, but currently this happens.
新式指令
btrfs - tool to control a btrfs filesystem
Version
btrfs version
btrfs-progs v4.4
help
btrfs help [--full]
btrfs subvolume --help
# 以下例子會建立一個叫 data 的 subvolume, 並把它設定成 mount 時的 root
#1 準備
mount /dev/sdg1 /mnt
cd /mnt
btrfs subvol show .
/mnt/btrfs is toplevel subvolume
#2 建立一個叫 data 的 subvolume
btrfs subvol create data
Remark: 刪除
btrfs subvolume delete <subvolume>
#3 查看所有 subvolume 的 ID (新建立的 subvolume ID 是 257)
btrfs subvol list .
ID 257 gen 11 top level 5 path data
#4 查看當前的 root ID
btrfs subvol get-default .
ID 5 (FS_TREE)
#5 將 ID 是 257 的 subvolume 設定成 root
btrfs subvol set-default 257 .
#6 測試
cd ..
umount /mnt
mount /dev/sdg1 /mnt
cd /mnt
btrfs subvolume show .
/mnt Name: data UUID: 6bb748d2-21a2-ec4a-84b8-18d441bb6bd0 Parent UUID: - Received UUID: - Creation time: 2019-01-03 23:13:14 +0800 Subvolume ID: 257 Generation: 11 Gen at creation: 8 Parent ID: 5 Top level ID: 5 Flags: - Snapshot(s):
#7 在另外地方 mount 回 top level (為了 take snapshot)
mount -o subvolid=5 /dev/sdg1 /path/to/mountpoint
Remark
toplevel subvolume: ID=5
#8 take snapshot
btrfs subvolume snapshot <source> [<dest>/]<name>
filesystem - Manage a btrfs filesystem
show [device]
e.g.
btrfs filesystem show /dev/sdd1
Label: 'raid4t' uuid: 0e100ef3-00c3-4761-90a4-965420a5c5fd Total devices 2 FS bytes used 112.00KiB devid 1 size 3.64TiB used 2.01GiB path /dev/sdd1 devid 2 size 3.64TiB used 2.01GiB path /dev/sde1
-m|--mbytes # show sizes in MiB
df <mount_point>
summary information about allocation of block group types of a given mount point
e.g.
btrfs filesystem df /data/raid4t
Data, RAID1: total=1.00GiB, used=512.00KiB System, RAID1: total=8.00MiB, used=16.00KiB Metadata, RAID1: total=1.00GiB, used=112.00KiB GlobalReserve, single: total=16.00MiB, used=0.00B
GlobalReserve
It is an artificial and internal emergency space. It is used eg. when the filesystem is full.
Its total size is dynamic based on the filesystem size, usually not larger than 512MiB, used may fluctuate.
usage <path>
Show detailed information about internal filesystem usage.
Overall: Device size: 7.28TiB Device allocated: 128.02GiB Device unallocated: 7.15TiB Device missing: 0.00B Used: 122.70GiB Free (estimated): 3.58TiB (min: 3.58TiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 65.00MiB (used: 0.00B) ....
defragment [options] <file>|<dir> [<file>|<dir>...]
-r # files in dir will be defragmented recursively
-f # flush data for each file before going to the next file.
e.g.
btrfs filesystem defragment pc_data
resize [+/-]<size>[gkm]|max <filesystem>
e.g.
cd /data
btrfs fi show .
Label: none uuid: 1711bb02-3009-42f2-952e-ea34ec1f218a Total devices 1 FS bytes used 1014.43GiB devid 1 size 1.10TiB used 1024.00GiB path /dev/mapper/vg3t-data_disk
btrfs fi resize max .
sync <path>
# This is done via a special ioctl and will also trigger cleaning of deleted subvolumes.
e.g.
btrfs fi sync /data/raid4t
FSSync '/data/raid4t'
設定/查看 Label
Set Label
btrfs filesystem label <mountpoint> <newlabel>
e.g.
btrfs filesystem label /mnt/tmp mytestlw
Get Lablel
btrfs filesystem label /mnt/tmp
mytestlw
device - Manage devices
btrfs device scan [<device> [<device>..]] # Scan devices for a btrfs filesystem.
# If no devices are passed, btrfs uses block devices containing btrfs filesystem as listed by blkid.
btrfs device add <dev> [<dev>..]
btrfs device delete <dev> [<dev>..] <path> ]
btrfs device usage <path> # Show detailed information about internal allocations in devices
/dev/sdd1, ID: 1 Device size: 3.64TiB Data,RAID1: 1.00GiB Metadata,RAID1: 1.00GiB System,RAID1: 8.00MiB Unallocated: 3.64TiB /dev/sde1, ID: 2 Device size: 3.64TiB Data,RAID1: 1.00GiB Metadata,RAID1: 1.00GiB System,RAID1: 8.00MiB Unallocated: 3.64TiB
btrfs device stats [-z] <path>|<device> # Read and print the device IO stats
-z # Reset stats to zero after reading them.
[/dev/sdd1].write_io_errs 0 [/dev/sdd1].read_io_errs 0 [/dev/sdd1].flush_io_errs 0 [/dev/sdd1].corruption_errs 0 [/dev/sdd1].generation_errs 0 [/dev/sde1].write_io_errs 0 [/dev/sde1].read_io_errs 0 [/dev/sde1].flush_io_errs 0 [/dev/sde1].corruption_errs 0 [/dev/sde1].generation_errs 0
Scrub
start
# The default IO priority of scrub is the idle class.
btrfs scrub start [-BdqrR] [-c ioprio_class -n ioprio_classdata] <path> | <device>
* identified by <path> or on a single <device>
* 加快 scrub 的速度 (Doc: man ionice)
-c <ioprio_class> # set IO priority class
# 0 for none, 1 for realtime, 2 for best-effort, 3 for idle
-n <ioprio_classdata> # Specify the scheduling class data.
# This only has an effect if the class accepts an argument.
# 0(highest) ~ 7
btrfs scrub start -c 2 .
scrub started on ., fsid 1711bb02-3009-42f2-952e-ea34ec1f218a (pid=27455)
status
btrfs scrub status [-dR] <path>|<device>
-d stats per device
-R print raw stats
i.e.
btrfs scrub status .
scrub status for 8ce021c5-6ea3-4694-93f9-d924e93d6eb4 scrub started at Sun Jul 28 22:33:13 2019 and finished after 00:00:00 total bytes scrubbed: 1.25MiB with 0 errors
btrfs scrub status -d .
scrub status for 8ce021c5-6ea3-4694-93f9-d924e93d6eb4 scrub device /dev/sde1 (id 1) history scrub started at Sun Jul 28 22:33:13 2019 and finished after 00:00:00 total bytes scrubbed: 640.00KiB with 0 errors scrub device /dev/sdf1 (id 2) history scrub started at Sun Jul 28 22:33:13 2019 and finished after 00:00:00 total bytes scrubbed: 640.00KiB with 0 errors
btrfs scrub status -R .
scrub status for 8ce021c5-6ea3-4694-93f9-d924e93d6eb4 scrub started at Sun Jul 28 22:33:13 2019 and finished after 00:00:00 data_extents_scrubbed: 16 tree_extents_scrubbed: 16 data_bytes_scrubbed: 1048576 tree_bytes_scrubbed: 262144 read_errors: 0 csum_errors: 0 verify_errors: 0 no_csum: 256 csum_discards: 0 super_errors: 0 malloc_errors: 0 uncorrectable_errors: 0 unverified_errors: 0 corrected_errors: 0 last_physical: 4333764608
cancel
btrfs scrub cancel <path>|<device>
Progress is saved in the scrub progress file and scrubbing can be resumed later using the scrub resume command.
i.e.
btrfs scrub cancel .
btrfs scrub status .
scrub status for 056bc837-b369-4ced-b47a-6ff9a0d1ad6c scrub started at Sat Jun 27 18:23:37 2020 and was aborted after 05:02:30 total bytes scrubbed: 394.23GiB with 0 errors
resume
Resume a canceled or interrupted scrub cycle
i.e.
btrfs scrub resume .
Check (btrfsck)
# Check an unmounted btrfs filesystem (Do off-line check on a btrfs filesystem)
# By default, btrfs check will not modify the device but you can reaffirm that by the option --readonly.
* The amount of memory required can be high
btrfs check [options] <device>
-b|--backup # use the first valid backup root copy
-s|--support <superblock> # use 'superblock’th superblock copy
--check-data-csum # verify checksums of data blocks
# (offline scrub but does not repair data from spare copies)
-p|--progress # indicate progress
--clear-space-cache v1|v2 # clear space cache for v1 or v2 (the clear_cache kernel mount option)
DANGEROUS OPTIONS
--repair # try to repair the filesystem.
--init-csum-tree # create a new CRC tree.
--init-extent-tree # create a new extent tree.
e.g.
btrfs check /dev/sdd1
UUID: cd3ac961-71c8-4295-b86c-eb0edd697a47 checking extents [o] checking free space cache [O] checking fs roots [o] checking csums checking root refs found 1683487281152 bytes used, no error found total csum bytes: 1641723396 total tree bytes: 1917534208 total fs tree bytes: 132743168 total extent tree bytes: 30687232 btree space waste bytes: 121731080 file data blocks allocated: 1681569746944 referenced 1681569746944
btrfs check --repair /dev/sdd1
balance
# spread block groups accross all devices so they match constraints defined by the respective profiles.
btrfs [filesystem] balance start [options] <path>
btrfs [filesystem] balance cancel <path>
btrfs [filesystem] balance status [-v] <path>
quota
btrfs quota enable <path>
btrfs quota disable <path>
btrfs quota rescan [-sw] <path>
snapshot
i.e.
# 建立 snapshot
# sda6 係 btrfs, 它 mount 在 /mnt/sda6
cd /mnt/sda6
# 建立 Folder 去放 snap (方便管理)
mkdir snap
# ./vm 係一個 subvolume 來, vm-20170914 係 snapshot 名稱
# -r = Make the new snapshot read only.
btrfs subvol snap -r ./vm snap/ vm-20170914
# List snapshot
btrfs subvol list .
Type filtering
- -s # show only snapshot subvolumes
- -r # show only readonly subvolumes
- -d # list deleted subvolumes that are not yet cleaned.
btrfs subvol list . -r
ID 2074 gen 2380 top level 5 path snapshot/2022-11-01 ID 2098 gen 2403 top level 5 path snapshot/2024-01-06
# Show snapshot in subvolume
btrfs subvol show /backup/vm_admin_bak
.... Snapshot(s): .snapshots/1606/snapshot .snapshots/2379/snapshot ....
# 刪除 snapshot "/backup/vm_admin_bak/.snapshots/2702/snapshot"
btrfs subvol delete /backup/vm_admin_bak/.snapshots/2702/snapshot
Delete subvolume (no-commit): '/backup/vm_admin_bak/.snapshots/2702/snapshot'
property
# Lists available properties with their descriptions for the given object.
btrfs property list /data/raid4t
ro Set/get read-only flag of subvolume. label Set/get label of device. compression Set/get compression for a file or directory
readonly snapshot to rw snapshot
btrfs property list /path/to/snapshot
ro Set/get read-only flag of subvolume. compression Set/get compression for a file or directory
小了 label
btrfs property set /path/to/snapshot ro true
btrfs property get /path/to/snapshot ro
ro=true
Remark
readonly 是指 snapshot 的內容是 ro. snapshot 本身是可以直接刪除的
i.e.
btrfs subvolume delete 2020-07-03
Kernel threads
- btrfs-cleaner
- btrfs-delalloc
- btrfs-delayed-m
- btrfs-endio-met
- btrfs-endio-wri
- btrfs-freespace
- btrfs-readahead
- btrfs-transacti
- btrfs-cache-<n>
- btrfs-endio-<n>
- btrfs-fixup-<n>
- btrfs-genwork-<n>
- btrfs-submit-<n>
- btrfs-worker-<n>
- flush-btrfs-<n>
Recovery
當遇上問題
cp: reading `winxp-vio.qcow2': Input/output error cp: failed to extend `/mnt/storage1/winxp-vio.qcow2': Input/output error
dmesg:
[4483186.892609] btrfs csum failed ino 100268 off 377020416 csum 1102225462 private 2315516263
[4483186.892795] btrfs csum failed ino 100268 off 377020416 csum 1102225462 private 2315516263
# btrfsck 竟然說無事 ...
btrfsck /dev/sda6 # btrfs check [options] <device>
found 14301544448 bytes used err is 0 total csum bytes: 13947324 total tree bytes: 19484672 total fs tree bytes: 200704 btree space waste bytes: 4095840 file data blocks allocated: 14282059776 referenced 14282059776 Btrfs Btrfs v0.19
解決方案
#由 dmesg 找出有問題的檔案
find . -inum 100268
./winxp-vio.qcow2
看來 winxp-vio.qcow2 這 file 沒救了 -__-
btrfs-zero-log
# clear out log tree
ie.
server:btrfs-progs# ./btrfs-zero-log /dev/sdb1
recovery,nospace_cache,clear_cache
btrfs -t btrfs -o recovery,nospace_cache,clear_cache DEVICE MOUNTPOINT
Build btrfs-progs
Check Version:
btrfs version
btrfs-progs v4.4
準備:
apt-get install uuid-dev libattr1-dev zlib1g-dev libacl1-dev e2fslibs-dev libblkid-dev liblzo2-dev
Git repository:
# Official:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git
# Btrfs progs developments and patch integration
http://repo.or.cz/w/btrfs-progs-unstable/devel.git
cd btrfs-progs/
make
Usage
./btrfs fi show
Craete fs opts
# list all opts
mkfs.btrfs -O list-all
Filesystem features available: mixed-bg - mixed data and metadata block groups (0x4) extref - increased hardlink limit per file to 65536 (0x40, default) raid56 - raid56 extended format (0x80) skinny-metadata - reduced-size metadata extent refs (0x100, default) no-holes - no explicit hole extents for files (0x200)
extref(extended inode refs)
the total number of links that can be stored for a given inode / parent dir pair is limited to under "4k"
* 在 format 時決定
More about btrfs
============== Data allocation
it allocates chunks of this raw storage, typically in 1GiB lumps
Many files may be placed within a chunk, and files may span across more than one chunk.
============== RAID-1
btrfs replicates data on a per-chunk basis (chunks are allocated in pairs(different block device))
============== Balancing
A btrfs balance operation rewrites things at the level of chunks.
============== CoW
If you mount the filesystem with nodatacow, or use chattr +C on the file, then it only does the CoW operation for data if there’s more than one copy referenced.
============== Subvolumes
A subvolume in btrfs is not the same as an LVM logical volume or a ZFS subvolume.
(With LVM, a logical volume is a block device in its own right(filesystem))
* A btrfs filesystem has a default subvolume, which is initially set to be the top-level subvolume and
which is mounted if no subvol or subvolid option is specified. (subvolume can be mounted by subvol or subvolid )
Changing the default subvolume with btrfs subvolume default will make the top level of the filesystem inaccessible,
except by use of the subvol=/ or subvolid=5 mount options.
* mounted top-level subvolume (subvol=/ or subvolid=5) => the full filesystem structure will be seen at the mount
Layout
- Flat Subvolumes
- Nested Subvolumes
何時用 Subvolumes
* "Split" of areas which are "complete" and/or "consistent" in themselves. ("/var/lib/postgresql/")
* Split of areas which need special properties. (VM images, "nodatacow")
====================== Snapshots
A snapshot is simply a subvolume that shares its data (and metadata) with some other subvolume, using btrfs's COW capabilities.
Once a [writable] snapshot is made, there is no difference in status between the original subvolume, and the new snapshot subvolume.
# Structure
toplevel +-- root (subvolume, to be mounted at /) +-- home (subvolume, to be mounted at /home) \-- snapshots (directory) +-- root (directory) +-- 2015-01-01 (subvolume, ro-snapshot of subvolume "root") +-- 2015-06-01 (subvolume, ro-snapshot of subvolume "root") \-- home (directory) \-- 2015-01-01 (subvolume, ro-snapshot of subvolume "home") \-- 2015-12-01 (subvolume, rw-snapshot of subvolume "home")
# fstab
LABEL=the-btrfs-fs-device / subvol=/root,defaults,noatime 0 0 LABEL=the-btrfs-fs-device /home subvol=/home,defaults,noatime 0 0 LABEL=the-btrfs-fs-device /root/btrfs-top-lvl subvol=/,defaults,noauto,noatime 0 0
# Creating a rw-snapshot
mount /root/btrfs-top-lvl
btrfs subvolume snapshot /root/btrfs-top-lvl/home /root/btrfs-top-lvl/snapshots/home/2015-12-01
umount /root/btrfs-top-lvl
# Restore
mount /root/btrfs-top-lvl
umount /home
mv /root/btrfs-top-lvl/home /root/btrfs-top-lvl/home.tmp #or it could have been deleted see below
mv /root/btrfs-top-lvl/snapshots/home/2015-12-01 /root/btrfs-top-lvl/home
mount /home
Remark - ro-snapshot
subvolume snapshot -r <source> <dest>|[<dest>/]<name>
* Read-only subvolumes cannot be moved.
建立多 Device 的 btrfs (mkfs.btrfs)
多個Device與單個Device
建立 btrfs 時, 如果有多個 device, 那它會有以下特性
多個 device:
- metadata: mirrored across two devices
- data: striped
單一 device:
- metadata: duplicated on same disk
# Don't duplicate metadata on a single drive
mkfs.btrfs -m single /dev/sdz1
-m <opt> <--- metadata profile <-- 確保在 degraded mode 依然 mount 到
-d <opt> <--- data profile
Remark
* more devices can be added after the FS has been created
* convert between RAID levels after the FS has been created
建立 Filesystem 時分別
mkfs.btrfs /dev/sdz1
For HDD 相當於
mkfs.btrfs -m dup -d single /dev/sdz1
For SSD (or non-rotational device) 相當於
mkfs.btrfs -m single -d single /dev/sdz1
多 Device
mkfs.btrfs /dev/sdx1 /dev/sdy1
特性
- Metadata: RAID1
- System: RAID1
- Data: RAID0
支援 RAID Level:
single, Raid0, Raid1, raid10(4 devices), Raid 5 and Raid6
single: full capacity of multiple drives with different sizes (metadata mirrored, data not mirrored and not striped)
RAID1 Usage
Create raid1:
mkfs.btrfs -m raid1 -d raid1 -L R1_Data /dev/sdd1 /dev/sde1
Opt:
- -m level # metadata (raid0, raid1, raid10 or single)
- -d level # data (raid0, raid1, raid10 or single)
- -L name # Label 只可以在建立 btrfs 時設定
* raid1 volumes only mountable once RW if degraded (kernel versions: 4.9.x, 4.4.x)
The read-only degraded mount policy is enforced by the kernel code, not btrfs user space tools.
* When in RO mode, it seams I cannot do anything; cannot replace, nor add, nor delete a disk.
* To get the filesystem mounted rw again, one needs to patch the kernel.
=> mount 完一次 RW 後, 之後 recover 並想 RW 就只有重新 Create & Copy
=> 所以, degraded mount 後應該立即 RAID1 -> Single
dmesg
[ 3961.007510] BTRFS info (device sdf1): allowing degraded mounts
[ 3961.007520] BTRFS info (device sdf1): disk space caching is enabled
[ 3961.007526] BTRFS info (device sdf1): has skinny extents
[ 3961.157803] BTRFS info (device sdf1): bdev (null) errs: wr 452660, rd 0, flush 0, corrupt 0, gen 0
[ 3976.264045] BTRFS warning (device sdf1): missing devices (1) exceeds the limit (0),
writeable mount is not allowed
[ 3976.291632] BTRFS: open_ctree failed
P.S.
它是可以多過2隻 HardDisk 去做 RAID 的 !!
Why do I have "single" chunks in my RAID filesystem?
Data, RAID1: total=3.08TiB, used=3.02TiB Data, single: total=8.00MiB, used=0.00B System, RAID1: total=3.88MiB, used=336.00KiB System, single: total=4.00MiB, used=0.00B Metadata, RAID1: total=4.19GiB, used=3.56GiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=512.00MiB, used=0.00B
The single chunks are perfectly normal, and are a result of the way that mkfs works.
They are small, harmless, and will remain unused as the FS grows, so you won't risk any unreplicated data.
Conversion (Single -> RAID1)
Single -> RAID1
由無 raid 的 /dev/sda1 到有raid1 (sda1 及 sdb1)
mount /dev/sda1 /mnt btrfs device add /dev/sdb1 /mnt btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt
之後此 command 會一直在行, 直到整個 proccess 完成. 此時 mount point 仍可 R/W
If the metadata is not converted from the single-device default,
it remains as DUP, which does not guarantee that copies of block are on separate devices.
If data is not converted it does not have any redundant copies at all.
# 在另外一個 Shell 查看情況
btrfs balance status .
Balance on '.' is running 52 out of about 2583 chunks balanced (53 considered), 98% left
一會兒
Balance on '.' is running 55 out of about 2583 chunks balanced (56 considered), 98% left
btrfs filesystem df .
Data, RAID1: total=18.00GiB, used=16.31GiB # 漸漸上升
Data, single: total=2.50TiB, used=2.49TiB # 漸漸下降
System, DUP: total=8.00MiB, used=304.00KiB
Metadata, DUP: total=3.50GiB, used=2.70GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
- raid0
- raid1
- raid10
- raid5
- raid6
- dup (A form of "RAID" which stores two copies of each piece of data on the same device.)
- single
Default:
# checking: btrfs fi df /data/pc_data
Data, single: total=696.01GiB, used=677.99GiB System, DUP: total=8.00MiB, used=96.00KiB Metadata, DUP: total=1.50GiB, used=903.45MiB GlobalReserve, single: total=512.00MiB, used=0.00B
RAID1 -> Single
cd /path/to/btrfs-mount-point
btrfs balance start -dconvert=single -mconvert=dup .
Done, had to relocate 6 out of 6 chunks
btrfs balance status .
Balance on '.' is running 5 out of about 2187 chunks balanced (6 considered), 100% left
Replacing failed devices
[Case 1] 發現有 disk 壞了:
mount 起時失敗
mount /dev/sdg1 /data/1T
mount: wrong fs type, bad option, bad superblock on /dev/sdg1, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so.
dmesg 見到
# 原本這 RAID1 是由 sdg1 及 sdh1 組成
[1519957.580510] BTRFS info (device sdg1): disk space caching is enabled [1519957.580513] BTRFS info (device sdg1): has skinny extents [1519957.591738] BTRFS warning (device sdg1): devid 2 uuid 289afc10-5bd5-47c4-b0f7-4c6e129554ae is missing [1519957.591743] BTRFS error (device sdg1): failed to read chunk tree: -5 [1519957.628542] BTRFS error (device sdg1): open_ctree failed
fi sh 看詳細情況
# Show the structure of a filesystem
btrfs filesystem show /dev/sdg1
warning, device 2 is missing
warning devid 2 not found already
Label: '1T-RAID' uuid: 394c3048-5fee-4bc9-b1b2-ec0084c0e2f0
Total devices 2 FS bytes used 1.12MiB
devid 1 size 931.51GiB used 4.04GiB path /dev/sdg1
*** Some devices missing
解決過程:
- mount in degraded mode
# /dev/sdg1 是 raid1 裡仍是正常的那隻 DISK
mount -o degraded /dev/sdg1 /data/1T
- determine the raid level currently in use
btrfs fi df /data/1T
Data, RAID1: total=1.00GiB, used=0.00B Data, single: total=1.00GiB, used=1.00MiB System, RAID1: total=8.00MiB, used=16.00KiB System, single: total=32.00MiB, used=0.00B Metadata, RAID1: total=1.00GiB, used=32.00KiB Metadata, single: total=1.00GiB, used=80.00KiB GlobalReserve, single: total=16.00MiB, used=0.00B
- add a new device
btrfs device add /dev/sdc1 /mnt
- remove the missing device
# 'missing' is a special device name
btrfs device delete missing /data/1T
* 行完 remove cmd 後, 它將會散 extents 去新碟(sdc1)
當沒有足球 device 時會出
ERROR: error removing device 'missing': unable to go below two devices on raid1
成功:
* 行完 CMD 後, sdb1 會不見了
失敗:
ERROR: error removing device 'missing': unable to go below two devices on raid1
[Case 2] 預知 Device 有問題:
btrfs device stats /data/1T
[/dev/sdg1].write_io_errs 0 [/dev/sdg1].read_io_errs 0 [/dev/sdg1].flush_io_errs 0 [/dev/sdg1].corruption_errs 0 [/dev/sdg1].generation_errs 0 [/dev/sdh1].write_io_errs 100 [/dev/sdh1].read_io_errs 50 [/dev/sdh1].flush_io_errs 30 [/dev/sdh1].corruption_errs 30 [/dev/sdh1].generation_errs 30
=> 預知了 sdh 有問題
# 它會由嘗試由 sdh1 sdi1 讀取資料, 然後寫入 sdh1
# start [-Bfr] <srcdev>|<devid> <targetdev> <path>
# If the source device is disconnected, from the system, you have to use the devid parameter format.
# "-r" only read from <srcdev> if no other zero-defect mirror exists.
btrfs replace start /dev/shf1 /dev/sdi1 /data/1T
# Status
# print continuously until the replace operation finishes
btrfs replace status /data/1T
Play btrfs with image file
- Create Image File
- Create RAID1
- Expand from 2 Disk to 4 Disk (1g -> 2g)
- Upgrade RAID1 to RAID10
# Create 1G image file X 4
truncate -s 1g vdisk1.img vdisk2.img vdisk3.img vdisk4.img
# map image file to block device
losetup -l
# 沒有 Output
losetup /dev/loop1 vdisk1.img
losetup /dev/loop2 vdisk2.img
losetup /dev/loop3 vdisk3.img
losetup /dev/loop4 vdisk4.img
# Create RAID1 (sdb, sdc)
mkfs.btrfs -m raid1 -d raid1 /dev/loop1 /dev/loop2
# mount it
mkdir /mnt/btrfs
mount -t btrfs /dev/loop1 /mnt/btrfs
# check status
btrfs filesystem df /mnt/btrfs
Data, RAID1: total=102.38MiB, used=128.00KiB System, RAID1: total=8.00MiB, used=16.00KiB Metadata, RAID1: total=102.38MiB, used=112.00KiB GlobalReserve, single: total=16.00MiB, used=0.00B
btrfs filesystem show /mnt/btrfs
Label: none uuid: cdcded1d-edeb-4c73-948d-c3628703b47d Total devices 2 FS bytes used 256.00KiB devid 1 size 1.00GiB used 212.75MiB path /dev/loop1 devid 2 size 1.00GiB used 212.75MiB path /dev/loop2
btrfs device usage /mnt/btrfs/
/dev/loop1, ID: 1 Device size: 1.00GiB Device slack: 0.00B Data,RAID1: 102.38MiB Metadata,RAID1: 102.38MiB System,RAID1: 8.00MiB Unallocated: 811.25MiB /dev/loop2, ID: 2 ...
df -h | grep btrfs
/dev/loop1 1.0G 17M 913M 2% /mnt/btrfs
# Add device to existing btrfs system (sdd, sde)
btrfs device add /dev/loop3 /dev/loop4 /mnt/btrfs
btrfs filesystem df /mnt/btrfs
Data, RAID1: total=102.38MiB, used=128.00KiB System, RAID1: total=8.00MiB, used=16.00KiB Metadata, RAID1: total=102.38MiB, used=112.00KiB GlobalReserve, single: total=16.00MiB, used=0.00B
btrfs filesystem show /mnt/btrfs
Label: none uuid: bd7b494c-f242-4bcd-9160-443b5adf141c Total devices 4 FS bytes used 256.00KiB devid 1 size 1.00GiB used 212.75MiB path /dev/loop1 devid 2 size 1.00GiB used 212.75MiB path /dev/loop2 devid 3 size 1.00GiB used 0.00B path /dev/loop3 devid 4 size 1.00GiB used 0.00B path /dev/loop4
df -h | grep btrfs
/dev/loop1 2.0G 17M 1.9G 1% /mnt/btrfs
blkid vdisk*.img
* 所有 file 都有相同的 UUID, 但有不同的 UUID_SUB
vdisk1.img: UUID="bd7b494c-f242-4bcd-9160-443b5adf141c" UUID_SUB="e333a51b-9ef1-4369-b440-83d538e628d9" TYPE="btrfs" vdisk2.img: UUID="bd7b494c-f242-4bcd-9160-443b5adf141c" UUID_SUB="85e90b67-9f60-4430-8c73-a1cf91e86bdd" TYPE="btrfs" vdisk3.img: UUID="bd7b494c-f242-4bcd-9160-443b5adf141c" UUID_SUB="53469a58-803a-4847-a383-2f1470b8c126" TYPE="btrfs" vdisk4.img: UUID="bd7b494c-f242-4bcd-9160-443b5adf141c" UUID_SUB="e19db13f-2995-4e9e-82af-3f5863cf61c5" TYPE="btrfs"
# raid1 -> raid10
btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt/btrfs
Done, had to relocate 3 out of 3 chunks
# check status
btrfs filesystem df /mnt/btrfs
Data, RAID10: total=416.00MiB, used=128.00KiB System, RAID10: total=64.00MiB, used=16.00KiB Metadata, RAID10: total=256.00MiB, used=112.00KiB GlobalReserve, single: total=16.00MiB, used=0.00B
Cleanup
umount /mnt/btrfs
losetup -l
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO /dev/loop1 0 0 0 0 /root/btrfs-test/vdisk1.img 0 /dev/loop4 0 0 0 0 /root/btrfs-test/vdisk4.img 0 /dev/loop2 0 0 0 0 /root/btrfs-test/vdisk2.img 0 /dev/loop3 0 0 0 0 /root/btrfs-test/vdisk3.img 0
# -D, --detach-all detach all used devices
losetup -D
btrfs-send / btrfs-receive
它比 rsync 優勝的地方
* rsync does not track file renames
* 支援臣大的檔案
btrfs-send
btrfs-send - generate a stream of changes between two subvolume snapshots
This command will generate a stream of instructions that describe changes between two subvolume snapshots.
The stream can be consumed by the btrfs receive command to replicate the sent snapshot on a different filesystem.
The command operates in two modes: full and incremental.
Only the send side is happening in-kernel. Receive is happening in user-space.
Based on the differences found by btrfs_compare_tree, we generate a stream of instructions.
"btrfs send" requires read-only subvolumes to operate on.
btrfs-receive - receive subvolumes from send stream
btrfs-receive
btrfs receive will fail in the following cases:
- receiving subvolume already exists
- previously received subvolume has been changed after it was received
A subvolume is made read-only after the receiving process finishes successfully
btrfs receive sets the subvolume read-only after it completes successfully. However, while the receive is in progress, users who have write access to files or directories in the receiving path can add, remove, or modify files, in which case the resulting read-only subvolume will not be an exact copy of the sent subvolume.
If the intention is to create an exact copy, the receiving path should be protected from access by users until the receive operation has completed and the subvolume is set to read-only.
Additionally, receive does not currently do a very good job of validating that an incremental send streams actually makes sense, and it is thus possible for a specially crafted send stream to create a subvolume with reflinks to arbitrary files in the same filesystem. Because of this, users are advised to not use btrfs receive on send streams from untrusted sources, and to protect trusted streams when sending them across untrusted networks.
應用(測試)
準備
mount | grep btrfs
/dev/vdd on /backup type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/) /dev/vdc on /home type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
btrfs subvol create /home/data # 必須是 subvol, 因為它才能 take snapshot
mkdir /home/snap /backup/data # 建立 Folder 存放 snapshot
Initial Bootstrapping
# 為 subvolume 建立一個 readonly snapshot
btrfs subvol snap -r /home/data /home/snap/data-snap1
# 將整個 snapshot 寫到 /backup, 建立了 data-snap1
btrfs send /home/snap/data-snap1 | btrfs receive /backup/data
題外話
btrfs subvol show /home/snap/data-snap1
snap/data-snap1
Name: data-snap1
UUID: 79798a57-ad04-1a4f-8f6e-89b99687d8d2
Parent UUID: b1e3cb62-a34a-2541-a15a-ac6032564e6b
Received UUID: -
...
Notes: send side 的 "Send time" 它只會與 "Creation time" 一樣, 所以沒意思
btrfs subvol show /backup/data/data-snap1
data/data-snap1 Name: data-snap1 UUID: 83bb6d33-b120-0a40-bf01-81b96ba5fb78 Parent UUID: - Received UUID: 79798a57-ad04-1a4f-8f6e-89b99687d8d2 Send time: 2023-02-16 17:58:38 +0800 ... Receive time: 2023-02-16 17:58:44 +0800
Notes: Receive time - Send time = 行左多耐
Incremental Backup
btrfs subvol snap -r /home/data /home/snap/data-snap2
# send side: 以 kernel mode 封行對比 snapshot 建立 "instruction stream"
# -p <parent> <= send an incremental stream from parent to subvol
# receive side: 在 "/backup/home" 內建立了 subversion "BACKUP-new"
btrfs send -p /home/snap/data-snap1 /home/snap/data-snap2 | btrfs receive /backup/data
btrfs subvol show /home/snap/data-snap2
snap/data-snap2
Name: data-snap2
UUID: 2a5054d1-5065-8c40-86bc-4c1599595b6f
Parent UUID: b1e3cb62-a34a-2541-a15a-ac6032564e6b
Received UUID: -
...
btrfs subvol show /backup/data/data-snap2
data/data-snap2 Name: data-snap2 UUID: 54ee9c82-5020-0646-a22f-a924cda56001 Parent UUID: 83bb6d33-b120-0a40-bf01-81b96ba5fb78 Received UUID: 2a5054d1-5065-8c40-86bc-4c1599595b6f ...
Notes
Incremental 會同時有 "Parent UUID" 及 "Received UUID"
Cleanup
# 下次 send / receive 只依賴 snap2, 所以 snap1 可以刪除
btrfs subvol del /home/snap/data-snap1
btrfs subvol del /backup/data/data-snap1
* receive 必須有 sent -p 指定的 snapshot
At subvol /home/snap/data-snap4 At snapshot data-snap4 ERROR: cannot find parent subvolume
MyBackupSctipt
#!/bin/bash bk_path="/data/pc_data/photo" snap_path="/data/_snap" dest_path="/data/1T/backup" keep=3 ############################################## lockfile=/tmp/btrfs-backup.lck if [ -e $lockfile ]; then echo "Under locking" && exit fi touch $lockfile folder=$(basename $bk_path) org_snap=$snap_path/$folder new_snap=$snap_path/${folder}-new org_bak=$dest_path/$folder new_bak=$dest_path/${folder}-new echo " * create new snapshot" btrfs sub snap -r $bk_path $new_snap sync echo " * send | receive" btrfs send -p $org_snap $new_snap | btrfs receive $dest_path echo " * send side cleanup" btrfs sub del $org_snap mv $new_snap $org_snap echo "receive side cleanup" btrfs subvolume delete $org_bak mv $new_bak $org_bak btrfs subvolume snapshot -r $org_bak $org_bak.$(date +%Y-%m-%d) echo " * rotate backup to $keep" ls -rd ${org_bak}.* | tail -n +$(( $keep + 1 ))| while read snap do echo $snap btrfs subvolume delete "$snap" done rm -f $lockfile ls $dest_path
GlobalReserve
What is the GlobalReserve and why does 'btrfs fi df' show it as single even on RAID filesystems?
The global block reserve is last-resort space for filesystem operations that may require allocating workspace even on a full filesystem.
An example is removing a file, subvolume or truncating a file.
This is mandated by the COW model, even removing data blocks requires to allocate some metadata blocks first (and free them once the change is persistent).
The block reserve is only virtual and is not stored on the devices.
It's an internal notion of Metadata but normally unreachable for the user actions (besides the ones mentioned above). For ease it's displayed as single.
The size of global reserve is determined dynamically according to the filesystem size but is capped at 512MiB. The value used greater than 0 means that it is in use.
XOR module
modprobe xor
[4945537.298825] xor: automatically using best checksumming function: generic_sse [4945537.316018] generic_sse: 3983.000 MB/sec [4945537.316026] xor: using function: generic_sse (3983.000 MB/sec)
P.S.
btrfs depends: libcrc32c, zlib_deflate
常遇到的應用
# Convert an existing directory into a subvolume
cp -a --reflink=always Path2Folder Path2Subvolume
Remark
--reflink[=always] is specified, perform a lightweight copy,
where the data blocks are copied only when modified.
If this is not possible the copy fails, or if --reflink=auto is specified,
fall back to a standard copy.
If you use --reflink=always on a non-COW capable filesyste, you will be given an error.
cp --reflink=always my_file.bin my_file_copy.bin
Data Integrity Test
Fault isolation
Btrfs generates checksums for data and metadata blocks.
Data blocks 與 Metadata blocks 係分別啟用 checksum
Default: Metadata: On; Data: Off
Corruption detection and correction
In Btrfs, checksums are verified each time a data block is read from disk.
If the file system detects a checksum mismatch while reading a block,
it first tries to obtain (or create) a good copy of this block from another device.
Test
sha256sum CentOS-7-x86_64-Minimal-1708.iso
bba314624956961a2ea31dd460cd860a77911c1e0a56e4820a12b9c5dad363f5
cp CentOS-7-x86_64-Minimal-1708.iso /mnt/btrfs
查看 HDD 某位置的 1 byte 內容
# 位置: 500MB = 524288000
xxd -ps -s 524288000 -l 1 vdisk1.img
84
xxd -ps -s 524288000 -l 1 vdisk2.img
d7
Remark
在 RAID1 結構下, vdisk1 與 vdisk 在同一位置的內容是不一樣的
# 破壞某位置的 1 byte 內容
# 破壞 vdisk1
printf '\x2a' | dd of=vdisk1.img bs=1 count=1 seek=524288000 conv=notrunc
sha256sum CentOS-7-x86_64-Minimal-1708.iso # 正常沒 error, dmesg 亦沒報錯 !!
# 破壞 vdisk2
printf '\x2a' | dd of=vdisk2.img bs=1 count=1 seek=524288000 conv=notrunc
sha256sum CentOS-7-x86_64-Minimal-1708.iso # 竟然仍是正常
令 Error 浮現
echo 1 > /proc/sys/vm/drop_caches
# read file 時是沒有 stderr 的, dmesg 才看到
sha256sum CentOS-7-x86_64-Minimal-1708.iso
# 查看 Error
dmesg
BTRFS warning (device loop1): csum failed root 5 ino 257 off 195297280 csum 0xe07abfd6 expected csum 0xdc613d1e mirror 2 BTRFS warning (device loop1): csum failed root 5 ino 257 off 195297280 csum 0xe07abfd6 expected csum 0xdc613d1e mirror 2 BTRFS info (device loop1): read error corrected: ino 257 off 195297280 (dev /dev/loop1 sector 1024000)
說明
root 5 ino 257 # 用 "stat filename" 可以看到 root 及 ino
find . -mount -inum 257
./CentOS-7-x86_64-Minimal-1708.iso
"(dev /dev/loop1 sector 1024000)" # loop1 的 500 MB 位於壞左
# 查看 recovery 的內容 2a -> 84
xxd -ps -s 524288000 -l 1 vdisk1.img
84
Scrubbing
Scrub job that is performed in the background.
(only checks and repairs the portions of disks that are in use)
Start Scrub
cd /path/to/mountpoint
btrfs scrub start .
scrub started on ., fsid c5dfef39-c902-4d51-8217-17e268a77661 (pid=27078) server:btrfs# WARNING: errors detected during scrubbing, corrected
dmesg
... BTRFS warning (device loop1): checksum error at logical 524288000 on dev /dev/loop1, physical 524288000, root 5, inode 257, offset 195297280, length 4096, links 1 (path: CentOS-7-x86_64-Minimal-1708.iso) ... BTRFS error (device loop1): bdev /dev/loop1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 ... BTRFS error (device loop1): fixed up error at logical 524288000 on dev /dev/loop1
Health Check
# -z|--reset # Print the stats and reset the values to zero afterwards
btrfs dev stats . # Read and print the device IO error statistics
[/dev/mapper/vg3t-data_disk].write_io_errs 0 [/dev/mapper/vg3t-data_disk].read_io_errs 0 [/dev/mapper/vg3t-data_disk].flush_io_errs 0 [/dev/mapper/vg3t-data_disk].corruption_errs 0 [/dev/mapper/vg3t-data_disk].generation_errs 0
btrfs scrub status .
UUID: ... Scrub started: Fri Dec 29 10:07:22 2023 Status: running Duration: 0:01:55 Time left: 1:34:27 ETA: Fri Dec 29 11:43:45 2023 Total to scrub: 1.00TiB Bytes scrubbed: 20.48GiB (1.99%) Rate: 182.39MiB/s Error summary: no errors found
btrfs: HUGE metadata allocated
First of all, BTRFS does allocate metadata (and data) one chunk at the time.
A typical size of metadata block group is 256MiB (filesystem smaller than 50GiB) and
1GiB (larger than 50GiB) for data. The system block group size is a few megabytes.
Keep in mind that BTRFS also stores smaller files in the metadata
(which may contribute to your "high" metadata usage)
By default BTRFS also duplicate metadata (recover in case of a corruption)
btrfs balance filters
Opts
- -d[<filters>] # act on data block groups
- -m[<filters>] # act on metadata chunks
filters
limit its action to a subset of the full filesystem
usage=<percent>
Balances only block groups with usage under the given percentage.
The value of 0 is allowed and will clean up completely unused block groups (ENOSPC)
i.e.
# The 60 indicates that you are allowing to have chunks with 40% wasted space(non used space)
# => if a chunk has less usage than that percentage it will be merged with others into new chunk
btrfs balance start -v -musage=60 /path
btrfs allocated 100%
原因
BTRFS starts every write in a freshly allocated chunk. (COW)
Data Usage: 分 block layer & file layer
Checking
df -h /var/lib/lxc/lamp/rootfs
Filesystem Size Used Avail Use% Mounted on
/dev/sdd6 38G 34G 1.9G 95% /var/lib/lxc/lamp/rootfs
btrfs fi show /var/lib/lxc/lamp/rootfs
Label: none uuid: 8145ac80-1173-473f-994f-9080e7d03713
Total devices 1 FS bytes used 33.31GiB
devid 1 size 37.16GiB used 37.16GiB path /dev/sdd6
btrfs fi usage /var/lib/lxc/lamp/rootfs
Overall: Device size: 37.16GiB Device allocated: 37.16GiB Device unallocated: 0.00B Device missing: 0.00B Used: 33.83GiB Free (estimated): 1.85GiB (min: 1.85GiB) Data ratio: 1.00 Metadata ratio: 1.99 Global reserve: 137.54MiB (used: 0.00B) ...
[Fix]
btrfs balance start /var/lib/lxc/lamp/rootfs &
btrfs balance status -v /var/lib/lxc/lamp/rootfs
Balance on '/var/lib/lxc/lamp/rootfs' is running 0 out of about 42 chunks balanced (21 considered), 100% left Dumping filters: flags 0x7, state 0x1, force is off DATA (flags 0x0): balancing METADATA (flags 0x0): balancing SYSTEM (flags 0x0): balancing