ext4

最後更新: 2020-08-22

介紹

An ext4 file system is split into a series of block groups.

To reduce performance difficulties due to fragmentation,

the block allocator tries very hard to keep each file's blocks within the same group,

thereby reducing seek times.

The size of a block group is specified in sb.s_blocks_per_group blocks, though it can also calculated as 8 * block_size_in_bytes.

With the default block size of 4KiB, each group will contain 32,768 blocks, for a length of 128MiB.

The number of block groups is the size of the device divided by the size of a block group.

All fields in ext4 are written to disk in little-endian order. HOWEVER,

  all fields in jbd2 (the journal) are written to disk in big-endian order.

 


Feature

 

Delayed allocation

does not allocate the blocks immediately when the process write()s,

rather, it delays the allocation of the blocks while the file is kept in cache, until it is really going to be written to the disk.

This gives the block allocator the opportunity to optimize the allocation in situations where the old system couldn't.

Extents

An extent is basically a bunch of contiguous physical blocks.

improving large-file performance and reducing fragmentation.

Extents replace the traditional block mapping scheme used by ext2 and ext3.

For example,

A 100 MB file can be allocated into a single extent of that size,

instead of needing to create the indirect mapping for 25600 blocks (4 KB per block).

Huge files are split in several extents.

Multiblock allocation

Ext3 block allocator only allocates one block (4KB) at a time.
Ext4 uses a "multiblock allocator" (mballoc) which allocates many blocks in a single call,
instead of a single block per call, avoiding a lot of overhead.

Journal checksumming

It is turned off by default for now

Fast fsck

at the end of each group's inode table will be stored a list of unused inodes

Online defragmentation

 

Barriers

opts:

  • barrier (Default)
  • nobarrier (3% performance up)

A write barrier is a kernel mechanism used to ensure that file system metadata is correctly written and

ordered on persistent storage, even when storage devices with volatile write caches lose power.

(確保 fsync() 有效)

Implement:

After the data transaction is written, the storage cache is flushed,

the commit block is written, and the cache is flushed again.

 * This requires an IO stack which can support barriers.

Case "nobarrier"

 * Disabling Write Caches (hdparm -W0 /dev/sd?)

 * Storage devices use battery-backed write caches => nobarrier

Persistent preallocation

preallocates the necessary blocks and data structures,
but there's no data on it until the application really needs to write the data in the future.
(avoid applications (like P2P apps) doing it themselves inefficiently by filling a file with zeros)

Enabling metadata checksums

When a filesystem has been created with e2fsprogs 1.44 or later,

metadata checksums should already be enabled by default.

dumpe2fs -h /dev/sde1 | grep metadata_csum

Filesystem features:      ... metadata_csum

P.S.

# hardware accelerated CRC32C algorithm, if the CPU supports SSE 4.2

grep sse /proc/cpuinfo

flags           : ...  sse4_1 sse4_2

grep driver /proc/crypto | grep crc32

driver       : crc32-pclmul
driver       : crc32c-generic
driver       : crc32c-intel

 


Format an Partition

 

mkfs.ext4

-L new-volume-label     # Set the volume label for the filesystem

-b block-size                # Valid block-size values are 1024, 2048 and 4096 bytes per block.

 


jbd2

 

JBD is the journaling block device that sits between the file system and the block device driver.
(ext3, ext4 and OCFS2 都用它)

The jbd2 version is for ext4

Atomic handle

guarantees that the high-level update either happens or not

Transaction

For the sake of efficiency and performance, JBD groups several atomic handles into a single transaction

 


ext4 mount options

 

  • user_xattr ( man 5 attr )
  • noacl (Disables POSIX ACL)
  • noatime
  • nodiratime
  • ro
  • journal_checksum                  # Enable checksumming of the journal transactions[detect corruption in the kernel/e2fsck])
  • data=ordered
  • stripe=n                                # Number of filesystem blocks that mballoc will try to use for allocation size and alignment
                                                # (RAID5/6 => Disks *  RAID chunk size)
  • data=<writeback|ordered*|journal>
                                         #
  • nobarrier
     
  • commit=5                       # told to sync metadata & data every 'nrsec' seconds. 0=Default(5 sec)
                                          # hang 機時會失去 5 秒的 data. (filesystem will not be damaged)
     
  • quota,                             # These options are ignored by the filesystem. They are used only by quota tools
    noquota,
    grpquota,
    usrquota

 


Data Mode

 

There are 3 different data modes: writeback, ordered, journal

 * writeback: ext4 does not journal data at all (metadata journaling)

 * ordered(Default): only officially journals metadata
                 ( When it's time to write the new metadata out to disk,
                 the associated data blocks are written first )

 * journal: provides full data and metadata journaling (slowest)
                (written to the journal first, and then to its final location)
                (Enabling this mode will disable delayed allocation and O_DIRECT support.)

 


ext4 online defrag

 

Install

yum install e2fsprogs

defrag single files:

e4defrag /patch/to/file

opts:

-c     Get  a  current fragmentation count and an ideal fragmentation count

<File>                                         now/best       size/ext
win7.qcow2                                     667/20         62883 KB

 Total/best extents                             667/20
 Average size per extent                        62883 KB
 Fragmentation score                            0
 [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]
 This file (win7.qcow2) does not need defragmentation.
 Done.

-v          Print the fragmentation count before and after defrag for each file

[1/1] "win7.qcow2"                extents: 667 -> 667
        Defrag size is larger than filesystem's free space              [ NG ]

defrag all directories:

e4defrag -r /patch/to/directory/

defrag a partition:

e4defrag /dev/sda1

e4defrag file_system_image.img

 


ext4 - Why doesn't deleting files increase available space

 

First, your filesystem has reserved some space that only root can write to.

That's why you see 124G of 130G used, but zero available.

tune2fs -m 0 /dev/sd?

 


Fix superblock

 

fsck.ext4 -v /dev/sd?

dumpe2fs /dev/sd? | grep superblock

dumpe2fs 1.42.9 (4-Feb-2014)
  Primary superblock at 0, Group descriptors at 1-175
  Backup superblock at 32768, Group descriptors at 32769-32943
  Backup superblock at 98304, Group descriptors at 98305-98479
  ...

e2fsck -y -b block_number /dev/sd?

詳見

 


ext4 Enabling metadata checksums

 

To protect against non-hostile corruption

When a filesystem has been created with e2fsprogs 1.44 or later,

  metadata checksums should already be enabled by default.

[0]

  • OS: Ubuntu 16.04 <- X
  • OS: Ubuntu 18.04 <- O

[1]

hardware accelerated CRC32C algorithm

CPU supports SSE 4.2 (grep sse /proc/cpuinfo)

[2]

kernel module (modprobe crc32c_generic / modprobe crc32c_intel)

grep CONFIG_CRYPTO_CRC32C /boot/config-*

CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_CRC32C_INTEL=y

[3]

# By Create New One

mkfs.ext4 -O metadata_csum /dev/path/to/disk

OR

# By Convertion

# -D     Optimize directories in filesystem

e2fsck -Df /dev/path/to/disk 

# Convert the filesystem to 64-bit

# filesystem with 64-bit to hold full 32-bit checksums

# file systems can span 1024 PiB instead of just 16 TiB(16 bit) volumes

# tune2fs -l /dev/path/to/disk | grep features => ... 64bit ...

resize2fs -b /dev/path/to/disk

# tune2fs -l /dev/path/to/disk | grep features => ... 64bit ...

# Disable: tune2fs -O ^metadata_csum

# tune2fs -l /dev/path/to/disk | grep features => ... metadata_csum ...

tune2fs -O metadata_csum /dev/path/to/disk

# mount option

It is not necessary to provide any mount options to enable the feature.

 


e2label

 

# e2label device [ volume-label ]

e2label /dev/sdd1 BackupDisk

e2label /dev/sdd1

BackupDisk

fstab

LABEL=BackupDisk    /backup    ext4    defaults 1 1

 

 


Use external journal to optimize performance

 

Step1

mke2fs -O journal_dev /dev/journal_device

Step2

tune2fs -J device=/dev/journal_device /dev/ext4_fs

 


DOC

https://www.kernel.org/doc/Documentation/filesystems/ext4.txt

 

 

 

Creative Commons license icon Creative Commons license icon