ext4

最後更新: 2017-01-01

介紹

 

An ext4 file system is split into a series of block groups.

To reduce performance difficulties due to fragmentation, the block allocator tries very hard to keep each file's blocks within the same group,

thereby reducing seek times.

The size of a block group is specified in sb.s_blocks_per_group blocks, though it can also calculated as 8 * block_size_in_bytes.

With the default block size of 4KiB, each group will contain 32,768 blocks, for a length of 128MiB.

The number of block groups is the size of the device divided by the size of a block group.

All fields in ext4 are written to disk in little-endian order. HOWEVER, all fields in jbd2 (the journal) are written to disk in big-endian order.
 

 


Feature

 

Delayed allocation

does not allocate the blocks immediately when the process write()s,

rather, it delays the allocation of the blocks while the file is kept in cache, until it is really going to be written to the disk.

This gives the block allocator the opportunity to optimize the allocation in situations where the old system couldn't.

Extents

An extent is basically a bunch of contiguous physical blocks. It basically says "The data is in the next n blocks".

For example,

a 100 MB file can be allocated into a single extent of that size,

instead of needing to create the indirect mapping for 25600 blocks (4 KB per block).

Huge files are split in several extents. Extents

Multiblock allocation

Ext3 block allocator only allocates one block (4KB) at a time.
Ext4 uses a "multiblock allocator" (mballoc) which allocates many blocks in a single call,
instead of a single block per call, avoiding a lot of overhead.

Journal checksumming

It is turned off by default for now

Fast fsck

at the end of each group's inode table will be stored a list of unused inodes

Online defragmentation

 

Barriers

By default ON.

a barrier forbids the writing of any blocks after the barrier until all blocks written before the barrier are committed to the media.

This also requires an IO stack which can support barriers.

Write barriers enforce proper on-disk ordering of journal commits, making volatile disk write caches safe to use

opts:

- barrier
- nobarrier

Persistent preallocation

preallocates the necessary blocks and data structures,
but there's no data on it until the application really needs to write the data in the future.
(avoid applications (like P2P apps) doing it themselves inefficiently by filling a file with zeros)

 


Format an Partition

 

mkfs.ext4

-L new-volume-label        # Set the volume label for the filesystem

 


jbd2

 

JBD is the journaling block device that sits between the file system and the block device driver.
(ext3, ext4 and OCFS2 都用它)

The jbd2 version is for ext4

Atomic handle

guarantees that the high-level update either happens or not

Transaction

For the sake of efficiency and performance, JBD groups several atomic handles into a single transaction

 


ext4 mount options

 

  • user_xattr ( man 5 attr )
  • noacl (Disables POSIX ACL)
  • noatime
  • nodiratime
  • nobarrier
  • ro
  • journal_checksum (Enable checksumming of the journal transactions[detect corruption in the kernel/e2fsck])
  • data=ordered
  • commit=nrsec (Default: 5 <-- 5 second )
  • stripe=n                                                  # Number of filesystem blocks that mballoc will try to use for allocation size and alignment
                                                                  # (RAID5/6 => Disks *  RAID chunk size)
  • data=<writeback|ordered*|journal>
  • commit=nrsec (told to sync all its data and metadata every 'nrsec' seconds, 0=Default(5 sec))
  • quota, noquota, grpquota, usrquota ( These options are ignored by the filesystem. They are used only by quota tools)

 


Data Mode

 

There are 3 different data modes: writeback, ordered, journal

 * writeback: ext4 does not journal data at all (metadata journaling)

 * ordered(Default): only officially journals metadata
                 ( When it's time to write the new metadata out to disk,
                 the associated data blocks are written first )

 * journal: provides full data and metadata journaling (slowest)
                (written to the journal first, and then to its final location)
                (Enabling this mode will disable delayed allocation and O_DIRECT support.)

 


ext4 online defrag

 

Install

yum install e2fsprogs

defrag single files:

e4defrag /patch/to/file

opts:

-c     Get  a  current fragmentation count and an ideal fragmentation count

<File>                                         now/best       size/ext
win7.qcow2                                     667/20         62883 KB

 Total/best extents                             667/20
 Average size per extent                        62883 KB
 Fragmentation score                            0
 [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]
 This file (win7.qcow2) does not need defragmentation.
 Done.

-v          Print the fragmentation count before and after defrag for each file

[1/1] "win7.qcow2"                extents: 667 -> 667
        Defrag size is larger than filesystem's free space              [ NG ]

defrag all directories:

e4defrag -r /patch/to/directory/

defrag a partition:

e4defrag /dev/sda1

e4defrag file_system_image.img

 


ext4 - Why doesn't deleting files increase available space

 

First, your filesystem has reserved some space that only root can write to.

That's why you see 124G of 130G used, but zero available.

tune2fs -m 0 /dev/sd?

 


Fix superblock

 

fsck.ext4 -v /dev/sd?

dumpe2fs /dev/sd? | grep superblock

dumpe2fs 1.42.9 (4-Feb-2014)
  Primary superblock at 0, Group descriptors at 1-175
  Backup superblock at 32768, Group descriptors at 32769-32943
  Backup superblock at 98304, Group descriptors at 98305-98479
  ...

e2fsck -y -b block_number /dev/sd?

 

 


DOC

https://www.kernel.org/doc/Documentation/filesystems/ext4.txt