最後更新: 2020-04-17
目錄
CPU
cpuset
設定 CPU 的 placement (CPU affinity)
e.g.
0-2,7
cpu.shares
- The weight of each group living in the same hierarchy(same level). # Default: 1024
- CPU Shares are relative. (當其中一隻 CT 閒置時, 另一隻 CT 就可以 CPU 100%)
cfs # only affects non-RT tasks
- cpu.cfs_period_us # Default: 100000 (100ms)
The each scheduler period. Larger periods will improve throughput at the expense of latency
- cpu.cfs_quota_us # Default: -1
during each cfs_period_us in for the current group will be allowed to run.
This represents aggregate time over all CPUs in the system:
allow full usage of two CPUs => this value to twice the value of cfs_period_us
rt
- cpu.rt_runtime_us 4000000 # Default: 0
- cpu.rt_period_us 5000000 # Default: 1000000
Statistics
# CPU 的總使用時間. All tasks in this group (nanoseconds)
cpuacct.usage
reset counter
echo 0 > /cgroups/cpuacct/cpuacct.usage
# 在每一個 core 所用的時間
cpuacct.usage_percpu
# aggregate user and system time consumed by tasks in this group.
cpuacct.stat
user 589182 <-- USER_HZ
system 41986
CPU CGroup 應用:
firefox 及 movie_player 得到不用的 CPU 使用量
# mount -t tmpfs cgroup_root /sys/fs/cgroup # mkdir /sys/fs/cgroup/cpu # mount -t cgroup -o cpu none /sys/fs/cgroup/cpu # cd /sys/fs/cgroup/cpu # mkdir multimedia # create "multimedia" group of tasks # mkdir browser # create "browser" group of tasks # #Configure the multimedia group to receive twice the CPU bandwidth # #that of browser group # echo 2048 > multimedia/cpu.shares # echo 1024 > browser/cpu.shares # firefox & # Launch firefox and move it to "browser" group # echo <firefox_pid> > browser/tasks # #Launch gmplayer (or your favourite movie player) # echo <movie_player_pid> > multimedia/tasks
memory
詳見: Documentation/cgroups/memory.txt
- memory.limit_in_bytes
- memory.memsw.limit_in_bytes
單位有 k, m, g, 不論單位的大小寫, 它們一律是 byte 來, -1 即取消已設定的限制
RAM
memory.limit_in_bytes
Set limit of memory(RAM)
memory.soft_limit_in_bytes
upper limit for user memory including the file cache.
當系統不夠 RAM 時, 它會把 cgroup 盡量推回 soft_limit 去
查看時 (cat memory.soft_limit_in_bytes) 的 Unit 係 bytes
soft_limit_in_bytes vs limit_in_bytes
Set/Show soft limit of memory usage
當整個 OS 不夠 RAM 時才會觸發此限制
limit_in_bytes > soft_limit_in_bytes
Swap
memory.memsw.limit_in_bytes
Set limit of memory(RAM) + Swap
* 必須設定 memory.limit_in_bytes 後才可以設定 memory.memsw.limit_in_bytes
memory.swappiness # Default: 60
Checking
memory.usage_in_bytes
show current usage for memory
memory.memsw.usage_in_bytes
show current usage for memory+Swap
=> 連 swap 都不夠用
memory.failcnt
show the number of memory usage hits limits
=> 用了 swap 幾多次
memory.memsw.failcnt
show the number of memory+Swap hits limits
memory.stat
show various statistics
e.g.
cache 0 rss 0 rss_huge 0 shmem 0 mapped_file 0 dirty 0 writeback 0 swap 0 ... total_*
OOM
cat memory.oom_control
oom_kill_disable 0 under_oom 0
LXC 的 Location
/sys/fs/cgroup/memory/lxc/CT_NAME
Network
net_prio.prioidx // kernel uses as an internal representation of this cgroup.
net_prio.ifpriomap
net_cls.classid // 讀出來時是 10 進的, 它的配合 tc 一起用
0xAAAABBBB (0x10001=1:1)
Block Device IO(blkio)
它可以 control
- Proportional weight division (Implemented in CFQ)
- I/O throttling
cat /sys/block/sda/queue/scheduler
echo cfq > /sys/block/sda/queue/scheduler
Check Kernel Support
- grep CONFIG_BLK_CGROUP /boot/config-* # Block IO controller
- grep CONFIG_BLK_DEV_THROTTLING /boot/config-* # throttling in block layer
- grep CONFIG_CFQ_GROUP_IOSCHED /boot/config-* # group scheduling in CFQ
- grep CONFIG_IOSCHED_CFQ /boot/config-*
Setting
weight
blkio.weight #100 ~ 1000
relative proportion (weight) of block I/O access available
This value is overridden for specific devices by the blkio.weight_device
throttle
blkio.throttle.read_bps_device # bytes/second
Format: major:minor bytes_per_second
blkio.throttle.read_iops_device
blkio.throttle.write_bps_device
blkio.throttle.write_iops_device
Report
blkio.reset_stats
resets the statistics recorded in the other pseudofiles.
blkio.time
reports the time that a cgroup had I/O access to specific devices.
major, minor, and time # time in milliseconds (ms)
blkio.sectors
reports the number of sectors transferred to or from specific devices by a cgroup.
major, minor, and sectors
Throttle Report
blkio.throttle.io_serviced
Number of IOs completed to/from the disk by the group(read or write, sync or async)
Major:Minor Operation number_of_IO
blkio.throttle.io_service_bytes
Number of bytes transferred to/from the disk by the group.
CFQ Report
blkio.io_serviced
reports the number of I/O operations performed on specific devices by a cgroup as seen by the CFQ scheduler.
Format: major, minor, operation, and number
operation: read, write, sync, or async
The other hand, blkio.throttle.io_serviced counts number of IO in terms of number of bios as seen by throttling policy.
blkio.io_service_bytes
blkio.io_service_time
Test 1
準備
mkdir /cgroup
mount -t tmpfs cgroup_root /cgroup
mkdir /cgroup/blkio
mount -t cgroup -o blkio none /cgroup/blkio
mkdir /cgroup/blkio/test1
mkdir /cgroup/blkio/test2
# Create two same size files on same disk
cd /root
dd if=/dev/zero of=zerofile1 bs=1M count=4096
cp zerofile1 zerofile2
設定限制
echo 1000 > /cgroup/blkio/test1/blkio.weight
echo 500 > /cgroup/blkio/test2/blkio.weight
Start Test
sync; echo 3 > /proc/sys/vm/drop_caches
cgexec -g blkio:test1 time dd if=zerofile1 of=/dev/null cgexec -g blkio:test2 time dd if=zerofile2 of=/dev/null
Or
dd if=zerofile1 of=/dev/null & P1=$!; echo $P1 > /cgroup/blkio/test1/tasks dd if=zerofile2 of=/dev/null & P2=$!; echo $P2 > /cgroup/blkio/test2/tasks
Checking
iotop -qqq -p $P1 -p $P2
Remark: Cleanup
rmdir /cgroup/blkio/test1 /cgroup/blkio/test2
umount /cgroup/blkio /cgroup
Test 2
準備
ls -l /dev/sda # brw-rw---- 1 root disk 8, 0 Apr 3 17:26 /dev/sda
echo "10 * 1024 * 1024" | bc # 10485760
echo "8:0 10485760" > /cgroup/blkio/test1/blkio.throttle.read_bps_device
echo "8:0 10485760" > /cgroup/blkio/test1/blkio.throttle.write_bps_device
Start Test
# "oflag=direct" <= Currently only sync IO queues are support.
# All the buffered writes are still system wide and not per group. Throttle 不支援 Buffer IO
dd if=/dev/zero of=zerofile1 bs=1M count=4096 oflag=direct &
echo $! > /cgroup/blkio/test1/tasks
iotop -p $!
# Test
sync; echo 3 > /proc/sys/vm/drop_caches
Block Device IO
blkio.weight // default: 500
cat blkio.time // ms
8:0 29478 // 8, 0 = sda
blkio.weight_device
echo 8:0 1000 > blkio.weight_device
info
blkio.sectors // number of sectors transferred to or from
blkio.time // ms
blkio.io_serviced
8:0 Read 33258
8:0 Write 0
8:0 Sync 33258
8:0 Async 0
8:0 Total 33258
Total 33258
blkio.io_service_bytes
8:0 Read 857239552
8:0 Write 0
8:0 Sync 857239552
8:0 Async 0
8:0 Total 857239552
Total 857239552
blkio.io_wait_time // spent waiting for service in the scheduler queues
blkio.io_queued
8:0 Read 0
8:0 Write 0
8:0 Sync 0
8:0 Async 0
8:0 Total 0
Total 0
blkio.throttle.read_bps_device
blkio.throttle.write_bps_device
blkio.throttle.read_iops_device
blkio.throttle.write_iops_device
echo "8:0 10" > blkio.throttle.write_iops_device
echo "8:0 10485760" > blkio.throttle.write_bps_device // 10 mb
blkio.reset_stats // reset couter
echo 1> blkio.reset_stats
Block Device IO CFQ cgroup settings
Setting
slice_idle = 0 group_idle = 1 quantum = 16
group_isolation
When group isolation is disabled, fairness can be expected only for a sequential workload.
By default, group isolation is enabled and fairness can be expected for random I/O workloads as well.
echo 1 > /sys/block/<disk_device>/queue/iosched/group_isolation
If group_isolation=0, then CFQ automatically moves all the random seeky queues in the root group.
That means there will be no service differentiation for that kind of workload.
This leads to better throughput as we do collective idling on root sync-noidle tree.
slice_idle
This specifies how long CFQ should idle for next request on certain cfq queues (for sequential workloads)
and service trees (for random workloads) before queue is expired and CFQ selects next queue to dispatch from.
By default slice_idle is a non-zero value. That means by default we idle on queues/service trees.
This can be very helpful on highly seeky media like single spindle SATA/SAS disks
where we can cut down on overall number of seeks and see improved throughput.
"0" => CFQ will not idle between cfq queues of a cfq group => able to driver higher queue depth => achieve better throughput.
group_idle
When set, CFQ will idle on the last process issuing I/O in a cgroup.
This should be set to 1 when using proportional weight I/O cgroups and setting slice_idle to 0
By default group_idle is same as slice_idle and does not do anything if slice_idle is enabled.
quantum
The quantum controls the number of I/Os that CFQ will send to the storage at a time,
essentially limiting the device queue depth. By default, this is set to 8.