Traffic Control

最後更新: 2023-09-12

目錄

 


術語

 

qdisc (Queue Discipline)

 * Queueing: Determine the way in which data is SENT

  • pfifo_fast - FIFOs with band (default)
    它一共有 3 個 band: 0(H) ~ 2(L)
    man tc-pfifo
  • sfq - Stochastic Fair Queuing
    Full Bandwith Usage 時才有用
  • htb - Hierarchy Token Bucket
    CLASSFUL
    replacement for the CBQ (支援層的 borrow)
    man tc-htb
  • tbf - Token Bucket Filter (Default)
    CLASSLESS
    man tc-tbf
  • cbq
  • fq_codel - Fair Queuing (FQ) with Controlled Delay (CoDel)
    man tc-fq_codel

Classful qdisc

A classful qdisc contains multiple classes. Some of these classes contains a further qdisc

Level

class determines its position in hierarchy. Leaves has level 0, root classes LEVEL_COUNT-1

Scheduling

A qdisc may, with the help of a classifier,

decide that some packets need to go out earlier than others.

This process is called Scheduling ( Example: pfifo_fast )

Classes

  • A class, in turn, may have several classes added to it.
  • A classful qdisc may have many classes
  • A leaf class is a class with no child classes
  • When you create a class, a fifo qdisc is attached to it.
    (When you add a child class, this qdisc is removed)

Classifier

Each classful qdisc needs to determine to which class it needs to send a packet.

Shaping

The process of delaying packets before they go out to make traffic confirm to a configured maximum rate.

Shaping is performed on egress.

Policing

Delaying or dropping packets in order to make traffic stay below a configured bandwidth.

In Linux, policing can only "drop" a packet and not delay it

non-Work-Conserving

Token Bucket Filter, may need to hold on to a packet for a certain time in order to limit the bandwidth.

This means that they sometimes refuse to pass a packet, even though they have one available.

Ingress Qdisc

This happens at a very early stage, before it has seen a lot of the kernel.

It is therefore a very good place to drop traffic very early, without consuming a lot of CPU power.

dequeueing

The packet now sits in the qdisc,

waiting for the kernel to ask for it for transmission over the network interface.

 

 


Install

 

yum -y install iproute            # Centos 7

yum -y install iproute-tc        # Centos 8

 


tc 指令 Syntax

 

tc  [ OPTIONS ]  OBJECT  COMMAND  dev  <eth0 | ppp0> [ parent n:m | root ]

OBJECT:

  • qdisc
  • class
  • filter
  • action
  • monitor

action:

  • add
  • del
  • show
  • replace

dev

  • 要是 primary interface, 不可以是 eth0:0

parent

  • n:m
  • root

tc qdisc

elementary to understanding traffic control.

tc class

Some qdiscs can contain classes, which contain further qdiscs
traffic may  then  be enqueued in any of the inner qdiscs, which are within the classes.

tc filter

A filter is used by a classful qdisc to determine in which class a packet will be enqueued.

COMMANDS

add, remove, change, replace( Performs  a  nearly atomic remove/add on an existing node id.)

 


Token bucket (algorithm)

 

- A token is added to the bucket every 1/r seconds.
- token:data => 1:1
- The bucket can hold at the most b tokens.

Average rate = r

accumulation of token => allows a short burst

* 10mbit/s on Intel, you need at least 10kbyte buffer
* latency => maximum amount of time a packet can sit in the TBF
* mpu => A zero-sized packet does not use zero bandwidth.(size < 64 bytes)
* peakrate 的高抵與 bucket size 有關

* due to the default 10ms timer resolution(CONFIG_HZ_?) of Unix, with 10.000 bits average packets,

  (we are limited to 1mbit/s of peakrate!)

應用

# If you have a networking device with a large queue, like a DSL modem or a cable modem,
# and you talk to it over a fast device, like over an ethernet interface,
# you will find that uploading absolutely destroys interactivity.

tc qdisc add dev ppp0 root tbf rate 220kbit latency 50ms burst 1540

 


Units

 

SI prefix (k-, m-, g-, t-)           # 1000

IEC prefix (ki-, mi-, gi- and ti-) # 1024

i.e.

Bits per second(bit)

  • kbit
  • mbit

Bytes per second(bps)

  • kibps
  • mibps

 


pfifo_fast (Linux default qdisc)

 

pfifo_fast: "fast" provides three different bands (individual FIFOs)

0 (highest priority)
1
2

* Within a particular class packets are sent in the order they arrived.
* pfifo_fast does not delay packets - it sends them at the speed the device can accept them

Mapping

TOS (它一共有 4 bit, 總共 16 個組合)

Binary Decimcal  Meaning
-----------------------------------------
1000   8         Minimize delay (md)
0100   4         Maximize throughput (mt)
0010   2         Maximize reliability (mr)
0001   1         Minimize monetary cost (mmc)
0000   0         Normal Service

TOS 一共有 4 組

TOS     Bits  Means                    Linux Priority    Band
------------------------------------------------------------
0x0     0     Normal Service           0 Best Effort     1
0x2     1     Minimize Monetary Cost   1 Filler          2
0x4     2     Maximize Reliability     0 Best Effort     1
0x6     3     mmc+mr                   0 Best Effort     1
0x8     4     Maximize Throughput      2 Bulk            2
0xa     5     mmc+mt                   2 Bulk            2
0xc     6     mr+mt                    2 Bulk            2
0xe     7     mmc+mr+mt                2 Bulk            2
0x10    8     Minimize Delay           6 Interactive     0
0x12    9     mmc+md                   6 Interactive     0
0x14    10    mr+md                    6 Interactive     0
0x16    11    mmc+mr+md                6 Interactive     0
0x18    12    mt+md                    4 Int. Bulk       1
0x1a    13    mmc+mt+md                4 Int. Bulk       1
0x1c    14    mr+mt+md                 4 Int. Bulk       1
0x1e    15    mmc+mr+mt+md             4 Int. Bulk       1

Linux Priority => 對應 priomap 第幾個位的值

Service 的 TOS

TELNET                   1000(8)           (minimize delay)
FTP
	Control          1000(8)           (minimize delay)
        Data             0100(4)           (maximize throughput)

查看 priomap 的 對應

tc qdisc show dev eth0

qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1

priomap classForPrio_0 classForPrio_1 ... classForPrio_15

查看 TOS

用 "tcpdump -v -v" 才看到

 


fifo (First In, First Out)

 

bfifo (Byte limited)

tc qdisc ... add bfifo [ limit bytes ]

pfifo (Packet limited)

tc qdisc ... add pfifo [ limit packets ]

limit => Maximum queue size.

If  the  list  is  too long, no further packets are allowed on. This is called 'tail drop'.

* [p|b]fifo, pfifo_fast (CLASSLESS)

# 用圖

If you don't want to shape, but only want to see if your interface is so loaded that it has to queue

# To list current rules

tc [-s] qdisc show [dev ethX]

-s[tatistics]

i.e.

tc -s qdisc show

qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 30482712 bytes 528027 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0

No packets were dropped => not slow down packets

limited FIFO

limit default = interface txqueuelen

tc qdisc add root dev <eth0> pfifo [ limit packets ]
tc qdisc add root dev <eth0> bfifo [ limit bytes ]

tc qdisc change dev eth0 pfifo limit 100

Default:

For  pfifo,  it defaults  to  the  interface txqueuelen

For  bfifo, it  defaults txqueuelen X MTU

 


SFQ (PRIO qdisc)

 

* SFQ is only useful in case your actual outgoing interface is really full!
* If you were only to run SFQ, nothing would happen, as packets enter & leave your router without delay

'Stochastic' because it doesn't really allocate a queue for each session,

it has an algorithm which divides traffic over a limited number of queues using a hashing algorithm. (perturb)

Set

tc qdisc add dev ppp0 root sfq perturb 10

perturb       # Reconfigure hashing once this many seconds.

limit            # The total number of packets that will be queued by this SFQ (after that it starts dropping them)

Show

tc -s -d qdisc ls

limit 128p flows 128/1024 perturb 10sec

(1) 128 packets can wait in this queue
(2) 128 can be active at a time
(3) 1024 hashbuckets available for accounting
(4) every 10 seconds, the hashes are reconfigured

應用

          1:   root qdisc
        / | \ 
       /  |  \
     1:1  1:2  1:3    classes
      |    |    |
     10:  20:  30:    qdiscs    qdiscs
     sfq  tbf  sfq
band  0    1    2

CLI

tc qdisc add dev eth0 root handle 1: prio

# This *instantly* creates classes 1:1, 1:2, 1:3

tc qdisc add dev eth0 parent 1:1 handle 10: sfq
tc qdisc add dev eth0 parent 1:2 handle 20: tbf rate 20kbit buffer 1600 limit 3000
tc qdisc add dev eth0 parent 1:3 handle 30: sfq     

** The bands are classes, and are called major:1 to major:3 by default

 


Incoming traffic

 

To 'shape' incoming traffic which you are not forwarding, use the Ingress Policer.

Incoming shaping is called 'policing', by the way, not 'shaping'.

 


roots, handles, siblings and parents

 

handle = a unique identifier within the traffic control structure for class and classful qdisc
              ("handle" "1:" just a name or identifier)

major:minor                 # x:y

1) the object as a qdisc if minor is 0. (x: / x:0) Any other value identifies the object as a class.

2) All classes sharing a parent must have unique minor numbers.

qdisc 與 class

########## classifier chain ##########

Kernel
========================================
          1:            # root qdisc (1:0)
           |
          1:1           # child class
        /  |  \
       /   |   \
   1:10  1:11  1:12     # child class
    |      |     |
    |     11:    |      # leaf class
    |            |
    10:         12:     # qdisc
   /   \       /   \
10:1  10:2  12:1  12:2  # leaf classes
========================================
NIC

* Packets get enqueued and dequeued at the root qdisc

* classes never get dequeued faster than their parents allow.

* nested classes ONLY talk to their parent qdiscs, never to an interface.

1: -> 1:1 -> 1:12 -> 12: -> 12:12

相當於

1: -> 12:2

 * A classful qdisc can only have children classes of its type.

    For example, an HTB qdisc can only have HTB classes as children.

 


Timer Interrupt Frequency Configuration

 

a fast response for user interaction and that may experience bus
contention and cacheline bounces as a result of timer interrupts.

Note that the timer interrupt occurs on each processor in an SMP
environment leading to NR_CPUS * HZ number of timer interrupts per second.

egrep '^CONFIG_HZ_[0-9]+' /boot/config-`uname -r`

100 Hz is a typical choice for servers, SMP and NUMA systems with lots of processors
250 Hz is a good compromise choice allowing server performance
1000 Hz is the preferred choice for desktop systems

 


NetEm - Network Emulator

 

 * setting on outgoing packets from the chosen network interface.

limit packets
delay TIME
loss PERCENT JITTER   # loss = loss randomly
corrupt PERCENT
duplicate PERCENT
reorder PERCENT
rate RATE

Slow down traffic by 200 ms

1. To delete all rules

tc qdisc del dev eth0 root

# delay 100ms

tc qdisc add dev eth0 root netem delay 100ms

# show

tc -s qdisc ls dev eth0

qdisc netem 8001: root refcnt 2 limit 1000 delay 200.0ms
 Sent 18994 bytes 305 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 1p requeues 0

# jitter: 30ms ~ 80 ms

tc qdisc replace dev eth0 root netem delay 50ms 30ms

1% random drop packet

tc qdisc add dev ens4 root netem loss 1%

 * The smallest possible non-zero value: 2^32 (0.0000000232%)

 


tc Filter (filtering commands)

 

filtertype: u32

It extracts a bit field from a 32 bit word in the packet

Bases the decision on fields within the packet and if it is equal to a value supplied by you it has a match.

 * lower priority(higher preference number) will be processed first (first prio win)

# attach to eth0, root 1:0
# set a priority 50 's u32 filter
# Remote 的 port 是 22
# send it to band 10:101

tc filter add dev eth0 protocol ip parent 1: \
    prio 60 u32 \
    match ip dport 22 0xffff \
    flowid 1:101

支援:

  • match ip dst 3.2.1.0/24           # dst, src, all
  • match ip sport 80 0xffff           # dport, sport

filtertype: fwmark (iptables)

iptables -A PREROUTING -t mangle -i eth0 -j MARK --set-mark 6

tc filter add dev eth1 protocol ip parent 1: prio 1 handle 6 fw flowid 1:1

# show

iptables -L -t mangle -n -v

Delete Filter Example

設定一個 filter 先

tc filter add dev eth0 parent 1: protocol ip handle 80 fw flowid 1:20

查看

tc filter show dev eth0

filter parent 1: protocol ip pref 49152 fw
filter parent 1: protocol ip pref 49152 fw handle 0x50 classid 20:

# Delete 它

tc filter del dev eth0 protocol ip pref 49152

 


Limit Outgoing (HTB)

 

HTB ensures that the amount of service provided to each class is

  at least the minimum of the amount it requests and the amount assigned to it.

When a class requests less than the amount assigned,

  the remaining (excess) bandwidth is distributed to other classes which request service.

 * 每層 class 的 rate 的總和一定要小過上層才有效

 * With HTB, you should attach all filters to the root !!

 * Each node within the tree can have its own filters

 * HTB use of the outbound bandwidth on a given link

 * each class has a single parent

 * each class contains a "leaf" qdisc which by default has pfifo

 * one root class cannot borrow from another root class

Doc

Example 1: Sharing Hierarchy with u32

Diagram

       qdisc (1:)  # attach htb
         |
       _1:1_       # root class. 80mbit                       Level 3
      /     \
    1:11     1:12  # leaf class: 1:12; child class: 1:11      Level 2
   /    \
1:21   1:22        #                                          Level 1
  |      |
 21:    22:        #                                          Level 0 (leaf: 21:, 22:)

source port map to class

  • 1:12 - *
  • 1:21 - 8021/tcp
  • 1:22 - 8022/tcp

[1] Delete existing rules

tc qdisc del dev eth0 root
iptables -t mangle -F           # 非必要時勿行

P.S.

tc qdisc del dev eth0 root       # 不只 qdisc, 會連 tc filter 一起 del 埋

當 interface 沒有 qdisc 時, 會見到

Error: Cannot delete qdisc with handle of zero.          # OS: Rocky8

[2] Attaches queue discipline HTB to eth0, and set default class(1:12)

tc qdisc add dev eth0 root handle 1: htb default 12

說明

"handle 1:" => "handle 1:0"                           # x:y

just a name or identifier with which to refer

The handle for a qdisc must have zero for its y value.
(Default minor id of class to which unclassified packets are sent "0")

"default minor-id"    # i.e. "default 12"

any traffic that is not otherwise classified will be assigned to class 1:12

Unclassified traffic gets sent to the class with this minor-id.

[3] "root" class, "1:1" under the qdisc "1:"

tc class add dev eth0 parent 1: classid 1:1 htb rate 80mbit

說明

classid major:minor

classes can be named.

 * The major number must be equal to the major number of the qdisc to which it belongs.

[4] Create two classes directly under the htb qdisc

tc class add dev eth0 parent 1:1 classid 1:11 htb rate 48mbit ceil 80mbit
tc class add dev eth0 parent 1:1 classid 1:12 htb rate 32mbit

說明

ceil:

Specifies the maximum bandwidth(burst) that a class can use.(borrow)

The default ceil is the same as the rate.

[5] Create two child classes under the "1:11"

tc class add dev eth0 parent 1:11 classid 11:21 htb rate 32mbit
tc class add dev eth0 parent 1:11 classid 11:22 htb rate 16mbit

說明

parent major:minor:

Place of this class within the hierarchy.

[6] Attach queuing disciplines to the leaf classes (沒有設定時, 預設是 pfifo)[非必要 Step]

tc qdisc add dev eth0 parent 1:21 handle 21: sfq perturb 10
tc qdisc add dev eth0 parent 1:22 handle 22: pfifo

[7] Which packets belong in which class

使用 tc filter 直接設定

Backup Server IP (n.n.n.n)

# By IP - traffic to backup server (float 100k)
tc filter add dev eth0 protocol ip parent 1:0 prio 50 u32 \
   match ip dst n.n.n.n flowid 1:101

Source Port (8080)

# By Port - tcp port 8080 (fix 300k)
tc filter add dev eth0 protocol ip parent 1:0 prio 49 u32 \
   match ip sport 8080 0xffff flowid 1:102

# By IP & Port

tc filter add dev eth0 protocol ip parent 1:0 prio 48 u32 \
   match ip src 192.168.123.10 match ip sport 1080 0xffff flowid 1:12

prio

* classes with higher priority are offered excess bandwidth first.

What class should you priorize? Generaly those classes where you really need low delays.

1 => highest priority

filter 駁 iptables 版

# 有 mark 的 package to 某 qdisc

tc filter add dev eth0 parent 1: protocol ip handle 8021 fw flowid 1:21
tc filter add dev eth0 parent 1: protocol ip handle 8022 fw flowid 1:22

 * flowid = classid

# 為 source port 80 的  packet set mark

iptables -t mangle -A OUTPUT -p tcp --sport 8012 -j MARK --set-mark 8021
iptables -t mangle -A OUTPUT -p tcp --sport 8012 -j MARK --set-mark 8022

 * mark 只可以用數字. iptables -t mangle 時是 0x????

Checking

(1)

tc qdisc show dev eth0

qdisc htb 1: root refcnt 2 r2q 10 default 0x12 direct_packets_stat 0 direct_qlen 1000
qdisc sfq 21: parent 1:21 limit 127p quantum 1514b depth 127 divisor 1024 perturb 10sec
qdisc pfifo 22: parent 1:22 limit 1000p

(2)

tc class show dev eth0

class htb 1:22 parent 1:11 leaf 22: prio 0 rate 16Mbit ceil 16Mbit burst 1600b cburst 1600b
class htb 1:11 parent 1:1 rate 48Mbit ceil 80Mbit burst 1590b cburst 1600b
class htb 1:1 root rate 80Mbit ceil 80Mbit burst 1600b cburst 1600b
class htb 1:12 parent 1:1 prio 0 rate 32Mbit ceil 32Mbit burst 1600b cburst 1600b
class htb 1:21 parent 1:11 leaf 21: prio 0 rate 32Mbit ceil 32Mbit burst 1600b cburst 1600b

(3)

tc filter show dev eth0

filter parent 1: protocol ip pref 49151 fw chain 0
filter parent 1: protocol ip pref 49151 fw chain 0 handle 0x1f56 classid 1:22
filter parent 1: protocol ip pref 49152 fw chain 0
filter parent 1: protocol ip pref 49152 fw chain 0 handle 0x1f55 classid 1:21

(4)

iptables -v -nL -t mangle | grep MARK

    0     0 MARK       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp spt:8012 MARK set 0x1f55
    0     0 MARK       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp spt:8012 MARK set 0x1f56

Note

 * "firewall-cmd --reload" 會清了之前用 iptables 的設定

Testing

# 在 123.10 上面測試

IP=192.168.88.128

wget $IP:8012/test.bin -O /dev/null

wget $IP:8021/test.bin -O /dev/null

wget $IP:8022/test.bin -O /dev/null

Statistics

tc -s -d qdisc show dev eth0

tc -s -d class show dev eth0

-d[etails]

-s[tatistics]

overlimits

how many times the discipline delayed a packet.

level

quantum

don't need to specify quantums manualy as HTB chooses precomputed values.

pps

tells you actual (10 sec averaged) rate going thru class.

giants (Default: 1600 bytes)

number of packets larger than mtu set in tc command.
HTB will work with these but rates will not be accurate at all.

lended

packets donated by this class

borrowed (borrows are transitive)

borrowed from parent.

Other

quantums:

In fact when more classes want to borrow bandwidth they are each given some number of bytes before serving other competing class.

This number is called quantum.

burst

burst bytes: Amount of bytes that can be burst at ceil speed

cburst bytes: Amount of bytes that can be burst at "infinite" speed

why I want bursts. Well it is cheap and simple way how to improve response times on congested link.

 * The burst and cburst of a class should always be at least as high as that of any of it children.

i.e.

... burst 2k

Notes

nginx config

http {
    server {
        listen       8012 default_server;
        listen       8021 default_server;
        listen       8022 default_server;
        ...
    }
    ...
}

 


Over root traffic

 

Leave Over Root Total

parent 1: classid 1:1 htb rate 100kbps

parent 1:1 classid 1:80 htb rate 60kbps
parent 1:1 classid 1:81 htb rate 70kbps
parent 1:1 classid 1:82 htb rate 80kbps

在以上 setting, class 80, 81, 82 它們都有 60kbps, 70kbps, 80kbps, 會無視了 root class 的 100kbps limit !!

ceil 的應用

parent 1: classid 1:1 htb rate 300kbps
parent 1:1 classid 1:80 htb rate 60kbps ceil 100kbps
parent 1:1 classid 1:81 htb rate 70kbps ceil 100kbps
parent 1:1 classid 1:82 htb rate 80kbps ceil 100kbps

在以上 setting, class 80, 81, 82 都有 9x kbps

 


tc on container node

 

Packet routes Diagram

venet0:0                venet0    eth0
CT ------------->------------- HN --------->-------- Remote

venet0:0                venet0    eth0
CT -------------<------------- HN ---------<-------- Remote

Limiting outgoing bandwidth

We can limit container outgoing bandwidth by setting the tc filter on eth0.

DEV=eth0
tc qdisc del dev $DEV root
tc qdisc add dev $DEV root handle 1: cbq avpkt 1000 bandwidth 100mbit
tc class add dev $DEV parent 1: classid 1:1 cbq rate 256kbit allot 1500 prio 5 bounded isolated
tc filter add dev $DEV parent 1: protocol ip prio 16 u32 match ip src X.X.X.X flowid 1:1
tc qdisc add dev $DEV parent 1:1 sfq perturb 10

Limiting incoming bandwidth

This can be done by setting the tc filter on:

DEV=venet0
tc qdisc del dev $DEV root
tc qdisc add dev $DEV root handle 1: cbq avpkt 1000 bandwidth 100mbit
tc class add dev $DEV parent 1: classid 1:1 cbq rate 256kbit allot 1500 prio 5 bounded isolated
tc filter add dev $DEV parent 1: protocol ip prio 16 u32 match ip dst X.X.X.X flowid 1:1
tc qdisc add dev $DEV parent 1:1 sfq perturb 10

Limiting CT to HN talks

DEV=venet0
tc filter add dev $DEV parent 1: protocol ip prio 20 u32 match u32 1 0x0000 police rate 2kbit buffer 10k drop flowid :1

Limiting packets per second rate from container

DEV=eth0
iptables -I FORWARD 1 -o $DEV -s X.X.X.X -m limit --limit 200/sec -j ACCEPT
iptables -I FORWARD 2 -o $DEV -s X.X.X.X -j DROP

 

 


The Intermediate queueing device (IMQ)

 

 

 


Doc

 

Linux Advanced Routing & Traffic Control

 - http://lartc.org/howto/index.html

tc man page

 - http://lartc.org/manpages/tc.txt

 

附加檔案大小
test.sh917 位元
limit.sh2.03 KB

Creative Commons license icon Creative Commons license icon