Linux Network Turning (sysctl)

最後更新: 2019-06-13

 


Diagram

 

NIC
 - verify MAC (if not on promiscuous mode) & FCS
 - DMA packets at RAM(driver)
 - HW: hard IRQ
 - Software(driver): soft IRQ
 - rx/tx buffer ring (ethtool -g ethX)
------------------------------------ tcpdump
qdisc (ifconfig ethX)
------------------------------------ iptables
send/rcv buffer (tcp_wmem / tcp_rmem)
sysctl net.ipv4.tcp_rmem
sysctl net.ipv4.tcp_wmem
------------------------------------ sendmsg / epoll
Application

 


Frame

 

preamble consists of a 56-bit (seven-byte) pattern of alternating 1 and 0 bits

SFD is the eight-bit (one-byte) [10101011]

...

FCS (frame check sequence)

error-detecting code added to a frame

In Ethernet: CRC32

algorithm result will always be a CRC32 residue of 0xC704DD7B when data has been received correctly.

 


TCP - Termination

 

* termination isn't a four-way handshake like establishment
 (it is a pair of two-way handshakes. )

A  --> FIN_WAIT_1 |            | FIN_WAIT_2 ||          | CLOSE-WAIT  | CLOSED
           fin    |    ack     |            ||    fin   |    ack      |
B                 | CLOSE-WAIT |            || LAST-ACK |             | CLOSED

FIN_WAIT_1: The local end-point has sent a connection termination request to the remote end-point.

CLOSE-WAIT: The local end-point has received a connection termination request and acknowledged it

FIN-WAIT-2: The local receives the ACK for its FIN(FIN_WAIT_1 發出的).
            It must now wait for the server to close.

LAST-ACK: The local end-point has performed a passive close ( sending a FIN)

TIME-WAIT: Sends an acknowledgement and goes into TIME_WAIT and after some time into CLOSED
           (The local end-point waits for twice the maximum segment lifetime (MSL) to pass before going to CLOSED)

 

 


Autotuning setting: tcp_mem, tcp_wmem, tcp_max_tw_buckets

 

# 2: This so called memory pressure mode is continued until the memory usage enters the lower threshold again,
#      and at which point it enters the default behaviour of the low threshold again.
# 3: TCP streams and packets start getting dropped until we reach a lower memory usage again.

net.ipv4.tcp_mem = 3097431 4129911 6194862

# The first value tells the kernel the minimum receive buffer for each TCP connection
# The second value specified tells the kernel the default receive buffer allocated for each TCP socket
# 3: the maximum receive buffer that can be allocated for a TCP socket
# This value is overridden by the  "/proc/sys/net/core/rmem_max"

net.ipv4.tcp_rmem = 4096 87380 6291456

# OS send buffer size for all types of connections.
# Every TCP socket has this much buffer space to use before the buffer is filled up.
# 1: as soon as it is opened (4k)
# 2: default
# 3: "/proc/sys/net/core/wmem_max" overrides this value

net.ipv4.tcp_wmem = 4096 65536 4194304

Remark

# 512 kbyte

echo 'net.core.wmem_max=524288' >> /etc/sysctl.conf
echo 'net.core.rmem_max=524288' >> /etc/sysctl.conf

# tells the system the maximum number of sockets in TIME-WAIT to be held simultaneously.
# If this number is exceeded, the exceeding sockets are destroyed and a warning message is printed to you.

net.ipv4.tcp_max_tw_buckets = 262144

 


SYN cookies

 

# over 了 tcp_max_syn_backlog 時就用 syncookies

net.ipv4.tcp_syncookies  = 1

Other Turning

net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_tw_reuse  = 1

# Default 60
net.ipv4.tcp_fin_timeout = 15

net.ipv4.ip_local_port_range = 1024 65535

 


backlog setting

 

# Limit of socket listen() backlog, (Default: 128)
# use high net.core.somaxconn to hide problems with their services,
# so from user's point of view process stall would look like a latency spike
# instead of connection interrupted/timeout
# Real cause is either slow processing of some requests
# (insufficient number of worker threads/processes in software)
# cat /proc/sys/net/core/somaxconn

net.core.somaxconn  = 65535 

# remembered connection requests, (Default: 128)
# which are still did not receive an acknowledgment from connecting client.
# cat /proc/sys/net/ipv4/tcp_max_syn_backlog

net.ipv4.tcp_max_syn_backlog = 65535

# maximum number of packets allowed to queue (Default: 300)
# when a particular interface receives packets faster than the kernel can process them
# Use high value for high speed cards to prevent loosing packets.
# ( data in the queue will be out of date )
# cat /proc/sys/net/core/netdev_max_backlog

net.core.netdev_max_backlog  = 200000

 


Default netfilter is in TCP “loose” mode

 

Allow ACK pkts to create new connection

Disable via cmd:

sysctl -w net/netfilter/nf_conntrack_tcp_loose=0

 


SynProxy

 

Take advantage of state “INVALID”

Drop invalid pkts before reaching LISTEN socket

iptables -m state --state INVALID -j DROP

 


Conntrack

 

Conntrack (lock-less) lookups are really fast

– Problem is insert and delete conntracks (central lock)

entries tuning

288 bytes * 2 Mill = 576.0 MB

net/netfilter/nf_conntrack_max=2000000

echo 2000000 > /sys/module/nf_conntrack/parameters/hashsize

 


timestamping

 

Timestamps are an optional addition to the TCP layer to provide information on round-trip times and to help with sequencing

# 1: 如果對方發來的包的 timestamp 是亂跳或滯後, 這樣服務器肯定不會回覆
# Server 會把"倒退"的 timestamp 的包當作是「recycle 的 tw 連接的重傳數據, 而不是新的請求」,
# 於是丟掉 Packet 不回, 就會出現 SYN 不響應 SYN-ACK
# 60s(timewai時間)內同一源 ip 的 socket connect  中的 timestamp 必須是遞增的

# 0: 這樣不會發送TCP 的 TS Value
# 在tcp timestamp關閉的條件下, 開啟 tcp_tw_recycle 是不起作用的;
# 而 tcp timestamp 可以獨立開啟並起作用.

Check TCP timestamping

cat /proc/sys/net/ipv4/tcp_timestamps         # Default: 1

Status

netstat -s | grep timestamp

        timestamp request: X
        timestamp replies: X
    X packets rejects in established connections because of timestamp

The downside of TCP timestamps is adversaries can remotely calculate the system uptime and boot time of the machine and the host's clock down to millisecond precision.

 


tcp_mtu_probing

 

find the MTU between your client and your server using "Path MTU discovery mechanism"
(It probably increase the default MTU (1500 on Linux))

ip link show br0

# Default: 0
# 1: Disabled by default, enabled when an ICMP black hole detected
# 2: Always enabled, use initial MSS of tcp_base_mss.

net.ipv4.tcp_mtu_probing = 2

Checking

cat /proc/sys/net/ipv4/tcp_mtu_probing

man 7 tcp

 


Disable ICMP Timestamps

 

ICMP timestamps need to be blocked with the firewall.

 * 不可在 Kernel Disable

 

 


TCP Congestion Control

 

Cubic

which is designed for high-speed networks, is better than Hybla, in conditions of high-latency and low error rates.

Hybla

optimize the channel throughput in heterogeneous networks incorporating satellite links.

sysctl net.ipv4.tcp_available_congestion_control

net.ipv4.tcp_available_congestion_control = cubic reno

modprobe tcp_hybla

lsmod | grep hybla

/etc/sysctl.conf

# Default: cubic
net.ipv4.tcp_congestion_control = hybla

sysctl -p

# Debain 9

echo hybla >> /etc/modules-load.d/modules.conf

# Checking

cat /proc/sys/net/ipv4/tcp_congestion_control

 


Full conn hashlimit trick

 

Attacker needs many real hosts, to reach full conn scalability limit

 - Fixed: htable-size 2097152 * 8 bytes = 16.7 MB

 – Variable: entry size 104 bytes * 500000 = 52 MB

iptables -t raw -A PREROUTING -i $DEV \
         -p tcp -m tcp --dport 80 --syn \
         -m hashlimit \
         --hashlimit-above 200/sec --hashlimit-burst 1000 \
         --hashlimit-mode srcip --hashlimit-name syn \
         --hashlimit-htable-size 2097152 \
         --hashlimit-srcmask 24 -j DROP

 


RST

 

When an unexpected TCP packet arrives at a host,

that host usually responds by sending a reset packet back on the same connection.

A reset packet is simply one with no payload and with the RST bit set in the TCP header flags.