最後更新: 2019-06-13
目錄
- Diagram
- Frame
- TCP - Termination
- Autotuning setting: tcp_mem, tcp_wmem, tcp_max_tw_buckets
- SYN cookies
- Backlog setting
- Default netfilter is in TCP “loose” mode
- SynProxy
- Conntrack
- Timestamping
- tcp_mtu_probing
- TCP Congestion Control
- Full conn hashlimit trick
- RST
Diagram
NIC - verify MAC (if not on promiscuous mode) & FCS - DMA packets at RAM(driver) - HW: hard IRQ - Software(driver): soft IRQ - rx/tx buffer ring (ethtool -g ethX) ------------------------------------ tcpdump qdisc (ifconfig ethX) ------------------------------------ iptables send/rcv buffer (tcp_wmem / tcp_rmem) sysctl net.ipv4.tcp_rmem sysctl net.ipv4.tcp_wmem ------------------------------------ sendmsg / epoll Application
Frame
preamble consists of a 56-bit (seven-byte) pattern of alternating 1 and 0 bits
SFD is the eight-bit (one-byte) [10101011]
...
FCS (frame check sequence)
error-detecting code added to a frame
In Ethernet: CRC32
algorithm result will always be a CRC32 residue of 0xC704DD7B when data has been received correctly.
TCP - Termination
* termination isn't a four-way handshake like establishment
(it is a pair of two-way handshakes. )
A --> FIN_WAIT_1 | | FIN_WAIT_2 || | CLOSE-WAIT | CLOSED fin | ack | || fin | ack | B | CLOSE-WAIT | || LAST-ACK | | CLOSED
FIN_WAIT_1: The local end-point has sent a connection termination request to the remote end-point.
CLOSE-WAIT: The local end-point has received a connection termination request and acknowledged it
FIN-WAIT-2: The local receives the ACK for its FIN(FIN_WAIT_1 發出的).
It must now wait for the server to close.
LAST-ACK: The local end-point has performed a passive close ( sending a FIN)
TIME-WAIT: Sends an acknowledgement and goes into TIME_WAIT and after some time into CLOSED
(The local end-point waits for twice the maximum segment lifetime (MSL) to pass before going to CLOSED)
Autotuning setting: tcp_mem, tcp_wmem, tcp_max_tw_buckets
# 2: This so called memory pressure mode is continued until the memory usage enters the lower threshold again,
# and at which point it enters the default behaviour of the low threshold again.
# 3: TCP streams and packets start getting dropped until we reach a lower memory usage again.
net.ipv4.tcp_mem = 3097431 4129911 6194862
# The first value tells the kernel the minimum receive buffer for each TCP connection
# The second value specified tells the kernel the default receive buffer allocated for each TCP socket
# 3: the maximum receive buffer that can be allocated for a TCP socket
# This value is overridden by the "/proc/sys/net/core/rmem_max"
net.ipv4.tcp_rmem = 4096 87380 6291456
# OS send buffer size for all types of connections.
# Every TCP socket has this much buffer space to use before the buffer is filled up.
# 1: as soon as it is opened (4k)
# 2: default
# 3: "/proc/sys/net/core/wmem_max" overrides this value
net.ipv4.tcp_wmem = 4096 65536 4194304
Remark
# 512 kbyte echo 'net.core.wmem_max=524288' >> /etc/sysctl.conf echo 'net.core.rmem_max=524288' >> /etc/sysctl.conf
# tells the system the maximum number of sockets in TIME-WAIT to be held simultaneously.
# If this number is exceeded, the exceeding sockets are destroyed and a warning message is printed to you.
net.ipv4.tcp_max_tw_buckets = 262144
SYN cookies
# over 了 tcp_max_syn_backlog 時就用 syncookies
net.ipv4.tcp_syncookies = 1
Other Turning
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_tw_reuse = 1
# Default 60
net.ipv4.tcp_fin_timeout = 15
net.ipv4.ip_local_port_range = 1024 65535
Backlog setting
# Limit of socket listen() backlog, (Default: 128)
# use high net.core.somaxconn to hide problems with their services,
# so from user's point of view process stall would look like a latency spike
# instead of connection interrupted/timeout
# Real cause is either slow processing of some requests
# (insufficient number of worker threads/processes in software)
# cat /proc/sys/net/core/somaxconn
net.core.somaxconn = 65535
# remembered connection requests, (Default: 128)
# which are still did not receive an acknowledgment from connecting client.
# cat /proc/sys/net/ipv4/tcp_max_syn_backlog
net.ipv4.tcp_max_syn_backlog = 65535
# maximum number of packets allowed to queue (Default: 300)
# when a particular interface receives packets faster than the kernel can process them
# Use high value for high speed cards to prevent loosing packets.
# ( data in the queue will be out of date )
# cat /proc/sys/net/core/netdev_max_backlog
net.core.netdev_max_backlog = 65535
Default netfilter is in TCP “loose” mode
Allow ACK pkts to create new connection
Disable via cmd:
sysctl -w net/netfilter/nf_conntrack_tcp_loose=0
SynProxy
Take advantage of state “INVALID”
Drop invalid pkts before reaching LISTEN socket
iptables -m state --state INVALID -j DROP
Conntrack
Conntrack (lock-less) lookups are really fast
– Problem is insert and delete conntracks (central lock)
entries tuning
288 bytes * 2 Mill = 576.0 MB
net/netfilter/nf_conntrack_max=2000000
echo 2000000 > /sys/module/nf_conntrack/parameters/hashsize
Timestamping
Timestamps are an optional addition to the TCP layer to provide information on round-trip times and to help with sequencing
(https://www.rfc-editor.org/rfc/rfc1323)
The timestamps are used for two distinct mechanisms:
- RTTM (Round Trip Time Measurement) and
- PAWS (Protect Against Wrapped Sequences)
The downside of TCP timestamps is adversaries can remotely calculate the system uptime and
boot time of the machine and the host's clock down to millisecond precision.
# 1: 如果對方發來的包的 timestamp 是亂跳或滯後, 這樣服務器肯定不會回覆
# Server 會把"倒退"的 timestamp 的包當作是「recycle 的 tw 連接的重傳數據, 而不是新的請求」,
# 於是丟掉 Packet 不回, 就會出現 SYN 不響應 SYN-ACK
# 60s(timewai時間)內同一源 ip 的 socket connect 中的 timestamp 必須是遞增的
# 0: 這樣不會發送TCP 的 TS Value
# 在tcp timestamp關閉的條件下, 開啟 tcp_tw_recycle 是不起作用的;
# 而 tcp timestamp 可以獨立開啟並起作用.
Check TCP timestamping
cat /proc/sys/net/ipv4/tcp_timestamps # Default: 1
0) Disable
1) Enable timestamps as defined in RFC1323 and
use random offset for each connection rather than only using the current time.
2) Like 1, but without random offsets.
Config
sysctl.conf
# Disable the TCP Timestamp Response on Linux net.ipv4.tcp_timestamps=0
Checking
# nmap to check for TCP timestamps
nmap -d -v -O server.domain.com
Status
netstat -s | grep timestamp
timestamp request: X timestamp replies: X X packets rejects in established connections because of timestamp
Disable ICMP Timestamps
ICMP timestamps need to be blocked with the firewall.
* 不可在 Kernel Disable
tcp_mtu_probing
find the MTU between your client and your server using "Path MTU discovery mechanism"
(It probably increase the default MTU (1500 on Linux))
ip link show br0
# Default: 0
# 1: Disabled by default, enabled when an ICMP black hole detected
# 2: Always enabled, use initial MSS of tcp_base_mss.
net.ipv4.tcp_mtu_probing = 2
Checking
cat /proc/sys/net/ipv4/tcp_mtu_probing
man 7 tcp
TCP Congestion Control
Cubic
which is designed for high-speed networks, is better than Hybla, in conditions of high-latency and low error rates.
Hybla
optimize the channel throughput in heterogeneous networks incorporating satellite links.
sysctl net.ipv4.tcp_available_congestion_control
net.ipv4.tcp_available_congestion_control = cubic reno
modprobe tcp_hybla
lsmod | grep hybla
/etc/sysctl.conf
# Default: cubic net.ipv4.tcp_congestion_control = hybla
sysctl -p
# Debain 9
echo hybla >> /etc/modules-load.d/modules.conf
# Checking
cat /proc/sys/net/ipv4/tcp_congestion_control
Full conn hashlimit trick
Attacker needs many real hosts, to reach full conn scalability limit
- Fixed: htable-size 2097152 * 8 bytes = 16.7 MB
– Variable: entry size 104 bytes * 500000 = 52 MB
iptables -t raw -A PREROUTING -i $DEV \ -p tcp -m tcp --dport 80 --syn \ -m hashlimit \ --hashlimit-above 200/sec --hashlimit-burst 1000 \ --hashlimit-mode srcip --hashlimit-name syn \ --hashlimit-htable-size 2097152 \ --hashlimit-srcmask 24 -j DROP
RST
When an unexpected TCP packet arrives at a host,
that host usually responds by sending a reset packet back on the same connection.
A reset packet is simply one with no payload and with the RST bit set in the TCP header flags.