M05 - TCP Internals

NETWORKING MASTERY · PHASE 2 · MODULE 05 · WEEKS 4–5

⚡ TCP Internals

3-way handshake · State machine · Sequence numbers · Flow control · Congestion control · SACK · Timers

Beginner → Intermediate Prerequisite: M03 IPv4 RFC 793 + RFC 9293 Most Critical Transport Protocol 3 Labs

TCP — RELIABLE, ORDERED, BIDIRECTIONAL BYTE STREAMS

📡

What TCP Guarantees — and What It Doesn't

OVERVIEW

TCP (Transmission Control Protocol, RFC 793 / RFC 9293) is Layer 4's workhorse. It takes IP's unreliable, unordered packet delivery and builds a reliable, ordered, bidirectional byte stream on top of it. Every major application protocol — HTTP, HTTPS, SSH, SMTP, FTP — runs over TCP because reliability matters more than raw speed for those use cases.

What TCP guarantees:

Reliability — every byte sent will be received, or the sender will know it failed. If a packet is lost, TCP detects it and retransmits automatically
Ordering — bytes arrive in the same order they were sent, even if packets arrive out of order in transit
No duplication — TCP detects and discards duplicate packets
Error detection — checksum on every segment
Flow control — sender doesn't overwhelm receiver's buffer
Congestion control — sender adapts to network capacity, doesn't collapse the network

What TCP does NOT guarantee:

Timing / latency — retransmissions add unpredictable delay
Bandwidth — TCP adapts to available capacity, never reserves it
Message boundaries — TCP is a stream, not a message protocol. If you send "Hello" and "World" as two separate write() calls, the receiver may get "HelloWorld" in one read() or "He" and "lloWorld" in two. Applications must implement their own framing

📞 Analogy — A Phone Call vs Postcards

UDP is like sending postcards — you write one, drop it in the postbox, and hope it arrives. No confirmation, no order guarantee, no retry. TCP is like a phone call: first you establish the call (3-way handshake), then both parties speak in turn and confirm they heard each other ("uh-huh, go on"), and if one side goes silent the other says "hello? are you still there?" (keepalive). When the call ends, both sides say goodbye properly (4-way teardown). This setup and teardown overhead is why TCP is slower for small one-shot queries — but the reliability is worth it for file transfers, web pages, and anything where missing data is unacceptable.

⚖️

TCP vs UDP — When to Use Which

COMPARISON

Property	TCP	UDP
Connection	Connection-oriented (3-way handshake)	Connectionless — fire and forget
Reliability	Guaranteed delivery + retransmission	Best-effort — no retransmission
Ordering	In-order delivery guaranteed	Packets may arrive out of order
Speed	Slower — overhead for reliability	Faster — minimal overhead
Header size	20–60 bytes	8 bytes
Flow control	Yes — sliding window	No
Congestion control	Yes — reduces sending rate under congestion	No — keeps sending regardless
Use cases	HTTP/HTTPS, SSH, SMTP, FTP, database	DNS, VoIP, video streaming, gaming, QUIC

💡 NGFW relevance: TCP is the dominant protocol for web traffic (HTTP/HTTPS), management traffic (SSH), and email (SMTP). Your NGFW must maintain connection state for every TCP session — tracking sequence numbers, connection phase (handshake/established/closing), and detecting anomalies. UDP sessions are tracked differently (timeout-based, no handshake state). Understanding TCP deeply is essential for building correct stateful inspection.

TCP HEADER — 20 BYTES MINIMUM, UP TO 60 BYTES WITH OPTIONS

Row 1

Source Port

16 bits

Destination Port

16 bits

Row 2

Sequence Number

32 bits

Row 3

Acknowledgement Number

32 bits

Row 4

Data Offset

4 bits

Res

Flags: CWR ECE URG ACK PSH RST SYN FIN

9 bits

Window Size

16 bits

Row 5

Checksum

16 bits

Urgent Pointer

16 bits

Row 6+

Options (if Data Offset > 5) + Padding

0–40 bytes — MSS, SACK, Timestamps, Window Scale

🔍

Every Field Explained

FIELD REFERENCE

Source Port and Destination Port (16 bits each)

Port numbers identify the application on each end. Combined with IP addresses, they form the 5-tuple that uniquely identifies a TCP connection: (src_ip, src_port, dst_ip, dst_port, protocol=TCP). Well-known ports: 80=HTTP, 443=HTTPS, 22=SSH, 25=SMTP, 53=DNS-TCP, 3306=MySQL. The client uses an ephemeral port (typically 49152–65535) assigned randomly by the OS.

Sequence Number (32 bits)

Identifies the position of the first byte of data in this segment within the entire byte stream. The sequence number space is 0 to 2³²−1 (wraps around). The Initial Sequence Number (ISN) is chosen randomly at connection setup — not starting at 0 — to prevent stale segments from old connections being confused with new ones. In a SYN segment, the sequence number is the ISN itself (no data yet).

Acknowledgement Number (32 bits)

The sequence number of the next byte the receiver expects from the sender. This acknowledges all bytes up to (but not including) this number. For example, if the receiver has successfully received bytes 0–999, it sends ACK=1000 meaning "I have everything up to 999, send me 1000 next". ACK is only valid when the ACK flag is set.

Data Offset (4 bits)

TCP header length in 32-bit words — same concept as IPv4's IHL. Minimum 5 (20 bytes). Maximum 15 (60 bytes). Tells the receiver where the payload data starts: data_offset_bytes = data_offset × 4.

TCP Flags (9 bits) — The Most Important Field for NGFW

CWR

Congestion Window Reduced — ECN response

ECE

ECN Echo — congestion signal received

URG

Urgent Pointer is valid (rarely used)

ACK

ACK number is valid — set on all except initial SYN

PSH

Push — receiver should flush buffer to app immediately

RST

Reset — abortive connection close

SYN

Synchronise — connection initiation

FIN

Finish — orderly connection close

Flag combinations reveal connection phase: SYN only = new connection attempt; SYN+ACK = server accepting; ACK only = data transfer; FIN+ACK = graceful close; RST = abort. Your NGFW inspects these flags to track connection state in its connection table.

Window Size (16 bits)

Advertises how many bytes the receiver can accept in its buffer right now. This is the foundation of TCP flow control — the sender must not send more unacknowledged data than the receiver's window allows. Scaled by the Window Scale option (up to ×65535) for high-bandwidth links. We cover this in the Flow Control tab.

Checksum (16 bits)

Computed over a "pseudo-header" (IP src, IP dst, Protocol=6, TCP length) plus the entire TCP header and payload. Detects corruption. The pseudo-header inclusion means the checksum also validates that the segment reached the correct destination IP — no mis-delivery.

Key TCP Options

Option	Kind	Purpose	NGFW Impact
MSS	2	Maximum Segment Size — largest payload sender will send	NGFW can reduce MSS to avoid fragmentation (MSS clamping)
Window Scale	3	Multiplier for Window Size (2^scale, up to ×65535)	Must track for correct window calculation
SACK Permitted	4	Signals both sides support Selective ACK	Signals need to track SACK blocks
SACK	5	Reports which out-of-order blocks were received	Must parse for correct retransmit tracking
Timestamps	8	RTT measurement + PAWS (protect against wrapped seqs)	Used for RTT monitoring in NGFW analytics
TFO (Fast Open)	34	Send data in SYN packet (1-RTT connection setup)	NGFW must parse data-in-SYN for DPI

THE THREE-WAY HANDSHAKE — CONNECTION ESTABLISHMENT

🤝

Why Three Steps?

CONCEPT

A TCP connection needs both sides to agree on two things before data can flow: (1) the connection exists, and (2) both sides know each other's initial sequence numbers (ISN) so they can properly track bytes. The three-way handshake achieves both with the minimum number of round trips.

Two steps (SYN → SYN+ACK) would let the server know the client's ISN, but the client wouldn't know the server acknowledged its SYN. Three steps (SYN → SYN+ACK → ACK) confirms both sides have exchanged and acknowledged ISNs, establishing a reliable bidirectional channel.

📊

The Handshake Step by Step

SEQUENCE DIAGRAM

Client

Server

SYN
seq=x ISN

▶

SYN received

SYN+ACK received

◀

SYN+ACK
seq=y ack=x+1

ACK
seq=x+1 ack=y+1

▶

ACK received

ESTABLISHED ✓

ESTABLISHED ✓

/* Step 1 — Client sends SYN */
Flags:  SYN
Seq:    x        # randomly chosen ISN — e.g. 1,000,000
Ack:    0        # ACK flag not set — nothing to ack yet
Options: MSS=1460, SACK permitted, Window Scale=7, Timestamps

/* Step 2 — Server sends SYN+ACK */
Flags:  SYN, ACK
Seq:    y        # server's own randomly chosen ISN — e.g. 5,000,000
Ack:    x+1      # "I received your SYN (which consumed 1 seq byte), send me x+1 next"
Options: MSS=1460, SACK permitted, Window Scale=9, Timestamps

/* Step 3 — Client sends ACK */
Flags:  ACK
Seq:    x+1      # client's next byte
Ack:    y+1      # "I received your SYN, send me y+1 next"
# Connection is now ESTABLISHED on both sides
# Client may include data in this segment (TCP Fast Open)

💡 Why random ISN? If ISN always started at 0, an attacker could inject forged segments into an existing connection — they just need to guess the current sequence number, which is trivial if it started from 0. Random ISN makes it computationally infeasible to forge in-window segments.

⚠️

SYN Flood Attack and SYN Cookies

SECURITY

A SYN flood is one of the oldest and most effective DoS attacks. The attacker sends thousands of SYN packets with spoofed source IPs. The server allocates state for each half-open connection, waiting for the final ACK that never comes. Eventually, the server's connection table fills up and it can't accept legitimate connections.

SYN Cookies (RFC 4987) solve this: instead of allocating state on SYN receipt, the server encodes the connection parameters (MSS, timestamp, etc.) into the initial sequence number (ISN) of the SYN+ACK. The state is "stored" in the sequence number itself. When the final ACK arrives, the server decodes the parameters from the ACK number and allocates state only then. No state is allocated for connections that never complete — SYN flood has no effect.

# Check if SYN cookies are enabled on Linux
cat /proc/sys/net/ipv4/tcp_syncookies
# 0 = disabled, 1 = enabled when backlog full, 2 = always enabled

# Enable permanently
echo 1 > /proc/sys/net/ipv4/tcp_syncookies

# NGFW-level SYN flood protection
# Rate-limit SYN packets per source IP per second
# Drop SYN packets exceeding threshold (e.g., >100 SYN/sec from one IP)
# TCP proxy: NGFW completes handshake on behalf of server, only forwards verified connections

TCP STATE MACHINE — 11 STATES, EVERY TRANSITION

🔄

TCP States — What Each Means

STATE MACHINE

A TCP connection moves through a well-defined sequence of states. Your NGFW must track the state of every TCP connection in its connection table — this is the essence of "stateful inspection". A packet that doesn't match expected state transitions is suspicious or malicious.

CLOSED

No connection. Initial and final state. No resources allocated.

LISTEN

Server waiting for incoming SYN. Socket bound and listening.

SYN_SENT

Client sent SYN, waiting for SYN+ACK from server.

SYN_RECEIVED

Server received SYN, sent SYN+ACK, waiting for client's ACK.

ESTABLISHED ✓

Full duplex connection open. Data transfer in progress. This is the normal operating state.

FIN_WAIT_1

This side sent FIN, waiting for ACK or FIN+ACK.

FIN_WAIT_2

Our FIN acknowledged. Waiting for remote FIN.

CLOSE_WAIT

Remote side closed. Waiting for local app to close its side.

CLOSING

Both sides sent FIN simultaneously. Waiting for ACK.

LAST_ACK

Passive close side sent FIN, waiting for final ACK.

TIME_WAIT

Both FINs ACKed. Wait 2×MSL before CLOSED. Prevents stale segment confusion.

🗺️

State Transitions — Full Diagram in Text

TRANSITIONS

/* CLIENT (active open) state transitions */
CLOSED
  → app calls connect()                    → SYN_SENT
  → SYN_SENT  + receive SYN+ACK, send ACK → ESTABLISHED
  → SYN_SENT  + receive SYN (simultaneous) → SYN_RECEIVED

/* SERVER (passive open) state transitions */
CLOSED
  → app calls listen()                     → LISTEN
  → LISTEN    + receive SYN, send SYN+ACK  → SYN_RECEIVED
  → SYN_RECEIVED + receive ACK             → ESTABLISHED

/* TEARDOWN — active close (initiating side) */
ESTABLISHED
  → app calls close(), send FIN            → FIN_WAIT_1
  → FIN_WAIT_1 + receive ACK              → FIN_WAIT_2
  → FIN_WAIT_2 + receive FIN, send ACK    → TIME_WAIT
  → TIME_WAIT  + 2*MSL timeout            → CLOSED

/* TEARDOWN — passive close (receiving side) */
ESTABLISHED
  → receive FIN, send ACK                  → CLOSE_WAIT
  → CLOSE_WAIT + app calls close(), send FIN → LAST_ACK
  → LAST_ACK   + receive ACK               → CLOSED

/* RST — abortive close (any state) */
any state
  → receive RST or send RST                → CLOSED (immediately)

/* Check states on Linux */
ss -tn          # show TCP connections with states
ss -tn state established
ss -tn state time-wait | wc -l   # count TIME_WAIT connections
netstat -an | grep TCP

⚠️ TIME_WAIT accumulation is a common production problem. Each connection in TIME_WAIT holds a socket for 2×MSL (typically 60–120 seconds on Linux). A high-traffic server closing 10,000 connections/second will have 600,000–1,200,000 TIME_WAIT sockets. This exhausts the ephemeral port range and can prevent new connections. Solutions: SO_REUSEADDR, tcp_tw_reuse (Linux sysctl), or reduce MSL. Your NGFW must not confuse TIME_WAIT connections with malicious activity.

🔥

NGFW State Tracking — What to Watch For

NGFW

A stateful NGFW must track TCP state transitions and reject packets that violate them:

Anomaly	Flags	Why It's Suspicious	Action
SYN-ACK without prior SYN	SYN+ACK	No SYN seen — spoofed or session spliced	Drop + log
Data without ESTABLISHED	PSH+ACK, no connection entry	Injected data, blind injection attack	Drop
RST with wrong sequence number	RST	RST injection attack to terminate connections	Drop if seq out of window
FIN before ESTABLISHED	FIN	Port scan (FIN scan) or evasion attempt	Drop + log
SYN to non-listening port	SYN	Port scan	Drop (no server) or RST
Christmas tree packet	SYN+FIN+PSH+URG	Nmap XMAS scan — OS fingerprinting	Drop + alert
NULL scan	no flags	Nmap NULL scan — firewall evasion	Drop + alert
Overlapping segments	varies	IDS evasion — inconsistent reassembly	Reassemble + inspect

SEQUENCE NUMBERS — ORDERING, RELIABILITY, AND BYTE TRACKING

🔢

How Sequence Numbers Work

CORE CONCEPT

TCP numbers every byte it sends with a sequence number. This enables: (1) the receiver to detect missing bytes, (2) the receiver to reorder out-of-order segments, and (3) the sender to know exactly which bytes were received via the ACK number.

/* Example: sending "Hello World" (11 bytes) */
ISN = 1000  # chosen randomly at handshake

Segment 1: seq=1001  data="Hello" (5 bytes)   → covers bytes 1001-1005
Segment 2: seq=1006  data=" Worl" (5 bytes)   → covers bytes 1006-1010
Segment 3: seq=1011  data="d"     (1 byte)    → covers bytes 1011-1011

/* Receiver sends ACKs */
After Segment 1: ACK=1006  # "I have 1001-1005, send me 1006 next"
After Segment 2: ACK=1011  # "I have 1001-1010, send me 1011 next"
After Segment 3: ACK=1012  # "I have 1001-1011, send me 1012 next"

/* What if Segment 2 is lost? */
Receiver gets Segment 1:  ACK=1006  (normal)
Receiver gets Segment 3:  ACK=1006  (still 1006 — can't advance past gap!)
                          → This is a duplicate ACK — signals a gap</span>

/* Sequence number arithmetic — always modular (wraps at 2^32) */
/* Use int32_t arithmetic for correct comparison */
int32_t diff = (int32_t)(seq_a - seq_b);
if (diff > 0) ...  # seq_a is ahead of seq_b

💡 SYN and FIN each consume one sequence number even though they carry no data. This is why the ACK after a SYN is ISN+1 (not ISN+0). It means both sides can unambiguously detect whether the connection control messages (SYN/FIN) were delivered.

📬

Cumulative vs Selective Acknowledgement

ACK MODES

TCP's basic ACK is cumulative — it acknowledges all bytes up to a point. This works well in the common case but is inefficient when segments arrive out of order:

/* Cumulative ACK — without SACK */
Sender sends:  seg[1000] seg[1500] seg[2000] seg[2500]
Network drops: seg[1500]
Receiver gets: seg[1000] ✓  ACK=1500
               seg[2000] ✓  ACK=1500  (still! — can't advance past 1500)
               seg[2500] ✓  ACK=1500  (still!)

Without SACK: sender must retransmit seg[1500] AND all after it
(go-back-N behaviour, though modern TCP is smarter)

/* Selective ACK (SACK) — RFC 2018 */
Receiver gets: seg[1000] ✓  ACK=1500
               seg[2000] ✓  ACK=1500  SACK=[2000-2499]
               seg[2500] ✓  ACK=1500  SACK=[2000-2999]

With SACK: sender knows ONLY seg[1500] is missing
           retransmits ONLY seg[1500]
           receiver ACKs=3000 after receiving it → done

SACK enabled by: "SACK Permitted" option in SYN/SYN+ACK
Up to 4 SACK blocks per segment (each block = 2×32-bit seq numbers = 8 bytes)

FLOW CONTROL — SLIDING WINDOW

🪟

The Sliding Window Mechanism

CONCEPT

Flow control prevents a fast sender from overwhelming a slow receiver's buffer. The receiver tells the sender exactly how much buffer space it has available via the Window Size field in every ACK. The sender must not have more than Window Size bytes of unacknowledged data in flight at any time.

The window "slides" forward as data is acknowledged — the sender's send window moves right as ACKs arrive, allowing more data to be sent.

📊

Sender's View of the Sequence Number Space

SEND BUFFER

The sender categorises its byte stream into four regions:

Sender

Sent + ACKed
already delivered

Sent, not ACKed
in flight

Can send
within window

Cannot send yet
window full or no data

        ← SND.UNA (last unACKed) → ← SND.NXT (next to send) → ← SND.UNA + win (window edge) →
      

Receiver

Received + Delivered
to application

Received in-order
buffered, not read yet

Out-of-order
buffered, gap before

Available buffer
= advertised window

        ← RCV.NXT (next expected) → ← RCV.WND (advertised window size) →
      

/* Flow control in action */
Receiver has 64KB buffer, app reads slowly:
  Initial window advertised: 65535 bytes

Sender sends 32KB → receiver buffers it, app hasn't read yet:
  Receiver advertises: Window = 65535 - 32768 = 32767 bytes

Sender sends another 20KB → receiver buffers:
  Receiver advertises: Window = 65535 - 52768 = 12767 bytes

Sender sends 12KB → buffer nearly full:
  Receiver advertises: Window = 767 bytes

App reads 40KB from buffer:
  Receiver advertises: Window = 40767 bytes   # window re-opens

/* Zero window — sender must stop */
Buffer completely full:
  Receiver advertises: Window = 0   # sender MUST stop sending data
  Sender starts Zero Window Probe timer
  Sender sends 1-byte probes periodically
  When receiver's app reads data → receiver sends Window Update ACK

⚡

Window Scale Option — High-Bandwidth Networks

OPTIMIZATION

The Window Size field is 16 bits — maximum 65,535 bytes. On a 1 Gbps link with 10ms RTT, the bandwidth-delay product (BDP) is 1 Gbps × 0.01s = 1.25 MB. With only 64 KB in flight, the link is only 64KB/1250KB = 5% utilised. The Window Scale option (RFC 7323) solves this by multiplying the window by a power of 2:

/* Window Scale option in SYN */
Scale factor = 7  # window size is multiplied by 2^7 = 128
Effective max window = 65535 × 128 = 8,388,480 bytes (8 MB)

/* Both sides must negotiate it in SYN / SYN+ACK */
/* If one side doesn't include Window Scale in SYN, neither side uses scaling */

/* Check on Linux */
ss -tni | grep rcv_space   # shows receiver socket buffer size
sysctl net.ipv4.tcp_rmem   # min/default/max receive buffer: "4096 131072 6291456"
sysctl net.ipv4.tcp_wmem   # min/default/max send buffer

CONGESTION CONTROL — PROTECTING THE NETWORK

🌐

The Congestion Collapse Problem

MOTIVATION

Flow control protects the receiver. Congestion control protects the network. In 1986, the internet experienced "congestion collapse" — throughput dropped to 0.1% of capacity because all senders kept retransmitting lost packets, further overloading already-saturated routers. Van Jacobson designed TCP congestion control (RFC 5681) to solve this: senders automatically reduce their sending rate when they detect packet loss.

TCP's congestion control maintains a Congestion Window (cwnd) — a sender-side limit on unacknowledged data in addition to the receiver's window. The effective window is: min(cwnd, receiver_window).

📈

Four Phases of TCP Congestion Control

ALGORITHM

Slow Start

cwnd starts at 1-10 MSS. Doubles every RTT (exponential growth). Continues until cwnd reaches ssthresh.

Congestion Avoidance

cwnd grows by 1 MSS per RTT (linear). Cautious probing of available bandwidth until loss detected.

Fast Retransmit

3 duplicate ACKs signal loss. Retransmit missing segment immediately without waiting for RTO timeout.

Fast Recovery

After fast retransmit: ssthresh = cwnd/2, cwnd = ssthresh + 3. Then enters Congestion Avoidance (not Slow Start).

/* NewReno algorithm (most common baseline) */
/* State variables */
cwnd = 10 * MSS    # congestion window (starts at 10 MSS per RFC 6928)
ssthresh = 65535   # slow start threshold (initial: large value)

/* Slow Start phase */
on each ACK: cwnd += MSS          # doubles every RTT (exponential)
when cwnd >= ssthresh: → Congestion Avoidance

/* Congestion Avoidance phase */
on each ACK: cwnd += MSS² / cwnd  # +1 MSS per RTT (linear)

/* Packet loss detected by TIMEOUT */
ssthresh = max(cwnd / 2, 2*MSS)
cwnd = 1 MSS        # drastic reduction — restart Slow Start

/* Packet loss detected by 3 DUPLICATE ACKs (mild congestion) */
ssthresh = max(cwnd / 2, 2*MSS)
cwnd = ssthresh + 3*MSS   # smaller reduction — Fast Recovery
# retransmit the missing segment immediately
# then enter Congestion Avoidance (skip Slow Start)

/* Check congestion control algorithm in use */
sysctl net.ipv4.tcp_congestion_control   # typical: "cubic" or "bbr"
ss -tni dst :80 | grep cwnd              # see live cwnd for connections

💡 Modern algorithms — CUBIC and BBR: NewReno is the baseline. Linux defaults to CUBIC (RFC 8312), which uses a cubic function for window growth — faster recovery after loss on high-BDP links. Google's BBR (Bottleneck Bandwidth and RTT) is newer and model-based rather than loss-based — it probes the actual bandwidth and RTT instead of reacting to drops. BBR dramatically improves performance on lossy networks (mobile, satellite). Understanding NewReno gives you the conceptual foundation; CUBIC and BBR are optimisations on the same principles.

TCP TIMERS — RETRANSMISSION, KEEPALIVE, TIME-WAIT

⏱️

The Four TCP Timers

TIMERS

Timer	Trigger	Action on Expiry	Default Value
RTO (Retransmission Timeout)	Segment sent with no ACK received within RTO	Retransmit oldest unacknowledged segment. Double RTO (exponential backoff). Reduce cwnd (Slow Start). Max 15 retries (then RST).	Dynamically calculated from RTT (min 200ms, max ~120s)
Persist (Zero Window)	Receiver advertises window=0	Send 1-byte Zero Window Probe to check if window has opened. Exponential backoff.	Starts at RTO, doubles each probe
Keepalive	No data exchanged for keepalive idle time	Send TCP keepalive probe (1-byte with seq=SND.NXT-1). If no response after N probes → close connection.	Idle: 7200s (2 hrs), Interval: 75s, Count: 9 probes (Linux defaults)
TIME_WAIT (2×MSL)	Connection enters TIME_WAIT state	After 2×MSL expires, move to CLOSED. Prevents stale segments from old connection being received by new connection with same 4-tuple.	MSL=60s on Linux → TIME_WAIT=120s. Configurable.

RTO Calculation — Karn's Algorithm

/* RTT measurement and RTO calculation (RFC 6298) */

/* Measure RTT for each ACKed segment (not retransmitted ones — Karn's rule) */
SRTT = 0.875 * SRTT + 0.125 * RTT_sample    # smoothed RTT (EWMA)
RTTVAR = 0.75 * RTTVAR + 0.25 * |SRTT - RTT_sample|  # RTT variance
RTO = SRTT + 4 * RTTVAR                      # RTO with safety margin
RTO = max(1 second, RTO)                     # floor: 1 second

/* On RTO timeout: double the RTO (exponential backoff) */
RTO = RTO * 2   # until max (typically 120 seconds)

/* After successful retransmission: restart RTT measurement from scratch */
# (Can't tell if ACK is for original or retransmitted — Karn's algorithm)

/* Check on Linux */
ss -tni | grep rtt   # shows rtt:X/Y for established connections

⚙️

Key TCP Tuning Parameters (Linux)

TUNING

# View all TCP-relevant sysctl parameters
sysctl -a | grep tcp

# Buffer sizes (affects window size and throughput)
sysctl net.ipv4.tcp_rmem    # receive: "4096 87380 6291456" (min/default/max)
sysctl net.ipv4.tcp_wmem    # send:    "4096 16384 4194304"
sysctl net.core.rmem_max    # max receive socket buffer (override tcp_rmem max)

# Connection setup
sysctl net.ipv4.tcp_syn_retries      # SYN retransmit attempts (default 6)
sysctl net.ipv4.tcp_synack_retries   # SYN+ACK retransmit attempts (default 5)
sysctl net.ipv4.tcp_syncookies       # SYN flood protection
sysctl net.ipv4.tcp_max_syn_backlog  # max half-open connections per socket

# TIME_WAIT
sysctl net.ipv4.tcp_tw_reuse    # reuse TIME_WAIT sockets for new connections
sysctl net.ipv4.tcp_fin_timeout # FIN_WAIT_2 timeout (default 60s)

# Keepalive
sysctl net.ipv4.tcp_keepalive_time     # idle time before probes (default 7200s)
sysctl net.ipv4.tcp_keepalive_intvl   # interval between probes (default 75s)
sysctl net.ipv4.tcp_keepalive_probes  # probe count before giving up (default 9)

# Congestion control
sysctl net.ipv4.tcp_congestion_control  # algorithm: cubic, bbr, reno
sysctl net.ipv4.tcp_sack                # SACK enabled (default 1)
sysctl net.ipv4.tcp_timestamps          # timestamps enabled (default 1)

TCP CONNECTION TEARDOWN — GRACEFUL AND ABORTIVE CLOSE

👋

Four-Way Graceful Close

TEARDOWN

TCP is full-duplex — each direction must be closed independently. The graceful close uses four messages (or three if the remote side closes simultaneously):

Client (active close)

Server (passive close)

FIN+ACK
seq=m

→ FIN_WAIT_1

▶

receive FIN → CLOSE_WAIT

ACK received → FIN_WAIT_2

◀

ACK
ack=m+1

wait for server FIN...

app closes → send FIN → LAST_ACK

receive FIN → TIME_WAIT (2×MSL)

◀

FIN+ACK
seq=n

ACK
ack=n+1

▶

ACK received → CLOSED

💡 Half-close: After sending FIN, the local side can no longer send data but can still receive data. The server may continue sending data (e.g., flushing a file) after acknowledging the client's FIN. This "half-closed" state (FIN_WAIT_2 on client, CLOSE_WAIT on server) persists until the server also sends its FIN.

⚡

RST — Abortive Close

RESET

A RST (Reset) segment immediately closes a connection without the graceful 4-way teardown. No data is buffered, no TIME_WAIT is entered — the connection is gone instantly. RST is sent in three main situations:

Connection to closed port — server receives SYN or data for a port nothing is listening on → sends RST
Abortive close — application calls close() with SO_LINGER set to 0 → kernel sends RST instead of FIN
Out-of-window segment — segment arrives with sequence number completely outside the current window → RST to signal error

/* RST injection attack */
/* Attacker crafts RST segment with sequence number in receiver's window */
/* Target receives RST → connection terminated immediately */
/* Historically used to disrupt BGP sessions (e.g., the 2004 RFC 4953 attack) */

/* Protection: check sequence number is in [RCV.NXT, RCV.NXT + RCV.WND) */
/* RFC 5961 "Improving TCP's Robustness to Blind In-Window Attacks" */

/* NGFW RST injection for connection termination */
/* Some NGFWs send RST to both sides to terminate blacklisted connections */
/* Must spoof the correct source IP and use a valid in-window sequence number */

TCP IN AN NGFW — STATEFUL INSPECTION DEEP DIVE

🛡️

How a Stateful Firewall Tracks TCP

STATEFUL INSPECTION

A stateful firewall maintains a connection table (also called session table or conntrack table) — a hash table keyed by the 5-tuple, storing the connection's current state and sequence number tracking data.

/* Connection table entry (conntrack) */
typedef struct {
    /* 5-tuple key (stored in bihash) */
    ip4_address_t   src_ip, dst_ip;
    uint16_t        src_port, dst_port;
    uint8_t         proto;              /* 6 = TCP */

    /* TCP state tracking */
    tcp_state_t     state;              /* SYN_SENT, ESTABLISHED, etc. */
    uint32_t        client_isn;         /* client's initial sequence number */
    uint32_t        server_isn;         /* server's initial sequence number */
    uint32_t        client_next_seq;    /* expected next seq from client */
    uint32_t        server_next_seq;    /* expected next seq from server */
    uint32_t        client_window;      /* client's advertised window */
    uint32_t        server_window;      /* server's advertised window */

    /* Policy and metadata */
    uint32_t        policy_id;          /* which policy matched this flow */
    uint64_t        bytes_client;       /* bytes from client → server */
    uint64_t        bytes_server;       /* bytes from server → client */
    uint64_t        last_seen;          /* timestamp for idle timeout */
    uint8_t         app_id;             /* L7 application (from DPI) */
} conntrack_entry_t;

For every packet, the NGFW:

Extracts the 5-tuple from IP + TCP headers
Looks up the 5-tuple in the connection table (O(1) bihash lookup)
If found: validates the packet against expected state (sequence numbers, flags) → allow, drop, or flag
If not found: check if it's a valid new connection attempt (SYN only, SYN+ACK for asymmetric routing) → create new entry or drop
Updates the connection entry (sequence numbers, bytes, last_seen)
Applies policy (allow, drop, inspect for DPI)

🔬

TCP Sequence Number Validation

SEQUENCE TRACKING

A sophisticated NGFW validates sequence numbers on every packet to detect injection attacks:

/* Validate incoming segment from client */
bool validate_tcp_segment(conntrack_entry_t *ct,
                          tcp_header_t *tcp, uint32_t payload_len) {
    uint32_t seq     = ntohl(tcp->seq);
    uint32_t ack     = ntohl(tcp->ack_seq);
    uint32_t win     = ntohs(tcp->window) << ct->server_wscale;

    /* Check 1: sequence number in valid receive window */
    /* seq must be in [next_expected, next_expected + window) */
    int32_t seq_delta = (int32_t)(seq - ct->client_next_seq);
    if (seq_delta < 0 || seq_delta > (int32_t)ct->server_window) {
        /* Out-of-window segment — could be injected */
        return false;
    }

    /* Check 2: ACK number in valid range */
    int32_t ack_delta = (int32_t)(ack - ct->server_isn);
    if (ack_delta < 0 || ack_delta > (int32_t)ct->server_next_seq) {
        return false;   /* ACKing data we haven't sent */
    }

    /* Check 3: flags match expected state */
    if (ct->state == TCP_ESTABLISHED) {
        if (tcp->syn && !tcp->rst)
            return false;   /* SYN in ESTABLISHED is suspicious */
    }

    return true;
}

⚡

MSS Clamping — Preventing Fragmentation

MSS CLAMPING

When a TCP connection passes through a firewall or VPN that reduces the effective MTU (e.g., PPPoE reduces MTU from 1500 to 1492, VPN adds header overhead), packets larger than the new MTU need to be fragmented — or dropped if DF=1. MSS clamping rewrites the MSS option in SYN/SYN+ACK segments to force both sides to use smaller segments that fit without fragmentation.

/* MSS clamping — rewrite MSS option in SYN segments */
/* Called "TCP MSS clamping" — applied on SYN and SYN+ACK */

Original SYN: MSS=1460 (assuming Ethernet MTU=1500, IP hdr=20, TCP hdr=20)
PPPoE link MTU: 1492 bytes
New MSS: 1492 - 20 (IP) - 20 (TCP) = 1452

NGFW rewrites MSS=1460 → MSS=1452 in the SYN before forwarding
Both sides now use 1452-byte segments → no fragmentation needed

/* Linux iptables MSS clamping */
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
  -j TCPMSS --clamp-mss-to-pmtu

/* In VPP (your data plane) */
# This would be implemented in your TCP normalisation plugin
# Find TCP Options in SYN segment, locate MSS option (Kind=2),
# compare with interface MTU, rewrite if MSS > (MTU - 40)

LAB 1

Capture and Decode a Complete TCP Lifecycle

Objective: Capture a full TCP session — handshake, data transfer, and teardown — and decode every flag, sequence number, ACK, and window size in Wireshark. Understand the connection from first SYN to last ACK.

Start Wireshark capture on your interface. Run: curl http://example.com. Stop capture. Filter: ip.addr == 93.184.216.34 to isolate the example.com conversation.

Handshake analysis: Find the SYN packet. Record: Sequence Number (ISN), Window Size, MSS option, SACK Permitted option, Window Scale option. Find SYN+ACK: verify ACK = client ISN + 1. Find the final ACK: verify ACK = server ISN + 1.

Data transfer analysis: Find the HTTP GET request packet. Record: Flags (PSH+ACK), Sequence Number, payload length. Find the HTTP response: record sequence numbers of the first and last response segment. Use Wireshark's "Follow TCP Stream" to see the full conversation.

Teardown analysis: Find the FIN+ACK from one side, the ACK reply, then the FIN+ACK from the other side, and the final ACK. Identify which side initiated the close. Look for TIME_WAIT: run ss -tn state time-wait immediately after curl — you may catch the socket in TIME_WAIT.

Sequence number graph: In Wireshark, go to Statistics → TCP Stream Graphs → Time/Sequence (Stevens). You'll see the sawtooth pattern of slow start, linear growth, and any retransmissions. If there are no retransmissions, artificially increase delay: tc qdisc add dev eth0 root netem delay 100ms loss 2% then curl again.

Check the connection state machine using: ss -tn (during the connection) — observe ESTABLISHED state. Check ss -tn state time-wait after connection closes. Map each ss state to the TCP state diagram in Tab 3.

LAB 2

Simulate TCP Attacks and Defences with Scapy

Objective: Use Scapy to craft malformed TCP segments and observe how Linux handles them. Understand SYN flood, RST injection, and invalid flag combinations.

SYN flood simulation (on loopback — safe): from scapy.all import *, then send 100 SYNs with random source IPs to a closed port: for i in range(100): send(IP(src=RandIP(), dst="127.0.0.1")/TCP(sport=RandShort(), dport=9999, flags="S"), verbose=0). Capture with tcpdump -i lo -n 'port 9999'. What does the server return for a closed port?

Flag anomaly detection: Send a Christmas tree packet (all flags set) to a listening port and observe: send(IP(dst="127.0.0.1")/TCP(dport=22, flags="FSRPAU"), verbose=1). Start a listening server first: nc -l 12345. Does the server accept it? What does the Linux kernel do with it?

SYN cookies demo: Enable SYN cookies: sudo sysctl net.ipv4.tcp_syncookies=2. Start nc -l 8888. Send 500 SYNs from random IPs to port 8888. Monitor the connection backlog: ss -tn state syn-recv | wc -l. With syncookies=2, the backlog should not grow indefinitely.

Build a mini port scanner: Write a Python script using Scapy that sends SYN to ports 1-1024 on localhost and records which ports return SYN+ACK (open) vs RST (closed) vs no response (filtered). This is exactly how Nmap's SYN scan works.

Analyse the output: Run your port scanner against your local machine. Cross-reference with ss -tlnp (listening TCP ports). Every port showing SYN+ACK in your scan should match a listening service. Ports showing RST are closed. Understand why firewall-filtered ports show no response.

LAB 3

Write a TCP Connection Tracker in C

Objective: Implement a simplified TCP state machine tracker using libpcap. This is the core of what a stateful firewall does — track each connection through its state transitions based on observed TCP flags.

Install libpcap: sudo apt install libpcap-dev. Create tcp_tracker.c. Define a connection table as a simple array of structs with fields: src_ip, dst_ip, src_port, dst_port, state (enum: SYN_SENT, ESTABLISHED, FIN_WAIT, CLOSED), last_seen.

Use pcap to capture TCP packets: pcap_open_live("eth0", 65535, 1, 1000, errbuf). Set filter: pcap_compile + pcap_setfilter with filter string "tcp". In the packet handler, parse Ethernet → IP → TCP headers manually using byte offsets.

Implement state transitions: if SYN-only → create new entry with state=SYN_SENT; if SYN+ACK → find matching entry (reversed 5-tuple), update to state=SYN_RECEIVED; if ACK after SYN+ACK → state=ESTABLISHED; if FIN → state=FIN_WAIT; after second FIN+ACK → state=CLOSED, remove entry.

Print a summary every second: number of connections in each state (SYN_SENT, ESTABLISHED, FIN_WAIT, CLOSED), total connections seen, connections per second. Run it while browsing the web or downloading a file — watch the ESTABLISHED count grow and shrink.

Bonus — Add anomaly detection: Log a warning when you see: (a) SYN+ACK without a prior SYN in the table, (b) RST with a sequence number outside the expected window, (c) data segments before ESTABLISHED state, (d) more than 5 SYNs per second from the same source IP.

M05 MASTERY CHECKLIST

Can explain TCP's 6 guarantees: reliability, ordering, no duplication, error detection, flow control, congestion control
Know what TCP does NOT guarantee: timing, bandwidth, message boundaries
Can draw the TCP header from memory with all field names and sizes (20 bytes, 5 rows)
Know all 8 TCP flags and what each does: CWR, ECE, URG, ACK, PSH, RST, SYN, FIN
Know flag combinations that indicate state: SYN=new connection, SYN+ACK=server reply, ACK=data, FIN+ACK=close, RST=abort
Know the key TCP options: MSS, Window Scale, SACK Permitted, SACK, Timestamps
Can explain the 3-way handshake step by step with sequence numbers: SYN(seq=x) → SYN+ACK(seq=y,ack=x+1) → ACK(ack=y+1)
Know why ISN is random: prevents stale segment injection and sequence number prediction attacks
Know what a SYN flood attack is and how SYN cookies defend against it
Can name and describe all 11 TCP states: CLOSED, LISTEN, SYN_SENT, SYN_RECEIVED, ESTABLISHED, FIN_WAIT_1, FIN_WAIT_2, CLOSE_WAIT, CLOSING, LAST_ACK, TIME_WAIT
Know the active vs passive close distinction and which side enters TIME_WAIT
Know why TIME_WAIT lasts 2×MSL and what problem it solves
Understand sequence numbers: each byte is numbered, SYN and FIN consume one number each
Understand ACK semantics: ACK=N means "I have received all bytes up to N-1, send me N next"
Understand cumulative vs selective ACK (SACK): SACK reports received out-of-order blocks, allows selective retransmission
Explain flow control: receiver's window size limits sender's unACKed data in flight
Know the four flow control regions: sent+ACKed, sent+unACKed, can send, cannot send
Know what zero window means and how the sender handles it: persist timer + zero window probes
Explain congestion control's 4 phases: Slow Start, Congestion Avoidance, Fast Retransmit, Fast Recovery
Know cwnd and ssthresh: cwnd doubles in SS, grows linearly in CA, halves on loss
Know the difference between timeout loss (cwnd→1) and 3-dupACK loss (cwnd→ssthresh, skip SS)
Know the 4 TCP timers: RTO (retransmit), Persist (zero window), Keepalive, TIME_WAIT
Know how RTO is calculated: SRTT + 4×RTTVAR, minimum 1 second
Understand 4-way teardown vs RST: FIN is graceful (buffered data delivered), RST is abortive (immediate)
Know what a stateful NGFW stores per TCP connection: 5-tuple, state, ISNs, sequence tracking, window sizes
Know 7+ TCP attack types and NGFW defences: SYN flood, RST injection, Christmas tree scan, NULL scan, data before ESTABLISHED, overlapping segments, invalid flags
Know MSS clamping: why it's needed, when applied (SYN/SYN+ACK), how it prevents fragmentation
Completed Lab 1: captured full TCP lifecycle in Wireshark, decoded sequence numbers and flags at every stage
Completed Lab 2: used Scapy to craft TCP attacks, implemented mini SYN port scanner
Completed Lab 3: built TCP connection tracker in C using libpcap with state machine and anomaly detection

✅ When complete: Move to M06 - UDP and ICMP. You now have deep TCP knowledge. M06 is shorter — UDP has almost no complexity by design — but understanding UDP's simplicity (and its implications for NGFW) is essential before moving to DNS (M07) and HTTP (M08), both of which use UDP heavily.

← M04 IPv6 🗺️ Roadmap Next: M06 - UDP and ICMP →