NETWORKING MASTERY · PHASE 2 · MODULE 05 · WEEKS 4–5
⚡ TCP Internals
3-way handshake · State machine · Sequence numbers · Flow control · Congestion control · SACK · Timers
Beginner → Intermediate Prerequisite: M03 IPv4 RFC 793 + RFC 9293 Most Critical Transport Protocol 3 Labs

TCP — RELIABLE, ORDERED, BIDIRECTIONAL BYTE STREAMS

📡

What TCP Guarantees — and What It Doesn't

OVERVIEW

TCP (Transmission Control Protocol, RFC 793 / RFC 9293) is Layer 4's workhorse. It takes IP's unreliable, unordered packet delivery and builds a reliable, ordered, bidirectional byte stream on top of it. Every major application protocol — HTTP, HTTPS, SSH, SMTP, FTP — runs over TCP because reliability matters more than raw speed for those use cases.

What TCP guarantees:

  • Reliability — every byte sent will be received, or the sender will know it failed. If a packet is lost, TCP detects it and retransmits automatically
  • Ordering — bytes arrive in the same order they were sent, even if packets arrive out of order in transit
  • No duplication — TCP detects and discards duplicate packets
  • Error detection — checksum on every segment
  • Flow control — sender doesn't overwhelm receiver's buffer
  • Congestion control — sender adapts to network capacity, doesn't collapse the network

What TCP does NOT guarantee:

  • Timing / latency — retransmissions add unpredictable delay
  • Bandwidth — TCP adapts to available capacity, never reserves it
  • Message boundaries — TCP is a stream, not a message protocol. If you send "Hello" and "World" as two separate write() calls, the receiver may get "HelloWorld" in one read() or "He" and "lloWorld" in two. Applications must implement their own framing
📞 Analogy — A Phone Call vs Postcards

UDP is like sending postcards — you write one, drop it in the postbox, and hope it arrives. No confirmation, no order guarantee, no retry. TCP is like a phone call: first you establish the call (3-way handshake), then both parties speak in turn and confirm they heard each other ("uh-huh, go on"), and if one side goes silent the other says "hello? are you still there?" (keepalive). When the call ends, both sides say goodbye properly (4-way teardown). This setup and teardown overhead is why TCP is slower for small one-shot queries — but the reliability is worth it for file transfers, web pages, and anything where missing data is unacceptable.

⚖️

TCP vs UDP — When to Use Which

COMPARISON
PropertyTCPUDP
ConnectionConnection-oriented (3-way handshake)Connectionless — fire and forget
ReliabilityGuaranteed delivery + retransmissionBest-effort — no retransmission
OrderingIn-order delivery guaranteedPackets may arrive out of order
SpeedSlower — overhead for reliabilityFaster — minimal overhead
Header size20–60 bytes8 bytes
Flow controlYes — sliding windowNo
Congestion controlYes — reduces sending rate under congestionNo — keeps sending regardless
Use casesHTTP/HTTPS, SSH, SMTP, FTP, databaseDNS, VoIP, video streaming, gaming, QUIC

💡 NGFW relevance: TCP is the dominant protocol for web traffic (HTTP/HTTPS), management traffic (SSH), and email (SMTP). Your NGFW must maintain connection state for every TCP session — tracking sequence numbers, connection phase (handshake/established/closing), and detecting anomalies. UDP sessions are tracked differently (timeout-based, no handshake state). Understanding TCP deeply is essential for building correct stateful inspection.

TCP HEADER — 20 BYTES MINIMUM, UP TO 60 BYTES WITH OPTIONS

Row 1
Source Port
16 bits
Destination Port
16 bits
Row 2
Sequence Number
32 bits
Row 3
Acknowledgement Number
32 bits
Row 4
Data Offset
4 bits
Res
3b
Flags: CWR ECE URG ACK PSH RST SYN FIN
9 bits
Window Size
16 bits
Row 5
Checksum
16 bits
Urgent Pointer
16 bits
Row 6+
Options (if Data Offset > 5) + Padding
0–40 bytes — MSS, SACK, Timestamps, Window Scale
🔍

Every Field Explained

FIELD REFERENCE

Source Port and Destination Port (16 bits each)

Port numbers identify the application on each end. Combined with IP addresses, they form the 5-tuple that uniquely identifies a TCP connection: (src_ip, src_port, dst_ip, dst_port, protocol=TCP). Well-known ports: 80=HTTP, 443=HTTPS, 22=SSH, 25=SMTP, 53=DNS-TCP, 3306=MySQL. The client uses an ephemeral port (typically 49152–65535) assigned randomly by the OS.

Sequence Number (32 bits)

Identifies the position of the first byte of data in this segment within the entire byte stream. The sequence number space is 0 to 2³²−1 (wraps around). The Initial Sequence Number (ISN) is chosen randomly at connection setup — not starting at 0 — to prevent stale segments from old connections being confused with new ones. In a SYN segment, the sequence number is the ISN itself (no data yet).

Acknowledgement Number (32 bits)

The sequence number of the next byte the receiver expects from the sender. This acknowledges all bytes up to (but not including) this number. For example, if the receiver has successfully received bytes 0–999, it sends ACK=1000 meaning "I have everything up to 999, send me 1000 next". ACK is only valid when the ACK flag is set.

Data Offset (4 bits)

TCP header length in 32-bit words — same concept as IPv4's IHL. Minimum 5 (20 bytes). Maximum 15 (60 bytes). Tells the receiver where the payload data starts: data_offset_bytes = data_offset × 4.

TCP Flags (9 bits) — The Most Important Field for NGFW

CWR
Congestion Window Reduced — ECN response
ECE
ECN Echo — congestion signal received
URG
Urgent Pointer is valid (rarely used)
ACK
ACK number is valid — set on all except initial SYN
PSH
Push — receiver should flush buffer to app immediately
RST
Reset — abortive connection close
SYN
Synchronise — connection initiation
FIN
Finish — orderly connection close

Flag combinations reveal connection phase: SYN only = new connection attempt; SYN+ACK = server accepting; ACK only = data transfer; FIN+ACK = graceful close; RST = abort. Your NGFW inspects these flags to track connection state in its connection table.

Window Size (16 bits)

Advertises how many bytes the receiver can accept in its buffer right now. This is the foundation of TCP flow control — the sender must not send more unacknowledged data than the receiver's window allows. Scaled by the Window Scale option (up to ×65535) for high-bandwidth links. We cover this in the Flow Control tab.

Checksum (16 bits)

Computed over a "pseudo-header" (IP src, IP dst, Protocol=6, TCP length) plus the entire TCP header and payload. Detects corruption. The pseudo-header inclusion means the checksum also validates that the segment reached the correct destination IP — no mis-delivery.

Key TCP Options

OptionKindPurposeNGFW Impact
MSS2Maximum Segment Size — largest payload sender will sendNGFW can reduce MSS to avoid fragmentation (MSS clamping)
Window Scale3Multiplier for Window Size (2^scale, up to ×65535)Must track for correct window calculation
SACK Permitted4Signals both sides support Selective ACKSignals need to track SACK blocks
SACK5Reports which out-of-order blocks were receivedMust parse for correct retransmit tracking
Timestamps8RTT measurement + PAWS (protect against wrapped seqs)Used for RTT monitoring in NGFW analytics
TFO (Fast Open)34Send data in SYN packet (1-RTT connection setup)NGFW must parse data-in-SYN for DPI

THE THREE-WAY HANDSHAKE — CONNECTION ESTABLISHMENT

🤝

Why Three Steps?

CONCEPT

A TCP connection needs both sides to agree on two things before data can flow: (1) the connection exists, and (2) both sides know each other's initial sequence numbers (ISN) so they can properly track bytes. The three-way handshake achieves both with the minimum number of round trips.

Two steps (SYN → SYN+ACK) would let the server know the client's ISN, but the client wouldn't know the server acknowledged its SYN. Three steps (SYN → SYN+ACK → ACK) confirms both sides have exchanged and acknowledged ISNs, establishing a reliable bidirectional channel.

📊

The Handshake Step by Step

SEQUENCE DIAGRAM
Client
Server
SYN
seq=x ISN
SYN received
SYN+ACK received
SYN+ACK
seq=y ack=x+1
ACK
seq=x+1 ack=y+1
ACK received
ESTABLISHED ✓
ESTABLISHED ✓
/* Step 1 — Client sends SYN */
Flags:  SYN
Seq:    x        # randomly chosen ISN — e.g. 1,000,000
Ack:    0        # ACK flag not set — nothing to ack yet
Options: MSS=1460, SACK permitted, Window Scale=7, Timestamps

/* Step 2 — Server sends SYN+ACK */
Flags:  SYN, ACK
Seq:    y        # server's own randomly chosen ISN — e.g. 5,000,000
Ack:    x+1      # "I received your SYN (which consumed 1 seq byte), send me x+1 next"
Options: MSS=1460, SACK permitted, Window Scale=9, Timestamps

/* Step 3 — Client sends ACK */
Flags:  ACK
Seq:    x+1      # client's next byte
Ack:    y+1      # "I received your SYN, send me y+1 next"
# Connection is now ESTABLISHED on both sides
# Client may include data in this segment (TCP Fast Open)

💡 Why random ISN? If ISN always started at 0, an attacker could inject forged segments into an existing connection — they just need to guess the current sequence number, which is trivial if it started from 0. Random ISN makes it computationally infeasible to forge in-window segments.

⚠️

SYN Flood Attack and SYN Cookies

SECURITY

A SYN flood is one of the oldest and most effective DoS attacks. The attacker sends thousands of SYN packets with spoofed source IPs. The server allocates state for each half-open connection, waiting for the final ACK that never comes. Eventually, the server's connection table fills up and it can't accept legitimate connections.

SYN Cookies (RFC 4987) solve this: instead of allocating state on SYN receipt, the server encodes the connection parameters (MSS, timestamp, etc.) into the initial sequence number (ISN) of the SYN+ACK. The state is "stored" in the sequence number itself. When the final ACK arrives, the server decodes the parameters from the ACK number and allocates state only then. No state is allocated for connections that never complete — SYN flood has no effect.

# Check if SYN cookies are enabled on Linux
cat /proc/sys/net/ipv4/tcp_syncookies
# 0 = disabled, 1 = enabled when backlog full, 2 = always enabled

# Enable permanently
echo 1 > /proc/sys/net/ipv4/tcp_syncookies

# NGFW-level SYN flood protection
# Rate-limit SYN packets per source IP per second
# Drop SYN packets exceeding threshold (e.g., >100 SYN/sec from one IP)
# TCP proxy: NGFW completes handshake on behalf of server, only forwards verified connections

TCP STATE MACHINE — 11 STATES, EVERY TRANSITION

🔄

TCP States — What Each Means

STATE MACHINE

A TCP connection moves through a well-defined sequence of states. Your NGFW must track the state of every TCP connection in its connection table — this is the essence of "stateful inspection". A packet that doesn't match expected state transitions is suspicious or malicious.

CLOSED
No connection. Initial and final state. No resources allocated.
LISTEN
Server waiting for incoming SYN. Socket bound and listening.
SYN_SENT
Client sent SYN, waiting for SYN+ACK from server.
SYN_RECEIVED
Server received SYN, sent SYN+ACK, waiting for client's ACK.
ESTABLISHED ✓
Full duplex connection open. Data transfer in progress. This is the normal operating state.
FIN_WAIT_1
This side sent FIN, waiting for ACK or FIN+ACK.
FIN_WAIT_2
Our FIN acknowledged. Waiting for remote FIN.
CLOSE_WAIT
Remote side closed. Waiting for local app to close its side.
CLOSING
Both sides sent FIN simultaneously. Waiting for ACK.
LAST_ACK
Passive close side sent FIN, waiting for final ACK.
TIME_WAIT
Both FINs ACKed. Wait 2×MSL before CLOSED. Prevents stale segment confusion.
🗺️

State Transitions — Full Diagram in Text

TRANSITIONS
/* CLIENT (active open) state transitions */
CLOSED
  → app calls connect()                    → SYN_SENT
  → SYN_SENT  + receive SYN+ACK, send ACK → ESTABLISHED
  → SYN_SENT  + receive SYN (simultaneous) → SYN_RECEIVED

/* SERVER (passive open) state transitions */
CLOSED
  → app calls listen()                     → LISTEN
  → LISTEN    + receive SYN, send SYN+ACK  → SYN_RECEIVED
  → SYN_RECEIVED + receive ACK             → ESTABLISHED

/* TEARDOWN — active close (initiating side) */
ESTABLISHED
  → app calls close(), send FIN            → FIN_WAIT_1
  → FIN_WAIT_1 + receive ACK              → FIN_WAIT_2
  → FIN_WAIT_2 + receive FIN, send ACK    → TIME_WAIT
  → TIME_WAIT  + 2*MSL timeout            → CLOSED

/* TEARDOWN — passive close (receiving side) */
ESTABLISHED
  → receive FIN, send ACK                  → CLOSE_WAIT
  → CLOSE_WAIT + app calls close(), send FIN → LAST_ACK
  → LAST_ACK   + receive ACK               → CLOSED

/* RST — abortive close (any state) */
any state
  → receive RST or send RST                → CLOSED (immediately)

/* Check states on Linux */
ss -tn          # show TCP connections with states
ss -tn state established
ss -tn state time-wait | wc -l   # count TIME_WAIT connections
netstat -an | grep TCP

⚠️ TIME_WAIT accumulation is a common production problem. Each connection in TIME_WAIT holds a socket for 2×MSL (typically 60–120 seconds on Linux). A high-traffic server closing 10,000 connections/second will have 600,000–1,200,000 TIME_WAIT sockets. This exhausts the ephemeral port range and can prevent new connections. Solutions: SO_REUSEADDR, tcp_tw_reuse (Linux sysctl), or reduce MSL. Your NGFW must not confuse TIME_WAIT connections with malicious activity.

🔥

NGFW State Tracking — What to Watch For

NGFW

A stateful NGFW must track TCP state transitions and reject packets that violate them:

AnomalyFlagsWhy It's SuspiciousAction
SYN-ACK without prior SYNSYN+ACKNo SYN seen — spoofed or session splicedDrop + log
Data without ESTABLISHEDPSH+ACK, no connection entryInjected data, blind injection attackDrop
RST with wrong sequence numberRSTRST injection attack to terminate connectionsDrop if seq out of window
FIN before ESTABLISHEDFINPort scan (FIN scan) or evasion attemptDrop + log
SYN to non-listening portSYNPort scanDrop (no server) or RST
Christmas tree packetSYN+FIN+PSH+URGNmap XMAS scan — OS fingerprintingDrop + alert
NULL scanno flagsNmap NULL scan — firewall evasionDrop + alert
Overlapping segmentsvariesIDS evasion — inconsistent reassemblyReassemble + inspect

SEQUENCE NUMBERS — ORDERING, RELIABILITY, AND BYTE TRACKING

🔢

How Sequence Numbers Work

CORE CONCEPT

TCP numbers every byte it sends with a sequence number. This enables: (1) the receiver to detect missing bytes, (2) the receiver to reorder out-of-order segments, and (3) the sender to know exactly which bytes were received via the ACK number.

/* Example: sending "Hello World" (11 bytes) */
ISN = 1000  # chosen randomly at handshake

Segment 1: seq=1001  data="Hello" (5 bytes)   → covers bytes 1001-1005
Segment 2: seq=1006  data=" Worl" (5 bytes)   → covers bytes 1006-1010
Segment 3: seq=1011  data="d"     (1 byte)    → covers bytes 1011-1011

/* Receiver sends ACKs */
After Segment 1: ACK=1006  # "I have 1001-1005, send me 1006 next"
After Segment 2: ACK=1011  # "I have 1001-1010, send me 1011 next"
After Segment 3: ACK=1012  # "I have 1001-1011, send me 1012 next"

/* What if Segment 2 is lost? */
Receiver gets Segment 1:  ACK=1006  (normal)
Receiver gets Segment 3:  ACK=1006  (still 1006 — can't advance past gap!)
                          → This is a duplicate ACK — signals a gap</span>

/* Sequence number arithmetic — always modular (wraps at 2^32) */
/* Use int32_t arithmetic for correct comparison */
int32_t diff = (int32_t)(seq_a - seq_b);
if (diff > 0) ...  # seq_a is ahead of seq_b

💡 SYN and FIN each consume one sequence number even though they carry no data. This is why the ACK after a SYN is ISN+1 (not ISN+0). It means both sides can unambiguously detect whether the connection control messages (SYN/FIN) were delivered.

📬

Cumulative vs Selective Acknowledgement

ACK MODES

TCP's basic ACK is cumulative — it acknowledges all bytes up to a point. This works well in the common case but is inefficient when segments arrive out of order:

/* Cumulative ACK — without SACK */
Sender sends:  seg[1000] seg[1500] seg[2000] seg[2500]
Network drops: seg[1500]
Receiver gets: seg[1000] ✓  ACK=1500
               seg[2000] ✓  ACK=1500  (still! — can't advance past 1500)
               seg[2500] ✓  ACK=1500  (still!)

Without SACK: sender must retransmit seg[1500] AND all after it
(go-back-N behaviour, though modern TCP is smarter)

/* Selective ACK (SACK) — RFC 2018 */
Receiver gets: seg[1000] ✓  ACK=1500
               seg[2000] ✓  ACK=1500  SACK=[2000-2499]
               seg[2500] ✓  ACK=1500  SACK=[2000-2999]

With SACK: sender knows ONLY seg[1500] is missing
           retransmits ONLY seg[1500]
           receiver ACKs=3000 after receiving it → done

SACK enabled by: "SACK Permitted" option in SYN/SYN+ACK
Up to 4 SACK blocks per segment (each block = 2×32-bit seq numbers = 8 bytes)

FLOW CONTROL — SLIDING WINDOW

🪟

The Sliding Window Mechanism

CONCEPT

Flow control prevents a fast sender from overwhelming a slow receiver's buffer. The receiver tells the sender exactly how much buffer space it has available via the Window Size field in every ACK. The sender must not have more than Window Size bytes of unacknowledged data in flight at any time.

The window "slides" forward as data is acknowledged — the sender's send window moves right as ACKs arrive, allowing more data to be sent.

📊

Sender's View of the Sequence Number Space

SEND BUFFER

The sender categorises its byte stream into four regions:

Sender
Sent + ACKed
already delivered
Sent, not ACKed
in flight
Can send
within window
Cannot send yet
window full or no data
← SND.UNA (last unACKed) → ← SND.NXT (next to send) → ← SND.UNA + win (window edge) →
Receiver
Received + Delivered
to application
Received in-order
buffered, not read yet
Out-of-order
buffered, gap before
Available buffer
= advertised window
← RCV.NXT (next expected) → ← RCV.WND (advertised window size) →
/* Flow control in action */
Receiver has 64KB buffer, app reads slowly:
  Initial window advertised: 65535 bytes

Sender sends 32KB → receiver buffers it, app hasn't read yet:
  Receiver advertises: Window = 65535 - 32768 = 32767 bytes

Sender sends another 20KB → receiver buffers:
  Receiver advertises: Window = 65535 - 52768 = 12767 bytes

Sender sends 12KB → buffer nearly full:
  Receiver advertises: Window = 767 bytes

App reads 40KB from buffer:
  Receiver advertises: Window = 40767 bytes   # window re-opens

/* Zero window — sender must stop */
Buffer completely full:
  Receiver advertises: Window = 0   # sender MUST stop sending data
  Sender starts Zero Window Probe timer
  Sender sends 1-byte probes periodically
  When receiver's app reads data → receiver sends Window Update ACK

Window Scale Option — High-Bandwidth Networks

OPTIMIZATION

The Window Size field is 16 bits — maximum 65,535 bytes. On a 1 Gbps link with 10ms RTT, the bandwidth-delay product (BDP) is 1 Gbps × 0.01s = 1.25 MB. With only 64 KB in flight, the link is only 64KB/1250KB = 5% utilised. The Window Scale option (RFC 7323) solves this by multiplying the window by a power of 2:

/* Window Scale option in SYN */
Scale factor = 7  # window size is multiplied by 2^7 = 128
Effective max window = 65535 × 128 = 8,388,480 bytes (8 MB)

/* Both sides must negotiate it in SYN / SYN+ACK */
/* If one side doesn't include Window Scale in SYN, neither side uses scaling */

/* Check on Linux */
ss -tni | grep rcv_space   # shows receiver socket buffer size
sysctl net.ipv4.tcp_rmem   # min/default/max receive buffer: "4096 131072 6291456"
sysctl net.ipv4.tcp_wmem   # min/default/max send buffer

CONGESTION CONTROL — PROTECTING THE NETWORK

🌐

The Congestion Collapse Problem

MOTIVATION

Flow control protects the receiver. Congestion control protects the network. In 1986, the internet experienced "congestion collapse" — throughput dropped to 0.1% of capacity because all senders kept retransmitting lost packets, further overloading already-saturated routers. Van Jacobson designed TCP congestion control (RFC 5681) to solve this: senders automatically reduce their sending rate when they detect packet loss.

TCP's congestion control maintains a Congestion Window (cwnd) — a sender-side limit on unacknowledged data in addition to the receiver's window. The effective window is: min(cwnd, receiver_window).

📈

Four Phases of TCP Congestion Control

ALGORITHM
Slow Start
cwnd starts at 1-10 MSS. Doubles every RTT (exponential growth). Continues until cwnd reaches ssthresh.
Congestion Avoidance
cwnd grows by 1 MSS per RTT (linear). Cautious probing of available bandwidth until loss detected.
Fast Retransmit
3 duplicate ACKs signal loss. Retransmit missing segment immediately without waiting for RTO timeout.
Fast Recovery
After fast retransmit: ssthresh = cwnd/2, cwnd = ssthresh + 3. Then enters Congestion Avoidance (not Slow Start).
/* NewReno algorithm (most common baseline) */
/* State variables */
cwnd = 10 * MSS    # congestion window (starts at 10 MSS per RFC 6928)
ssthresh = 65535   # slow start threshold (initial: large value)

/* Slow Start phase */
on each ACK: cwnd += MSS          # doubles every RTT (exponential)
when cwnd >= ssthresh: → Congestion Avoidance

/* Congestion Avoidance phase */
on each ACK: cwnd += MSS² / cwnd  # +1 MSS per RTT (linear)

/* Packet loss detected by TIMEOUT */
ssthresh = max(cwnd / 2, 2*MSS)
cwnd = 1 MSS        # drastic reduction — restart Slow Start

/* Packet loss detected by 3 DUPLICATE ACKs (mild congestion) */
ssthresh = max(cwnd / 2, 2*MSS)
cwnd = ssthresh + 3*MSS   # smaller reduction — Fast Recovery
# retransmit the missing segment immediately
# then enter Congestion Avoidance (skip Slow Start)

/* Check congestion control algorithm in use */
sysctl net.ipv4.tcp_congestion_control   # typical: "cubic" or "bbr"
ss -tni dst :80 | grep cwnd              # see live cwnd for connections

💡 Modern algorithms — CUBIC and BBR: NewReno is the baseline. Linux defaults to CUBIC (RFC 8312), which uses a cubic function for window growth — faster recovery after loss on high-BDP links. Google's BBR (Bottleneck Bandwidth and RTT) is newer and model-based rather than loss-based — it probes the actual bandwidth and RTT instead of reacting to drops. BBR dramatically improves performance on lossy networks (mobile, satellite). Understanding NewReno gives you the conceptual foundation; CUBIC and BBR are optimisations on the same principles.

TCP TIMERS — RETRANSMISSION, KEEPALIVE, TIME-WAIT

⏱️

The Four TCP Timers

TIMERS
TimerTriggerAction on ExpiryDefault Value
RTO (Retransmission Timeout) Segment sent with no ACK received within RTO Retransmit oldest unacknowledged segment. Double RTO (exponential backoff). Reduce cwnd (Slow Start). Max 15 retries (then RST). Dynamically calculated from RTT (min 200ms, max ~120s)
Persist (Zero Window) Receiver advertises window=0 Send 1-byte Zero Window Probe to check if window has opened. Exponential backoff. Starts at RTO, doubles each probe
Keepalive No data exchanged for keepalive idle time Send TCP keepalive probe (1-byte with seq=SND.NXT-1). If no response after N probes → close connection. Idle: 7200s (2 hrs), Interval: 75s, Count: 9 probes (Linux defaults)
TIME_WAIT (2×MSL) Connection enters TIME_WAIT state After 2×MSL expires, move to CLOSED. Prevents stale segments from old connection being received by new connection with same 4-tuple. MSL=60s on Linux → TIME_WAIT=120s. Configurable.

RTO Calculation — Karn's Algorithm

/* RTT measurement and RTO calculation (RFC 6298) */

/* Measure RTT for each ACKed segment (not retransmitted ones — Karn's rule) */
SRTT = 0.875 * SRTT + 0.125 * RTT_sample    # smoothed RTT (EWMA)
RTTVAR = 0.75 * RTTVAR + 0.25 * |SRTT - RTT_sample|  # RTT variance
RTO = SRTT + 4 * RTTVAR                      # RTO with safety margin
RTO = max(1 second, RTO)                     # floor: 1 second

/* On RTO timeout: double the RTO (exponential backoff) */
RTO = RTO * 2   # until max (typically 120 seconds)

/* After successful retransmission: restart RTT measurement from scratch */
# (Can't tell if ACK is for original or retransmitted — Karn's algorithm)

/* Check on Linux */
ss -tni | grep rtt   # shows rtt:X/Y for established connections
⚙️

Key TCP Tuning Parameters (Linux)

TUNING
# View all TCP-relevant sysctl parameters
sysctl -a | grep tcp

# Buffer sizes (affects window size and throughput)
sysctl net.ipv4.tcp_rmem    # receive: "4096 87380 6291456" (min/default/max)
sysctl net.ipv4.tcp_wmem    # send:    "4096 16384 4194304"
sysctl net.core.rmem_max    # max receive socket buffer (override tcp_rmem max)

# Connection setup
sysctl net.ipv4.tcp_syn_retries      # SYN retransmit attempts (default 6)
sysctl net.ipv4.tcp_synack_retries   # SYN+ACK retransmit attempts (default 5)
sysctl net.ipv4.tcp_syncookies       # SYN flood protection
sysctl net.ipv4.tcp_max_syn_backlog  # max half-open connections per socket

# TIME_WAIT
sysctl net.ipv4.tcp_tw_reuse    # reuse TIME_WAIT sockets for new connections
sysctl net.ipv4.tcp_fin_timeout # FIN_WAIT_2 timeout (default 60s)

# Keepalive
sysctl net.ipv4.tcp_keepalive_time     # idle time before probes (default 7200s)
sysctl net.ipv4.tcp_keepalive_intvl   # interval between probes (default 75s)
sysctl net.ipv4.tcp_keepalive_probes  # probe count before giving up (default 9)

# Congestion control
sysctl net.ipv4.tcp_congestion_control  # algorithm: cubic, bbr, reno
sysctl net.ipv4.tcp_sack                # SACK enabled (default 1)
sysctl net.ipv4.tcp_timestamps          # timestamps enabled (default 1)

TCP CONNECTION TEARDOWN — GRACEFUL AND ABORTIVE CLOSE

👋

Four-Way Graceful Close

TEARDOWN

TCP is full-duplex — each direction must be closed independently. The graceful close uses four messages (or three if the remote side closes simultaneously):

Client (active close)
Server (passive close)
FIN+ACK
seq=m
→ FIN_WAIT_1
receive FIN → CLOSE_WAIT
ACK received → FIN_WAIT_2
ACK
ack=m+1
wait for server FIN...
app closes → send FIN → LAST_ACK
receive FIN → TIME_WAIT (2×MSL)
FIN+ACK
seq=n
ACK
ack=n+1
ACK received → CLOSED

💡 Half-close: After sending FIN, the local side can no longer send data but can still receive data. The server may continue sending data (e.g., flushing a file) after acknowledging the client's FIN. This "half-closed" state (FIN_WAIT_2 on client, CLOSE_WAIT on server) persists until the server also sends its FIN.

RST — Abortive Close

RESET

A RST (Reset) segment immediately closes a connection without the graceful 4-way teardown. No data is buffered, no TIME_WAIT is entered — the connection is gone instantly. RST is sent in three main situations:

  • Connection to closed port — server receives SYN or data for a port nothing is listening on → sends RST
  • Abortive close — application calls close() with SO_LINGER set to 0 → kernel sends RST instead of FIN
  • Out-of-window segment — segment arrives with sequence number completely outside the current window → RST to signal error
/* RST injection attack */
/* Attacker crafts RST segment with sequence number in receiver's window */
/* Target receives RST → connection terminated immediately */
/* Historically used to disrupt BGP sessions (e.g., the 2004 RFC 4953 attack) */

/* Protection: check sequence number is in [RCV.NXT, RCV.NXT + RCV.WND) */
/* RFC 5961 "Improving TCP's Robustness to Blind In-Window Attacks" */

/* NGFW RST injection for connection termination */
/* Some NGFWs send RST to both sides to terminate blacklisted connections */
/* Must spoof the correct source IP and use a valid in-window sequence number */

TCP IN AN NGFW — STATEFUL INSPECTION DEEP DIVE

🛡️

How a Stateful Firewall Tracks TCP

STATEFUL INSPECTION

A stateful firewall maintains a connection table (also called session table or conntrack table) — a hash table keyed by the 5-tuple, storing the connection's current state and sequence number tracking data.

/* Connection table entry (conntrack) */
typedef struct {
    /* 5-tuple key (stored in bihash) */
    ip4_address_t   src_ip, dst_ip;
    uint16_t        src_port, dst_port;
    uint8_t         proto;              /* 6 = TCP */

    /* TCP state tracking */
    tcp_state_t     state;              /* SYN_SENT, ESTABLISHED, etc. */
    uint32_t        client_isn;         /* client's initial sequence number */
    uint32_t        server_isn;         /* server's initial sequence number */
    uint32_t        client_next_seq;    /* expected next seq from client */
    uint32_t        server_next_seq;    /* expected next seq from server */
    uint32_t        client_window;      /* client's advertised window */
    uint32_t        server_window;      /* server's advertised window */

    /* Policy and metadata */
    uint32_t        policy_id;          /* which policy matched this flow */
    uint64_t        bytes_client;       /* bytes from client → server */
    uint64_t        bytes_server;       /* bytes from server → client */
    uint64_t        last_seen;          /* timestamp for idle timeout */
    uint8_t         app_id;             /* L7 application (from DPI) */
} conntrack_entry_t;

For every packet, the NGFW:

  1. Extracts the 5-tuple from IP + TCP headers
  2. Looks up the 5-tuple in the connection table (O(1) bihash lookup)
  3. If found: validates the packet against expected state (sequence numbers, flags) → allow, drop, or flag
  4. If not found: check if it's a valid new connection attempt (SYN only, SYN+ACK for asymmetric routing) → create new entry or drop
  5. Updates the connection entry (sequence numbers, bytes, last_seen)
  6. Applies policy (allow, drop, inspect for DPI)
🔬

TCP Sequence Number Validation

SEQUENCE TRACKING

A sophisticated NGFW validates sequence numbers on every packet to detect injection attacks:

/* Validate incoming segment from client */
bool validate_tcp_segment(conntrack_entry_t *ct,
                          tcp_header_t *tcp, uint32_t payload_len) {
    uint32_t seq     = ntohl(tcp->seq);
    uint32_t ack     = ntohl(tcp->ack_seq);
    uint32_t win     = ntohs(tcp->window) << ct->server_wscale;

    /* Check 1: sequence number in valid receive window */
    /* seq must be in [next_expected, next_expected + window) */
    int32_t seq_delta = (int32_t)(seq - ct->client_next_seq);
    if (seq_delta < 0 || seq_delta > (int32_t)ct->server_window) {
        /* Out-of-window segment — could be injected */
        return false;
    }

    /* Check 2: ACK number in valid range */
    int32_t ack_delta = (int32_t)(ack - ct->server_isn);
    if (ack_delta < 0 || ack_delta > (int32_t)ct->server_next_seq) {
        return false;   /* ACKing data we haven't sent */
    }

    /* Check 3: flags match expected state */
    if (ct->state == TCP_ESTABLISHED) {
        if (tcp->syn && !tcp->rst)
            return false;   /* SYN in ESTABLISHED is suspicious */
    }

    return true;
}

MSS Clamping — Preventing Fragmentation

MSS CLAMPING

When a TCP connection passes through a firewall or VPN that reduces the effective MTU (e.g., PPPoE reduces MTU from 1500 to 1492, VPN adds header overhead), packets larger than the new MTU need to be fragmented — or dropped if DF=1. MSS clamping rewrites the MSS option in SYN/SYN+ACK segments to force both sides to use smaller segments that fit without fragmentation.

/* MSS clamping — rewrite MSS option in SYN segments */
/* Called "TCP MSS clamping" — applied on SYN and SYN+ACK */

Original SYN: MSS=1460 (assuming Ethernet MTU=1500, IP hdr=20, TCP hdr=20)
PPPoE link MTU: 1492 bytes
New MSS: 1492 - 20 (IP) - 20 (TCP) = 1452

NGFW rewrites MSS=1460 → MSS=1452 in the SYN before forwarding
Both sides now use 1452-byte segments → no fragmentation needed

/* Linux iptables MSS clamping */
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
  -j TCPMSS --clamp-mss-to-pmtu

/* In VPP (your data plane) */
# This would be implemented in your TCP normalisation plugin
# Find TCP Options in SYN segment, locate MSS option (Kind=2),
# compare with interface MTU, rewrite if MSS > (MTU - 40)
LAB 1

Capture and Decode a Complete TCP Lifecycle

Objective: Capture a full TCP session — handshake, data transfer, and teardown — and decode every flag, sequence number, ACK, and window size in Wireshark. Understand the connection from first SYN to last ACK.

1
Start Wireshark capture on your interface. Run: curl http://example.com. Stop capture. Filter: ip.addr == 93.184.216.34 to isolate the example.com conversation.
2
Handshake analysis: Find the SYN packet. Record: Sequence Number (ISN), Window Size, MSS option, SACK Permitted option, Window Scale option. Find SYN+ACK: verify ACK = client ISN + 1. Find the final ACK: verify ACK = server ISN + 1.
3
Data transfer analysis: Find the HTTP GET request packet. Record: Flags (PSH+ACK), Sequence Number, payload length. Find the HTTP response: record sequence numbers of the first and last response segment. Use Wireshark's "Follow TCP Stream" to see the full conversation.
4
Teardown analysis: Find the FIN+ACK from one side, the ACK reply, then the FIN+ACK from the other side, and the final ACK. Identify which side initiated the close. Look for TIME_WAIT: run ss -tn state time-wait immediately after curl — you may catch the socket in TIME_WAIT.
5
Sequence number graph: In Wireshark, go to Statistics → TCP Stream Graphs → Time/Sequence (Stevens). You'll see the sawtooth pattern of slow start, linear growth, and any retransmissions. If there are no retransmissions, artificially increase delay: tc qdisc add dev eth0 root netem delay 100ms loss 2% then curl again.
6
Check the connection state machine using: ss -tn (during the connection) — observe ESTABLISHED state. Check ss -tn state time-wait after connection closes. Map each ss state to the TCP state diagram in Tab 3.
LAB 2

Simulate TCP Attacks and Defences with Scapy

Objective: Use Scapy to craft malformed TCP segments and observe how Linux handles them. Understand SYN flood, RST injection, and invalid flag combinations.

1
SYN flood simulation (on loopback — safe): from scapy.all import *, then send 100 SYNs with random source IPs to a closed port: for i in range(100): send(IP(src=RandIP(), dst="127.0.0.1")/TCP(sport=RandShort(), dport=9999, flags="S"), verbose=0). Capture with tcpdump -i lo -n 'port 9999'. What does the server return for a closed port?
2
Flag anomaly detection: Send a Christmas tree packet (all flags set) to a listening port and observe: send(IP(dst="127.0.0.1")/TCP(dport=22, flags="FSRPAU"), verbose=1). Start a listening server first: nc -l 12345. Does the server accept it? What does the Linux kernel do with it?
3
SYN cookies demo: Enable SYN cookies: sudo sysctl net.ipv4.tcp_syncookies=2. Start nc -l 8888. Send 500 SYNs from random IPs to port 8888. Monitor the connection backlog: ss -tn state syn-recv | wc -l. With syncookies=2, the backlog should not grow indefinitely.
4
Build a mini port scanner: Write a Python script using Scapy that sends SYN to ports 1-1024 on localhost and records which ports return SYN+ACK (open) vs RST (closed) vs no response (filtered). This is exactly how Nmap's SYN scan works.
5
Analyse the output: Run your port scanner against your local machine. Cross-reference with ss -tlnp (listening TCP ports). Every port showing SYN+ACK in your scan should match a listening service. Ports showing RST are closed. Understand why firewall-filtered ports show no response.
LAB 3

Write a TCP Connection Tracker in C

Objective: Implement a simplified TCP state machine tracker using libpcap. This is the core of what a stateful firewall does — track each connection through its state transitions based on observed TCP flags.

1
Install libpcap: sudo apt install libpcap-dev. Create tcp_tracker.c. Define a connection table as a simple array of structs with fields: src_ip, dst_ip, src_port, dst_port, state (enum: SYN_SENT, ESTABLISHED, FIN_WAIT, CLOSED), last_seen.
2
Use pcap to capture TCP packets: pcap_open_live("eth0", 65535, 1, 1000, errbuf). Set filter: pcap_compile + pcap_setfilter with filter string "tcp". In the packet handler, parse Ethernet → IP → TCP headers manually using byte offsets.
3
Implement state transitions: if SYN-only → create new entry with state=SYN_SENT; if SYN+ACK → find matching entry (reversed 5-tuple), update to state=SYN_RECEIVED; if ACK after SYN+ACK → state=ESTABLISHED; if FIN → state=FIN_WAIT; after second FIN+ACK → state=CLOSED, remove entry.
4
Print a summary every second: number of connections in each state (SYN_SENT, ESTABLISHED, FIN_WAIT, CLOSED), total connections seen, connections per second. Run it while browsing the web or downloading a file — watch the ESTABLISHED count grow and shrink.
5
Bonus — Add anomaly detection: Log a warning when you see: (a) SYN+ACK without a prior SYN in the table, (b) RST with a sequence number outside the expected window, (c) data segments before ESTABLISHED state, (d) more than 5 SYNs per second from the same source IP.

M05 MASTERY CHECKLIST

When complete: Move to M06 - UDP and ICMP. You now have deep TCP knowledge. M06 is shorter — UDP has almost no complexity by design — but understanding UDP's simplicity (and its implications for NGFW) is essential before moving to DNS (M07) and HTTP (M08), both of which use UDP heavily.

← M04 IPv6 🗺️ Roadmap Next: M06 - UDP and ICMP →