WHY STATEFUL INSPECTION — THE FOUNDATION OF NGFW
Stateless vs Stateful Packet Filtering
CONCEPTThe first generation of firewalls was stateless — each packet was evaluated in isolation. A rule "allow TCP dst-port 80" permits any TCP packet with that destination port, including crafted RSTs, out-of-sequence data, and attack payloads. The attacker simply sets dst_port=80 and everything passes.
Stateful inspection tracks the state of every network flow. A packet is only forwarded if it represents a valid state transition in a known, existing connection — a SYN-ACK without a preceding SYN is dropped, a RST from a server with no known session is dropped, data from an address that hasn't completed the handshake is dropped.
Stateless Filter Weaknesses
- Cannot distinguish new vs established connections
- Cannot detect TCP sequence number anomalies
- Cannot track UDP "sessions" (no inherent state)
- Cannot enforce return-traffic matching
- Easily evaded by packets that match rule syntax but violate protocol semantics
Stateful Inspection Advantages
- Validates every packet against valid protocol state
- Automatically permits return traffic for established connections
- Detects SYN floods, RST injection, sequence attacks
- Enables per-flow policy: inspect, permit, deny based on full session context
- Foundation for all NGFW L7 features: app ID, DPI, TLS inspection, QoS
The Five-Tuple — Unique Flow Identifier
FIVE-TUPLEEvery TCP/UDP flow is uniquely identified by five fields. The conntrack engine uses this five-tuple as the hash key to locate the flow's state entry in the session table.
/* Five-tuple: the universal flow key */
src_ip — Source IP address (4B IPv4, 16B IPv6)
dst_ip — Destination IP address
src_port — Source TCP/UDP port (2B)
dst_port — Destination TCP/UDP port
protocol — IP protocol (6=TCP, 17=UDP, 1=ICMP, 50=ESP)
/* CRITICAL: both directions of a flow must hash to the same entry */
/* Forward: (client_ip, client_port, server_ip, server_port, proto) */
/* Reverse: (server_ip, server_port, client_ip, client_port, proto) */
/* Solution: canonical form — sort so "smaller" endpoint is always first */
typedef struct {
uint8_t src_ip[16]; /* zero-pad IPv4 to 16B for uniform handling */
uint8_t dst_ip[16];
uint16_t src_port;
uint16_t dst_port;
uint8_t proto;
uint8_t pad[3]; /* 40 bytes total — fits 2x AVX2 registers */
} flow_key_t;
void normalise_key(flow_key_t *k, int *is_initiator) {
int cmp = memcmp(k->src_ip, k->dst_ip, 16);
if (cmp > 0 || (cmp == 0 && k->src_port > k->dst_port)) {
/* swap to canonical form */
uint8_t tmp[16]; memcpy(tmp, k->src_ip, 16);
memcpy(k->src_ip, k->dst_ip, 16);
memcpy(k->dst_ip, tmp, 16);
uint16_t p = k->src_port; k->src_port = k->dst_port; k->dst_port = p;
*is_initiator = 0; /* this is the return direction */
} else {
*is_initiator = 1;
}
}
/* For ICMP: no ports — encode type/identifier in port fields */
/* Echo request (type=8): src_port=identifier, dst_port=type */
/* ICMP error: extract inner IP header five-tuple from payload */SESSION TABLE — DESIGN FOR CORRECTNESS AND SPEED
Session Entry Structure — Every Field Explained
DATA STRUCTUREtypedef struct session {
/* ── Lookup key (must match hash table key size) ─── */
flow_key_t key; /* canonical five-tuple, 40B */
/* ── Protocol state ─────────────────────────────── */
uint8_t proto; /* IPPROTO_TCP, _UDP, _ICMP */
uint8_t tcp_state; /* TCP_S_SYN_SENT .. TCP_S_CLOSED */
uint8_t flags; /* F_NAT, F_TLS_INSPECT, F_LOG */
/* ── TCP sequence tracking ───────────────────────── */
uint32_t fwd_seq; /* highest seq seen (forward) */
uint32_t rev_seq; /* highest seq seen (reverse) */
uint32_t fwd_ack; /* last ACK seen (forward) */
uint32_t rev_ack;
uint32_t fwd_window; /* advertised receive window */
uint32_t rev_window;
uint8_t fwd_wscale; /* window scale factor (TCP option) */
uint8_t rev_wscale;
/* ── Timestamps ──────────────────────────────────── */
uint64_t created_ns; /* session birth (CLOCK_MONOTONIC) */
uint64_t last_seen_ns; /* last packet timestamp */
uint32_t timeout_s; /* per-protocol inactivity timeout */
uint32_t expire_tick; /* timer wheel bucket */
/* ── Traffic counters ────────────────────────────── */
uint64_t fwd_pkts;
uint64_t fwd_bytes;
uint64_t rev_pkts;
uint64_t rev_bytes;
/* ── NAT rewrite (populated if F_NAT set) ────────── */
uint32_t nat_src_ip; /* translated src IP (SNAT) */
uint16_t nat_src_port; /* translated src port (NAPT) */
uint32_t nat_dst_ip; /* translated dst IP (DNAT) */
uint16_t nat_dst_port; /* translated dst port (DNAT) */
/* ── Application layer ───────────────────────────── */
uint16_t app_id; /* identified app: HTTP=1, TLS=2, DNS=3... */
uint8_t risk_score; /* 0–100 threat score from DPI */
void *dpi_state; /* DPI engine per-flow state machine */
/* ── Policy result (cached after first policy eval) ─ */
uint32_t policy_id; /* matched ACL rule index */
uint8_t action; /* ACTION_PERMIT, _DENY, _INSPECT */
uint8_t ssl_inspect; /* 1 = TLS MITM proxy active */
/* ── Linked list for timer wheel ─────────────────── */
struct session *tw_next;
struct session *tw_prev;
} __attribute__((packed)) session_t;
/* Target size: ≤ 256 bytes per session */
/* 1M sessions × 256B = 256MB — fits in L3 cache on server CPUs */
/* Typical enterprise NGFW: 500K–4M concurrent sessions */Hash Table and Memory Pool — Lock-Free Design
HASH TABLE/* Option 1: VPP clib_bihash_48_8 (production choice) */
clib_bihash_48_8_t session_table;
clib_bihash_init_48_8(&session_table, "sessions", 1<<20, 512<<20);
/* Lookup */
clib_bihash_kv_48_8_t kv;
memcpy(kv.key, &flow_key, 48);
if (!clib_bihash_search_48_8(&session_table, &kv, &kv)) {
session_t *s = pool_elt_at_index(session_pool, (uint32_t)kv.value);
/* fast path: cached policy, cached NAT mapping */
}
/* Option 2: DPDK rte_hash with lock-free concurrent access */
struct rte_hash_parameters p = {
.name = "sessions",
.entries = 1<<20,
.key_len = sizeof(flow_key_t),
.hash_func = rte_jhash,
.extra_flag = RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF,
.socket_id = rte_socket_id(),
};
struct rte_hash *ht = rte_hash_create(&p);
/* Bulk lookup — SIMD, 8 flows simultaneously */
const void *keys[8] = {&k0,&k1,&k2,&k3,&k4,&k5,&k6,&k7};
int32_t positions[8];
uint64_t hit_mask = rte_hash_lookup_bulk(ht, keys, 8, positions);
/* bit i set in hit_mask → flow i found; positions[i] = session index */
/* Memory pool: pre-allocated array, no malloc per session */
session_t *session_pool = rte_malloc_socket("sessions",
MAX_SESSIONS * sizeof(session_t), 64, rte_eth_dev_socket_id(0));
/* Per-NUMA-socket pools: each socket has its own pool and hash table */
/* Workers pinned to same socket as their NIC → zero cross-NUMA memory */💡 Two critical performance design choices: (1) Use a pre-allocated memory pool indexed by integer — malloc/free per session causes heap fragmentation and cache thrashing at scale. (2) Per-NUMA-socket tables with per-CPU-core session creation — eliminate cross-socket locking entirely. VPP's clib_bihash is designed exactly for this access pattern.
Timer Wheel — O(1) Session Expiry
TIMEOUTS| Protocol / State | Timeout | Rationale |
|---|---|---|
| TCP SYN (half-open) | 10–30s | SYN flood defence — expire incomplete handshakes quickly |
| TCP ESTABLISHED | 3600–86400s | Long-lived HTTP/2, SSH; keepalives refresh timeout |
| TCP FIN_WAIT / TIME_WAIT | 120–240s | RFC 793 minimum 2×MSL; allow delayed packets |
| UDP generic | 30–60s | Most UDP is short request-response |
| UDP DNS | 10–20s | Single exchange; keep short |
| UDP QUIC | 120–300s | Connection migration; mobile clients change IPs |
| ICMP Echo | 10–30s | Ping timeout window |
| ICMP Error | 5s | One-shot messages; no reply expected |
/* Timer wheel — O(1) insert and expire */
/* 65536 buckets with 1-second resolution = ~18 hours of range */
#define WHEEL_SIZE 65536
typedef struct {
session_t *buckets[WHEEL_SIZE]; /* linked list head per bucket */
uint32_t current_tick;
} timer_wheel_t;
void session_refresh(session_t *s, timer_wheel_t *tw, uint32_t timeout_s) {
/* Remove from current bucket */
if (s->tw_prev) s->tw_prev->tw_next = s->tw_next;
if (s->tw_next) s->tw_next->tw_prev = s->tw_prev;
/* Insert into new bucket */
s->expire_tick = tw->current_tick + timeout_s;
uint32_t b = s->expire_tick & (WHEEL_SIZE - 1);
s->tw_next = tw->buckets[b];
s->tw_prev = NULL;
if (tw->buckets[b]) tw->buckets[b]->tw_prev = s;
tw->buckets[b] = s;
}
/* Called once per second from background thread */
void timer_wheel_advance(timer_wheel_t *tw, session_table_t *st) {
tw->current_tick++;
uint32_t b = tw->current_tick & (WHEEL_SIZE - 1);
session_t *s = tw->buckets[b];
tw->buckets[b] = NULL;
while (s) {
session_t *next = s->tw_next;
if (s->expire_tick == tw->current_tick) {
session_destroy(st, s); /* remove from hash, return to pool */
} else {
/* Re-insert (bucket collision from previous wrap) */
session_refresh(s, tw, s->expire_tick - tw->current_tick);
}
s = next;
}
}TCP STATE MACHINE — NGFW SEQUENCE TRACKING
TCP Conntrack States and Transitions
TCP STATES/* TCP states from NGFW perspective (observer, not endpoint) */
typedef enum {
TCP_S_NONE = 0, /* no session — first packet */
TCP_S_SYN_SENT, /* SYN seen from initiator */
TCP_S_SYN_RCVD, /* SYN+ACK seen from responder */
TCP_S_ESTABLISHED, /* ACK completing 3-way handshake */
TCP_S_FIN_WAIT_1, /* FIN from initiator */
TCP_S_FIN_WAIT_2, /* FIN ACKed by responder */
TCP_S_CLOSE_WAIT, /* FIN from responder (responder closing first) */
TCP_S_LAST_ACK, /* FIN from initiator after CLOSE_WAIT */
TCP_S_TIME_WAIT, /* waiting for stray packets */
TCP_S_CLOSED, /* RST received or graceful close complete */
} tcp_state_t;
/* State transition table */
uint8_t tcp_next_state[TCP_S_CLOSED+1][2 /*dir*/][64 /*flags*/];
/* Build the table at init (simplified) */
void tcp_init_transitions(void) {
/* SYN from initiator → create session, SYN_SENT */
/* Handled at session creation, not in table */
/* SYN+ACK from responder → SYN_RCVD */
tcp_next_state[TCP_S_SYN_SENT][0/*rev*/][TH_SYN|TH_ACK] = TCP_S_SYN_RCVD;
/* ACK from initiator → ESTABLISHED */
tcp_next_state[TCP_S_SYN_RCVD][1/*fwd*/][TH_ACK] = TCP_S_ESTABLISHED;
/* FIN from initiator → FIN_WAIT_1 */
tcp_next_state[TCP_S_ESTABLISHED][1/*fwd*/][TH_FIN] = TCP_S_FIN_WAIT_1;
tcp_next_state[TCP_S_ESTABLISHED][1/*fwd*/][TH_FIN|TH_ACK] = TCP_S_FIN_WAIT_1;
/* ACK of FIN → FIN_WAIT_2 */
tcp_next_state[TCP_S_FIN_WAIT_1][0/*rev*/][TH_ACK] = TCP_S_FIN_WAIT_2;
/* RST → CLOSED from any state */
for (int st = 0; st <= TCP_S_CLOSED; st++) {
tcp_next_state[st][0][TH_RST] = TCP_S_CLOSED;
tcp_next_state[st][1][TH_RST] = TCP_S_CLOSED;
}
/* ... more transitions ... */
}
/* Fast-path state lookup */
uint8_t new_state = tcp_next_state[s->tcp_state][is_initiator][tcp_flags & 0x3f];
if (new_state != s->tcp_state) {
s->tcp_state = new_state;
if (new_state == TCP_S_CLOSED)
session_schedule_destroy(s, 5); /* brief grace period */
}TCP Sequence Number Validation
SEQ VALIDATION/* Sequence validation: reject RST injection and data injection attacks */
/* Only accept packets whose sequence number falls within expected window */
int tcp_seq_in_window(session_t *s, const struct tcphdr *th,
uint16_t data_len, int is_fwd) {
uint32_t seq = ntohl(th->th_seq);
uint32_t ack = ntohl(th->th_ack);
uint32_t e_seq = is_fwd ? s->fwd_seq : s->rev_seq; /* expected */
uint32_t window = is_fwd ? s->rev_window : s->fwd_window; /* peer window */
/* Allow [e_seq - window, e_seq + window] accounting for 32-bit wrap */
int32_t lo = (int32_t)(seq - (e_seq - window));
int32_t hi = (int32_t)(seq + data_len - (e_seq + window));
if (lo < 0 || hi > 0) {
/* Out of window */
if (th->th_flags & TH_RST) {
/* RST injection — drop silently, increment counter */
s->rst_inject_count++;
return 0;
}
/* Data outside window — could be retransmit or injection */
/* Log and drop; don't tear down session */
return 0;
}
/* Update tracked sequence numbers */
if (is_fwd) {
if (SEQ_GT(seq + data_len, s->fwd_seq))
s->fwd_seq = seq + data_len;
if (th->th_flags & TH_ACK)
s->fwd_ack = ack;
} else {
if (SEQ_GT(seq + data_len, s->rev_seq))
s->rev_seq = seq + data_len;
if (th->th_flags & TH_ACK)
s->rev_ack = ack;
}
return 1;
}
/* Window scale option — must be parsed from SYN/SYN-ACK */
/* Effective window = advertised_window << window_scale */
/* Not parsing window scale → sequence window may be too narrow */
/* Result: false positives on legitimate high-BDP connections */UDP AND ICMP TRACKING
UDP Pseudo-Session Tracking
UDP/* UDP has no handshake — NGFW creates pseudo-session on first permitted datagram */
typedef enum {
UDP_S_NEW, /* first packet seen, no reply yet */
UDP_S_REPLIED, /* return packet seen — bidirectional confirmed */
UDP_S_CLOSING /* idle → pending removal */
} udp_state_t;
/* Per-protocol timeouts */
static const struct {
uint16_t dst_port;
uint32_t timeout_s;
} udp_port_timeouts[] = {
{ 53, 10 }, /* DNS — short, single exchange */
{ 67, 30 }, /* DHCP */
{ 123, 60 }, /* NTP */
{ 500, 3600}, /* IKEv2 */
{ 4500, 3600}, /* IKE NAT-T */
{ 5060, 300 }, /* SIP registration */
{ 0, 30 }, /* default */
};
uint32_t udp_timeout(uint16_t dst_port) {
for (int i = 0; udp_port_timeouts[i].dst_port; i++)
if (udp_port_timeouts[i].dst_port == dst_port)
return udp_port_timeouts[i].timeout_s;
return 30;
}
/* QUIC sessions (UDP 443) require special handling */
/* QUIC Connection IDs survive IP address changes (mobile handoff) */
/* Proper QUIC tracking: parse QUIC header → extract Connection ID */
/* Create secondary index: connection_id → session */
/* On IP change: update session's five-tuple but keep DPI state */
/* UDP session creation in forwarding loop */
if (!session_lookup(&key)) {
if (policy_permits(&key)) {
session_t *s = session_create(&key, IPPROTO_UDP,
udp_timeout(ntohs(udph->dest)));
/* On first packet from responder: s->state = UDP_S_REPLIED */
/* This is the confirmation that remote is responding */
} else {
drop_packet(m);
}
}ICMP Tracking — Echo and Error Correlation
ICMP/* ICMP Echo tracking: identifier field plays role of session ID */
/* Request: type=8, code=0, id=PID, seq=N */
/* Reply: type=0, code=0, id=PID, seq=N */
/* Encode ICMP into flow_key_t */
void icmp_to_key(const struct iphdr *iph,
const struct icmphdr *ich, flow_key_t *k) {
memset(k, 0, sizeof(*k));
k->src_ip[12] = iph->saddr >> 24; /* ... */
k->dst_ip[12] = iph->daddr >> 24; /* ... */
k->proto = IPPROTO_ICMP;
if (ich->type == ICMP_ECHO || ich->type == ICMP_ECHOREPLY) {
/* Use identifier as src_port, type as dst_port */
k->src_port = ntohs(ich->un.echo.id);
k->dst_port = (ich->type == ICMP_ECHO) ? 8 : 0;
/* Normalise: request and reply must share key */
/* After normalise, ECHO and ECHOREPLY map to same entry */
}
}
/* ICMP error correlation — find the session that triggered the error */
void icmp_error_correlate(const struct icmphdr *ich, size_t len,
session_t **orig_session) {
/* ICMP error payload: original IP header + first 8B of transport */
const struct iphdr *inner = (const struct iphdr *)(ich + 1);
if (len < sizeof(*ich) + sizeof(*inner) + 8) { *orig_session = NULL; return; }
flow_key_t inner_key;
const uint8_t *inner_l4 = (const uint8_t *)inner + inner->ihl * 4;
inner_key.proto = inner->protocol;
/* ... fill src/dst IP from inner header ... */
if (inner->protocol == IPPROTO_TCP || inner->protocol == IPPROTO_UDP) {
inner_key.src_port = ntohs(*(uint16_t *)(inner_l4 + 0));
inner_key.dst_port = ntohs(*(uint16_t *)(inner_l4 + 2));
}
normalise_key(&inner_key, NULL);
*orig_session = session_lookup(&inner_key);
}
/* ICMP policy */
/* Echo request/reply: permit if within session */
/* ICMP unreachable / TTL exceeded: permit only if correlates to a session */
/* ICMP Redirect (type=5): ALWAYS DROP — routing attack vector */
/* ICMP Timestamp (type=13): drop unless explicitly permitted */NAT FUNDAMENTALS — TYPES AND POLICY
NAT Type Taxonomy
NAT TYPES/* 1. Static NAT (1:1 bidirectional mapping) */ /* Private IP ←→ Public IP, permanent */ /* Use: servers that need a consistent public address */ 192.168.1.100 ←→ 203.0.113.100 /* 2. Dynamic NAT (pool-based, no port sharing) */ /* Each private IP gets one public IP from a pool */ /* Pool exhausted → new connections fail */ /* Rarely used today — NAPT replaced it */ /* 3. NAPT / PAT (many-to-one with port multiplexing) */ /* The standard for home/enterprise edge */ 192.168.1.10:4501 → 203.0.113.1:10001 192.168.1.20:4502 → 203.0.113.1:10002 192.168.1.30:4501 → 203.0.113.1:10003 /* same client port OK — different IP */ /* 4. DNAT (Destination NAT) — server publishing */ /* Rewrite destination; used for load balancing, port forwarding */ dst:203.0.113.1:80 → dst:10.0.0.5:8080 /* 5. Twice NAT (full NAT) */ /* Rewrite both src and dst — overlapping address spaces */ src:10.0.0.5 dst:10.0.0.1 → src:172.16.0.5 dst:172.16.0.1 /* NAT and conntrack coupling */ /* NAT must use conntrack because it needs to: */ /* 1. Know which translation to apply to RETURN traffic */ /* 2. Apply the REVERSE translation (DNAT reply → SNAT) */ /* 3. Persist the mapping for the lifetime of the session */ /* Session entry stores both original and translated addresses */ /* Netfilter hook points for NAT */ /* SNAT: POSTROUTING hook — after routing decision, before NIC */ /* DNAT: PREROUTING hook — before routing decision */ /* VPP NAT44-ED: processes both directions in ip4-unicast arc */
NAPT MECHANICS — PORT ALLOCATION AND PACKET REWRITE
Port Pool and Packet Rewrite
NAPT/* Port pool — bitmap allocator, O(1) amortised */
typedef struct {
uint32_t public_ip;
uint16_t port_start; /* 1024 */
uint16_t port_end; /* 65535 */
uint8_t bitmap[8192]; /* 64512 ports / 8 = 8064 bytes */
uint16_t next_hint; /* round-robin start hint */
uint32_t in_use; /* current count */
} nat_port_pool_t;
uint16_t nat_alloc_port(nat_port_pool_t *pool) {
uint16_t start = pool->next_hint;
for (int i = 0; i < 64512; i++) {
uint16_t p = pool->port_start + ((start + i) % 64512);
uint32_t off = (p - pool->port_start);
uint32_t by = off / 8, bi = off % 8;
if (!(pool->bitmap[by] & (1u << bi))) {
pool->bitmap[by] |= (1u << bi);
pool->next_hint = (start + i + 1) % 64512;
pool->in_use++;
return p;
}
}
return 0; /* port exhaustion */
}
void nat_free_port(nat_port_pool_t *pool, uint16_t port) {
uint32_t off = port - pool->port_start;
pool->bitmap[off / 8] &= ~(1u << (off % 8));
pool->in_use--;
}
/* Packet rewrite — what changes in each direction */
/*
Outbound SNAT:
IP header: src_addr = nat_src_ip (was private IP)
TCP header: src_port = nat_src_port (was ephemeral port)
Checksums: both IP and TCP/UDP must be updated
Inbound reverse:
IP header: dst_addr = orig_src_ip (restore private IP)
TCP header: dst_port = orig_src_port (restore original port)
Checksums: updated
DNAT outbound:
IP header: dst_addr = real_server_ip
TCP header: dst_port = real_server_port
DNAT inbound (reply):
IP header: src_addr = public_vip_ip
TCP header: src_port = published_port
*/
/* Incremental checksum update (RFC 1624) */
/* Much faster than full recalculation — only 1-2 words changed */
static inline uint16_t csum_update_u32(uint16_t old_csum,
uint32_t old_val, uint32_t new_val) {
uint32_t s = (uint16_t)~old_csum;
s += (uint16_t)~(old_val & 0xFFFF) + (uint16_t)~(old_val >> 16);
s += (uint16_t)(new_val & 0xFFFF) + (uint16_t)(new_val >> 16);
while (s >> 16) s = (s & 0xFFFF) + (s >> 16);
return (uint16_t)~s;
}
void nat_rewrite_outbound(struct iphdr *iph, struct tcphdr *th,
uint32_t new_sip, uint16_t new_sport) {
uint32_t old_sip = iph->saddr;
uint16_t old_sport = th->source;
iph->saddr = htonl(new_sip);
th->source = htons(new_sport);
/* Update IP checksum (covers IP header — 32-bit addr field) */
iph->check = csum_update_u32(iph->check, old_sip, new_sip);
/* Update TCP/UDP checksum (pseudo-header includes src IP + port) */
th->check = csum_update_u32(th->check, old_sip, new_sip);
th->check = csum_update_u32(th->check,
(uint32_t)ntohs(old_sport),
(uint32_t)new_sport);
}
/* Hardware offload (DPDK RTE_MBUF_F_TX_IP_CKSUM) avoids software calc */ALG — APPLICATION LEVEL GATEWAYS
Why Protocols Break NAT and How ALGs Fix Them
ALGProtocols that embed IP addresses or ports in their application payload break NAT — the outer IP header is rewritten correctly, but the embedded address in the payload still points to the private IP the remote host cannot reach. An ALG must inspect and rewrite the payload to fix this.
/* Protocols requiring ALG */
/* FTP (see M09): PORT/PASV commands contain IP:port */
/* SIP/VoIP: SDP body contains media IP:port */
/* H.323: Q.931/H.245 contain internal IPs */
/* SIP ALG — the most complex common ALG */
Incoming INVITE from UA behind NAT:
INVITE sip:bob@remote.com SIP/2.0
Contact: sip:alice@192.168.1.5:5060 ← private IP!
Via: SIP/2.0/UDP 192.168.1.5:5060 ← private IP!
Content-Type: application/sdp
Content-Length: 128
v=0
o=alice 1234 5678 IN IP4 192.168.1.5 ← private IP in SDP!
c=IN IP4 192.168.1.5 ← private IP!
m=audio 16384 RTP/AVP 0 ← media port
ALG processing:
1. Detect SIP (port 5060 or inspect INVITE/REGISTER keywords)
2. Buffer TCP stream for full SIP message (may span segments)
3. Parse Contact and Via headers → replace 192.168.1.5 with public IP
4. Parse SDP body → replace IP in c= and o= lines
5. Allocate NAT port for RTP media:
203.0.113.1:20000 ← 192.168.1.5:16384
6. Replace 16384 with 20000 in m= line
7. Recalculate Content-Length (body size changed!)
8. Adjust TCP sequence numbers if body size changed
9. Update IP/TCP checksums
typedef struct sip_alg {
uint8_t *buf; /* TCP reassembly buffer */
uint32_t buf_len;
int32_t seq_delta; /* cumulative sequence number adjustment */
uint32_t media_ip; /* original media IP (to replace) */
uint16_t media_port; /* original media port */
uint16_t nat_port; /* allocated NAT port for media stream */
} sip_alg_t;
/* Modern alternative: STUN/ICE (RFC 5389/8445) */
/* Application-level NAT traversal — no ALG needed */
/* Client queries public STUN server → learns public IP:port */
/* Encodes public address in SDP → remote can connect directly */
/* WebRTC uses ICE exclusively — no SIP ALG involvement */VPP NAT44-ED — PRODUCTION IMPLEMENTATION
VPP NAT44-ED Architecture
VPP NAT/* NAT44-ED: Endpoint-Dependent — key includes destination address */
/* Advantage: same client port can be used for different destinations */
/* Required for: hairpinning, multiple ISP links, port-restricted NAT */
/* VPP NAT44-ED session key (endpoint-dependent) */
typedef struct {
ip4_address_t local_addr; /* pre-NAT source IP */
ip4_address_t ext_addr; /* destination IP (the "endpoint") */
u16 local_port;
u16 ext_port;
u8 proto;
u8 pad[3];
} ed_key_t; /* 14 bytes padded to 16 */
/* Two-level lookup for ED NAT */
/* Table 1: (local_ip, local_port, ext_ip, ext_port, proto) → session */
/* Table 2: (nat_ip, nat_port, ext_ip, ext_port, proto) → session */
/* Both tables point to same session pool entry */
/* VPP NAT configuration */
nat44 enable sessions 1048576 /* 1M session table */
nat44 add pool address 203.0.113.1
nat44 add pool address 203.0.113.2 /* multiple public IPs for load */
set interface nat44 in tap0 out tap1
nat44 add interface address tap1 /* masquerade mode */
/* Static DNAT mapping */
nat44 add static mapping tcp local 10.0.0.5 8080 external 203.0.113.1 80
/* Per-worker session distribution */
/* VPP: each worker thread has its own session cache slice */
/* New session → hash five-tuple → assign to worker thread */
/* All subsequent packets of same flow → same worker (RSS ensures this) */
/* Monitoring */
show nat44 sessions /* active session count and list */
show nat44 sessions detail /* full field dump */
show nat44 summary /* pool usage, allocations */
show nat44 addresses /* public IP pool */
show errors /* NAT error counters */
/* VPP NAT error counters */
nat44-ed-in2out: packet does not match existing session: new connections
nat44-ed-in2out: out of ports: port pool exhausted
nat44-ed-in2out: TCP RST packet without session: likely stale RST
nat44-ed-out2in: unknown session: return packet with no matching session
/* Debugging a specific session */
nat44 del session in 10.0.0.5:54321 out 203.0.113.1:12345 tcp
/* Manually remove a stale session */ATTACKS AND HARDENING
Attack Taxonomy and Mitigations
SECURITY| Attack | Mechanism | Impact | Mitigation |
|---|---|---|---|
| SYN Flood | Spoofed SYNs → half-open sessions → session table exhaustion | DoS — all new connections rejected | SYN cookies; rate limit per source; half-open session cap |
| Session Table Exhaustion | Many short-lived connections filling the table | New connections dropped globally | Per-source rate limit; aggressive half-open timeout; alert at 80% fill |
| RST Injection | Crafted RST with valid five-tuple → session torn down | Disrupts legitimate connections; BGP attacks | Sequence number validation — RST must be within window |
| NAT Port Exhaustion | Source opens 64K connections → port pool drained | Further outbound connections fail | Per-source connection limit; monitor port pool; alert |
| NAT Slipstreaming | Browser crafts HTTP request that tricks SIP/FTP ALG → arbitrary port opened | External attacker reaches internal services | Disable unused ALGs; validate ALG content; block inbound 5060/21 |
| IP Fragmentation Bypass | Fragment 1 with no ports passes filter; fragment 2 contains attack in offset | Policy bypass | Reassemble ALL fragments before conntrack lookup |
| Overlapping Fragments | Send two overlapping fragments with different content — OS and NGFW reassemble differently | IDS/NGFW evasion | Drop overlapping fragments; or enforce first/last-fragment policy |
| Teardrop / Tiny Fragment | Fragment so small that TCP header split across two fragments | Stack crash (historical); policy bypass | Drop fragments too small to contain complete L4 header |
/* SYN Cookie — no state until 3-way handshake verified */
/* ISN = HMAC(4-tuple + timestamp) — encodes connection info */
uint32_t syn_cookie_gen(uint32_t sip, uint32_t dip,
uint16_t sport, uint16_t dport,
uint32_t ts_tick, const uint8_t *key) {
uint8_t data[14];
memcpy(data, &sip, 4);
memcpy(data + 4, &dip, 4);
memcpy(data + 8, &sport, 2);
memcpy(data + 10, &dport, 2);
memcpy(data + 12, &ts_tick, 2); /* low 16 bits of time */
uint8_t mac[32];
HMAC_SHA256(key, 32, data, sizeof(data), mac);
return ntohl(*(uint32_t *)mac) ^ (ts_tick & 0xFFFF);
}
/* On receiving ACK for SYN-ACK: */
int syn_cookie_verify(uint32_t client_ack, ...) {
/* client sends ACK = cookie + 1 */
uint32_t expected = syn_cookie_gen(...) + 1;
return (client_ack == expected);
}
/* If valid: create session entry NOW — entire SYN phase was stateless */
/* SYN cookies sacrifice MSS options (encoded as upper 3 bits of ISN) */
/* Half-open connection limit */
#define MAX_HALF_OPEN 100000 /* global limit */
#define MAX_PER_SOURCE 100 /* per source IP limit */
atomic_uint half_open_count = 0;
int session_allow_new_syn(uint32_t src_ip) {
if (atomic_load(&half_open_count) >= MAX_HALF_OPEN)
return 0; /* global limit — trigger SYN cookies */
if (src_rate_get(src_ip) > MAX_PER_SOURCE)
return 0; /* per-source limit */
atomic_fetch_add(&half_open_count, 1);
return 1;
}High-Performance Session Table in C
Objective: Build a complete session table with five-tuple hashing, TCP state machine, timer-wheel timeout management, and SYN cookie defence. Benchmark at realistic NGFW throughput.
normalise_key() and unit-test it exhaustively: verify that (A:1234→B:80,TCP) and (B:80→A:1234,TCP) produce the same canonical key. Test with IPv4 and IPv6 addresses. This is the most common conntrack bug — get it right before building anything else.clock_gettime(CLOCK_MONOTONIC). Target: <100ns average.tcpdump -r test.pcap -tttt to generate your test data.NAPT Packet Rewriter
Objective: Implement a working NAPT engine using AF_PACKET, with correct incremental checksum updates. Verify with real HTTP traffic flowing through it.
csum_update_u32() from Tab 5. Test against full checksum recalculation: for 10,000 random packets, verify that incremental update gives the same result as recalculating from scratch. Any mismatch means a bug — connections will silently break in production.curl http://external_ip:8080 reaching the internal web server.ping -s 3000 -M dont 10.2.0.2 (sends oversized packets that will be fragmented). Verify that your NAPT correctly handles fragment 1 (has TCP/UDP header) and fragment 2+ (no transport header — use session from fragment 1). Incorrect handling here is a common security bypass.VPP NAT44-ED Deep Inspection
Objective: Configure VPP NAT44-ED in detail, inspect session internals, test edge cases, and compare with your Lab 1 implementation.
vppctl show nat44 sessions detail — capture the output. For each session, identify: ED key fields, state, byte counters, NAT port allocation. Compare to your Lab 1 session_t layout.vppctl show nat44 addresses for port pool fill level. Identify the exact error counter that increments on exhaustion (show errors | grep nat44). What happens to existing connections when new ones fail? Is there any prioritisation?M23 MASTERY CHECKLIST
- Know why stateful inspection beats stateless: validates protocol state, blocks crafted packets, auto-permits return traffic, enables per-flow L7 features
- Know the five-tuple: src_ip, dst_ip, src_port, dst_port, protocol — and that both flow directions must map to the same session entry
- Know canonical normalisation: why it's needed, how to implement it (compare and swap if needed), and what bugs occur without it
- Know session_t key fields: five-tuple, TCP state, seq/ack/window tracking, timestamps, counters, NAT mapping, app_id, policy_id, DPI state pointer
- Know session sizing guidance: 256 bytes × 1M sessions = 256MB; typical NGFW 500K–4M sessions
- Know why memory pool beats malloc/free: no fragmentation, O(1) alloc, better cache locality
- Know hash table requirements: O(1) lookup, lock-free concurrent access, cache-friendly layout
- Know clib_bihash and rte_hash as the two primary production choices
- Know timer wheel: O(1) session expiry; 65536 buckets at 1s resolution = 18h range
- Know per-protocol timeouts: TCP SYN 30s, TCP ESTAB 3600s, UDP 30s, UDP DNS 10s, ICMP 10s
- Know TCP conntrack states and transitions: NONE → SYN_SENT → SYN_RCVD → ESTABLISHED → FIN_WAIT_1 → FIN_WAIT_2 → TIME_WAIT → CLOSED
- Know TCP sequence validation: accept only within window; RST with wrong seq = injection attack → drop
- Know window scale option must be parsed from SYN — else window is too narrow and false positives occur
- Know UDP pseudo-session: created on first permitted datagram, transitions NEW → REPLIED on first return
- Know QUIC special handling: Connection IDs survive IP change; 5-tuple keyed sessions break on mobile handoff
- Know ICMP Echo tracking: identifier field is the session discriminator; must decode type/code to normalise direction
- Know ICMP error correlation: extract inner IP five-tuple from ICMP error payload to find the triggered session
- Know ICMP Redirect (type 5) must always be dropped — routing attack vector
- Know 5 NAT types: static 1:1, dynamic pool, NAPT/PAT, DNAT, twice-NAT
- Know NAPT port pool: bitmap allocator, 64K ports per public IP per protocol, round-robin allocation
- Know NAPT packet rewrite: which fields change (src_ip, src_port for SNAT; dst_ip, dst_port for DNAT) and why checksums must be updated
- Know incremental checksum update (RFC 1624): faster than full recalculation; critical correctness requirement
- Know why ALGs exist: embedded IP:port in payload breaks NAT; ALG patches payload content
- Know SIP ALG complexity: must patch SDP body, Content-Length, multiple headers, TCP seq numbers on size change
- Know VPP NAT44-ED: endpoint-dependent keying; two lookup tables; worker-thread per-session distribution
- Know VPP NAT commands: show nat44 sessions, show nat44 summary, nat44 add static mapping
- Know SYN cookie: ISN = HMAC(4-tuple + timestamp); no state during SYN flood; session created on ACK verification
- Know 8 conntrack/NAT attacks and mitigations: SYN flood, table exhaustion, RST injection, port exhaustion, NAT slipstreaming, fragmentation bypass, overlapping fragment, tiny fragment
- Completed Lab 1: session table with hash, TCP FSM, timer wheel, SYN cookies — benchmarked and integration-tested
- Completed Lab 2: NAPT engine with bitmap port pool, incremental checksum, AF_PACKET forwarding, DNAT rule, fragment handling
- Completed Lab 3: VPP NAT44-ED configuration, session inspection, port exhaustion test, hairpinning test
✅ When complete: Move to M24 - Deep Packet Inspection and Application Identification — building L7 inspection on top of the conntrack session table foundation you now fully understand.