M03 - IPv4 Deep Dive

NETWORKING MASTERY · PHASE 1 · MODULE 03 · WEEK 2

🌐 IPv4 Deep Dive

IP addressing · Subnetting · CIDR · Header fields · Fragmentation · TTL · ICMP · Routing basics

Beginner → Intermediate Prerequisite: M01, M02 RFC 791 Subnetting ICMP 3 Labs

THE INTERNET PROTOCOL — WHY IT EXISTS

🌐

IP — The Language of the Internet

FOUNDATION

The Internet Protocol (IP) is the fundamental protocol that makes the internet work. Defined in RFC 791 (1981), it gives every device a logical address and defines how data is packaged into packets and routed across interconnected networks.

Without IP, you could only talk to devices on your same physical network — your switch's MAC table handles that. IP is what lets your laptop in Mumbai send data to a server in Frankfurt through dozens of intermediate networks and routers, none of which need to know anything about your laptop or the server directly.

IP's three core jobs:

Logical addressing — Every device gets an IP address. Unlike MAC addresses (hardware), IP addresses are logical and can be assigned, changed, and hierarchically organised for efficient routing
Packet fragmentation and reassembly — If a packet is too large for a network link, IP splits it into smaller fragments and reassembles at the destination
Best-effort delivery — IP makes its best effort to deliver packets but makes no guarantees. Packets can be lost, duplicated, reordered, or corrupted. Reliability is left to upper layers (TCP at L4)

📮 Analogy — The Postal System

IP is like the postal service. Your letter (packet) has a destination address (IP address). The postal system (internet) routes it through intermediate sorting offices (routers) without you needing to know the route. Each sorting office reads the destination address, decides which direction to send it, and passes it along. If the letter is too thick for a slot (MTU exceeded), it gets split into multiple envelopes (fragmentation). The postal service doesn't guarantee delivery — letters can get lost, arrive late, or arrive out of order. If you need guarantees, you use registered mail (TCP).

📊

IP in Context — The Protocol Stack

POSITION IN STACK

IP sits at Layer 3 of the OSI model — above Ethernet (L2) and below TCP/UDP (L4). Every TCP connection, every UDP datagram, every DNS query, every HTTP request — they all travel inside IP packets.

/* Stack position of IPv4 */

Application layer:  HTTP data ("GET /index.html...")
                         ↓ TCP wraps with segment header
Transport layer:    [TCP hdr: sport=52341 dport=80] + [HTTP data]
                         ↓ IP wraps with packet header
Network layer:      [IP hdr: src=10.0.0.5 dst=93.184.216.34] + [TCP] + [HTTP]
                         ↓ Ethernet wraps with frame header
Data Link layer:    [Eth hdr: dst_mac src_mac 0x0800] + [IP] + [TCP] + [HTTP] + [CRC]
                         ↓ NIC transmits as bits
Physical layer:     01001000 01010100 01010100...

The IP header's Protocol field (1 byte) tells the receiver what L4 protocol lives inside the packet: 6 = TCP, 17 = UDP, 1 = ICMP, 89 = OSPF, 50 = ESP (IPsec). This is how the kernel knows which protocol handler to pass the packet to after stripping the IP header.

IPv4 HEADER — 20 BYTES MINIMUM, EVERY FIELD EXPLAINED

📦

IPv4 Header Layout

HEADER FORMAT

The IPv4 header is a minimum of 20 bytes (160 bits). It precedes the payload (TCP segment, UDP datagram, ICMP message, etc.). Each row below represents 32 bits (4 bytes) as transmitted on the wire.

Row 1

Ver

4 bits

IHL

4 bits

DSCP / ECN

8 bits

Total Length

16 bits

Row 2

Identification

16 bits

Flags

3 bits

Fragment Offset

13 bits

Row 3

TTL

8 bits

Protocol

8 bits

Header Checksum

16 bits

Row 4

Source IP Address

32 bits (4 bytes)

Row 5

Destination IP Address

32 bits (4 bytes)

Row 6+

Options (if IHL > 5) + Padding

0–40 bytes — rare in practice

🔍

Every Field — What It Does and Why It Matters

FIELD REFERENCE

Version (4 bits)

Always 0100 = 4 for IPv4. IPv6 uses 0110 = 6. The receiver checks this first to confirm which IP version it's dealing with. In DPDK/VPP, this is the first thing ip4-input validates.

IHL — Internet Header Length (4 bits)

Specifies the header length in 32-bit words. Minimum value is 5 (5 × 4 bytes = 20 bytes, the minimum header with no options). Maximum is 15 (15 × 4 = 60 bytes). IHL tells the receiver where the payload starts: payload offset = IHL × 4.

/* C: find where IP payload begins */
uint8_t *ip_hdr = packet_start;
uint8_t  ihl     = (ip_hdr[0] & 0x0F);     /* low nibble of first byte */
uint8_t *payload = ip_hdr + (ihl * 4);     /* jump over header */

DSCP / ECN (8 bits — formerly TOS)

Originally called Type of Service, now split into two fields:

DSCP (Differentiated Services Code Point, 6 bits) — QoS marking. Routers and firewalls use this to prioritise packets. Common values: 0 = Best Effort, 46 = Expedited Forwarding (voice/video), 34 = Assured Forwarding. NGFW policy engines can mark and classify traffic using DSCP.
ECN (Explicit Congestion Notification, 2 bits) — allows congestion notification without packet drops. Routers mark ECN bits when they're near capacity; the receiver signals the sender to slow down.

Total Length (16 bits)

The total size of the IP packet in bytes — header + payload. Maximum value: 65535. Practical maximum on standard Ethernet: 1500 (MTU). This field is critical: receivers use it to know how many bytes to read, and it allows detection of truncated packets.

Identification (16 bits)

A unique ID assigned by the sender to identify all fragments of the same original packet. When a large packet is fragmented, all fragments get the same Identification value — the receiver uses it to reassemble them. Not used for non-fragmented packets (but still set by the OS).

Flags (3 bits)

Reserved (always 0)

Don't Fragment

More Fragments

DF (Don't Fragment) — tells routers not to fragment this packet. If the packet is too large for a link and DF=1, the router drops it and sends an ICMP "Fragmentation Needed" message back. Used by Path MTU Discovery (PMTUD)
MF (More Fragments) — set to 1 on all fragment except the last. Receiver uses this to know when it has collected all fragments

Fragment Offset (13 bits)

Position of this fragment's data within the original packet, measured in units of 8 bytes. A value of 185 means this fragment's data starts at byte offset 185 × 8 = 1480 in the original packet. The receiver uses Identification + Fragment Offset to put fragments back in order.

TTL — Time To Live (8 bits)

A counter decremented by 1 at each router hop. When TTL reaches 0, the router discards the packet and sends an ICMP Time Exceeded message back to the sender. Purpose: prevent packets from looping forever in a routing loop. Starting TTL is typically 64 (Linux), 128 (Windows), or 255 (some routers). We cover TTL in detail in the TTL and Routing tab.

Protocol (8 bits)

Identifies the L4 protocol inside the payload:

1 — ICMP
6 — TCP
17 — UDP
41 — IPv6-in-IPv4 (6in4 tunnel)
47 — GRE
50 — ESP (IPsec)
51 — AH (IPsec)
89 — OSPF
132 — SCTP

Header Checksum (16 bits)

A checksum computed over the IP header only (not the payload — TCP/UDP have their own checksums). Each router must recompute it after decrementing TTL. If the checksum fails, the packet is silently dropped. Modern NICs (including your Mellanox) offload checksum verification to hardware.

Source IP Address (32 bits) and Destination IP Address (32 bits)

The 4-byte IPv4 addresses of sender and receiver. These are the primary fields routers use for forwarding decisions. In NAT, both source and destination addresses may be rewritten by the firewall/NAT device.

💡 In DPDK/VPP code, the IPv4 header is accessed via ip4_header_t (VPP) or a manual struct. Key fields accessed in the fast path: ip4->dst_address (FIB lookup), ip4->protocol (dispatch to TCP/UDP), ip4->ttl (decrement), ip4->checksum (recompute after TTL change). These are the fields your graph nodes will read millions of times per second.

IPv4 ADDRESSING — 32-BIT ADDRESSES, NOTATION, CLASSES

🏷️

IPv4 Address Structure

CORE CONCEPT

An IPv4 address is a 32-bit number — four groups of 8 bits (octets) separated by dots. We write it in dotted-decimal notation where each octet is expressed as a decimal number from 0 to 255.

Binary representation of 192.168.1.100

192

11000000

168

10101000

00000001

100

01100100

Full 32-bit binary: 11000000.10101000.00000001.01100100

Every IP address has two parts — a network portion and a host portion. The subnet mask tells you which bits are the network part (1s) and which are the host part (0s).

All devices in the same network have identical network bits
Each device has a unique host portion within its network
Routers forward packets based on the network portion — they don't care about individual host bits

📚

Classful Addressing — Historical but Still Referenced

BACKGROUND

Before CIDR (1993), IPv4 addresses were divided into fixed classes. You still hear these terms in networking conversations:

Class	First Bits	Range	Default Mask	Networks	Hosts/Network	Use
A	0xxxxxxx	1.0.0.0 – 126.255.255.255	/8 (255.0.0.0)	126	16,777,214	Large orgs
B	10xxxxxx	128.0.0.0 – 191.255.255.255	/16 (255.255.0.0)	16,384	65,534	Medium orgs
C	110xxxxx	192.0.0.0 – 223.255.255.255	/24 (255.255.255.0)	2,097,152	254	Small orgs
D	1110xxxx	224.0.0.0 – 239.255.255.255	N/A	N/A	N/A	Multicast
E	1111xxxx	240.0.0.0 – 255.255.255.255	N/A	N/A	N/A	Reserved/Experimental

Classful addressing wasted enormous numbers of IP addresses (a company needing 300 hosts got a Class B with 65,534 addresses — 65,234 wasted). CIDR replaced classful addressing, but the Class A/B/C terminology persists in configuration and documentation.

🔧

Subnet Mask — The Network/Host Boundary

SUBNET MASK

A subnet mask is a 32-bit number where all network bits are 1 and all host bits are 0. Two notation forms:

Dotted-decimal: 255.255.255.0 — easier to read for humans
CIDR prefix length: /24 — count of 1-bits. Much more compact.

/* Example: 192.168.1.100/24 */
IP address:   192.168.1.100  =  11000000.10101000.00000001.01100100
Subnet mask:  255.255.255.0  =  11111111.11111111.11111111.00000000
                                 ←──── Network portion ────→ ←Host→

/* AND operation: IP & mask = Network address */
Network addr: 192.168.1.0    =  11000000.10101000.00000001.00000000

/* Broadcast: network with all host bits = 1 */
Broadcast:    192.168.1.255  =  11000000.10101000.00000001.11111111

/* Usable hosts: from .1 to .254 (254 hosts for /24) */
First host:   192.168.1.1
Last host:    192.168.1.254

Three critical addresses in every subnet:

Network address — host bits all 0. Identifies the subnet itself, not assignable to a host (e.g., 192.168.1.0)
Broadcast address — host bits all 1. Sends to all hosts in the subnet, not assignable (e.g., 192.168.1.255)
Usable host range — everything between. For /24: 192.168.1.1 to 192.168.1.254 = 254 usable hosts

SUBNETTING AND CIDR — DIVIDING ADDRESS SPACE EFFICIENTLY

✂️

Why Subnetting Exists

MOTIVATION

Subnetting takes a large network and divides it into smaller sub-networks. This is done for three reasons:

Security isolation — different departments/zones in different subnets, firewall between them (your NGFW use case)
Performance — smaller broadcast domains mean less broadcast noise
Address efficiency — allocate exactly as many IPs as you need, no wastage

When you subnet, you borrow bits from the host portion and add them to the network portion — increasing the prefix length. More network bits = smaller subnets = fewer hosts per subnet.

📐

CIDR Prefix Reference Table

REFERENCE

Prefix	Subnet Mask	Hosts	Usable Hosts	Typical Use
`/8`	255.0.0.0	16,777,216	16,777,214	ISP, large org backbone
`/16`	255.255.0.0	65,536	65,534	Large campus, cloud VPC
`/20`	255.255.240.0	4,096	4,094	Medium office, data centre zone
`/24`	255.255.255.0	256	254	Standard office LAN, server subnet
`/25`	255.255.255.128	128	126	Split /24 into two halves
`/26`	255.255.255.192	64	62	Department subnets
`/27`	255.255.255.224	32	30	Small team subnet
`/28`	255.255.255.240	16	14	Small server cluster
`/29`	255.255.255.248	8	6	Router-to-router links
`/30`	255.255.255.252	4	2	Point-to-point links (2 hosts only)
`/31`	255.255.255.254	2	2*	P2P links (RFC 3021 — no network/broadcast)
`/32`	255.255.255.255	1	1	Host route, loopback, BGP next-hop

Formula: Hosts = 2^(32-prefix). Usable = Hosts - 2 (subtract network and broadcast). Exception: /31 and /32 have special rules.

🧮

Subnetting by Hand — Step-by-Step Method

TECHNIQUE

Problem: You have 192.168.10.0/24 and need to create 4 equal subnets. What are the subnets?

Step 1 — How many bits to borrow?

You need 4 subnets = 2² → borrow 2 bits from the host portion. New prefix = /24 + 2 = /26.

Step 2 — What is the block size?

Block size = 256 - subnet_mask_last_octet = 256 - 192 = 64. (For /26: mask = 255.255.255.192, last octet = 192.)

Step 3 — List the subnets (increment by block size in the last octet):

Subnet	Network Addr	First Host	Last Host	Broadcast
`/26` #1	192.168.10.0	192.168.10.1	192.168.10.62	192.168.10.63
`/26` #2	192.168.10.64	192.168.10.65	192.168.10.126	192.168.10.127
`/26` #3	192.168.10.128	192.168.10.129	192.168.10.190	192.168.10.191
`/26` #4	192.168.10.192	192.168.10.193	192.168.10.254	192.168.10.255

Visual — network vs host bits for /26:

Network bits (26 bits fixed)
192.168.10.xx

Host bits (6 bits variable)
0–63 per subnet

💡 NGFW application: In a typical enterprise NGFW deployment you'll design security zones as subnets: 10.0.1.0/24 = Inside LAN, 10.0.2.0/24 = DMZ servers, 10.0.3.0/24 = Management. The firewall sits between these subnets and applies policy at the IP layer. Knowing subnetting lets you write precise ACL rules like permit ip 10.0.1.0/24 10.0.2.0/24.

💻

Subnet Arithmetic in C

CODE

#include <stdio.h>
#include <arpa/inet.h>
#include <stdint.h>

int main() {
    /* IP and prefix */
    uint32_t ip     = inet_addr("192.168.10.100");  /* network byte order */
    uint32_t prefix = 26;

    /* Build mask: ~0 shifted left by (32-prefix) bits */
    uint32_t mask = htonl(~0u << (32 - prefix));    /* 0xFFFFFFC0 = /26 */

    /* Network address = ip AND mask */
    uint32_t network   = ip & mask;

    /* Broadcast = network OR (NOT mask) */
    uint32_t broadcast = network | ~mask;

    /* First and last host */
    uint32_t first = htonl(ntohl(network) + 1);
    uint32_t last  = htonl(ntohl(broadcast) - 1);

    /* Usable host count */
    uint32_t hosts = ntohl(broadcast) - ntohl(network) - 1;

    char buf[INET_ADDRSTRLEN];
    printf("Network:   %s\n", inet_ntop(AF_INET, &network,   buf, sizeof(buf)));
    printf("Broadcast: %s\n", inet_ntop(AF_INET, &broadcast, buf, sizeof(buf)));
    printf("First:     %s\n", inet_ntop(AF_INET, &first,     buf, sizeof(buf)));
    printf("Last:      %s\n", inet_ntop(AF_INET, &last,      buf, sizeof(buf)));
    printf("Hosts:     %u\n", hosts);
    return 0;
}

SPECIAL IPv4 ADDRESS RANGES — KNOW THESE BY HEART

🗺️

Reserved and Special Address Ranges

REFERENCE

Range	Name	Purpose	RFC	NGFW Relevance
`10.0.0.0/8`	Private Class A	Internal networks — not routed on internet	RFC 1918	Typically "inside" zone — allow policy
`172.16.0.0/12`	Private Class B	Internal networks — covers 172.16–172.31.x.x	RFC 1918	Often used for DMZ / management
`192.168.0.0/16`	Private Class C	Internal networks — common in SOHO/offices	RFC 1918	Home/branch office subnets
`127.0.0.0/8`	Loopback	Local host communication. 127.0.0.1 = "localhost"	RFC 5735	Never route this — drop at perimeter
`169.254.0.0/16`	Link-Local / APIPA	Auto-assigned when DHCP fails. Not routable	RFC 3927	Indicator of DHCP failure on host
`100.64.0.0/10`	Shared Address Space	CGN (Carrier-Grade NAT) — ISP internal use	RFC 6598	Treat like RFC 1918 — don't route externally
`0.0.0.0/8`	Unspecified	0.0.0.0 = "this host" — used before IP assigned	RFC 1122	Drop all packets with source 0.0.0.0
`255.255.255.255/32`	Broadcast	Limited broadcast — all hosts on local network	RFC 919	Drop at firewall — never route
`224.0.0.0/4`	Multicast	Group communication (OSPF, video streaming)	RFC 5771	Allow selectively (OSPF: 224.0.0.5/6)
`240.0.0.0/4`	Reserved	Reserved for future use — treat as invalid	RFC 1112	Drop all packets in this range
`192.0.2.0/24`	TEST-NET-1	Documentation and examples — never real traffic	RFC 5737	Drop at perimeter
`198.51.100.0/24`	TEST-NET-2	Documentation — as above	RFC 5737	Drop at perimeter
`203.0.113.0/24`	TEST-NET-3	Documentation — as above	RFC 5737	Drop at perimeter

🛡️

Bogon Filtering — NGFW First Line of Defence

SECURITY

A bogon is an IP address that should never appear as a source on the public internet — either because it's reserved (RFC 1918 private, loopback, link-local) or unallocated. An NGFW at the internet perimeter should drop all packets with bogon source addresses — they indicate either misconfiguration or deliberate spoofing (attack).

/* Bogon filter — drop these source IPs at internet-facing interface */
/* These are source addresses that should NEVER arrive from the internet */

Bogon source ranges to block:
  10.0.0.0/8          RFC 1918 private
  172.16.0.0/12       RFC 1918 private
  192.168.0.0/16      RFC 1918 private
  127.0.0.0/8         Loopback
  169.254.0.0/16      Link-local
  100.64.0.0/10       Shared address space
  0.0.0.0/8           Unspecified
  240.0.0.0/4         Reserved
  224.0.0.0/4         Multicast (as source — invalid)
  192.0.2.0/24        TEST-NET-1
  198.51.100.0/24     TEST-NET-2
  203.0.113.0/24      TEST-NET-3

/* Unicas Reverse Path Forwarding (uRPF) — a smarter bogon filter */
/* Router drops packets if the source IP has no route back via the */
/* same interface the packet arrived on — prevents spoofed sources */

In VPP, bogon filtering is implemented as an IP feature arc plugin with a bihash lookup of source address against a prefix table. You'll build a version of this in Phase 6 (NGFW Development).

TTL, ROUTING BASICS, AND HOW ROUTERS FORWARD PACKETS

⏱️

TTL — Time To Live

TTL

TTL is an 8-bit counter in the IP header that starts at a value set by the sender (typically 64 for Linux, 128 for Windows, 255 for many routers) and is decremented by 1 at every router hop. When TTL reaches 0, the router discards the packet and sends an ICMP Time Exceeded message back to the original sender.

Why TTL exists: Without TTL, a packet caught in a routing loop (two routers sending it back and forth) would circulate forever, consuming bandwidth indefinitely. TTL guarantees every packet has a finite lifetime.

/* TTL trace: packet from your laptop to 8.8.8.8 */
Hop 1: Your router     TTL: 64 → 63   (decremented, forwarded)
Hop 2: ISP router 1   TTL: 63 → 62   (decremented, forwarded)
Hop 3: ISP router 2   TTL: 62 → 61   (decremented, forwarded)
...
Hop 12: Google router  TTL: 52 → 51   (decremented, forwarded)
Hop 13: 8.8.8.8        TTL: 51        (received — destination reached)

/* If TTL hits 0 at an intermediate router: */
Router discards packet + sends ICMP Type 11, Code 0 (Time Exceeded)
Sender receives ICMP with source IP of the discarding router
→ This is how traceroute works! (see ICMP tab)</span>

/* Default TTL values by OS */
Linux:   64    (set in /proc/sys/net/ipv4/ip_default_ttl)
Windows: 128
Cisco:   255
macOS:   64

NGFW use of TTL: TTL can reveal OS fingerprinting — a packet arriving with TTL=127 likely came from Windows (started at 128, lost 1 hop). Firewalls can use this for passive OS detection. Some NGFW features normalise TTL values to prevent fingerprinting attacks.

🗺️

How Routers Make Forwarding Decisions

ROUTING BASICS

Every router maintains a routing table (also called the FIB — Forwarding Information Base). When a packet arrives, the router looks up the destination IP address in the FIB using Longest Prefix Match (LPM): find the most specific route that covers the destination.

/* Example routing table on a Linux router */
$ ip route show

10.0.0.0/8       via 192.168.1.1 dev eth0        # Match any 10.x.x.x
10.10.0.0/16     via 192.168.1.2 dev eth0        # More specific match
10.10.1.0/24     dev eth1 proto kernel scope link # Most specific — local
0.0.0.0/0        via 203.0.113.1 dev eth2         # Default route (catch-all)

/* LPM example: packet destined for 10.10.1.55 */
Matches 0.0.0.0/0    → /0  — too broad
Matches 10.0.0.0/8   → /8  — candidate
Matches 10.10.0.0/16 → /16 — more specific
Matches 10.10.1.0/24 → /24 — MOST SPECIFIC → this one wins

/* Router actions after lookup: */
1. Decrement TTL (if TTL becomes 0: drop + send ICMP Time Exceeded)
2. Recompute IP header checksum (TTL changed)
3. ARP-resolve next-hop MAC if not cached
4. Rewrite Ethernet header: new dst MAC (next-hop) + src MAC (this router's outgoing port)
5. Transmit on outgoing interface

💡 What the router does NOT touch: Source IP, destination IP, and the entire IP payload (TCP/UDP/application data). IP routing is transparent to endpoints — your laptop doesn't know or care how many routers handled its packet. Routers only touch the Ethernet header and the TTL/checksum fields of the IP header.

IP FRAGMENTATION AND PATH MTU DISCOVERY

✂️

Why Fragmentation Exists

CONCEPT

Every network link has a Maximum Transmission Unit (MTU) — the largest IP packet it can carry. Standard Ethernet: 1500 bytes. Some links are smaller (PPPoE adds 8 bytes overhead, reducing effective MTU to 1492). When a packet larger than a link's MTU needs to cross that link, IP fragments it into smaller pieces.

Fragmentation happens at any router along the path (not just the sender) and reassembly happens only at the destination host — not at intermediate routers. This design choice avoids reassembly overhead at every hop.

✂️

How Fragmentation Works

MECHANICS

Scenario: a 4000-byte IP packet arrives at a router whose outgoing link has MTU 1500. The router fragments it into three pieces.

Original packet: 4000 bytes (IP header 20B + data 3980B) — too large for MTU 1500

IP Hdr
20B

Data: 3980 bytes

↓ Router fragments at MTU 1500 boundary

Fragment 1 — 1500B total

IP Hdr
20B
ID=x
MF=1
off=0

Data bytes 0–1479
(1480 bytes)

Fragment 2 — 1500B total

IP Hdr
20B
ID=x
MF=1
off=185

Data bytes 1480–2959
(1480 bytes)

Fragment 3 — 1040B total

IP Hdr
20B
ID=x
MF=0
off=370

Data bytes 2960–3979
(1020 bytes)

Fragment field values explained:

Identification = x — same value in all 3 fragments (receiver uses this to group them)
MF=1 — More Fragments — set on first two fragments, MF=0 on the last
Fragment Offset — in units of 8 bytes: 0 / 185 (1480÷8) / 370 (2960÷8)
Fragment data size — must be multiple of 8 bytes (except last) to allow correct offset calculation

⚠️

Fragmentation Problems and PMTUD

ISSUES

Fragmentation causes several real-world problems:

Performance overhead — reassembly at the destination consumes CPU and memory. Fragments must be buffered until all arrive.
Firewall complexity — stateful firewalls must reassemble fragments before inspecting the transport header (TCP/UDP ports are only in the first fragment). This is a significant processing cost.
Fragment attacks — attackers exploit fragmentation: overlapping fragments (Teardrop), tiny first fragment (hides TCP flags from firewall), missing last fragment (holds reassembly buffer forever).
ICMP filtering — some networks block ICMP, which breaks Path MTU Discovery (see below).

Path MTU Discovery (PMTUD)

Modern systems avoid fragmentation by discovering the smallest MTU on the path before sending large packets:

Sender sets DF=1 on all packets

Don't Fragment bit = 1 tells routers not to fragment — drop instead.

Router with smaller MTU drops packet

Router drops the oversized packet and sends ICMP Type 3, Code 4 "Fragmentation Needed" with the MTU of its link.

ICMP: Type=3 Code=4 Next-Hop-MTU=1492

Sender reduces packet size

Sender receives the ICMP and records the reduced MTU for this destination. Future packets use the smaller size. TCP adjusts its MSS (Maximum Segment Size) accordingly.

⚠️ ICMP Black Holes: If a firewall blocks ICMP (a common but misguided practice), PMTUD breaks. The sender never receives the "Fragmentation Needed" message, packets keep getting dropped silently, and connections hang. This manifests as "large downloads hang after a few KB". NGFW policy must allow ICMP Type 3, Code 4 through for PMTUD to work correctly.

ICMP — INTERNET CONTROL MESSAGE PROTOCOL (RFC 792)

📨

What ICMP Is and Why It Exists

OVERVIEW

ICMP is IP's built-in diagnostic and error-reporting protocol. It travels inside IP packets (Protocol = 1) and is used by routers and hosts to report errors and exchange control information. ICMP itself has no concept of ports — it operates below TCP/UDP.

ICMP is essential for:

Ping — testing reachability (Echo Request/Reply)
Traceroute — discovering the path to a destination (abuses TTL expiry)
Error reporting — telling senders why their packets were dropped
Path MTU Discovery — informing senders of MTU limitations (Type 3, Code 4)

ICMP format: 8-byte fixed header (Type, Code, Checksum, + 4 bytes of type-specific data) followed by optional additional data.

📋

ICMP Message Types — Complete Reference

REFERENCE

Type	Code	Name	Direction	Caused By / Use
`0`	0	Echo Reply	Host → Pinger	Response to ping (Type 8)
`3`	0	Dest Unreachable — Net	Router → Sender	No route to destination network
`3`	1	Dest Unreachable — Host	Router → Sender	No route to specific host
`3`	2	Dest Unreachable — Protocol	Host → Sender	Protocol not supported on destination
`3`	3	Dest Unreachable — Port	Host → Sender	UDP port not listening (no process bound)
`3`	4	Fragmentation Needed	Router → Sender	Packet too large + DF=1 set. Includes next-hop MTU.
`3`	9	Dest Unreachable — Filtered	Router/FW → Sender	Firewall rejected the packet (admin filter)
`5`	0-3	Redirect	Router → Host	Better route exists via different gateway
`8`	0	Echo Request	Sender → Host	Ping — tests reachability
`11`	0	Time Exceeded (TTL)	Router → Sender	TTL reached 0 — used by traceroute
`11`	1	Time Exceeded (Reassembly)	Host → Sender	Fragment reassembly timer expired
`12`	0-2	Parameter Problem	Router/Host → Sender	IP header field is invalid

🔍

Traceroute — How It Works

TECHNIQUE

Traceroute exploits TTL and ICMP Time Exceeded messages to discover every router on the path to a destination. It sends packets with progressively increasing TTL values (1, 2, 3, ...) and each router that drops the packet (TTL=0) sends back its IP address in the ICMP Time Exceeded message.

/* Traceroute algorithm */
Round 1: Send 3 packets with TTL=1
  → First router decrements to 0, drops, sends ICMP Time Exceeded
  → Reveal: first hop IP = 192.168.1.1 (your gateway)

Round 2: Send 3 packets with TTL=2
  → Second router decrements to 0, drops, sends ICMP Time Exceeded
  → Reveal: second hop IP = 10.10.1.1 (ISP edge router)

Round 3: Send 3 packets with TTL=3
  → Third router... and so on until destination replies

/* Two implementations */
traceroute on Linux: sends UDP packets to high port (33434+)
                     destination replies with ICMP Port Unreachable (type 3, code 3)
tracert   on Windows: sends ICMP Echo Requests
                     destination replies with Echo Reply (type 0)

/* Run it */
$ traceroute -n 8.8.8.8    # -n skips DNS resolution (faster)
$ traceroute -I 8.8.8.8    # -I uses ICMP instead of UDP
$ mtr 8.8.8.8              # live updating traceroute

Interpreting traceroute output:

* * * — router doesn't respond to probes (rate-limited or blocks ICMP) — does NOT mean the path is broken there
Increasing RTT — normal as you move further away
RTT jumping down — ICMP is rate-limited and TTL-exceeded replies travel a shorter path back
Asymmetric routing — forward and return path may be different (explains apparent RTT anomalies)

🛡️

ICMP and NGFW — What to Allow, What to Block

NGFW POLICY

A common mistake is to block all ICMP at the firewall. This breaks PMTUD and troubleshooting. Here's the correct NGFW policy for ICMP:

ICMP Type/Code	Direction	NGFW Action	Reason
Type 8 (Echo Request)	Inbound from internet	Block or rate-limit	Reduces attack surface, prevents mapping
Type 0 (Echo Reply)	Inbound (reply to outbound ping)	Allow (stateful)	Return traffic for initiated pings
Type 3, Code 4 (Frag Needed)	Inbound	Always allow	PMTUD — blocking this breaks connections
Type 3, Code 0-3 (Dest Unreach)	Inbound	Allow (stateful)	Error replies for existing connections
Type 11 (TTL Exceeded)	Inbound	Allow	Traceroute return path, debugging
Type 5 (Redirect)	Inbound	Block	ICMP redirect attacks — can reroute traffic
All ICMP	Outbound	Allow	Internal users need full diagnostic capability

LAB 1

Dissect an IPv4 Header with Scapy and Wireshark

Objective: Craft raw IP packets with specific field values and observe exactly how each field appears in Wireshark. Build deep familiarity with every byte of the IP header.

Install Scapy: pip3 install scapy. Open Python3 as root: sudo python3. Import: from scapy.all import *.

Construct a minimal IP packet and inspect its fields: p = IP(dst="8.8.8.8") / ICMP() then p.show(). Observe every field Scapy has automatically set: version=4, ihl=5, ttl=64, proto=1 (ICMP), src (your IP), dst (8.8.8.8).

Examine the raw bytes: bytes(p). Count them — 20 bytes of IP header + 8 bytes ICMP = 28 bytes. Identify which bytes correspond to which fields (e.g., bytes 8–9 = TTL + Protocol).

Start a Wireshark capture. Send the packet: send(p). Find it in Wireshark. Expand the Internet Protocol layer. Verify every field matches what Scapy showed.

Now deliberately set unusual field values and observe: p = IP(dst="8.8.8.8", ttl=1, flags="DF", id=0xBEEF) / ICMP(). Send it. In Wireshark: (a) Does TTL appear as 1? (b) Is the DF flag set? (c) Is the Identification 0xBEEF (decimal 48879)? (d) What ICMP error did you get back — Time Exceeded?

Bonus — Hexdump analysis: hexdump(p) in Scapy shows the raw hex. Manually decode the first 20 bytes: byte 0 = version+IHL (0x45 = version 4, IHL 5), bytes 2-3 = total length, byte 8 = TTL, byte 9 = protocol. Cross-reference with the IP header diagram in Tab 1.

LAB 2

Subnetting Practice — Design an NGFW Network

Objective: Design a complete network layout for an NGFW deployment from scratch using subnetting. This simulates real-world work.

Task: You have the network 10.0.0.0/16 and need to create: (a) Inside LAN for 500 hosts, (b) DMZ for 50 servers, (c) Management network for 20 devices, (d) NGFW to router link (2 hosts only). Design subnets with the minimum waste. Calculate network address, mask, broadcast, first host, last host for each.

Write the answer before checking. Then verify using:

python3 -c "import ipaddress; n = ipaddress.ip_network('10.0.0.0/23'); print(list(n.hosts())[0], list(n.hosts())[-1], n.broadcast_address)"

. Use Python's ipaddress module to validate all your subnet calculations.

Configure your subnets on Linux loopback interfaces to test them: sudo ip addr add 10.0.0.1/23 dev lo label lo:0. Add routes for each subnet: sudo ip route add 10.0.2.0/25 dev lo. Verify routing with ip route get 10.0.2.50.

Write iptables rules reflecting your NGFW policy between zones: allow all from Inside LAN to internet (MASQUERADE), allow only HTTP/HTTPS from internet to DMZ, deny Inside LAN to DMZ except port 443, deny all to Management except SSH from specific host. Verify with iptables -L -n -v.

LAB 3

ICMP and Traceroute Analysis

Objective: Capture and fully decode ICMP messages including ping, Time Exceeded (traceroute), and Destination Unreachable. Understand the complete ICMP interaction with IP.

Start Wireshark capture with filter icmp. Run: ping -c 4 8.8.8.8. Identify Type 8 (Echo Request) and Type 0 (Echo Reply) packets. In the hex dump, find: byte 20 = ICMP Type, byte 21 = ICMP Code, bytes 22-23 = Checksum, bytes 24-27 = Identifier + Sequence number.

Run traceroute while capturing: sudo traceroute -n -I 8.8.8.8. Capture Type 11 (Time Exceeded) replies from intermediate routers. Expand an ICMP Time Exceeded packet — notice it includes the first 8 bytes of the original IP payload (the original IP header is embedded) so the sender knows which packet caused the error.

Generate a Destination Unreachable (Port Unreachable): nc -u 8.8.8.8 9999 then type anything and press Enter. You'll get an ICMP Type 3, Code 3 (Port Unreachable) back from Google's server since nothing listens on UDP 9999. Capture and decode it in Wireshark.

Use Scapy to build a custom ICMP packet: force a specific TTL to trigger Time Exceeded from a known router hop. Use ans, unans = sr(IP(dst="8.8.8.8", ttl=3)/ICMP(), timeout=2). Then ans.show() — you'll see the response from hop 3 on your path to 8.8.8.8.

Bonus: Write a 20-line Python traceroute using Scapy. Send ICMP with TTL=1,2,3,... until you reach the destination or TTL=30. Print the IP of each responding router. This is the core of how traceroute works — you're implementing it from scratch.

M03 MASTERY CHECKLIST

Can explain IP's three core jobs: logical addressing, fragmentation, best-effort delivery
Know the IPv4 header is 20 bytes minimum (5 × 32-bit rows) and can draw it from memory
Know every header field: Version, IHL, DSCP/ECN, Total Length, ID, Flags, Fragment Offset, TTL, Protocol, Checksum, Src IP, Dst IP
Know the key Protocol field values: 1=ICMP, 6=TCP, 17=UDP, 47=GRE, 50=ESP, 89=OSPF
Know how to find the payload start using IHL: payload_offset = IHL × 4
Know the two flag bits: DF (Don't Fragment) and MF (More Fragments) and what each does
Know the three classful address classes (A/B/C) and their default masks (/8, /16, /24)
Can convert between dotted-decimal and CIDR notation for any prefix length
Can manually calculate network address, broadcast address, first host, last host, and host count for any given CIDR prefix
Can subnet a given network into N equal subnets by hand using the block-size method
Know all RFC 1918 private ranges by heart: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
Know at least 8 special address ranges: loopback, link-local, multicast, broadcast, shared (CGN), TEST-NETs
Understand bogon filtering and can list the key ranges an NGFW should drop at the internet perimeter
Know how TTL works: decremented at each router, ICMP Time Exceeded when 0, default values per OS
Understand how LPM routing works: router looks up destination IP in FIB, most specific match wins
Know what a router does and does NOT modify: decrements TTL, recomputes checksum, rewrites Ethernet header — never touches IP src/dst or payload
Can explain IP fragmentation: what triggers it, Identification/MF/Offset fields, reassembly at destination only
Understand Path MTU Discovery (PMTUD): DF=1 + ICMP Type 3 Code 4 — and why blocking ICMP breaks it
Know key ICMP types: 0 (Echo Reply), 3 (Unreachable), 5 (Redirect), 8 (Echo Request), 11 (Time Exceeded)
Know which ICMP types to allow at NGFW: always allow Type 3 Code 4, block Type 5 (Redirect)
Can explain how traceroute works: sends TTL=1,2,3... probes, collects ICMP Time Exceeded from each hop
Completed Lab 1: crafted IPv4 packets in Scapy, decoded header bytes, verified fields in Wireshark
Completed Lab 2: designed an NGFW subnet layout for 4 zones, configured on Linux, wrote iptables rules
Completed Lab 3: captured ping, traceroute, and Destination Unreachable ICMP messages, wrote custom traceroute in Scapy

✅ When complete: Move to M04 - IPv6. You now have deep IPv4 knowledge. IPv6 keeps the same layered approach but changes addressing fundamentally — 128-bit addresses, no broadcast, mandatory SLAAC, ICMPv6 replaces ARP. Much of what you learned here maps directly.

← M02 Ethernet and L2 🗺️ Roadmap Next: M04 - IPv6 →