WHY TUNNELING EXISTS — OVERLAY OVER UNDERLAY
The Tunneling Concept
OVERVIEWTunneling encapsulates one network protocol inside another — creating a virtual link between two endpoints that may be separated by many intermediate hops that don't need to understand the inner protocol. The underlay is the physical/IP network; the overlay is the virtual network running on top.
Core use cases for tunneling:
- Carry non-IP traffic over IP — legacy protocols (IPX, SNA) encapsulated in IP/GRE for transport over modern IP networks
- Connect private networks over public internet — VPN tunnels (GRE+IPsec, WireGuard) connect branch offices over the internet as if they were directly connected
- Scale L2 over L3 — VxLAN extends Layer 2 Ethernet broadcast domains across Layer 3 IP networks — essential for data centre multi-tenancy and VM migration
- Traffic engineering — MPLS labels allow routers to forward packets along pre-computed explicit paths, bypassing normal IP routing
- Network virtualisation — SDN overlays (OVN, NSX, ACI) use tunnels to implement virtual networks with arbitrary topology on top of physical hardware
Encapsulation Overhead Comparison
OVERHEAD| Tunnel Type | Added Headers | Total Overhead | Effective MTU (from 1500) |
|---|---|---|---|
| GRE (basic) | IP(20) + GRE(4) | 24 bytes | 1476 bytes |
| GRE + IPsec (ESP) | IP(20) + GRE(4) + ESP(~50) | ~74 bytes | ~1426 bytes |
| VxLAN | Eth(14) + IP(20) + UDP(8) + VxLAN(8) | 50 bytes | 1450 bytes |
| MPLS (1 label) | MPLS label(4) | 4 bytes per label | 1496 bytes |
| MPLS (2 labels) | MPLS label(8) | 8 bytes | 1492 bytes |
| WireGuard | IP(20) + UDP(8) + WireGuard(~32) | ~60 bytes | ~1440 bytes |
| IPsec (ESP transport) | ESP(~40) | ~40 bytes | ~1460 bytes |
⚠️ MTU fragmentation is the #1 tunneling operational problem. When the effective MTU is reduced by tunnel overhead, packets that filled the original MTU now exceed the tunnel's MTU. If DF=1 is set (common with TCP), they get dropped. Solutions: MSS clamping (TCP only), Path MTU Discovery, configuring tunnel endpoints with reduced MTU, jumbo frames on the underlay.
MPLS — MULTIPROTOCOL LABEL SWITCHING
MPLS Architecture and Label Forwarding
MPLSMPLS (RFC 3032) inserts a 32-bit label between the Layer 2 header and the IP header — often called "Layer 2.5". Labels allow routers to forward packets based on a fixed-length label lookup (O(1)) rather than an IP LPM lookup (more complex), and enable traffic engineering by pre-computing explicit paths through the network.
/* MPLS label format (32 bits) */ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Label (20 bits) | Exp(3b)|S| TTL | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Label: 20-bit forwarding label (0–15 = reserved) Exp: 3-bit traffic class (QoS, formerly called "EXP") S bit: Bottom of Stack — set on the innermost label TTL: copied from IP TTL on ingress, decremented at each LSR hop /* MPLS packet structure */ [Ethernet hdr][MPLS label 1][MPLS label 2][IP hdr][TCP hdr][Data] ↑ outer label ↑ inner label (multiple labels = "label stack") /* Label operations */ PUSH: Ingress LER adds label(s) to packet SWAP: Transit LSR replaces label with new label (the forwarding operation) POP: Egress LER removes label, exposes inner packet /* MPLS forwarding table (LFIB) */ Incoming label | Operation | Outgoing label | Outgoing interface 100 | SWAP→200 | 200 | eth1 200 | POP | (none) | eth2 → IP routing takes over 300 | PUSH 400 | 400 | eth3 → add outer label
💡 Penultimate Hop Popping (PHP): The second-to-last router in an MPLS path removes the label (POP) before forwarding to the egress router. This allows the egress router to process the packet as pure IP without needing a label lookup. Signalled by the egress router advertising label 3 (Implicit NULL) to its upstream neighbour.
MPLS Traffic Engineering and VPNs
APPLICATIONSMPLS has two dominant applications in service-provider networks:
MPLS-TE (Traffic Engineering)
RSVP-TE or LDP establishes explicit Label Switched Paths (LSPs) through the network following a pre-computed route (not necessarily the shortest IGP path). Allows bandwidth reservation, fast-reroute (50ms failover), and load distribution across parallel paths.
MPLS L3VPN (BGP/MPLS VPN)
Service providers use MPLS+BGP to provide isolated virtual private networks to customers. Customer routes are carried in BGP with a Route Distinguisher (RD) to separate them. The MPLS label stack (outer=transport, inner=VPN) directs packets to the correct customer VRF at the egress PE router.
GRE — GENERIC ROUTING ENCAPSULATION (RFC 2784)
GRE Header and Operation
GREGRE (Generic Routing Encapsulation) is the simplest tunnel protocol. It encapsulates any L3 protocol packet inside an IP packet with a small GRE header. GRE itself provides no encryption or authentication — it's just a wrapper. Encryption is typically added by combining GRE with IPsec.
/* GRE packet structure */ [Outer IP hdr: src=tunnel_src dst=tunnel_dst proto=47] [GRE header: 4 bytes minimum] Flags(4b) | Reserved(9b) | Version(3b) | Protocol Type(16b) [Optional: Checksum(16b) + Reserved(16b)] [Optional: Key(32b)] [Optional: Sequence Number(32b)] [Inner IP packet: src=orig_src dst=orig_dst] [Original payload] /* GRE Protocol Type field — what's inside */ 0x0800 = IPv4 (most common) 0x86DD = IPv6 0x0806 = ARP 0x8847 = MPLS /* Linux GRE tunnel setup */ # Create GRE tunnel interface ip tunnel add gre1 mode gre local 203.0.113.1 remote 198.51.100.1 ttl 255 ip link set gre1 up ip addr add 10.100.0.1/30 dev gre1 # Route traffic through tunnel ip route add 192.168.2.0/24 via 10.100.0.2 dev gre1 # Verify ip tunnel show ping 10.100.0.2 # ping tunnel endpoint /* GRE keepalives (Cisco extension) */ # GRE itself has no keepalive — use OSPF/BFD over the tunnel for failure detection # Or configure GRE keepalives (encapsulate keepalive inside GRE inside tunnel)
💡 GRE + IPsec is the classic site-to-site VPN. GRE provides the tunnel (any-protocol encapsulation, routing over the tunnel), and IPsec provides encryption and authentication. Most enterprise VPN gateways still use this combination. Modern alternatives: WireGuard (simpler, faster), IPsec IKEv2 (no GRE needed), OpenVPN.
VxLAN — VIRTUAL EXTENSIBLE LAN (RFC 7348)
Why VxLAN Exists — Scaling L2 Over L3
VXLANTraditional VLANs have a fundamental limitation: they are bounded by a Layer 3 network. Two VMs in the same VLAN must be on the same L2 segment — you can't have VLAN 100 span across multiple data centre buildings connected by IP routing. With cloud and hyperscale data centres needing millions of isolated tenant networks, the 4094 VLAN limit was also a constraint.
VxLAN solves both problems: it encapsulates entire Ethernet frames (including VLAN tags) inside UDP/IP packets, allowing L2 segments to span any IP network. The VxLAN Network Identifier (VNI) is 24 bits — supporting 16 million isolated networks.
VxLAN Encapsulation and VTEP
VXLAN DETAILS/* VxLAN packet structure */ [Outer Ethernet: src=VTEP_MAC dst=next-hop_MAC type=0x0800] [Outer IP: src=VTEP_IP dst=remote_VTEP_IP proto=17 (UDP)] [Outer UDP: src=ephemeral dst=4789 (IANA VxLAN port)] [VxLAN header: 8 bytes] Flags(8b) | Reserved(24b) | VNI(24b) | Reserved(8b) (I flag = 1 when VNI is valid) [Inner Ethernet frame: src=VM_MAC dst=dest_VM_MAC type=0x0800] [Inner IP packet] [Payload] Total overhead: 50 bytes → effective MTU 1450 from standard 1500-byte underlay /* VNI — VxLAN Network Identifier */ 24 bits → 16,777,216 unique overlay networks Equivalent to VLAN ID but vastly larger scale Each VNI is a separate L2 broadcast domain /* VTEP — VxLAN Tunnel End Point */ The device that encapsulates/decapsulates VxLAN: On ingress (from VM): Ethernet frame → wrap in VxLAN/UDP/IP On egress (to VM): VxLAN/UDP/IP → unwrap → deliver Ethernet frame VTEPs can be: - Hypervisor (Linux bridge/OVS with VXLAN) - Hardware switch (ToR switch with VxLAN support) - Dedicated gateway appliance /* Linux VxLAN setup */ # Create VxLAN tunnel interface ip link add vxlan100 type vxlan id 100 dstport 4789 \ local 10.0.0.1 remote 10.0.0.2 dev eth0 ip link set vxlan100 up ip addr add 192.168.100.1/24 dev vxlan100 # Add static FDB entry (tell Linux: MAC xx is at remote VTEP 10.0.0.2) bridge fdb add aa:bb:cc:dd:ee:ff dev vxlan100 dst 10.0.0.2 # Multicast VxLAN (learning mode) ip link add vxlan100 type vxlan id 100 group 239.1.1.1 dev eth0 # BUM (Broadcast, Unknown unicast, Multicast) traffic → multicast group # VTEPs join the multicast group — learn each other's MACs via flooding
EVPN — BGP Control Plane for VxLAN
EVPNTraditional VxLAN floods BUM (Broadcast, Unknown unicast, Multicast) traffic to discover MACs — this doesn't scale. EVPN (Ethernet VPN, RFC 7432) uses BGP as a control plane to distribute MAC-to-IP-to-VTEP mappings, eliminating flooding:
/* EVPN Route Types (the key ones) */ Type 2 (MAC/IP Advertisement): "MAC aa:bb:cc:dd:ee:ff, IP 192.168.1.5 is at VTEP 10.0.0.1, VNI 100" → VTEPs learn MAC/IP locations via BGP, no flooding needed Type 3 (Inclusive Multicast): "VTEP 10.0.0.1 participates in VNI 100 BUM forwarding" → Ingress replication list instead of multicast /* Symmetric IRB — Integrated Routing and Bridging */ # Layer 3 routing between VNIs without leaving the VxLAN fabric # Each VTEP acts as a distributed gateway for its local VMs # No hairpinning through a central gateway router /* Modern data centre: Leaf-Spine with VxLAN+EVPN */ Spine switches: pure IP underlay + iBGP route reflector for EVPN Leaf switches: VTEPs + EVPN BGP speakers VMs/containers: connected to leaf switches, in VxLAN VNIs /* FRR VxLAN+EVPN config */ router bgp 65001 address-family l2vpn evpn neighbor SPINE activate advertise-all-vni
OTHER TUNNEL TYPES — GENEVE, WIREGUARD, 6IN4
Tunnel Protocol Reference
REFERENCE| Protocol | RFC | Transport | Overhead | Use Case |
|---|---|---|---|---|
| GRE | RFC 2784 | IP Proto 47 | 24B | Site-to-site VPN (with IPsec), multi-protocol transport, GRE keepalives |
| IP-in-IP | RFC 2003 | IP Proto 4 | 20B | Simple IPv4-in-IPv4; no options/encryption, minimum overhead |
| 6in4 | RFC 4213 | IP Proto 41 | 20B | IPv6-in-IPv4 tunnels; connect IPv6 islands over IPv4 backbone |
| VxLAN | RFC 7348 | UDP 4789 | 50B | Data centre overlay, VM mobility, L2 over L3, cloud networking |
| GENEVE | RFC 8926 | UDP 6081 | 50B+ | Next-gen overlay (OpenStack, OVN); extensible TLV options in header |
| MPLS | RFC 3032 | Between L2/L3 | 4B/label | Service provider TE, L3VPN, L2VPN, fast-reroute |
| IPsec (tunnel) | RFC 4303 | IP Proto 50/51 | ~50B | Encrypted site-to-site and remote-access VPN; mandatory encryption |
| WireGuard | — | UDP (custom) | ~60B | Modern VPN: simple, fast, strong crypto (ChaCha20/Poly1305/Curve25519) |
| VLAN (802.1Q) | IEEE 802.1Q | Ethernet tag | 4B | L2 network segmentation; not technically a tunnel but a virtual L2 overlay |
| PPPoE | RFC 2516 | Ethernet | 8B | ISP DSL access; encapsulates PPP in Ethernet; reduces MTU to 1492 |
WHEN TO USE WHICH TUNNEL — DECISION GUIDE
Tunnel Selection Decision Guide
DECISION/* Which tunnel to use — decision tree */
Need to connect two office networks over internet securely?
→ IPsec IKEv2 (standard, vendor-interoperable)
→ WireGuard (modern, simple, fast — if both ends are Linux/modern)
→ GRE + IPsec (if you need routing protocols over the tunnel)
Need to carry non-IP traffic (e.g., IPX, MPLS) over IP?
→ GRE (supports any EtherType in Protocol Type field)
Need to scale L2 (VMs, containers) across IP data centre fabric?
→ VxLAN (with EVPN for control plane)
→ GENEVE (if you need extensible metadata in the header)
Need traffic engineering and bandwidth reservation in SP network?
→ MPLS-TE with RSVP-TE
Need the absolute minimum overhead (no encryption needed)?
→ IP-in-IP (20 bytes overhead, IPv4 only)
Connecting IPv6 island over IPv4 network?
→ 6in4 (static), 6to4 (automatic), Teredo (through NAT)
Need a simple test or diagnostic tunnel?
→ GRE (easiest to configure on Linux with ip tunnel add)NGFW CHALLENGES WITH TUNNELED TRAFFIC
The Tunnel Inspection Problem
NGFWTunnels present a fundamental challenge for NGFWs: the firewall sees the outer packet (which may be innocuous — UDP to port 4789, or IP proto 47) but not the inner packet (which may contain malicious traffic). An attacker can use a tunnel to bypass firewall rules by hiding prohibited traffic inside permitted tunnel traffic.
| Tunnel Type | What NGFW Sees Without Inspection | Inspection Approach |
|---|---|---|
| GRE | IP packets destined to tunnel endpoint (Proto 47) | Decapsulate GRE at firewall, inspect inner IP packet against policy, re-encapsulate or forward |
| VxLAN | UDP port 4789 traffic between VTEPs | Decapsulate at hypervisor/switch level before reaching NGFW, or deploy NGFW as a VTEP; EVPN allows policy attachment to VNIs |
| IPsec (encrypted) | Encrypted ESP/AH packets — opaque content | Terminate IPsec at NGFW → inspect decrypted content → re-encrypt. Or use split-tunneling to bypass NGFW for trusted traffic |
| DNS tunnelling | Legitimate-looking UDP 53 traffic | Deep DNS inspection: entropy analysis, label length, query frequency (see M07) |
| HTTPS tunnels | TLS-encrypted traffic on 443 | SSL inspection (see M08) |
| ICMP tunnels | ICMP Echo Request/Reply | Inspect ICMP data field for non-standard content (see M06) |
/* GRE decapsulation in NGFW (VPP-style) */ /* Packet arrives: outer IP → GRE → inner IP → TCP → payload */ 1. ip4-input: outer IP validated, routed to gre-input graph node 2. gre-input: outer IP and GRE header stripped 3. Inner packet injected back into ip4-input 4. ip4-input: inner IP subject to full policy (ACL, conntrack, DPI) 5. If policy permits: route inner packet; NGFW logs both outer (IP src/dst of tunnel endpoints) and inner (actual src/dst) /* VxLAN inspection flow */ Outer UDP dst=4789 → vxlan-input → strip outer Eth+IP+UDP+VxLAN Inner Ethernet frame → subject to L2/L3 policy per VNI VNI 100 = "tenant network A" → apply tenant A's security policy VNI 200 = "tenant network B" → apply tenant B's security policy
GRE Tunnel Setup and Analysis
Objective: Create a GRE tunnel between two Linux VMs, route traffic through it, and capture the encapsulated packets to understand the header structure.
sudo ip tunnel add gre1 mode gre local 10.0.0.1 remote 10.0.0.2 ttl 255; sudo ip link set gre1 up; sudo ip addr add 172.16.0.1/30 dev gre1. On VM2 (outer IP 10.0.0.2): same commands with reversed IPs. Test: ping 172.16.0.2.sudo tcpdump -i eth0 proto 47 -v while pinging through the tunnel. You should see GRE packets (IP proto 47) with an outer IP src/dst and an inner ICMP payload. Note the double IP header in the capture.ping -M do -s 1472 172.16.0.2. The effective MTU through GRE is 1476 (1500-20-4). With -s 1473 (1501B IP = exceeds 1476B GRE MTU), you should get "Frag needed". Add a route to a remote subnet through the tunnel and verify end-to-end connectivity.VxLAN Overlay Network
Objective: Create a VxLAN overlay that allows two VMs on different physical hosts (different subnets) to appear as if they're on the same L2 segment.
sudo ip link add vxlan100 type vxlan id 100 dstport 4789 local 10.0.0.1 remote 10.0.0.2 dev eth0; sudo ip link set vxlan100 up; sudo ip addr add 192.168.100.1/24 dev vxlan100. On Host2: same with .2 addresses.sudo tcpdump -i eth0 udp port 4789 -v while pinging 192.168.100.2. In Wireshark, expand the packet: outer Ethernet, outer IP (UDP), VxLAN header (VNI=100), inner Ethernet, inner ICMP.M13 MASTERY CHECKLIST
- Know the tunneling concept: overlay over underlay, encapsulate inner packet inside outer packet
- Know 5 use cases for tunneling: carry non-IP over IP, connect private nets over internet, scale L2 over L3, traffic engineering, network virtualisation
- Know tunnel overhead and effective MTU for each: GRE=24B/1476B, VxLAN=50B/1450B, MPLS=4B/label, WireGuard=~60B
- Know MTU is the #1 tunneling operational problem; know solutions: MSS clamping, PMTUD, jumbo frames on underlay
- Know MPLS label format: 20-bit label, 3-bit Exp (QoS), S bit (bottom of stack), 8-bit TTL
- Know 3 MPLS operations: PUSH (ingress adds label), SWAP (transit replaces label), POP (egress removes label)
- Know PHP (Penultimate Hop Popping): second-to-last router pops label so egress does pure IP lookup
- Know MPLS applications: Traffic Engineering (explicit paths), L3VPN (isolated customer routing)
- Know GRE: IP proto 47, 4-byte header, Protocol Type field (0x0800=IPv4), no built-in encryption
- Know GRE+IPsec is the classic site-to-site VPN combination
- Know why VxLAN exists: scale L2 over L3 networks, overcome 4094 VLAN limit (VNI=24 bits, 16M networks)
- Know VxLAN encapsulation: outer Eth+IP+UDP(4789)+VxLAN(8B) + inner Ethernet frame; total overhead=50B
- Know VTEP: device that encapsulates/decapsulates VxLAN; can be hypervisor, hardware switch, or appliance
- Know VNI: 24-bit VxLAN Network Identifier; each VNI is an isolated L2 broadcast domain
- Know EVPN: BGP control plane for VxLAN; distributes MAC/IP/VTEP mappings; eliminates BUM flooding
- Know GENEVE: next-gen overlay (RFC 8926), extensible TLV header, used by OVN and OpenStack
- Know when to use each tunnel: VxLAN for DC overlay, GRE+IPsec for site-to-site VPN, MPLS for SP TE
- Know the NGFW tunnel inspection challenge: outer packet may be permitted while inner packet violates policy
- Know NGFW approaches: GRE decapsulation for inspection, VxLAN per-VNI policy, IPsec termination + inspect + re-encrypt
- Completed Lab 1: created GRE tunnel, captured encapsulated packets, tested MTU limits
- Completed Lab 2: created VxLAN overlay, verified L2 connectivity across L3 network, tested VNI isolation
🎉 Phase 3 Complete — Routing and Forwarding
You have completed all 4 modules of Phase 3: Routing and FIB (M10), OSPF (M11), BGP (M12), and Tunneling (M13). You can now design, analyse, and implement the routing infrastructure an enterprise or service-provider network requires. Move to Phase 4 — Linux Networking and Socket Programming, starting with M14 - Linux Network Stack.