M13 - MPLS, VxLAN, GRE and Tunneling

NETWORKING MASTERY · PHASE 3 · MODULE 13 · WEEK 11 · PHASE 3 FINAL

🔗 MPLS, VxLAN, GRE and Tunneling

Label switching · Overlay networks · VxLAN VTEP · GRE encapsulation · IPsec tunnels · Tunnel comparison

Intermediate → Advanced Prerequisite: M10, M12 RFC 3032 · RFC 7348 · RFC 2784 Data Centre and VPN Core 2 Labs

WHY TUNNELING EXISTS — OVERLAY OVER UNDERLAY

🔗

The Tunneling Concept

OVERVIEW

Tunneling encapsulates one network protocol inside another — creating a virtual link between two endpoints that may be separated by many intermediate hops that don't need to understand the inner protocol. The underlay is the physical/IP network; the overlay is the virtual network running on top.

Core use cases for tunneling:

Carry non-IP traffic over IP — legacy protocols (IPX, SNA) encapsulated in IP/GRE for transport over modern IP networks
Connect private networks over public internet — VPN tunnels (GRE+IPsec, WireGuard) connect branch offices over the internet as if they were directly connected
Scale L2 over L3 — VxLAN extends Layer 2 Ethernet broadcast domains across Layer 3 IP networks — essential for data centre multi-tenancy and VM migration
Traffic engineering — MPLS labels allow routers to forward packets along pre-computed explicit paths, bypassing normal IP routing
Network virtualisation — SDN overlays (OVN, NSX, ACI) use tunnels to implement virtual networks with arbitrary topology on top of physical hardware

📦

Encapsulation Overhead Comparison

OVERHEAD

Tunnel Type	Added Headers	Total Overhead	Effective MTU (from 1500)
GRE (basic)	IP(20) + GRE(4)	24 bytes	1476 bytes
GRE + IPsec (ESP)	IP(20) + GRE(4) + ESP(~50)	~74 bytes	~1426 bytes
VxLAN	Eth(14) + IP(20) + UDP(8) + VxLAN(8)	50 bytes	1450 bytes
MPLS (1 label)	MPLS label(4)	4 bytes per label	1496 bytes
MPLS (2 labels)	MPLS label(8)	8 bytes	1492 bytes
WireGuard	IP(20) + UDP(8) + WireGuard(~32)	~60 bytes	~1440 bytes
IPsec (ESP transport)	ESP(~40)	~40 bytes	~1460 bytes

⚠️ MTU fragmentation is the #1 tunneling operational problem. When the effective MTU is reduced by tunnel overhead, packets that filled the original MTU now exceed the tunnel's MTU. If DF=1 is set (common with TCP), they get dropped. Solutions: MSS clamping (TCP only), Path MTU Discovery, configuring tunnel endpoints with reduced MTU, jumbo frames on the underlay.

MPLS — MULTIPROTOCOL LABEL SWITCHING

🏷️

MPLS Architecture and Label Forwarding

MPLS

MPLS (RFC 3032) inserts a 32-bit label between the Layer 2 header and the IP header — often called "Layer 2.5". Labels allow routers to forward packets based on a fixed-length label lookup (O(1)) rather than an IP LPM lookup (more complex), and enable traffic engineering by pre-computing explicit paths through the network.

/* MPLS label format (32 bits) */
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                Label (20 bits)                | Exp(3b)|S|  TTL  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Label:  20-bit forwarding label (0–15 = reserved)
Exp:    3-bit traffic class (QoS, formerly called "EXP")
S bit:  Bottom of Stack — set on the innermost label
TTL:    copied from IP TTL on ingress, decremented at each LSR hop

/* MPLS packet structure */
[Ethernet hdr][MPLS label 1][MPLS label 2][IP hdr][TCP hdr][Data]
                ↑ outer label  ↑ inner label
                (multiple labels = "label stack")

/* Label operations */
PUSH:   Ingress LER adds label(s) to packet
SWAP:   Transit LSR replaces label with new label (the forwarding operation)
POP:    Egress LER removes label, exposes inner packet

/* MPLS forwarding table (LFIB) */
Incoming label | Operation | Outgoing label | Outgoing interface
100            | SWAP→200  | 200            | eth1
200            | POP       | (none)         | eth2  → IP routing takes over
300            | PUSH 400  | 400            | eth3  → add outer label

💡 Penultimate Hop Popping (PHP): The second-to-last router in an MPLS path removes the label (POP) before forwarding to the egress router. This allows the egress router to process the packet as pure IP without needing a label lookup. Signalled by the egress router advertising label 3 (Implicit NULL) to its upstream neighbour.

🛣️

MPLS Traffic Engineering and VPNs

APPLICATIONS

MPLS has two dominant applications in service-provider networks:

MPLS-TE (Traffic Engineering)

RSVP-TE or LDP establishes explicit Label Switched Paths (LSPs) through the network following a pre-computed route (not necessarily the shortest IGP path). Allows bandwidth reservation, fast-reroute (50ms failover), and load distribution across parallel paths.

MPLS L3VPN (BGP/MPLS VPN)

Service providers use MPLS+BGP to provide isolated virtual private networks to customers. Customer routes are carried in BGP with a Route Distinguisher (RD) to separate them. The MPLS label stack (outer=transport, inner=VPN) directs packets to the correct customer VRF at the egress PE router.

GRE — GENERIC ROUTING ENCAPSULATION (RFC 2784)

🔧

GRE Header and Operation

GRE

GRE (Generic Routing Encapsulation) is the simplest tunnel protocol. It encapsulates any L3 protocol packet inside an IP packet with a small GRE header. GRE itself provides no encryption or authentication — it's just a wrapper. Encryption is typically added by combining GRE with IPsec.

/* GRE packet structure */
[Outer IP hdr: src=tunnel_src dst=tunnel_dst proto=47]
[GRE header: 4 bytes minimum]
  Flags(4b) | Reserved(9b) | Version(3b) | Protocol Type(16b)
  [Optional: Checksum(16b) + Reserved(16b)]
  [Optional: Key(32b)]
  [Optional: Sequence Number(32b)]
[Inner IP packet: src=orig_src dst=orig_dst]
[Original payload]

/* GRE Protocol Type field — what's inside */
0x0800 = IPv4 (most common)
0x86DD = IPv6
0x0806 = ARP
0x8847 = MPLS

/* Linux GRE tunnel setup */
# Create GRE tunnel interface
ip tunnel add gre1 mode gre local 203.0.113.1 remote 198.51.100.1 ttl 255
ip link set gre1 up
ip addr add 10.100.0.1/30 dev gre1

# Route traffic through tunnel
ip route add 192.168.2.0/24 via 10.100.0.2 dev gre1

# Verify
ip tunnel show
ping 10.100.0.2   # ping tunnel endpoint

/* GRE keepalives (Cisco extension) */
# GRE itself has no keepalive — use OSPF/BFD over the tunnel for failure detection
# Or configure GRE keepalives (encapsulate keepalive inside GRE inside tunnel)

💡 GRE + IPsec is the classic site-to-site VPN. GRE provides the tunnel (any-protocol encapsulation, routing over the tunnel), and IPsec provides encryption and authentication. Most enterprise VPN gateways still use this combination. Modern alternatives: WireGuard (simpler, faster), IPsec IKEv2 (no GRE needed), OpenVPN.

VxLAN — VIRTUAL EXTENSIBLE LAN (RFC 7348)

🏗️

Why VxLAN Exists — Scaling L2 Over L3

VXLAN

Traditional VLANs have a fundamental limitation: they are bounded by a Layer 3 network. Two VMs in the same VLAN must be on the same L2 segment — you can't have VLAN 100 span across multiple data centre buildings connected by IP routing. With cloud and hyperscale data centres needing millions of isolated tenant networks, the 4094 VLAN limit was also a constraint.

VxLAN solves both problems: it encapsulates entire Ethernet frames (including VLAN tags) inside UDP/IP packets, allowing L2 segments to span any IP network. The VxLAN Network Identifier (VNI) is 24 bits — supporting 16 million isolated networks.

📦

VxLAN Encapsulation and VTEP

VXLAN DETAILS

/* VxLAN packet structure */
[Outer Ethernet: src=VTEP_MAC dst=next-hop_MAC type=0x0800]
[Outer IP: src=VTEP_IP dst=remote_VTEP_IP proto=17 (UDP)]
[Outer UDP: src=ephemeral dst=4789 (IANA VxLAN port)]
[VxLAN header: 8 bytes]
  Flags(8b) | Reserved(24b) | VNI(24b) | Reserved(8b)
  (I flag = 1 when VNI is valid)
[Inner Ethernet frame: src=VM_MAC dst=dest_VM_MAC type=0x0800]
[Inner IP packet]
[Payload]

Total overhead: 50 bytes → effective MTU 1450 from standard 1500-byte underlay

/* VNI — VxLAN Network Identifier */
24 bits → 16,777,216 unique overlay networks
Equivalent to VLAN ID but vastly larger scale
Each VNI is a separate L2 broadcast domain

/* VTEP — VxLAN Tunnel End Point */
The device that encapsulates/decapsulates VxLAN:
  On ingress (from VM): Ethernet frame → wrap in VxLAN/UDP/IP
  On egress (to VM):    VxLAN/UDP/IP → unwrap → deliver Ethernet frame
VTEPs can be:
  - Hypervisor (Linux bridge/OVS with VXLAN)
  - Hardware switch (ToR switch with VxLAN support)
  - Dedicated gateway appliance

/* Linux VxLAN setup */
# Create VxLAN tunnel interface
ip link add vxlan100 type vxlan id 100 dstport 4789 \
    local 10.0.0.1 remote 10.0.0.2 dev eth0

ip link set vxlan100 up
ip addr add 192.168.100.1/24 dev vxlan100

# Add static FDB entry (tell Linux: MAC xx is at remote VTEP 10.0.0.2)
bridge fdb add aa:bb:cc:dd:ee:ff dev vxlan100 dst 10.0.0.2

# Multicast VxLAN (learning mode)
ip link add vxlan100 type vxlan id 100 group 239.1.1.1 dev eth0
# BUM (Broadcast, Unknown unicast, Multicast) traffic → multicast group
# VTEPs join the multicast group — learn each other's MACs via flooding

🎛️

EVPN — BGP Control Plane for VxLAN

EVPN

Traditional VxLAN floods BUM (Broadcast, Unknown unicast, Multicast) traffic to discover MACs — this doesn't scale. EVPN (Ethernet VPN, RFC 7432) uses BGP as a control plane to distribute MAC-to-IP-to-VTEP mappings, eliminating flooding:

/* EVPN Route Types (the key ones) */
Type 2 (MAC/IP Advertisement):
  "MAC aa:bb:cc:dd:ee:ff, IP 192.168.1.5 is at VTEP 10.0.0.1, VNI 100"
  → VTEPs learn MAC/IP locations via BGP, no flooding needed

Type 3 (Inclusive Multicast):
  "VTEP 10.0.0.1 participates in VNI 100 BUM forwarding"
  → Ingress replication list instead of multicast

/* Symmetric IRB — Integrated Routing and Bridging */
# Layer 3 routing between VNIs without leaving the VxLAN fabric
# Each VTEP acts as a distributed gateway for its local VMs
# No hairpinning through a central gateway router

/* Modern data centre: Leaf-Spine with VxLAN+EVPN */
Spine switches:  pure IP underlay + iBGP route reflector for EVPN
Leaf switches:   VTEPs + EVPN BGP speakers
VMs/containers:  connected to leaf switches, in VxLAN VNIs

/* FRR VxLAN+EVPN config */
router bgp 65001
  address-family l2vpn evpn
    neighbor SPINE activate
    advertise-all-vni

OTHER TUNNEL TYPES — GENEVE, WIREGUARD, 6IN4

🔧

Tunnel Protocol Reference

REFERENCE

Protocol	RFC	Transport	Overhead	Use Case
GRE	RFC 2784	IP Proto 47	24B	Site-to-site VPN (with IPsec), multi-protocol transport, GRE keepalives
IP-in-IP	RFC 2003	IP Proto 4	20B	Simple IPv4-in-IPv4; no options/encryption, minimum overhead
6in4	RFC 4213	IP Proto 41	20B	IPv6-in-IPv4 tunnels; connect IPv6 islands over IPv4 backbone
VxLAN	RFC 7348	UDP 4789	50B	Data centre overlay, VM mobility, L2 over L3, cloud networking
GENEVE	RFC 8926	UDP 6081	50B+	Next-gen overlay (OpenStack, OVN); extensible TLV options in header
MPLS	RFC 3032	Between L2/L3	4B/label	Service provider TE, L3VPN, L2VPN, fast-reroute
IPsec (tunnel)	RFC 4303	IP Proto 50/51	~50B	Encrypted site-to-site and remote-access VPN; mandatory encryption
WireGuard	—	UDP (custom)	~60B	Modern VPN: simple, fast, strong crypto (ChaCha20/Poly1305/Curve25519)
VLAN (802.1Q)	IEEE 802.1Q	Ethernet tag	4B	L2 network segmentation; not technically a tunnel but a virtual L2 overlay
PPPoE	RFC 2516	Ethernet	8B	ISP DSL access; encapsulates PPP in Ethernet; reduces MTU to 1492

WHEN TO USE WHICH TUNNEL — DECISION GUIDE

🎯

Tunnel Selection Decision Guide

DECISION

/* Which tunnel to use — decision tree */

Need to connect two office networks over internet securely?
  → IPsec IKEv2 (standard, vendor-interoperable)
  → WireGuard (modern, simple, fast — if both ends are Linux/modern)
  → GRE + IPsec (if you need routing protocols over the tunnel)

Need to carry non-IP traffic (e.g., IPX, MPLS) over IP?
  → GRE (supports any EtherType in Protocol Type field)

Need to scale L2 (VMs, containers) across IP data centre fabric?
  → VxLAN (with EVPN for control plane)
  → GENEVE (if you need extensible metadata in the header)

Need traffic engineering and bandwidth reservation in SP network?
  → MPLS-TE with RSVP-TE

Need the absolute minimum overhead (no encryption needed)?
  → IP-in-IP (20 bytes overhead, IPv4 only)

Connecting IPv6 island over IPv4 network?
  → 6in4 (static), 6to4 (automatic), Teredo (through NAT)

Need a simple test or diagnostic tunnel?
  → GRE (easiest to configure on Linux with ip tunnel add)

NGFW CHALLENGES WITH TUNNELED TRAFFIC

🛡️

The Tunnel Inspection Problem

NGFW

Tunnels present a fundamental challenge for NGFWs: the firewall sees the outer packet (which may be innocuous — UDP to port 4789, or IP proto 47) but not the inner packet (which may contain malicious traffic). An attacker can use a tunnel to bypass firewall rules by hiding prohibited traffic inside permitted tunnel traffic.

Tunnel Type	What NGFW Sees Without Inspection	Inspection Approach
GRE	IP packets destined to tunnel endpoint (Proto 47)	Decapsulate GRE at firewall, inspect inner IP packet against policy, re-encapsulate or forward
VxLAN	UDP port 4789 traffic between VTEPs	Decapsulate at hypervisor/switch level before reaching NGFW, or deploy NGFW as a VTEP; EVPN allows policy attachment to VNIs
IPsec (encrypted)	Encrypted ESP/AH packets — opaque content	Terminate IPsec at NGFW → inspect decrypted content → re-encrypt. Or use split-tunneling to bypass NGFW for trusted traffic
DNS tunnelling	Legitimate-looking UDP 53 traffic	Deep DNS inspection: entropy analysis, label length, query frequency (see M07)
HTTPS tunnels	TLS-encrypted traffic on 443	SSL inspection (see M08)
ICMP tunnels	ICMP Echo Request/Reply	Inspect ICMP data field for non-standard content (see M06)

/* GRE decapsulation in NGFW (VPP-style) */
/* Packet arrives: outer IP → GRE → inner IP → TCP → payload */

1. ip4-input: outer IP validated, routed to gre-input graph node
2. gre-input: outer IP and GRE header stripped
3. Inner packet injected back into ip4-input
4. ip4-input: inner IP subject to full policy (ACL, conntrack, DPI)
5. If policy permits: route inner packet; NGFW logs both
   outer (IP src/dst of tunnel endpoints) and inner (actual src/dst)

/* VxLAN inspection flow */
Outer UDP dst=4789 → vxlan-input → strip outer Eth+IP+UDP+VxLAN
Inner Ethernet frame → subject to L2/L3 policy per VNI
VNI 100 = "tenant network A" → apply tenant A's security policy
VNI 200 = "tenant network B" → apply tenant B's security policy

LAB 1

GRE Tunnel Setup and Analysis

Objective: Create a GRE tunnel between two Linux VMs, route traffic through it, and capture the encapsulated packets to understand the header structure.

On VM1 (outer IP 10.0.0.1):

sudo ip tunnel add gre1 mode gre local 10.0.0.1 remote 10.0.0.2 ttl 255; sudo ip link set gre1 up; sudo ip addr add 172.16.0.1/30 dev gre1

. On VM2 (outer IP 10.0.0.2): same commands with reversed IPs. Test: ping 172.16.0.2.

Capture the traffic: on VM1, run sudo tcpdump -i eth0 proto 47 -v while pinging through the tunnel. You should see GRE packets (IP proto 47) with an outer IP src/dst and an inner ICMP payload. Note the double IP header in the capture.

Open the capture in Wireshark. Expand the GRE packet: outer Ethernet, outer IP (proto=47), GRE header (protocol type=0x0800 = IPv4), inner IP, inner ICMP. Identify the tunnel overhead: how many extra bytes vs a direct ICMP ping?

Test MTU: ping with large packets: ping -M do -s 1472 172.16.0.2. The effective MTU through GRE is 1476 (1500-20-4). With -s 1473 (1501B IP = exceeds 1476B GRE MTU), you should get "Frag needed". Add a route to a remote subnet through the tunnel and verify end-to-end connectivity.

LAB 2

VxLAN Overlay Network

Objective: Create a VxLAN overlay that allows two VMs on different physical hosts (different subnets) to appear as if they're on the same L2 segment.

On Host1 (underlay IP 10.0.0.1): create VxLAN interface with VNI 100:

sudo ip link add vxlan100 type vxlan id 100 dstport 4789 local 10.0.0.1 remote 10.0.0.2 dev eth0; sudo ip link set vxlan100 up; sudo ip addr add 192.168.100.1/24 dev vxlan100

. On Host2: same with .2 addresses.

Capture VxLAN traffic: on Host1, sudo tcpdump -i eth0 udp port 4789 -v while pinging 192.168.100.2. In Wireshark, expand the packet: outer Ethernet, outer IP (UDP), VxLAN header (VNI=100), inner Ethernet, inner ICMP.

Verify the inner Ethernet frame: the inner Ethernet dst/src are the VxLAN interface MAC addresses, not the physical interface MACs. This is the key insight: to the overlay network, the VxLAN interfaces appear directly connected at L2 regardless of the physical topology.

Bonus — multiple VNIs: Add a second VxLAN interface with VNI 200 on both hosts with a different /24 overlay subnet. Verify VNI 100 and VNI 200 are completely isolated — ping from VNI 100 cannot reach VNI 200 addresses (no inter-VNI routing configured). This is L2 isolation between tenants.

M13 MASTERY CHECKLIST

Know the tunneling concept: overlay over underlay, encapsulate inner packet inside outer packet
Know 5 use cases for tunneling: carry non-IP over IP, connect private nets over internet, scale L2 over L3, traffic engineering, network virtualisation
Know tunnel overhead and effective MTU for each: GRE=24B/1476B, VxLAN=50B/1450B, MPLS=4B/label, WireGuard=~60B
Know MTU is the #1 tunneling operational problem; know solutions: MSS clamping, PMTUD, jumbo frames on underlay
Know MPLS label format: 20-bit label, 3-bit Exp (QoS), S bit (bottom of stack), 8-bit TTL
Know 3 MPLS operations: PUSH (ingress adds label), SWAP (transit replaces label), POP (egress removes label)
Know PHP (Penultimate Hop Popping): second-to-last router pops label so egress does pure IP lookup
Know MPLS applications: Traffic Engineering (explicit paths), L3VPN (isolated customer routing)
Know GRE: IP proto 47, 4-byte header, Protocol Type field (0x0800=IPv4), no built-in encryption
Know GRE+IPsec is the classic site-to-site VPN combination
Know why VxLAN exists: scale L2 over L3 networks, overcome 4094 VLAN limit (VNI=24 bits, 16M networks)
Know VxLAN encapsulation: outer Eth+IP+UDP(4789)+VxLAN(8B) + inner Ethernet frame; total overhead=50B
Know VTEP: device that encapsulates/decapsulates VxLAN; can be hypervisor, hardware switch, or appliance
Know VNI: 24-bit VxLAN Network Identifier; each VNI is an isolated L2 broadcast domain
Know EVPN: BGP control plane for VxLAN; distributes MAC/IP/VTEP mappings; eliminates BUM flooding
Know GENEVE: next-gen overlay (RFC 8926), extensible TLV header, used by OVN and OpenStack
Know when to use each tunnel: VxLAN for DC overlay, GRE+IPsec for site-to-site VPN, MPLS for SP TE
Know the NGFW tunnel inspection challenge: outer packet may be permitted while inner packet violates policy
Know NGFW approaches: GRE decapsulation for inspection, VxLAN per-VNI policy, IPsec termination + inspect + re-encrypt
Completed Lab 1: created GRE tunnel, captured encapsulated packets, tested MTU limits
Completed Lab 2: created VxLAN overlay, verified L2 connectivity across L3 network, tested VNI isolation

🎉 Phase 3 Complete — Routing and Forwarding

You have completed all 4 modules of Phase 3: Routing and FIB (M10), OSPF (M11), BGP (M12), and Tunneling (M13). You can now design, analyse, and implement the routing infrastructure an enterprise or service-provider network requires. Move to Phase 4 — Linux Networking and Socket Programming, starting with M14 - Linux Network Stack.

← M12 BGP 🗺️ Roadmap Next: M14 - Linux Network Stack →