HOW ROUTERS DECIDE WHERE PACKETS GO
The Routing Problem
OVERVIEWRouting is the process of selecting a path for network traffic. When a packet arrives at a router, the router must answer one question in microseconds: which interface should I send this packet out of? The answer depends on the packet's destination IP address and the router's routing table.
Three types of routes populate a routing table:
- Connected routes — automatically created when an interface is brought up and assigned an IP. The router knows it can reach that subnet directly. These have the highest trustworthiness.
- Static routes — manually configured by an administrator. Simple, predictable, but don't adapt to topology changes.
- Dynamic routes — learned from other routers via routing protocols (OSPF, BGP, RIP). Adapt automatically when links fail or topology changes.
A router's routing table is like a GPS navigation system's map. When you enter a destination, the GPS looks at all known roads, finds the best route, and gives you turn-by-turn directions. A routing table contains all known network destinations (like roads on a map) with their next hops (like turns). The GPS's "fastest route" is like a router's "lowest metric". Just as GPS updates its map when roads are closed, dynamic routing protocols update the routing table when links fail. And just as GPS uses the most specific address you enter — a full street address wins over just a city name — routers use the most specific prefix that matches.
The Five-Step Packet Forwarding Process
FORWARDING/* When a router receives a packet */ Step 1: Receive packet on ingress interface - NIC copies frame from wire to memory - Strip Ethernet header, verify CRC Step 2: Verify IP header - Check IP version (4 or 6) - Validate IP header checksum (drop if corrupt) - Check TTL: if TTL == 0 → discard + send ICMP Time Exceeded - Decrement TTL Step 3: FIB lookup (Longest Prefix Match) - Look up destination IP in Forwarding Information Base - Find most specific matching prefix - Get: egress interface + next-hop IP Step 4: Resolve next-hop MAC (ARP / Neighbour Cache) - Look up next-hop IP in ARP cache - If miss: send ARP request, queue packet Step 5: Rewrite L2 header + transmit - New Ethernet header: dst=next-hop MAC, src=egress port MAC - Recompute IP checksum (TTL changed) - Transmit on egress interface /* What the router does NOT change */ # Source IP, Destination IP — unchanged (except NAT) # Payload (TCP/UDP/application data) — unchanged # Only TTL (decremented) and checksum (recomputed) change in IP header
RIB AND FIB — THE TWO ROUTING DATABASES
RIB vs FIB — Different Purposes, Different Structures
ARCHITECTUREModern routing systems maintain two separate databases for routes. Understanding the distinction is critical for data-plane engineering:
RIB — Routing Information Base
The control plane database. Contains all routes learned from all sources — connected, static, OSPF, BGP, RIP, etc. — including multiple competing routes to the same destination from different protocols or next-hops. The RIB is the "complete knowledge" database. It is managed by the routing daemon (Quagga, FRR, VPP's routing engine).
The RIB is not used for packet forwarding — it's too large and complex for per-packet lookups at line rate.
FIB — Forwarding Information Base
The data plane database. Contains only the best route per destination prefix — selected from the RIB by the routing daemon after applying administrative distance and metric comparisons. The FIB is what the forwarding engine (ASIC, NP, or CPU) uses for every packet lookup.
The FIB is optimised for speed: it may be stored in TCAM (hardware), a radix trie, or a hash table. In VPP, the FIB is a multi-level lookup structure in hugepage memory. In Linux, it's the kernel routing table.
/* RIB → FIB population process */ RIB contains (for destination 10.0.0.0/8): OSPF: 10.0.0.0/8 via 192.168.1.2 metric=20 AD=110 Static: 10.0.0.0/8 via 192.168.1.3 metric=0 AD=1 BGP: 10.0.0.0/8 via 10.255.0.1 metric=100 AD=200 Route selection: 1. Prefer lower Administrative Distance: Static AD=1 wins over OSPF(110) and BGP(200) 2. Among equal AD routes: prefer lower metric Best route: Static 10.0.0.0/8 via 192.168.1.3 FIB entry: 10.0.0.0/8 → egress=eth1 nexthop=192.168.1.3 mac=aa:bb:cc:dd:ee:ff /* Linux commands */ ip route show table main # FIB (main routing table) ip route show table all # all routing tables ip route get 8.8.8.8 # which route would be used for 8.8.8.8? /* VPP FIB inspection */ # vppctl: show ip fib # vppctl: show ip fib 10.0.0.0/8 # vppctl: show ip fib summary
Hardware FIB — TCAM
HARDWAREHigh-end routers and switches implement the FIB in TCAM (Ternary Content-Addressable Memory) — a specialised memory that can match a search key against all stored entries simultaneously in O(1) time, regardless of table size.
TCAM stores entries with three states per bit: 0, 1, or X (don't care). For IP routing: the network bits are stored as 0/1, and the host bits as X. A lookup of any destination IP matches the most specific entry in a single clock cycle at line rate — 100 Gbps, no software involved.
/* TCAM entry for 192.168.1.0/24 */ Value: 11000000.10101000.00000001.00000000 Mask: 11111111.11111111.11111111.00000000 (X = don't care on unmasked bits) /* A lookup of 192.168.1.55: */ Match: 11000000.10101000.00000001.00110111 AND mask: first 24 bits match → HIT /* TCAM limitations */ # Expensive per-bit (vs SRAM) # High power consumption # Limited capacity (typically 256K–2M entries in enterprise routers) # Internet full routing table ~950K prefixes — approaches TCAM limits # BGP router TCAM exhaustion is a real operational concern
LONGEST PREFIX MATCH — THE CORE ROUTING ALGORITHM
What is Longest Prefix Match?
CONCEPTLongest Prefix Match (LPM) is the rule for selecting which route to use when multiple routes match a destination address. The route with the most specific prefix (longest subnet mask / highest prefix length) wins.
/* Routing table example */ 10.0.0.0/8 via 192.168.1.1 # matches any 10.x.x.x 10.10.0.0/16 via 192.168.1.2 # matches any 10.10.x.x 10.10.1.0/24 via 192.168.1.3 # matches any 10.10.1.x 10.10.1.5/32 via 192.168.1.4 # matches ONLY 10.10.1.5 0.0.0.0/0 via 192.168.1.254 # default — matches anything /* LPM for destination 10.10.1.5 */ 0.0.0.0/0 matches → /0 prefix length 10.0.0.0/8 matches → /8 prefix length 10.10.0.0/16 matches → /16 prefix length 10.10.1.0/24 matches → /24 prefix length 10.10.1.5/32 matches → /32 prefix length ← LONGEST MATCH → use this route /* LPM for destination 10.10.2.50 */ 0.0.0.0/0 matches → /0 10.0.0.0/8 matches → /8 10.10.0.0/16 matches → /16 ← LONGEST MATCH → use this route 10.10.1.0/24 no match (wrong third octet) 10.10.1.5/32 no match /* LPM for destination 8.8.8.8 */ 0.0.0.0/0 matches → /0 ← LONGEST (only) MATCH → default route
💡 The /32 host route is the most specific possible — it matches exactly one IP address. Used for: BGP next-hop routes, traffic engineering, sinkholing specific IPs, loopback interfaces. In VPP/DPDK data planes you will frequently add /32 routes for specific flows.
LPM Data Structures — Radix Trie
ALGORITHMSoftware LPM (used in Linux kernel, VPP for IPv6, and when TCAM isn't available) is typically implemented with a radix trie (Patricia trie). Each bit of the IP address is a branch point. The lookup traverses the trie bit by bit from MSB to LSB, recording the last matching node that has a route entry. After all bits are processed, the last recorded entry is the LPM result.
/* Binary trie LPM lookup for 10.10.1.5 */ /* 10.10.1.5 = 00001010.00001010.00000001.00000101 */ Bit 1 (MSB): 0 → go left Bit 2: 0 → go left Bit 3: 0 → go left Bit 4: 0 → go left Bit 5: 1 → go right ← first '1' bit Found entry: 10.0.0.0/8 (prefix ends at bit 8) → record as current best ...continue to bit 8, record 10.0.0.0/8... ...continue to bit 16, record 10.10.0.0/16... ...continue to bit 24, record 10.10.1.0/24... ...continue to bit 32, record 10.10.1.5/32... End of bits → return 10.10.1.5/32 (last recorded = longest match) /* Implementing LPM in C — simple version */ typedef struct trie_node { struct trie_node *child[2]; /* 0=left, 1=right */ uint32_t nexthop; /* 0 = no route at this node */ uint8_t prefix_len; } trie_node_t; uint32_t lpm_lookup(trie_node_t *root, uint32_t dst_ip) { trie_node_t *node = root; uint32_t best_nexthop = 0; for (int bit = 31; bit >= 0 && node; bit--) { if (node->nexthop) best_nexthop = node->nexthop; int b = (dst_ip >> bit) & 1; node = node->child[b]; } if (node && node->nexthop) best_nexthop = node->nexthop; return best_nexthop; /* 0 = no route (drop) */ }
LPM at Scale — DIR-24-8 and LC-Trie
For IPv4 in software at high speed, routers often use DIR-24-8 (Direct Index Route lookup): a two-level table where the first 24 bits index a 16M-entry array (with one 32-bit entry per /24 prefix), and the last 8 bits are resolved with a secondary table for prefixes longer than /24. This achieves O(1) or O(2) lookup with excellent cache performance.
ECMP — EQUAL-COST MULTI-PATH ROUTING
What ECMP Is and Why It Exists
CONCEPTECMP (Equal-Cost Multi-Path) allows a router to use multiple next-hops simultaneously for a destination when multiple paths have the same metric. This provides both load balancing (traffic distributed across multiple links) and redundancy (if one path fails, others continue).
/* ECMP in Linux routing table */ $ ip route show 10.0.0.0/8 nexthop via 192.168.1.1 dev eth0 weight 1 nexthop via 192.168.1.2 dev eth1 weight 1 nexthop via 192.168.1.3 dev eth2 weight 1 # Three equal-cost paths — traffic balanced across all three /* ECMP hashing — how traffic is distributed */ # Each packet is assigned to a path using a hash of its flow identifier # This ensures packets of the SAME FLOW go to the SAME next-hop # (required for stateful protocols — TCP reordering causes retransmits) /* Hash inputs (per-flow consistent hashing) */ 5-tuple hash (most common): hash(src_ip, dst_ip, src_port, dst_port, protocol) % num_paths /* For IPv6: also include Flow Label (20-bit) */ hash(src_ip6, dst_ip6, flow_label) % num_paths /* Weighted ECMP (unequal links) */ 10.0.0.0/8 nexthop via 192.168.1.1 weight 3 # 3/4 of traffic nexthop via 192.168.1.2 weight 1 # 1/4 of traffic
ECMP Challenges — The Polarisation Problem
CHALLENGESECMP has two well-known operational problems:
- Hash polarisation — if every router in a topology uses the same hash function and inputs, all traffic may land on the same path through the network. The fix: vary hash seeds per router, or include router-specific fields (e.g., ingress interface) in the hash.
- Large flows dominate — ECMP distributes by flow, not by byte count. A single TCP elephant flow (file download) sending 10 Gbps gets one path. 1000 small flows might be evenly distributed across 3 paths but the elephant's path carries 10× more traffic. Fix: flow-aware traffic engineering (MPLS TE, Segment Routing).
ECMP and Stateful Firewalls
ECMP creates a critical challenge for stateful firewalls and NGFW clusters: if outbound and inbound packets of the same TCP session take different paths and hit different firewall nodes, the firewall node handling the return traffic has no state for the connection and drops it (asymmetric routing). Solutions: session synchronisation between firewall nodes, consistent hashing to ensure symmetric path, or stateless inspection modes.
/* ECMP configuration in Linux */ # Add ECMP route ip route add 10.0.0.0/8 \ nexthop via 192.168.1.1 dev eth0 weight 1 \ nexthop via 192.168.1.2 dev eth1 weight 1 # Control ECMP hash inputs sysctl net.ipv4.fib_multipath_hash_policy # 0 = L3 (src+dst IP only) # 1 = L3+L4 (5-tuple) — recommended for better distribution # 2 = L3+L4 including inner headers for encapsulated packets # VPP ECMP (load-balance object) # vppctl: ip route add 10.0.0.0/8 via 192.168.1.1 GigE0/0 # vppctl: ip route add 10.0.0.0/8 via 192.168.1.2 GigE0/1 # VPP creates a load-balance object with flow-hash buckets
ADMINISTRATIVE DISTANCE — ROUTE TRUSTWORTHINESS
Administrative Distance — Route Source Preference
ADWhen multiple routing protocols learn routes to the same destination, the router must choose which one to install in the FIB. Administrative Distance (AD) is the preference value assigned to each routing source — lower AD = more trusted = preferred. AD is a Cisco term; other vendors use similar concepts (route preference, distance).
| Route Source | AD (Cisco) | AD (Linux metric) | Why This Priority |
|---|---|---|---|
| Connected interface | 0 | 0 | Router is directly attached — 100% reliable |
| Static route | 1 | — | Manually configured — administrator knows best |
| EIGRP (internal) | 90 | — | Cisco proprietary — high trust |
| OSPF | 110 | — | Standard IGP — trusted but below static |
| IS-IS | 115 | — | Standard IGP — similar to OSPF |
| RIP | 120 | — | Old distance-vector — lower trust |
| EIGRP (external) | 170 | — | Redistributed from another protocol — less trusted |
| BGP (eBGP) | 20 | — | External BGP — specific route from peer, high trust |
| BGP (iBGP) | 200 | — | Internal BGP — redistributed, lowest trust by default |
| Unreachable (blackhole) | 255 | — | Used to mark routes as unusable |
/* AD in practice — floating static route */ # Primary path: OSPF learns 10.0.0.0/8 via fiber link (AD=110) # Backup: static route via 4G modem (should only be used if OSPF fails) ip route add 10.0.0.0/8 via 192.168.100.1 metric 200 # In Linux: metric 200 = lower priority than OSPF routes # While OSPF is active: OSPF route wins (metric 110) # When OSPF fails: OSPF route removed → static route with metric 200 activates # This is a "floating static route" — floats below dynamic routes /* Viewing route sources in Linux */ ip route show proto ospf # routes from OSPF ip route show proto bgp # routes from BGP ip route show proto static # static routes ip route show proto kernel # connected (kernel-generated)
POLICY-BASED ROUTING — BEYOND DESTINATION-ONLY FORWARDING
What Policy Routing Adds
PBRStandard routing forwards packets based only on destination IP address. Policy-Based Routing (PBR) allows routing decisions based on additional criteria: source IP, DSCP/TOS, protocol, port, or even incoming interface. This is essential for NGFW deployments where different traffic classes must take different paths.
/* Linux Policy Routing (ip rule + multiple routing tables) */ # Scenario: ISP-A (eth0) for normal traffic, ISP-B (eth1) for VoIP # Step 1: Create separate routing tables # /etc/iproute2/rt_tables: add "200 isp_b" echo "200 isp_b" >> /etc/iproute2/rt_tables # Step 2: Populate table for ISP-B ip route add default via 10.2.0.1 dev eth1 table isp_b ip route add 10.2.0.0/24 dev eth1 table isp_b # local subnet # Step 3: Create rules to select table based on criteria ip rule add from 192.168.1.0/24 dport 5060 table isp_b # SIP → ISP-B ip rule add dscp 46 table isp_b # EF (VoIP) → ISP-B ip rule add from 10.0.0.5 table isp_b # specific host → ISP-B # Default rule: use main table (ISP-A) # View rules (evaluated in priority order, lower = first) ip rule show # 0: from all lookup local # 100: from 10.0.0.5 lookup isp_b # 200: dscp 46 lookup isp_b # 32766: from all lookup main # 32767: from all lookup default
💡 NGFW use cases for PBR: Route management traffic out a dedicated OOB (out-of-band) interface. Send IDS/IPS traffic to an inline inspection appliance. Route different VLANs to different next-hops. Forward specific threat-tagged traffic to a honeypot or sandbox. Force all DNS to the internal resolver regardless of destination port.
LINUX ROUTING INTERNALS AND COMMANDS
Linux Kernel Routing Architecture
LINUX/* Linux routing table management commands */ # View main routing table (FIB) ip route show ip route show table main route -n # older tool (still useful) # View a specific route and which would be used ip route get 8.8.8.8 # 8.8.8.8 via 192.168.1.1 dev eth0 src 192.168.1.5 uid 1000 # cache # Add static routes ip route add 10.0.0.0/8 via 192.168.1.1 # via next-hop ip route add 10.0.0.0/8 dev eth0 # directly connected ip route add 10.0.0.0/8 via 192.168.1.1 metric 100 # with metric ip route add blackhole 203.0.113.0/24 # null route (drop) ip route add prohibit 192.0.2.0/24 # drop + ICMP prohibit ip route add unreachable 198.51.100.0/24 # drop + ICMP unreachable # Delete routes ip route del 10.0.0.0/8 via 192.168.1.1 # Default route ip route add default via 192.168.1.1 ip route add 0.0.0.0/0 via 192.168.1.1 # same thing # Make Linux forward packets (act as router) sysctl net.ipv4.ip_forward=1 echo 1 > /proc/sys/net/ipv4/ip_forward # Permanent: add to /etc/sysctl.conf # Route cache (Linux 3.6+ uses FIB directly — no separate route cache) # Previous versions had a route cache (dst_entry) that caused hash table attacks # IPv6 routing ip -6 route show ip -6 route add 2001:db8::/32 via fe80::1 dev eth0 ip -6 route get 2001:4860:4860::8888
Routing in VPP — The Data-Plane Perspective
VPP/* VPP FIB architecture */ # VPP uses a multi-level FIB structure in hugepage memory # IP4 FIB: hash table + mtrie (multi-way trie) for IPv4 # IP6 FIB: hash table for /128s, mtrie for others # VPP routing commands (vppctl) show ip fib # entire IPv4 FIB show ip fib 10.0.0.0/8 # specific prefix show ip fib summary # prefix count by length ip route add 10.0.0.0/8 via 192.168.1.1 GigabitEthernet0/8/0 ip route add 0.0.0.0/0 via 192.168.1.254 GigabitEthernet0/8/0 # ECMP in VPP ip route add 10.0.0.0/8 via 192.168.1.1 GigabitEthernet0/8/0 ip route add 10.0.0.0/8 via 192.168.1.2 GigabitEthernet0/8/1 # VPP automatically creates a load-balance adjacency with flow-hash buckets show ip fib 10.0.0.0/8 # shows load-balance object with N buckets # VPP FIB lookup in C (graph node) /* In ip4_lookup.c: */ ip4_fib_mtrie_lookup_step (mtrie, &leaf, &a->dst_address, 0); ip4_fib_mtrie_lookup_step (mtrie, &leaf, &a->dst_address, 1); ip4_fib_mtrie_lookup_step (mtrie, &leaf, &a->dst_address, 2); ip4_fib_mtrie_lookup_step (mtrie, &leaf, &a->dst_address, 3); /* 4 pipeline stages for 32-bit IPv4 address lookup */
LPM Trie Implementation in C
Objective: Implement a binary trie for LPM routing lookups. Insert routes and verify the correct next-hop is returned for various destination IPs.
trie_node_t struct and trie_insert(root, prefix, prefix_len, nexthop) function. For each prefix, set bits 0–prefix_len of the address path in the trie and store the nexthop at the terminal node.lpm_lookup(root, dst_ip) — traverse the trie bit by bit from MSB to LSB, recording the most recently seen nexthop. Return the last recorded nexthop (or 0 for no route).Build a Multi-Path Router with ECMP
Objective: Configure a Linux VM as a router with ECMP, observe traffic distribution, and test failover when one path goes down.
sudo sysctl net.ipv4.ip_forward=1. Add an ECMP route with two nexthops: sudo ip route add 10.0.0.0/8 nexthop via 192.168.1.1 dev eth0 weight 1 nexthop via 192.168.1.2 dev eth1 weight 1.ip route get with different destination IPs to see which nexthop is selected. The hash of the destination IP determines the path: for i in $(seq 1 20); do ip route get 10.0.$i.1 | grep via; done. Count how many go to each nexthop — should be roughly 50/50.sudo sysctl net.ipv4.fib_multipath_hash_policy=1. Repeat. The distribution should change because port numbers now affect the hash.sudo ip link set eth1 down). Verify the ECMP route has only one nexthop remaining: ip route show. Bring it back up and re-add the nexthop.Policy Routing — Multiple Uplinks
Objective: Configure Linux Policy Routing to route different traffic classes to different uplinks.
echo "100 isp1" >> /etc/iproute2/rt_tables; echo "200 isp2" >> /etc/iproute2/rt_tables. Populate each: ip route add default via 10.1.0.1 table isp1; ip route add default via 10.2.0.1 table isp2.ip rule add dport 22 table isp1 priority 100; ip rule add dport 80 table isp2 priority 101; ip rule add dport 443 table isp2 priority 102. View with ip rule show.ip route get 8.8.8.8 dport 22 vs ip route get 8.8.8.8 dport 80. Verify different nexthops are selected. Use traceroute on each to confirm different first hops.M10 MASTERY CHECKLIST
- Know the 5-step forwarding process: receive → validate IP (TTL, checksum) → FIB lookup (LPM) → ARP resolve → rewrite L2 + transmit
- Know what the router changes: decrements TTL, recomputes IP checksum, rewrites Ethernet src/dst MAC
- Know what the router does NOT change: src IP, dst IP, payload
- Know the difference between RIB (control plane, all routes) and FIB (data plane, best routes for forwarding)
- Know how TCAM works: ternary (0/1/X) match, all entries searched simultaneously, O(1) lookup
- Know TCAM limitations: expensive, power-hungry, limited capacity (~1M entries), BGP full table exhaustion risk
- Know LPM rule: among all matching prefixes, the one with the longest (most specific) prefix length wins
- Can manually perform LPM given a routing table and destination IP
- Know the radix trie LPM algorithm: traverse bit by bit MSB to LSB, record last matching node
- Know DIR-24-8: two-level array for IPv4 LPM at near-O(1) with good cache behaviour
- Know ECMP: multiple equal-cost next-hops, flow-based consistent hashing for per-flow path selection
- Know ECMP hash inputs: 5-tuple (src_ip, dst_ip, src_port, dst_port, proto) ensures same flow → same path
- Know ECMP challenges: hash polarisation, elephant flows, asymmetric routing breaking stateful firewalls
- Know Administrative Distance: lower AD = more trusted = installed in FIB. Key values: connected=0, static=1, OSPF=110, iBGP=200
- Know floating static route: static with high metric used as backup when dynamic route disappears
- Know Policy-Based Routing: route decisions based on src IP, DSCP, protocol, port — not just dst IP
- Know Linux PBR: ip rule + multiple routing tables; rules evaluated in priority order
- Know key Linux routing commands: ip route show/add/del/get, ip rule show/add, sysctl net.ipv4.ip_forward
- Know null/blackhole routes: drop packets matching prefix (used for traffic engineering and anti-DDoS)
- Completed Lab 1: implemented binary trie LPM in C, verified correct nexthop for all test cases
- Completed Lab 2: configured ECMP with two nexthops, verified hash distribution, tested failover
- Completed Lab 3: configured policy routing with two uplinks, verified different traffic classes use different paths
✅ When complete: Move to M11 - OSPF Internals. You now understand how routers forward packets — M11 covers how they learn routes dynamically via OSPF's link-state flooding and Dijkstra's SPF algorithm.