NETWORKING MASTERY · PHASE 3 · MODULE 10 · WEEK 8
🗺️ Routing Fundamentals and FIB
FIB/RIB architecture · Longest Prefix Match · LPM algorithms · ECMP · Policy routing · Administrative distance
Intermediate Prerequisite: M03 IPv4 RFC 1812 Core Data-Plane Operation 3 Labs

HOW ROUTERS DECIDE WHERE PACKETS GO

🗺️

The Routing Problem

OVERVIEW

Routing is the process of selecting a path for network traffic. When a packet arrives at a router, the router must answer one question in microseconds: which interface should I send this packet out of? The answer depends on the packet's destination IP address and the router's routing table.

Three types of routes populate a routing table:

  • Connected routes — automatically created when an interface is brought up and assigned an IP. The router knows it can reach that subnet directly. These have the highest trustworthiness.
  • Static routes — manually configured by an administrator. Simple, predictable, but don't adapt to topology changes.
  • Dynamic routes — learned from other routers via routing protocols (OSPF, BGP, RIP). Adapt automatically when links fail or topology changes.
🚦 Analogy — GPS Navigation System

A router's routing table is like a GPS navigation system's map. When you enter a destination, the GPS looks at all known roads, finds the best route, and gives you turn-by-turn directions. A routing table contains all known network destinations (like roads on a map) with their next hops (like turns). The GPS's "fastest route" is like a router's "lowest metric". Just as GPS updates its map when roads are closed, dynamic routing protocols update the routing table when links fail. And just as GPS uses the most specific address you enter — a full street address wins over just a city name — routers use the most specific prefix that matches.

📊

The Five-Step Packet Forwarding Process

FORWARDING
/* When a router receives a packet */

Step 1: Receive packet on ingress interface
  - NIC copies frame from wire to memory
  - Strip Ethernet header, verify CRC

Step 2: Verify IP header
  - Check IP version (4 or 6)
  - Validate IP header checksum (drop if corrupt)
  - Check TTL: if TTL == 0 → discard + send ICMP Time Exceeded
  - Decrement TTL

Step 3: FIB lookup (Longest Prefix Match)
  - Look up destination IP in Forwarding Information Base
  - Find most specific matching prefix
  - Get: egress interface + next-hop IP

Step 4: Resolve next-hop MAC (ARP / Neighbour Cache)
  - Look up next-hop IP in ARP cache
  - If miss: send ARP request, queue packet

Step 5: Rewrite L2 header + transmit
  - New Ethernet header: dst=next-hop MAC, src=egress port MAC
  - Recompute IP checksum (TTL changed)
  - Transmit on egress interface

/* What the router does NOT change */
# Source IP, Destination IP — unchanged (except NAT)
# Payload (TCP/UDP/application data) — unchanged
# Only TTL (decremented) and checksum (recomputed) change in IP header

RIB AND FIB — THE TWO ROUTING DATABASES

📚

RIB vs FIB — Different Purposes, Different Structures

ARCHITECTURE

Modern routing systems maintain two separate databases for routes. Understanding the distinction is critical for data-plane engineering:

RIB — Routing Information Base

The control plane database. Contains all routes learned from all sources — connected, static, OSPF, BGP, RIP, etc. — including multiple competing routes to the same destination from different protocols or next-hops. The RIB is the "complete knowledge" database. It is managed by the routing daemon (Quagga, FRR, VPP's routing engine).

The RIB is not used for packet forwarding — it's too large and complex for per-packet lookups at line rate.

FIB — Forwarding Information Base

The data plane database. Contains only the best route per destination prefix — selected from the RIB by the routing daemon after applying administrative distance and metric comparisons. The FIB is what the forwarding engine (ASIC, NP, or CPU) uses for every packet lookup.

The FIB is optimised for speed: it may be stored in TCAM (hardware), a radix trie, or a hash table. In VPP, the FIB is a multi-level lookup structure in hugepage memory. In Linux, it's the kernel routing table.

/* RIB → FIB population process */

RIB contains (for destination 10.0.0.0/8):
  OSPF:   10.0.0.0/8 via 192.168.1.2  metric=20  AD=110
  Static: 10.0.0.0/8 via 192.168.1.3  metric=0   AD=1
  BGP:    10.0.0.0/8 via 10.255.0.1   metric=100  AD=200

Route selection:
  1. Prefer lower Administrative Distance: Static AD=1 wins over OSPF(110) and BGP(200)
  2. Among equal AD routes: prefer lower metric
  Best route: Static 10.0.0.0/8 via 192.168.1.3

FIB entry: 10.0.0.0/8 → egress=eth1 nexthop=192.168.1.3 mac=aa:bb:cc:dd:ee:ff

/* Linux commands */
ip route show table main        # FIB (main routing table)
ip route show table all         # all routing tables
ip route get 8.8.8.8            # which route would be used for 8.8.8.8?

/* VPP FIB inspection */
# vppctl: show ip fib
# vppctl: show ip fib 10.0.0.0/8
# vppctl: show ip fib summary
🏎️

Hardware FIB — TCAM

HARDWARE

High-end routers and switches implement the FIB in TCAM (Ternary Content-Addressable Memory) — a specialised memory that can match a search key against all stored entries simultaneously in O(1) time, regardless of table size.

TCAM stores entries with three states per bit: 0, 1, or X (don't care). For IP routing: the network bits are stored as 0/1, and the host bits as X. A lookup of any destination IP matches the most specific entry in a single clock cycle at line rate — 100 Gbps, no software involved.

/* TCAM entry for 192.168.1.0/24 */
Value: 11000000.10101000.00000001.00000000
Mask:  11111111.11111111.11111111.00000000  (X = don't care on unmasked bits)

/* A lookup of 192.168.1.55: */
Match: 11000000.10101000.00000001.00110111
AND mask: first 24 bits match → HIT

/* TCAM limitations */
# Expensive per-bit (vs SRAM)
# High power consumption
# Limited capacity (typically 256K–2M entries in enterprise routers)
# Internet full routing table ~950K prefixes — approaches TCAM limits
# BGP router TCAM exhaustion is a real operational concern

LONGEST PREFIX MATCH — THE CORE ROUTING ALGORITHM

🔍

What is Longest Prefix Match?

CONCEPT

Longest Prefix Match (LPM) is the rule for selecting which route to use when multiple routes match a destination address. The route with the most specific prefix (longest subnet mask / highest prefix length) wins.

/* Routing table example */
10.0.0.0/8       via 192.168.1.1   # matches any 10.x.x.x
10.10.0.0/16     via 192.168.1.2   # matches any 10.10.x.x
10.10.1.0/24     via 192.168.1.3   # matches any 10.10.1.x
10.10.1.5/32     via 192.168.1.4   # matches ONLY 10.10.1.5
0.0.0.0/0        via 192.168.1.254 # default — matches anything

/* LPM for destination 10.10.1.5 */
0.0.0.0/0     matches  → /0  prefix length
10.0.0.0/8    matches  → /8  prefix length
10.10.0.0/16  matches  → /16 prefix length
10.10.1.0/24  matches  → /24 prefix length
10.10.1.5/32  matches  → /32 prefix length  ← LONGEST MATCH → use this route

/* LPM for destination 10.10.2.50 */
0.0.0.0/0     matches  → /0
10.0.0.0/8    matches  → /8
10.10.0.0/16  matches  → /16  ← LONGEST MATCH → use this route
10.10.1.0/24  no match (wrong third octet)
10.10.1.5/32  no match

/* LPM for destination 8.8.8.8 */
0.0.0.0/0     matches  → /0   ← LONGEST (only) MATCH → default route

💡 The /32 host route is the most specific possible — it matches exactly one IP address. Used for: BGP next-hop routes, traffic engineering, sinkholing specific IPs, loopback interfaces. In VPP/DPDK data planes you will frequently add /32 routes for specific flows.

🌳

LPM Data Structures — Radix Trie

ALGORITHM

Software LPM (used in Linux kernel, VPP for IPv6, and when TCAM isn't available) is typically implemented with a radix trie (Patricia trie). Each bit of the IP address is a branch point. The lookup traverses the trie bit by bit from MSB to LSB, recording the last matching node that has a route entry. After all bits are processed, the last recorded entry is the LPM result.

/* Binary trie LPM lookup for 10.10.1.5 */
/* 10.10.1.5 = 00001010.00001010.00000001.00000101 */

Bit 1 (MSB): 0 → go left
Bit 2:       0 → go left
Bit 3:       0 → go left
Bit 4:       0 → go left
Bit 5:       1 → go right  ← first '1' bit
  Found entry: 10.0.0.0/8 (prefix ends at bit 8) → record as current best
...continue to bit 8, record 10.0.0.0/8...
...continue to bit 16, record 10.10.0.0/16...
...continue to bit 24, record 10.10.1.0/24...
...continue to bit 32, record 10.10.1.5/32...
End of bits → return 10.10.1.5/32 (last recorded = longest match)

/* Implementing LPM in C — simple version */
typedef struct trie_node {
    struct trie_node *child[2]; /* 0=left, 1=right */
    uint32_t nexthop;           /* 0 = no route at this node */
    uint8_t  prefix_len;
} trie_node_t;

uint32_t lpm_lookup(trie_node_t *root, uint32_t dst_ip) {
    trie_node_t *node = root;
    uint32_t best_nexthop = 0;
    for (int bit = 31; bit >= 0 && node; bit--) {
        if (node->nexthop) best_nexthop = node->nexthop;
        int b = (dst_ip >> bit) & 1;
        node = node->child[b];
    }
    if (node && node->nexthop) best_nexthop = node->nexthop;
    return best_nexthop; /* 0 = no route (drop) */
}

LPM at Scale — DIR-24-8 and LC-Trie

For IPv4 in software at high speed, routers often use DIR-24-8 (Direct Index Route lookup): a two-level table where the first 24 bits index a 16M-entry array (with one 32-bit entry per /24 prefix), and the last 8 bits are resolved with a secondary table for prefixes longer than /24. This achieves O(1) or O(2) lookup with excellent cache performance.

ECMP — EQUAL-COST MULTI-PATH ROUTING

⚖️

What ECMP Is and Why It Exists

CONCEPT

ECMP (Equal-Cost Multi-Path) allows a router to use multiple next-hops simultaneously for a destination when multiple paths have the same metric. This provides both load balancing (traffic distributed across multiple links) and redundancy (if one path fails, others continue).

/* ECMP in Linux routing table */
$ ip route show
10.0.0.0/8
    nexthop via 192.168.1.1 dev eth0 weight 1
    nexthop via 192.168.1.2 dev eth1 weight 1
    nexthop via 192.168.1.3 dev eth2 weight 1
# Three equal-cost paths — traffic balanced across all three

/* ECMP hashing — how traffic is distributed */
# Each packet is assigned to a path using a hash of its flow identifier
# This ensures packets of the SAME FLOW go to the SAME next-hop
# (required for stateful protocols — TCP reordering causes retransmits)

/* Hash inputs (per-flow consistent hashing) */
5-tuple hash (most common):
  hash(src_ip, dst_ip, src_port, dst_port, protocol) % num_paths

/* For IPv6: also include Flow Label (20-bit) */
hash(src_ip6, dst_ip6, flow_label) % num_paths

/* Weighted ECMP (unequal links) */
10.0.0.0/8
    nexthop via 192.168.1.1 weight 3   # 3/4 of traffic
    nexthop via 192.168.1.2 weight 1   # 1/4 of traffic
⚠️

ECMP Challenges — The Polarisation Problem

CHALLENGES

ECMP has two well-known operational problems:

  • Hash polarisation — if every router in a topology uses the same hash function and inputs, all traffic may land on the same path through the network. The fix: vary hash seeds per router, or include router-specific fields (e.g., ingress interface) in the hash.
  • Large flows dominate — ECMP distributes by flow, not by byte count. A single TCP elephant flow (file download) sending 10 Gbps gets one path. 1000 small flows might be evenly distributed across 3 paths but the elephant's path carries 10× more traffic. Fix: flow-aware traffic engineering (MPLS TE, Segment Routing).

ECMP and Stateful Firewalls

ECMP creates a critical challenge for stateful firewalls and NGFW clusters: if outbound and inbound packets of the same TCP session take different paths and hit different firewall nodes, the firewall node handling the return traffic has no state for the connection and drops it (asymmetric routing). Solutions: session synchronisation between firewall nodes, consistent hashing to ensure symmetric path, or stateless inspection modes.

/* ECMP configuration in Linux */
# Add ECMP route
ip route add 10.0.0.0/8 \
    nexthop via 192.168.1.1 dev eth0 weight 1 \
    nexthop via 192.168.1.2 dev eth1 weight 1

# Control ECMP hash inputs
sysctl net.ipv4.fib_multipath_hash_policy
# 0 = L3 (src+dst IP only)
# 1 = L3+L4 (5-tuple) — recommended for better distribution
# 2 = L3+L4 including inner headers for encapsulated packets

# VPP ECMP (load-balance object)
# vppctl: ip route add 10.0.0.0/8 via 192.168.1.1 GigE0/0
# vppctl: ip route add 10.0.0.0/8 via 192.168.1.2 GigE0/1
# VPP creates a load-balance object with flow-hash buckets

ADMINISTRATIVE DISTANCE — ROUTE TRUSTWORTHINESS

🏆

Administrative Distance — Route Source Preference

AD

When multiple routing protocols learn routes to the same destination, the router must choose which one to install in the FIB. Administrative Distance (AD) is the preference value assigned to each routing source — lower AD = more trusted = preferred. AD is a Cisco term; other vendors use similar concepts (route preference, distance).

Route SourceAD (Cisco)AD (Linux metric)Why This Priority
Connected interface00Router is directly attached — 100% reliable
Static route1Manually configured — administrator knows best
EIGRP (internal)90Cisco proprietary — high trust
OSPF110Standard IGP — trusted but below static
IS-IS115Standard IGP — similar to OSPF
RIP120Old distance-vector — lower trust
EIGRP (external)170Redistributed from another protocol — less trusted
BGP (eBGP)20External BGP — specific route from peer, high trust
BGP (iBGP)200Internal BGP — redistributed, lowest trust by default
Unreachable (blackhole)255Used to mark routes as unusable
/* AD in practice — floating static route */
# Primary path: OSPF learns 10.0.0.0/8 via fiber link (AD=110)
# Backup: static route via 4G modem (should only be used if OSPF fails)

ip route add 10.0.0.0/8 via 192.168.100.1 metric 200
# In Linux: metric 200 = lower priority than OSPF routes
# While OSPF is active: OSPF route wins (metric 110)
# When OSPF fails: OSPF route removed → static route with metric 200 activates
# This is a "floating static route" — floats below dynamic routes

/* Viewing route sources in Linux */
ip route show proto ospf    # routes from OSPF
ip route show proto bgp     # routes from BGP
ip route show proto static  # static routes
ip route show proto kernel  # connected (kernel-generated)

POLICY-BASED ROUTING — BEYOND DESTINATION-ONLY FORWARDING

📋

What Policy Routing Adds

PBR

Standard routing forwards packets based only on destination IP address. Policy-Based Routing (PBR) allows routing decisions based on additional criteria: source IP, DSCP/TOS, protocol, port, or even incoming interface. This is essential for NGFW deployments where different traffic classes must take different paths.

/* Linux Policy Routing (ip rule + multiple routing tables) */

# Scenario: ISP-A (eth0) for normal traffic, ISP-B (eth1) for VoIP

# Step 1: Create separate routing tables
# /etc/iproute2/rt_tables: add "200 isp_b"
echo "200 isp_b" >> /etc/iproute2/rt_tables

# Step 2: Populate table for ISP-B
ip route add default via 10.2.0.1 dev eth1 table isp_b
ip route add 10.2.0.0/24 dev eth1 table isp_b  # local subnet

# Step 3: Create rules to select table based on criteria
ip rule add from 192.168.1.0/24 dport 5060 table isp_b  # SIP → ISP-B
ip rule add dscp 46 table isp_b                           # EF (VoIP) → ISP-B
ip rule add from 10.0.0.5 table isp_b                    # specific host → ISP-B
# Default rule: use main table (ISP-A)

# View rules (evaluated in priority order, lower = first)
ip rule show
# 0:      from all lookup local
# 100:    from 10.0.0.5 lookup isp_b
# 200:    dscp 46 lookup isp_b
# 32766:  from all lookup main
# 32767:  from all lookup default

💡 NGFW use cases for PBR: Route management traffic out a dedicated OOB (out-of-band) interface. Send IDS/IPS traffic to an inline inspection appliance. Route different VLANs to different next-hops. Forward specific threat-tagged traffic to a honeypot or sandbox. Force all DNS to the internal resolver regardless of destination port.

LINUX ROUTING INTERNALS AND COMMANDS

🐧

Linux Kernel Routing Architecture

LINUX
/* Linux routing table management commands */

# View main routing table (FIB)
ip route show
ip route show table main
route -n   # older tool (still useful)

# View a specific route and which would be used
ip route get 8.8.8.8
# 8.8.8.8 via 192.168.1.1 dev eth0 src 192.168.1.5 uid 1000
#    cache

# Add static routes
ip route add 10.0.0.0/8 via 192.168.1.1              # via next-hop
ip route add 10.0.0.0/8 dev eth0                     # directly connected
ip route add 10.0.0.0/8 via 192.168.1.1 metric 100  # with metric
ip route add blackhole 203.0.113.0/24                # null route (drop)
ip route add prohibit 192.0.2.0/24                   # drop + ICMP prohibit
ip route add unreachable 198.51.100.0/24             # drop + ICMP unreachable

# Delete routes
ip route del 10.0.0.0/8 via 192.168.1.1

# Default route
ip route add default via 192.168.1.1
ip route add 0.0.0.0/0 via 192.168.1.1  # same thing

# Make Linux forward packets (act as router)
sysctl net.ipv4.ip_forward=1
echo 1 > /proc/sys/net/ipv4/ip_forward
# Permanent: add to /etc/sysctl.conf

# Route cache (Linux 3.6+ uses FIB directly — no separate route cache)
# Previous versions had a route cache (dst_entry) that caused hash table attacks

# IPv6 routing
ip -6 route show
ip -6 route add 2001:db8::/32 via fe80::1 dev eth0
ip -6 route get 2001:4860:4860::8888
🔧

Routing in VPP — The Data-Plane Perspective

VPP
/* VPP FIB architecture */
# VPP uses a multi-level FIB structure in hugepage memory
# IP4 FIB: hash table + mtrie (multi-way trie) for IPv4
# IP6 FIB: hash table for /128s, mtrie for others

# VPP routing commands (vppctl)
show ip fib                          # entire IPv4 FIB
show ip fib 10.0.0.0/8              # specific prefix
show ip fib summary                  # prefix count by length

ip route add 10.0.0.0/8 via 192.168.1.1 GigabitEthernet0/8/0
ip route add 0.0.0.0/0  via 192.168.1.254 GigabitEthernet0/8/0

# ECMP in VPP
ip route add 10.0.0.0/8 via 192.168.1.1 GigabitEthernet0/8/0
ip route add 10.0.0.0/8 via 192.168.1.2 GigabitEthernet0/8/1
# VPP automatically creates a load-balance adjacency with flow-hash buckets
show ip fib 10.0.0.0/8    # shows load-balance object with N buckets

# VPP FIB lookup in C (graph node)
/* In ip4_lookup.c: */
ip4_fib_mtrie_lookup_step (mtrie, &leaf, &a->dst_address, 0);
ip4_fib_mtrie_lookup_step (mtrie, &leaf, &a->dst_address, 1);
ip4_fib_mtrie_lookup_step (mtrie, &leaf, &a->dst_address, 2);
ip4_fib_mtrie_lookup_step (mtrie, &leaf, &a->dst_address, 3);
/* 4 pipeline stages for 32-bit IPv4 address lookup */
LAB 1

LPM Trie Implementation in C

Objective: Implement a binary trie for LPM routing lookups. Insert routes and verify the correct next-hop is returned for various destination IPs.

1
Implement the trie_node_t struct and trie_insert(root, prefix, prefix_len, nexthop) function. For each prefix, set bits 0–prefix_len of the address path in the trie and store the nexthop at the terminal node.
2
Implement lpm_lookup(root, dst_ip) — traverse the trie bit by bit from MSB to LSB, recording the most recently seen nexthop. Return the last recorded nexthop (or 0 for no route).
3
Insert: 0.0.0.0/0 → 1, 10.0.0.0/8 → 2, 10.10.0.0/16 → 3, 10.10.1.0/24 → 4, 10.10.1.5/32 → 5. Verify: lookup(10.10.1.5)=5, lookup(10.10.1.6)=4, lookup(10.10.2.1)=3, lookup(10.20.0.1)=2, lookup(8.8.8.8)=1.
4
Benchmark: insert 100,000 random /24 prefixes and measure lookups/second. Compare with a linear scan of the same routes. The trie should be 100–1000× faster on large tables.
LAB 2

Build a Multi-Path Router with ECMP

Objective: Configure a Linux VM as a router with ECMP, observe traffic distribution, and test failover when one path goes down.

1
Enable IP forwarding: sudo sysctl net.ipv4.ip_forward=1. Add an ECMP route with two nexthops: sudo ip route add 10.0.0.0/8 nexthop via 192.168.1.1 dev eth0 weight 1 nexthop via 192.168.1.2 dev eth1 weight 1.
2
Verify traffic distribution: use ip route get with different destination IPs to see which nexthop is selected. The hash of the destination IP determines the path: for i in $(seq 1 20); do ip route get 10.0.$i.1 | grep via; done. Count how many go to each nexthop — should be roughly 50/50.
3
Enable L4 hashing: sudo sysctl net.ipv4.fib_multipath_hash_policy=1. Repeat. The distribution should change because port numbers now affect the hash.
4
Test failover: bring down eth1 (sudo ip link set eth1 down). Verify the ECMP route has only one nexthop remaining: ip route show. Bring it back up and re-add the nexthop.
LAB 3

Policy Routing — Multiple Uplinks

Objective: Configure Linux Policy Routing to route different traffic classes to different uplinks.

1
Add routing tables: echo "100 isp1" >> /etc/iproute2/rt_tables; echo "200 isp2" >> /etc/iproute2/rt_tables. Populate each: ip route add default via 10.1.0.1 table isp1; ip route add default via 10.2.0.1 table isp2.
2
Add rules: SSH traffic (port 22) to ISP1 (management), web traffic to ISP2: ip rule add dport 22 table isp1 priority 100; ip rule add dport 80 table isp2 priority 101; ip rule add dport 443 table isp2 priority 102. View with ip rule show.
3
Test: ip route get 8.8.8.8 dport 22 vs ip route get 8.8.8.8 dport 80. Verify different nexthops are selected. Use traceroute on each to confirm different first hops.

M10 MASTERY CHECKLIST

When complete: Move to M11 - OSPF Internals. You now understand how routers forward packets — M11 covers how they learn routes dynamically via OSPF's link-state flooding and Dijkstra's SPF algorithm.

← M09 SMTP/FTP/DHCP 🗺️ Roadmap Next: M11 - OSPF →