WHY DPDK — THE KERNEL IS NOT FAST ENOUGH
The Performance Problem DPDK Solves
MOTIVATIONYou already have 2.5 years of DPDK experience — this module deepens that foundation with the theoretical underpinning of each performance technique, connecting your practical knowledge to the "why".
/* Why kernel forwarding is slow — root causes */ 1. Interrupt overhead (eliminated by PMD polling): 10G at 64B = 14.8M packets/s = 14.8M IRQs/s Each IRQ: context switch + cache invalidation ≈ 1000 cycles = 14.8T cycles/s wasted 2. Memory allocation (eliminated by mempool): kmalloc()/kfree() per sk_buff → fragmentation, lock contention DPDK mempool: pre-allocated, lock-free, O(1) 3. Memory copies (eliminated by zero-copy design): NIC → DMA buffer → sk_buff → socket rcvbuf → userspace DPDK: NIC DMA → mbuf in hugepage → application (1 copy from NIC) 4. Cache misses (eliminated by hugepages + NUMA pinning): 4KB pages: 1GB of packet buffers = 262,144 TLB entries → TLB thrash 2MB hugepages: same 1GB = 512 TLB entries → fits in TLB 5. Lock contention (eliminated by per-core design): Kernel routing: locks on ARP cache, routing table, socket buffers DPDK: each core owns its own queues and mempools → no locks /* Performance numbers */ Kernel stack: ~1-3 Mpps per core (64B packets) DPDK (Intel): ~30-80 Mpps per core DPDK (Mellanox/ConnectX): up to 100+ Mpps per core Your servers: AMD EPYC + Mellanox — which PMD are you using?
EAL — ENVIRONMENT ABSTRACTION LAYER
EAL Initialization and Configuration
EAL/* DPDK EAL — abstracts OS and hardware */ /* Manages: hugepages, NUMA, CPU affinity, PCI devices, logging */ /* Minimal DPDK application skeleton */ #include <rte_eal.h> #include <rte_ethdev.h> #include <rte_mbuf.h> int main(int argc, char **argv) { /* EAL init: parses EAL args, sets up hugepages, maps devices */ int ret = rte_eal_init(argc, argv); if (ret < 0) rte_exit(EXIT_FAILURE, "EAL init failed\n"); argc -= ret; argv += ret; /* remaining args are app-specific */ /* Check available ports */ uint16_t nb_ports = rte_eth_dev_count_avail(); printf("Available ports: %u\n", nb_ports); /* Create mempool (see Tab 2) */ struct rte_mempool *mp = rte_pktmbuf_pool_create( "MBUF_POOL", /* name */ 8192 * nb_ports, /* n: total mbufs */ 256, /* cache_size: per-core cache */ 0, /* priv_size */ RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id()); /* NUMA socket */ /* Configure each port */ uint16_t port_id; RTE_ETH_FOREACH_DEV(port_id) { port_init(port_id, mp); } /* Launch worker on each lcore */ rte_eal_mp_remote_launch(lcore_main, NULL, CALL_MAIN); rte_eal_cleanup(); return 0; } /* Key EAL command-line arguments */ -l 0-3 # use lcores 0,1,2,3 (logical CPU cores) -n 4 # 4 memory channels --socket-mem 2048 # 2GB hugepage memory on socket 0 --vdev eth_pcap0,iface=eth0 # use pcap driver (for testing without real NIC) -a 0000:01:00.0 # allow only this PCI device /* Hugepage setup (required before DPDK runs) */ echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages mkdir -p /mnt/huge mount -t hugetlbfs nodev /mnt/huge
mbuf AND MEMPOOL — PACKET BUFFER MANAGEMENT
rte_mbuf and rte_mempool
MBUF/* rte_mbuf — DPDK's packet buffer (analogous to sk_buff) */ struct rte_mbuf { /* Buffer addresses */ void *buf_addr; /* virtual address of buffer start */ rte_iova_t buf_iova; /* IO/DMA address */ uint16_t buf_len; /* total buffer length */ uint16_t data_off; /* offset to first byte of data (headroom) */ uint16_t data_len; /* data length in THIS mbuf segment */ uint32_t pkt_len; /* total packet length (all segments) */ /* Segmentation (chained mbufs for large packets) */ struct rte_mbuf *next; /* next segment in chain (NULL if only one) */ uint8_t nb_segs; /* number of segments */ /* Offload flags */ uint64_t ol_flags; /* PKT_TX_IP_CKSUM, PKT_RX_RSS_HASH, etc. */ uint32_t packet_type; /* RTE_PTYPE_L3_IPV4, L4_TCP, etc. */ uint32_t hash.rss; /* RSS hash computed by NIC */ /* Port and queue */ uint16_t port; uint32_t seqn; }; /* Accessing packet data */ struct rte_ipv4_hdr *ip = rte_pktmbuf_mtod_offset(m, struct rte_ipv4_hdr *, sizeof(struct rte_ether_hdr)); uint32_t src_ip = rte_be_to_cpu_32(ip->src_addr); uint32_t dst_ip = rte_be_to_cpu_32(ip->dst_addr); /* Prepend header (like skb_push) */ struct rte_ether_hdr *eth = (struct rte_ether_hdr *) rte_pktmbuf_prepend(m, sizeof(struct rte_ether_hdr)); /* rte_mempool — pre-allocated, lock-free pool */ /* Pool has: global ring + per-lcore cache (avoids lock on common case) */ /* Allocate mbuf from pool */ struct rte_mbuf *m = rte_pktmbuf_alloc(mp); if (!m) { /* pool exhausted — back-pressure or drop */ } /* Free mbuf back to pool */ rte_pktmbuf_free(m); /* returns to per-lcore cache, then global ring */ /* Bulk allocate/free (amortizes pool overhead) */ struct rte_mbuf *mbufs[32]; rte_pktmbuf_alloc_bulk(mp, mbufs, 32); rte_mempool_put_bulk(mp, (void **)mbufs, 32);
PMD AND BURST API — THE CORE FORWARDING LOOP
Poll Mode Driver and rte_eth_rx/tx_burst
PMD/* Port initialization */ static int port_init(uint16_t port, struct rte_mempool *mp) { struct rte_eth_conf port_conf = { .rxmode = { .max_lro_pkt_size = RTE_ETHER_MAX_LEN }, .txmode = { .mq_mode = RTE_ETH_MQ_TX_NONE }, }; const int rx_rings = 1, tx_rings = 1; uint16_t nb_rxd = 1024, nb_txd = 1024; rte_eth_dev_configure(port, rx_rings, tx_rings, &port_conf); rte_eth_dev_adjust_nb_rx_tx_desc(port, &nb_rxd, &nb_txd); /* Setup RX queue on NUMA-local socket */ rte_eth_rx_queue_setup(port, 0, nb_rxd, rte_eth_dev_socket_id(port), NULL, mp); /* Setup TX queue */ rte_eth_tx_queue_setup(port, 0, nb_txd, rte_eth_dev_socket_id(port), NULL); rte_eth_dev_start(port); rte_eth_promiscuous_enable(port); return 0; } /* Core forwarding loop — the main performance-critical loop */ static int lcore_main(void *arg) { uint16_t port; struct rte_mbuf *bufs[BURST_SIZE]; /* BURST_SIZE = 32 */ RTE_ETH_FOREACH_DEV(port) { if (rte_eth_dev_socket_id(port) > 0 && rte_eth_dev_socket_id(port) != (int)rte_socket_id()) printf("WARNING: port %u is on remote NUMA\n", port); } printf("Core %u forwarding. [Ctrl+C to quit]\n", rte_lcore_id()); while (1) { RTE_ETH_FOREACH_DEV(port) { /* POLL: pull up to BURST_SIZE packets from NIC RX queue */ uint16_t nb_rx = rte_eth_rx_burst(port, 0, bufs, BURST_SIZE); if (nb_rx == 0) continue; /* nothing received */ /* Process each packet */ for (uint16_t i = 0; i < nb_rx; i++) { process_packet(bufs[i]); /* L3 lookup, NAT, filter... */ } /* Burst transmit — send all processed packets */ uint16_t nb_tx = rte_eth_tx_burst(port ^ 1, 0, bufs, nb_rx); /* Free any packets that failed to transmit */ if (nb_tx < nb_rx) rte_pktmbuf_free_bulk(&bufs[nb_tx], nb_rx - nb_tx); } } } /* Why burst size matters for performance */ # Burst=1: function call overhead dominates → low throughput # Burst=32: amortize call overhead, fill cache with packet data → optimal # Burst=128: diminishing returns, prefetch distance too large # Empirically: 32 is optimal for most workloads
RSS AND FLOW DIRECTOR — HARDWARE PACKET STEERING
RSS Configuration and Flow Director
RSS/* RSS — Receive Side Scaling */ /* NIC hashes packet 5-tuple → assigns to RX queue → specific lcore */ /* Ensures packets of same flow always go to same core (session affinity) */ struct rte_eth_conf port_conf = { .rxmode = { .mq_mode = RTE_ETH_MQ_RX_RSS, }, .rx_adv_conf = { .rss_conf = { .rss_key = NULL, /* NULL = use default 40-byte key */ .rss_hf = RTE_ETH_RSS_IP | RTE_ETH_RSS_TCP | RTE_ETH_RSS_UDP, }, }, }; /* Per-packet RSS hash (computed by NIC hardware) */ if (m->ol_flags & RTE_MBUF_F_RX_RSS_HASH) uint32_t hash = m->hash.rss; /* use for flow table lookup */ /* Symmetric RSS — ensure fwd and return packets land on same core */ /* Standard RSS: hash(sIP,dIP,sPort,dPort) — fwd and return differ! */ /* Symmetric: hash(sIP^dIP, sPort^dPort) — XOR makes it symmetric */ /* Implemented by using a special Toeplitz key: */ static uint8_t sym_rss_key[] = { 0x6d, 0x5a, 0x56, 0xda, 0x25, 0x5b, 0x0e, 0xc2, 0x41, 0x67, 0x25, 0x3d, 0x43, 0xa3, 0x8f, 0xb0, 0xd0, 0xca, 0x2b, 0xcb, 0xae, 0x7b, 0x30, 0xb4, 0x77, 0xcb, 0x2d, 0xa3, 0x80, 0x30, 0xf2, 0x0c, 0x6a, 0x42, 0xb7, 0x3b, 0xbe, 0xac, 0x01, 0xfa, }; /* Flow Director — exact-match steering beyond RSS */ /* Program specific 5-tuples → specific queue */ struct rte_flow_attr attr = { .ingress = 1 }; struct rte_flow_item pattern[4]; struct rte_flow_action action[2]; /* Match IPv4 + TCP dst port 80 */ struct rte_flow_item_ipv4 ipv4_spec = { .hdr.dst_addr = htonl(0xc0a80001) }; struct rte_flow_item_tcp tcp_spec = { .hdr.dst_port = htons(80) }; /* → send to queue 3 */ struct rte_flow_action_queue queue_action = { .index = 3 }; /* Create the flow rule */ struct rte_flow_error error; struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, action, &error);
DPDK PIPELINES — STRUCTURING COMPLEX DATA PLANES
Run-to-Completion vs Pipeline Models
PIPELINES/* Model 1: Run-to-Completion (RTC) */ /* Each lcore processes packets end-to-end: RX → all processing → TX */ /* Pros: no inter-core communication, lowest latency */ /* Cons: all processing must fit in one core's budget */ lcore 0: RX port 0 → classify → ACL → NAT → TX port 1 lcore 1: RX port 1 → classify → ACL → NAT → TX port 0 lcore 2: RX port 2 → classify → ACL → NAT → TX port 3 /* Model 2: Pipeline (assembly line) */ /* Different cores handle different stages; communicate via ring queues */ /* Pros: each stage specialised, cache-friendly per stage */ /* Cons: ring enqueue/dequeue latency, pipeline stalls */ lcore 0 (RX): NIC → mbuf → enqueue to classify_ring lcore 1 (Classify): dequeue → L3 parse → enqueue to acl_ring lcore 2 (ACL): dequeue → policy check → enqueue to nat_ring lcore 3 (NAT+TX): dequeue → NAT → NIC TX /* rte_ring — lock-free SPSC/MPSC/SPMC/MPMC ring queue */ struct rte_ring *ring = rte_ring_create("MY_RING", 4096, rte_socket_id(), RING_F_SP_ENQ | RING_F_SC_DEQ); /* Enqueue (producer side) */ rte_ring_enqueue_burst(ring, (void **)mbufs, nb_mbufs, NULL); /* Dequeue (consumer side) */ uint16_t n = rte_ring_dequeue_burst(ring, (void **)mbufs, BURST_SIZE, NULL); /* DPDK Graph framework (DPDK 20.11+) */ /* Modern way to build pipelines: graph of nodes, edges are rte_rings */ /* Nodes: ip4_lookup, ip4_rewrite, acl_classify, etc. */ /* Automatic vectorisation: processes batch of packets per node */
DPDK PERFORMANCE TUNING
Systematic Performance Optimisation
TUNING/* 1. CPU isolation — dedicate cores to DPDK */ # Kernel boot: isolcpus=4-7,nohz_full=4-7,rcu_nocbs=4-7 # DPDK EAL: -l 4-7 (use cores 4-7) # These cores will spin 100% polling — don't share with OS /* 2. NUMA awareness — memory on same socket as NIC */ if (rte_eth_dev_socket_id(port) != (int)rte_socket_id()) { printf("NUMA mismatch: NIC on socket %d, core on socket %d\n", rte_eth_dev_socket_id(port), rte_socket_id()); /* Cross-NUMA memory access: +60ns latency per access */ /* Fix: pin workers to same NUMA node as their NIC */ } /* Mempool MUST be on same NUMA as NIC: */ rte_pktmbuf_pool_create("MP", N, 256, 0, BUF_SIZE, rte_eth_dev_socket_id(port)); /* NOT rte_socket_id() */ /* 3. Prefetching — hide memory latency */ for (i = 0; i < nb_rx; i++) { if (i + 4 < nb_rx) rte_prefetch0(rte_pktmbuf_mtod(bufs[i + 4], void *)); process_packet(bufs[i]); /* by the time we process [i], [i+4] is in L1 */ } /* 4. TX descriptor writeback — reduce PCIe round trips */ struct rte_eth_txconf txconf = { .tx_thresh = { .pthresh = 32, .hthresh = 0, .wthresh = 0 }, .tx_free_thresh = 32, /* free 32 at once, not 1 by 1 */ }; /* 5. Offloads — hardware helps software */ struct rte_eth_conf port_conf = { .rxmode.offloads = RTE_ETH_RX_OFFLOAD_CHECKSUM | /* HW validates cksum */ RTE_ETH_RX_OFFLOAD_RSS_HASH, /* HW computes RSS */ .txmode.offloads = RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | /* HW fills IP cksum */ RTE_ETH_TX_OFFLOAD_TCP_CKSUM, /* HW fills TCP cksum */ }; /* 6. Measuring performance */ uint64_t hz = rte_get_timer_hz(); uint64_t start = rte_get_timer_cycles(); /* ... process N packets ... */ uint64_t elapsed = rte_get_timer_cycles() - start; double mpps = (double)N / ((double)elapsed / hz) / 1e6; printf("%.2f Mpps (%.1f ns/packet)\n", mpps, 1e9 * elapsed / hz / N);
DPDK Packet Counter and L3 Forwarder
Objective: Build the classic DPDK "basicfwd" — receive packets, swap MAC addresses, transmit back. Add per-flow counters.
dpdk-devbind.py --bind=vfio-pci 0000:01:00.0. Build and run the DPDK skeleton from the EAL tab. Verify you can receive packets.Multi-Core DPDK with RSS
Objective: Scale DPDK forwarding to multiple cores using RSS for flow distribution.
rte_eth_stats_get(). Distribution should be roughly even (within 10%).Performance Profiling Deep-Dive
Objective: Profile your DPDK application at the cycle level and identify bottlenecks.
perf stat -e cycles,cache-misses,dTLB-misses -C 4 sleep 5.M17 MASTERY CHECKLIST
- Know the 5 root causes of kernel networking overhead and how DPDK eliminates each
- Know DPDK performance numbers: ~30-80 Mpps/core vs kernel ~1-3 Mpps/core
- Know EAL responsibilities: hugepage management, CPU affinity, PCI device mapping, NUMA awareness
- Know EAL CLI options: -l (cores), -n (memory channels), --socket-mem, -a (PCI allowlist)
- Know hugepage requirement and why: 2MB pages reduce TLB pressure (512 vs 262,144 entries for 1GB)
- Know rte_mbuf key fields: buf_addr/buf_iova, data_off, data_len, pkt_len, next, ol_flags, hash.rss
- Know rte_mempool design: global ring + per-lcore cache, lock-free on common path
- Know the PMD polling model: spin loop calling rte_eth_rx_burst, no interrupts ever
- Know why burst size matters: amortizes function call overhead; optimal typically 32
- Know NUMA mismatch penalty: ~60ns latency per cross-NUMA memory access
- Know RSS configuration: mq_mode=RSS, rss_hf for hash fields (IP, TCP, UDP)
- Know symmetric RSS problem and solution: XOR-based Toeplitz key ensures fwd/return on same core
- Know Flow Director: exact-match 5-tuple → specific queue steering (harder than RSS, more precise)
- Know run-to-completion vs pipeline models; when to choose each
- Know rte_ring: lock-free SPSC/MPSC/SPMC/MPMC; used to connect pipeline stages
- Know TX offloads: RTE_ETH_TX_OFFLOAD_IPV4_CKSUM, TCP_CKSUM — hardware fills checksums
- Know prefetching pattern: prefetch N+4 while processing N to hide memory latency
- Completed Lab 1: built DPDK L2 forwarder with per-flow hash table counters, benchmarked burst sizes
- Completed Lab 2: configured multi-core with RSS, verified symmetric flow distribution
- Completed Lab 3: profiled with PMU counters, quantified NUMA penalty and cache miss cost
✅ When complete: Move to M18 - VPP and Data Plane Development — the final Phase 4 module, covering the vector packet processor your team actively uses for R&D.