DPDK MASTERY · PHASE 1 OF 3 · MODULE A
Foundation, Architecture & EAL
Why DPDK exists · Full software stack · Environment Abstraction Layer · PCIe device binding
Ch 1 — Why DPDK Ch 2 — Architecture Ch 3 — EAL Deep Dive C · Linux · PCIe · VFIO Weeks 1–2

⚡ The Core Problem

DPDK (Data Plane Development Kit) exists because the Linux kernel packet processing path — perfectly adequate for general-purpose computing — is fundamentally too slow for line-rate NF processing at 10G, 25G, 100G speeds. It eliminates five categories of overhead: interrupts, allocation, copies, context switches, and protocol stack traversal.
📌 Packet budget at 100Gbps / 64-byte frames: ~148 million packets per second → 6.7 nanoseconds per packet. A single L3 cache miss (~40 cycles @ 3 GHz = ~13 ns) already exceeds this budget. Every design choice in DPDK is shaped by this constraint.

OVERHEAD BREAKDOWN

OverheadDescriptionCostDPDK Solution
Hardware interruptCPU suspends task, saves state, runs ISR~1–5 µs + cache pollutionInterrupts disabled — CPU polls NIC (PMD)
softirq schedulingISR → NET_RX_SOFTIRQ deferredUnpredictable latencyNo softirq — polling loop runs constantly
sk_buff allocationPer-packet kmalloc from slab allocator~100–200 ns / packetPre-allocated mbuf pool — zero runtime alloc
Protocol stack traversalFull IP/TCP even if NF doesn't need itMany cache missesApp implements only what it needs
Memory copiesDMA ring → sk_buff → user buffer2–4 memcpy / packetNIC DMA writes directly to user-space hugepage mbuf
Context switchKernel → user-space on recv()~1–10 µsNo syscalls in data path — pure user-space
TLB pressure4KB pages → many TLB entries → frequent misses10–100s cycles / miss2MB/1GB hugepages — far fewer TLB entries

NAPI — Good but Not Enough

NAPI (New API) switches from interrupt-driven to polling mode under high load, reducing interrupt rate. But it still requires: sk_buff allocation per packet, full protocol stack traversal, memory copies to user space, and context switches on recv(). DPDK eliminates all of these — NAPI only eliminates one.

LINUX KERNEL PACKET PATH — 9 STAGES

Linux Kernel Packet Path — Full Chain ① Packet arrives on wire → NIC DMA writes it to ring buffer in kernel memory ② NIC raises hardware interrupt → CPU suspends current task ③ CPU runs ISR (Interrupt Service Routine) → Quick acknowledgement, schedules softirq ④ NET_RX_SOFTIRQ scheduled → Deferred processing (unpredictable latency) ⑤ NAPI poll loop runs → Pulls packets from NIC ring into sk_buff structures ⑥ sk_buff travels up the protocol stack → Ethernet → IP → TCP/UDP (many function calls, cache misses) ⑦ Packet data copied into socket receive buffer → First memory copy ⑧ Application calls recv() / read() → Context switch to user space (~1–10 µs) ⑨ Packet data copied from socket buffer into application buffer → Second memory copy Total: 2+ memory copies · 1+ context switch · 1 interrupt · 1 softirq per packet

sk_buff — The Kernel Packet Structure

The sk_buff is the Linux kernel's equivalent of DPDK's rte_mbuf. Every packet is wrapped in an sk_buff.
  • Allocated per packet at interrupt time — per-packet malloc
  • Contains: head/data/tail/end pointers, network/transport header pointers, protocol info, device pointer, checksum fields
  • Supports cloning and reference counting — complex lifecycle with overhead
  • Size: 200+ bytes of metadata overhead per packet
Comparison PointLinux sk_buffDPDK rte_mbuf
AllocationPer-packet kmalloc at interrupt timePre-allocated in pool at startup — zero runtime alloc
Memory locationKernel heap (4KB pages, swappable)Hugepage RAM (2MB/1GB, pinned, NUMA-local)
NIC DMA targetKernel ring buffer → copied to sk_buffDirectly into mbuf data buffer — zero copy
User-space accessRequires syscall + copyDirect pointer — no syscall, no copy
Metadata size200+ bytes~128 bytes (rte_mbuf header)

DPDK PACKET PATH — ZERO OVERHEAD

DPDK Fast Path — The Canonical Loop NIC Rx Queue (hardware ring in hugepage memory) ↓ NIC DMA writes packet directly into pre-allocated mbuf → ZERO CPU involvement · ZERO kernel involvement PMD polls ring — no interrupt, no softirq, no context switch rte_eth_rx_burst(port, queue, mbufs[], burst_size) → returns batch of received mbufs ↓ Application processes packets (modify headers, lookup, forward) ↓ rte_eth_tx_burst(port, queue, mbufs[], n) → NIC DMA reads from mbuf and sends on wire ↓ rte_pktmbuf_free(mbuf) → return mbuf to pool (no free — just reset head pointer) 0 interrupts · 0 copies · 0 malloc/free · 0 context switches
The canonical DPDK main loop in C:
// Minimal DPDK fast-path polling loop struct rte_mbuf *pkts[BURST_SZ]; while (1) { uint16_t nb_rx = rte_eth_rx_burst(port, queue, pkts, BURST_SZ); for (uint16_t i = 0; i < nb_rx; i++) { process_packet(pkts[i]); // modify, forward, drop } uint16_t nb_tx = rte_eth_tx_burst(port, queue, pkts, nb_rx); // Free any packets the Tx ring couldn't accept for (uint16_t i = nb_tx; i < nb_rx; i++) rte_pktmbuf_free(pkts[i]); }

When NOT to Use DPDK

  • Low packet rate applications — the dedicated polling CPU is wasted
  • Need full TCP/IP stack — DPDK has no kernel TCP (VPP or mTCP fill this gap)
  • Hardware without PMD support — older or obscure NICs
  • Rapid prototyping — DPDK has significant setup complexity

DPDK SOFTWARE STACK

DPDK Software Stack ┌─────────────────────────────────────────────────────────────────┐ │ USER APPLICATION (NF / SASE-DP) │ │ SASE URL Filter │ Blaze Broker │ MRI Resolver │ NGFW │ ├──────────────┬────────────┬──────────────┬───────────────────────┤ │ rte_flow │ rte_acl │ rte_hash │ rte_lpm │ │ rte_timer │ rte_meter │ rte_sched │ rte_distributor │ ├──────────────┴────────────┴──────────────┴───────────────────────┤ │ rte_mbuf │ rte_mempool │ rte_ring │ ├─────────────────────────────────────────────────────────────────┤ │ ethdev API (rte_ethdev.h) │ ├─────────────────────────────────────────────────────────────────┤ │ Poll Mode Drivers (PMD): ixgbe │ i40e │ mlx5 │ tap │ ring │ ├─────────────────────────────────────────────────────────────────┤ │ EAL — Environment Abstraction Layer │ │ hugepages │ lcore mgmt │ PCI init │ memory │ log │ timer │ └─────────────────────────────────────────────────────────────────┘ KERNEL: UIO/VFIO driver (tiny — only device init/interrupt) NIC HARDWARE: Rx/Tx descriptor rings
ComponentResponsibilityKey API
EALHardware init, hugepage setup, lcore management, PCI device bindingrte_eal_init()
PMDNIC-specific driver — polls hardware Rx/Tx queues directlyrte_eth_rx_burst() / rte_eth_tx_burst()
ethdev APIHardware-agnostic interface to PMD — port configure, queue setuprte_eth_dev_configure()
rte_mempoolPre-allocated object pool — eliminates runtime mallocrte_pktmbuf_pool_create()
rte_mbufPacket buffer structure — wraps packet data + metadatarte_pktmbuf_mtod()
rte_ringLock-free FIFO circular buffer — inter-core packet passingrte_ring_enqueue_bulk()
rte_flowHardware flow classification and steering (Flow API)rte_flow_create()
rte_hashExact-match hash table — flow table lookupsrte_hash_lookup_bulk()
rte_lpmLongest-prefix match — routing tablerte_lpm_lookup_bulk()
rte_aclMulti-field ACL classificationrte_acl_classify()
rte_distributorWork distributor — one RX core fans out to N worker coresrte_distributor_process()

EAL — Environment Abstraction Layer

EAL is the foundation of every DPDK application. It must be the first call in main(). It initializes hugepages, discovers and probes NIC devices, pins lcores to CPUs, and provides all the primitives the rest of DPDK builds on.
// Minimal EAL initialization int main(int argc, char *argv[]) { int ret = rte_eal_init(argc, argv); if (ret < 0) rte_exit(EXIT_FAILURE, "EAL init failed\n"); argc -= ret; // EAL consumes its own args; remaining go to app argv += ret; unsigned nb_ports = rte_eth_dev_count_avail(); printf("Available NIC ports: %u\n", nb_ports); unsigned nb_lcores = rte_lcore_count(); printf("Configured lcores: %u\n", nb_lcores); // ... rest of application rte_eal_cleanup(); return 0; }

KEY EAL COMMAND-LINE FLAGS

FlagMeaningExample
-lList of lcores to use (CPU threads)-l 0-3 or -l 0,2,4,6
-nNumber of memory channels-n 4
-aAllowlist: bind only these PCI devices-a 0000:03:00.0
--socket-memHugepage memory per NUMA socket (MB)--socket-mem 1024,1024
--huge-dirPath where hugepages are mounted--huge-dir /dev/hugepages
--proc-typeProcess type for multi-process DPDK--proc-type=primary
--file-prefixShared memory prefix (multi-process)--file-prefix=myapp
--log-levelComponent log verbosity--log-level=pmd:8

LCORE CONCEPTS

lcores vs Physical CPUs

An lcore is DPDK's logical core — maps 1:1 to a hardware CPU thread (including hyperthreads). EAL pins each lcore to a CPU using pthread_setaffinity_np() at startup, preventing OS scheduler migration. The main lcore (lcore 0 by default) runs after rte_eal_init() returns. Worker lcores are launched with rte_eal_remote_launch().
// Enumerate and query lcores unsigned lcore_id; RTE_LCORE_FOREACH_WORKER(lcore_id) { unsigned socket = rte_lcore_to_socket_id(lcore_id); unsigned cpu = rte_lcore_to_cpu_id(lcore_id); printf("lcore %u → CPU %u on socket %u\n", lcore_id, cpu, socket); } // Launch worker function on each lcore RTE_LCORE_FOREACH_WORKER(lcore_id) { rte_eal_remote_launch(worker_loop, NULL, lcore_id); } rte_eal_mp_wait_lcore(); // wait for all workers to finish

PCIe Device Binding — Taking NIC from Kernel

Before DPDK can use a NIC, that NIC must be unbound from its kernel driver and bound to a DPDK-compatible driver. This is how DPDK takes ownership of the NIC away from the kernel.

BINDING WORKFLOW

DriverModeSecurityRequirementUse Case
vfio-pciIOMMU-protected DMASafe — IOMMU blocks unauthorized DMAIOMMU enabled in BIOS + kernel (intel_iommu=on)Production — recommended
uio_pci_genericNo IOMMUUnsafe — NIC can DMA anywhereNo IOMMU requiredDev/test only
igb_uioNo IOMMUUnsafeOut-of-tree kernel moduleLegacy — avoid
🆕 Mellanox/NVIDIA mlx4/mlx5 Exception: These NICs do not require unbinding from the kernel driver. They use a bifurcated driver model — the kernel driver (mlx5_core) handles management traffic and control operations, while the DPDK PMD gets dedicated hardware queues for fast-path traffic. Both coexist on the same NIC simultaneously. This is what the Jio SASE-DP deployment uses.

IOMMU & IOVA

IOVA — How NIC DMA Addresses Host Memory CPU Virtual Address 0x7F3A00001080 ← what your C code uses Physical Address 0x200001080 ← actual silicon location IOVA 0x200001080 ← what NIC DMA engine uses VFIO mode: IOMMU sits between PCIe bus and RAM DPDK registers allowed DMA regions with IOMMU at startup Unregistered DMA attempts → BLOCKED by IOMMU Kernel memory is safe even if NIC has a bug How DPDK manages IOVA: ① EAL allocs hugepages (physically contiguous 2MB frames) ② mmap() maps hugepages into process virtual address space ③ rte_mem_virt2iova() converts virtual ↔ IOVA for any hugepage addr ④ Rx descriptor pre-filled: desc[i].buf_addr = IOVA of empty mbuf ⑤ NIC DMA reads descriptor → writes packet to that IOVA ⑥ PMD converts: mbuf ptr = rte_mem_iova2virt(desc[i].buf_addr)
⚠️ Why IOVAs must be physically contiguous: NIC DMA writes a packet as one contiguous PCIe burst. 2MB hugepages guarantee physical contiguity within each page — safe for DMA. Normal 4KB pages may not be contiguous in physical RAM — never use for DMA buffers.

Q: What is the main overhead DPDK eliminates?

The interrupt per packet and associated context switch. At 100Gbps/64B there are ~148 Mpps — one interrupt per packet would saturate the CPU just handling interrupts. DPDK disables interrupts and uses polling (PMD), so the CPU spends 100% of its time processing packets, not handling interrupts.

Q: What is NAPI and why isn't it enough?

NAPI (New API) switches from interrupt-driven to polling mode under high load, reducing interrupt overhead. But it still requires sk_buff allocation per packet, full protocol stack traversal, memory copies to user space, and context switches on recv(). DPDK eliminates all of these — NAPI only eliminates one.

Q: How many memory copies does Linux do vs DPDK?

Linux: DMA → kernel ring buffer → sk_buff → user-space buffer = 2–3 copies.
DPDK: NIC DMA writes directly into user-space hugepage mbuf = 0 copies (zero copy).

Q: What is the packet budget at 100Gbps with 64-byte packets?

~148 million packets per second → ~6.7 nanoseconds per packet. A single L3 cache miss (~40 cycles @ 3 GHz = ~13 ns) already exceeds this. This is why hugepages, NUMA alignment, and cache-conscious programming are mandatory in DPDK — not optional.

Q: What is a bifurcated driver model and which NIC uses it?

Mellanox/NVIDIA ConnectX NICs (mlx4/mlx5) support bifurcation: the NIC is bound to both the kernel driver (mlx5_core) and the DPDK PMD simultaneously. The kernel driver handles management traffic (ARP, ICMP, control). The DPDK PMD gets dedicated hardware queues for fast-path traffic. This allows DPDK and kernel networking on the same NIC without rebinding — a key operational advantage in production deployments.

Q: What does rte_eal_init() do?

It: (1) parses EAL command-line arguments, (2) mounts and allocates hugepages on configured NUMA sockets, (3) probes PCI devices and initializes PMD drivers, (4) pins lcores to physical CPUs via pthread_setaffinity_np(), (5) initializes the memory allocator, log system, and timer subsystem. It must be the first call in main(). Returns the number of args consumed — the rest belong to the application.

Q: Why does DPDK require hugepages?

Two reasons: (1) TLB efficiency — 2MB pages mean far fewer TLB entries, dramatically reducing TLB misses on the hot path. (2) DMA stability — hugepages are pinned (mlock'd) so they cannot be swapped out. NIC DMA needs fixed physical addresses (IOVAs). If a page is swapped out, its IOVA is stale — the NIC would write to wrong/freed memory. Hugepages guarantee the IOVA is always valid.
🔥 Lab 1: DPDK Hello World + EAL Probe

Build the standard DPDK helloworld sample and instrument it to probe every EAL primitive.

1
Setup hugepages: echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
2
Mount hugetlbfs: mount -t hugetlbfs none /dev/hugepages
3
Bind NIC to VFIO: dpdk-devbind.py --bind=vfio-pci 0000:03:00.0
4
Build helloworld: cd $DPDK_BUILD/examples/helloworld && make
5
Run with 4 lcores: ./helloworld -l 0-3 -n 4
6
Add probe code: print lcore count, NUMA socket per lcore, available port count, hugepage memory per socket
7
Verify IOVA: Call rte_mem_virt2iova() on a stack variable — expect RTE_BAD_IOVA (stack is not a hugepage). Call on a hugepage-allocated buffer — expect valid IOVA.
🔥 Lab 2: Skeleton NIC Receiver (no mbuf pool yet)

Before setting up full rx/tx queues: use dpdk-devbind.py to cycle bind/unbind a NIC and observe how it disappears from ip link. Then probe the NIC capabilities.

1
Run ip link — note the NIC interface name (e.g. eth1)
2
Bind to vfio-pci: dpdk-devbind.py --bind=vfio-pci 0000:03:00.0
3
Run ip link again — eth1 is gone from the OS view
4
In C: call rte_eth_dev_info_get(0, &info) and print driver_name, max_rx_queues, max_tx_queues, rx_desc_lim.nb_max
5
Restore: dpdk-devbind.py --bind=ixgbe 0000:03:00.0eth1 reappears

MASTERY CHECKLIST

↑ Roadmap P1B: Hugepages, mempool & mbuf →