DPDK MASTERY · PHASE 1 OF 3 · MODULE A
Foundation, Architecture & EAL
Why DPDK exists · Full software stack · Environment Abstraction Layer · PCIe device binding
Ch 1 — Why DPDK
Ch 2 — Architecture
Ch 3 — EAL Deep Dive
C · Linux · PCIe · VFIO
Weeks 1–2
⚡ The Core Problem
DPDK (Data Plane Development Kit) exists because the Linux kernel packet processing path — perfectly adequate for general-purpose computing — is fundamentally too slow for line-rate NF processing at 10G, 25G, 100G speeds. It eliminates five categories of overhead: interrupts, allocation, copies, context switches, and protocol stack traversal.📌 Packet budget at 100Gbps / 64-byte frames: ~148 million packets per second → 6.7 nanoseconds per packet. A single L3 cache miss (~40 cycles @ 3 GHz = ~13 ns) already exceeds this budget. Every design choice in DPDK is shaped by this constraint.
OVERHEAD BREAKDOWN
| Overhead | Description | Cost | DPDK Solution |
|---|---|---|---|
Hardware interrupt | CPU suspends task, saves state, runs ISR | ~1–5 µs + cache pollution | Interrupts disabled — CPU polls NIC (PMD) |
softirq scheduling | ISR → NET_RX_SOFTIRQ deferred | Unpredictable latency | No softirq — polling loop runs constantly |
sk_buff allocation | Per-packet kmalloc from slab allocator | ~100–200 ns / packet | Pre-allocated mbuf pool — zero runtime alloc |
Protocol stack traversal | Full IP/TCP even if NF doesn't need it | Many cache misses | App implements only what it needs |
Memory copies | DMA ring → sk_buff → user buffer | 2–4 memcpy / packet | NIC DMA writes directly to user-space hugepage mbuf |
Context switch | Kernel → user-space on recv() | ~1–10 µs | No syscalls in data path — pure user-space |
TLB pressure | 4KB pages → many TLB entries → frequent misses | 10–100s cycles / miss | 2MB/1GB hugepages — far fewer TLB entries |
NAPI — Good but Not Enough
NAPI (New API) switches from interrupt-driven to polling mode under high load, reducing interrupt rate. But it still requires: sk_buff allocation per packet, full protocol stack traversal, memory copies to user space, and context switches on recv(). DPDK eliminates all of these — NAPI only eliminates one.LINUX KERNEL PACKET PATH — 9 STAGES
Linux Kernel Packet Path — Full Chain
① Packet arrives on wire
→ NIC DMA writes it to ring buffer in kernel memory
② NIC raises hardware interrupt
→ CPU suspends current task
③ CPU runs ISR (Interrupt Service Routine)
→ Quick acknowledgement, schedules softirq
④ NET_RX_SOFTIRQ scheduled
→ Deferred processing (unpredictable latency)
⑤ NAPI poll loop runs
→ Pulls packets from NIC ring into sk_buff structures
⑥ sk_buff travels up the protocol stack
→ Ethernet → IP → TCP/UDP (many function calls, cache misses)
⑦ Packet data copied into socket receive buffer
→ First memory copy
⑧ Application calls recv() / read()
→ Context switch to user space (~1–10 µs)
⑨ Packet data copied from socket buffer into application buffer
→ Second memory copy
Total: 2+ memory copies · 1+ context switch · 1 interrupt · 1 softirq per packet
sk_buff — The Kernel Packet Structure
Thesk_buff is the Linux kernel's equivalent of DPDK's rte_mbuf. Every packet is wrapped in an sk_buff.
- Allocated per packet at interrupt time — per-packet malloc
- Contains: head/data/tail/end pointers, network/transport header pointers, protocol info, device pointer, checksum fields
- Supports cloning and reference counting — complex lifecycle with overhead
- Size: 200+ bytes of metadata overhead per packet
| Comparison Point | Linux sk_buff | DPDK rte_mbuf |
|---|---|---|
| Allocation | Per-packet kmalloc at interrupt time | Pre-allocated in pool at startup — zero runtime alloc |
| Memory location | Kernel heap (4KB pages, swappable) | Hugepage RAM (2MB/1GB, pinned, NUMA-local) |
| NIC DMA target | Kernel ring buffer → copied to sk_buff | Directly into mbuf data buffer — zero copy |
| User-space access | Requires syscall + copy | Direct pointer — no syscall, no copy |
| Metadata size | 200+ bytes | ~128 bytes (rte_mbuf header) |
DPDK PACKET PATH — ZERO OVERHEAD
DPDK Fast Path — The Canonical Loop
NIC Rx Queue (hardware ring in hugepage memory)
↓
NIC DMA writes packet directly into pre-allocated mbuf
→ ZERO CPU involvement · ZERO kernel involvement
PMD polls ring — no interrupt, no softirq, no context switch
rte_eth_rx_burst(port, queue, mbufs[], burst_size)
→ returns batch of received mbufs
↓
Application processes packets (modify headers, lookup, forward)
↓
rte_eth_tx_burst(port, queue, mbufs[], n)
→ NIC DMA reads from mbuf and sends on wire
↓
rte_pktmbuf_free(mbuf)
→ return mbuf to pool (no free — just reset head pointer)
0 interrupts · 0 copies · 0 malloc/free · 0 context switches
✅ The canonical DPDK main loop in C:
// Minimal DPDK fast-path polling loop
struct rte_mbuf *pkts[BURST_SZ];
while (1) {
uint16_t nb_rx = rte_eth_rx_burst(port, queue, pkts, BURST_SZ);
for (uint16_t i = 0; i < nb_rx; i++) {
process_packet(pkts[i]); // modify, forward, drop
}
uint16_t nb_tx = rte_eth_tx_burst(port, queue, pkts, nb_rx);
// Free any packets the Tx ring couldn't accept
for (uint16_t i = nb_tx; i < nb_rx; i++)
rte_pktmbuf_free(pkts[i]);
}
When NOT to Use DPDK
- Low packet rate applications — the dedicated polling CPU is wasted
- Need full TCP/IP stack — DPDK has no kernel TCP (VPP or mTCP fill this gap)
- Hardware without PMD support — older or obscure NICs
- Rapid prototyping — DPDK has significant setup complexity
DPDK SOFTWARE STACK
DPDK Software Stack
┌─────────────────────────────────────────────────────────────────┐
│ USER APPLICATION (NF / SASE-DP) │
│ SASE URL Filter │ Blaze Broker │ MRI Resolver │ NGFW │
├──────────────┬────────────┬──────────────┬───────────────────────┤
│ rte_flow │ rte_acl │ rte_hash │ rte_lpm │
│ rte_timer │ rte_meter │ rte_sched │ rte_distributor │
├──────────────┴────────────┴──────────────┴───────────────────────┤
│ rte_mbuf │ rte_mempool │ rte_ring │
├─────────────────────────────────────────────────────────────────┤
│ ethdev API (rte_ethdev.h) │
├─────────────────────────────────────────────────────────────────┤
│ Poll Mode Drivers (PMD): ixgbe │ i40e │ mlx5 │ tap │ ring │
├─────────────────────────────────────────────────────────────────┤
│ EAL — Environment Abstraction Layer │
│ hugepages │ lcore mgmt │ PCI init │ memory │ log │ timer │
└─────────────────────────────────────────────────────────────────┘
KERNEL: UIO/VFIO driver (tiny — only device init/interrupt)
NIC HARDWARE: Rx/Tx descriptor rings
| Component | Responsibility | Key API |
|---|---|---|
EAL | Hardware init, hugepage setup, lcore management, PCI device binding | rte_eal_init() |
PMD | NIC-specific driver — polls hardware Rx/Tx queues directly | rte_eth_rx_burst() / rte_eth_tx_burst() |
ethdev API | Hardware-agnostic interface to PMD — port configure, queue setup | rte_eth_dev_configure() |
rte_mempool | Pre-allocated object pool — eliminates runtime malloc | rte_pktmbuf_pool_create() |
rte_mbuf | Packet buffer structure — wraps packet data + metadata | rte_pktmbuf_mtod() |
rte_ring | Lock-free FIFO circular buffer — inter-core packet passing | rte_ring_enqueue_bulk() |
rte_flow | Hardware flow classification and steering (Flow API) | rte_flow_create() |
rte_hash | Exact-match hash table — flow table lookups | rte_hash_lookup_bulk() |
rte_lpm | Longest-prefix match — routing table | rte_lpm_lookup_bulk() |
rte_acl | Multi-field ACL classification | rte_acl_classify() |
rte_distributor | Work distributor — one RX core fans out to N worker cores | rte_distributor_process() |
EAL — Environment Abstraction Layer
EAL is the foundation of every DPDK application. It must be the first call in main(). It initializes hugepages, discovers and probes NIC devices, pins lcores to CPUs, and provides all the primitives the rest of DPDK builds on.// Minimal EAL initialization
int main(int argc, char *argv[]) {
int ret = rte_eal_init(argc, argv);
if (ret < 0)
rte_exit(EXIT_FAILURE, "EAL init failed\n");
argc -= ret; // EAL consumes its own args; remaining go to app
argv += ret;
unsigned nb_ports = rte_eth_dev_count_avail();
printf("Available NIC ports: %u\n", nb_ports);
unsigned nb_lcores = rte_lcore_count();
printf("Configured lcores: %u\n", nb_lcores);
// ... rest of application
rte_eal_cleanup();
return 0;
}
KEY EAL COMMAND-LINE FLAGS
| Flag | Meaning | Example |
|---|---|---|
-l | List of lcores to use (CPU threads) | -l 0-3 or -l 0,2,4,6 |
-n | Number of memory channels | -n 4 |
-a | Allowlist: bind only these PCI devices | -a 0000:03:00.0 |
--socket-mem | Hugepage memory per NUMA socket (MB) | --socket-mem 1024,1024 |
--huge-dir | Path where hugepages are mounted | --huge-dir /dev/hugepages |
--proc-type | Process type for multi-process DPDK | --proc-type=primary |
--file-prefix | Shared memory prefix (multi-process) | --file-prefix=myapp |
--log-level | Component log verbosity | --log-level=pmd:8 |
LCORE CONCEPTS
lcores vs Physical CPUs
An lcore is DPDK's logical core — maps 1:1 to a hardware CPU thread (including hyperthreads). EAL pins each lcore to a CPU usingpthread_setaffinity_np() at startup, preventing OS scheduler migration. The main lcore (lcore 0 by default) runs after rte_eal_init() returns. Worker lcores are launched with rte_eal_remote_launch().
// Enumerate and query lcores
unsigned lcore_id;
RTE_LCORE_FOREACH_WORKER(lcore_id) {
unsigned socket = rte_lcore_to_socket_id(lcore_id);
unsigned cpu = rte_lcore_to_cpu_id(lcore_id);
printf("lcore %u → CPU %u on socket %u\n", lcore_id, cpu, socket);
}
// Launch worker function on each lcore
RTE_LCORE_FOREACH_WORKER(lcore_id) {
rte_eal_remote_launch(worker_loop, NULL, lcore_id);
}
rte_eal_mp_wait_lcore(); // wait for all workers to finish
PCIe Device Binding — Taking NIC from Kernel
Before DPDK can use a NIC, that NIC must be unbound from its kernel driver and bound to a DPDK-compatible driver. This is how DPDK takes ownership of the NIC away from the kernel.BINDING WORKFLOW
- 1Default state: NIC bound to kernel driver (
ixgbe,ice,mlx5_core) → Kernel uses NIC foreth0, visible toip link - 2Unbind from kernel:
dpdk-devbind.py --unbind 0000:03:00.0 - 3Bind to VFIO-PCI:
dpdk-devbind.py --bind=vfio-pci 0000:03:00.0 - 4DPDK state: NIC invisible to
ip link/ifconfig. PMD accesses NIC registers via mmap. DPDK owns all queues.
| Driver | Mode | Security | Requirement | Use Case |
|---|---|---|---|---|
vfio-pci | IOMMU-protected DMA | Safe — IOMMU blocks unauthorized DMA | IOMMU enabled in BIOS + kernel (intel_iommu=on) | Production — recommended |
uio_pci_generic | No IOMMU | Unsafe — NIC can DMA anywhere | No IOMMU required | Dev/test only |
igb_uio | No IOMMU | Unsafe | Out-of-tree kernel module | Legacy — avoid |
🆕 Mellanox/NVIDIA mlx4/mlx5 Exception: These NICs do not require unbinding from the kernel driver. They use a bifurcated driver model — the kernel driver (
mlx5_core) handles management traffic and control operations, while the DPDK PMD gets dedicated hardware queues for fast-path traffic. Both coexist on the same NIC simultaneously. This is what the Jio SASE-DP deployment uses.IOMMU & IOVA
IOVA — How NIC DMA Addresses Host Memory
CPU Virtual Address 0x7F3A00001080 ← what your C code uses
Physical Address 0x200001080 ← actual silicon location
IOVA 0x200001080 ← what NIC DMA engine uses
VFIO mode: IOMMU sits between PCIe bus and RAM
DPDK registers allowed DMA regions with IOMMU at startup
Unregistered DMA attempts → BLOCKED by IOMMU
Kernel memory is safe even if NIC has a bug
How DPDK manages IOVA:
① EAL allocs hugepages (physically contiguous 2MB frames)
② mmap() maps hugepages into process virtual address space
③ rte_mem_virt2iova() converts virtual ↔ IOVA for any hugepage addr
④ Rx descriptor pre-filled: desc[i].buf_addr = IOVA of empty mbuf
⑤ NIC DMA reads descriptor → writes packet to that IOVA
⑥ PMD converts: mbuf ptr = rte_mem_iova2virt(desc[i].buf_addr)
⚠️ Why IOVAs must be physically contiguous: NIC DMA writes a packet as one contiguous PCIe burst. 2MB hugepages guarantee physical contiguity within each page — safe for DMA. Normal 4KB pages may not be contiguous in physical RAM — never use for DMA buffers.
Q: What is the main overhead DPDK eliminates?
The interrupt per packet and associated context switch. At 100Gbps/64B there are ~148 Mpps — one interrupt per packet would saturate the CPU just handling interrupts. DPDK disables interrupts and uses polling (PMD), so the CPU spends 100% of its time processing packets, not handling interrupts.Q: What is NAPI and why isn't it enough?
NAPI (New API) switches from interrupt-driven to polling mode under high load, reducing interrupt overhead. But it still requires sk_buff allocation per packet, full protocol stack traversal, memory copies to user space, and context switches on recv(). DPDK eliminates all of these — NAPI only eliminates one.Q: How many memory copies does Linux do vs DPDK?
Linux: DMA → kernel ring buffer → sk_buff → user-space buffer = 2–3 copies.DPDK: NIC DMA writes directly into user-space hugepage mbuf = 0 copies (zero copy).
Q: What is the packet budget at 100Gbps with 64-byte packets?
~148 million packets per second → ~6.7 nanoseconds per packet. A single L3 cache miss (~40 cycles @ 3 GHz = ~13 ns) already exceeds this. This is why hugepages, NUMA alignment, and cache-conscious programming are mandatory in DPDK — not optional.Q: What is a bifurcated driver model and which NIC uses it?
Mellanox/NVIDIA ConnectX NICs (mlx4/mlx5) support bifurcation: the NIC is bound to both the kernel driver (mlx5_core) and the DPDK PMD simultaneously. The kernel driver handles management traffic (ARP, ICMP, control). The DPDK PMD gets dedicated hardware queues for fast-path traffic. This allows DPDK and kernel networking on the same NIC without rebinding — a key operational advantage in production deployments.Q: What does rte_eal_init() do?
It: (1) parses EAL command-line arguments, (2) mounts and allocates hugepages on configured NUMA sockets, (3) probes PCI devices and initializes PMD drivers, (4) pins lcores to physical CPUs viapthread_setaffinity_np(), (5) initializes the memory allocator, log system, and timer subsystem. It must be the first call in main(). Returns the number of args consumed — the rest belong to the application.
Q: Why does DPDK require hugepages?
Two reasons: (1) TLB efficiency — 2MB pages mean far fewer TLB entries, dramatically reducing TLB misses on the hot path. (2) DMA stability — hugepages are pinned (mlock'd) so they cannot be swapped out. NIC DMA needs fixed physical addresses (IOVAs). If a page is swapped out, its IOVA is stale — the NIC would write to wrong/freed memory. Hugepages guarantee the IOVA is always valid.🔥 Lab 1: DPDK Hello World + EAL Probe
Build the standard DPDK helloworld sample and instrument it to probe every EAL primitive.
1
Setup hugepages:
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages2
Mount hugetlbfs:
mount -t hugetlbfs none /dev/hugepages3
Bind NIC to VFIO:
dpdk-devbind.py --bind=vfio-pci 0000:03:00.04
Build helloworld:
cd $DPDK_BUILD/examples/helloworld && make5
Run with 4 lcores:
./helloworld -l 0-3 -n 46
Add probe code: print lcore count, NUMA socket per lcore, available port count, hugepage memory per socket
7
Verify IOVA: Call
rte_mem_virt2iova() on a stack variable — expect RTE_BAD_IOVA (stack is not a hugepage). Call on a hugepage-allocated buffer — expect valid IOVA.🔥 Lab 2: Skeleton NIC Receiver (no mbuf pool yet)
Before setting up full rx/tx queues: use dpdk-devbind.py to cycle bind/unbind a NIC and observe how it disappears from ip link. Then probe the NIC capabilities.
1
Run
ip link — note the NIC interface name (e.g. eth1)2
Bind to vfio-pci:
dpdk-devbind.py --bind=vfio-pci 0000:03:00.03
Run
ip link again — eth1 is gone from the OS view4
In C: call
rte_eth_dev_info_get(0, &info) and print driver_name, max_rx_queues, max_tx_queues, rx_desc_lim.nb_max5
Restore:
dpdk-devbind.py --bind=ixgbe 0000:03:00.0 — eth1 reappearsMASTERY CHECKLIST
- Can state the 6 categories of Linux kernel packet path overhead and DPDK's solution for each
- Can explain why NAPI is not enough
- Can draw the DPDK software stack from EAL to user application
- Can explain what rte_eal_init() does and why it must be first
- Can bind a NIC to vfio-pci and explain what changes in the kernel view
- Can explain IOVA, why hugepages must be used for DMA, and what happens if a page is swapped out
- Can explain the bifurcated driver model and which NICs use it
- Can enumerate lcores, query their NUMA socket, and launch a worker function