DPDK P1 — Foundation, Architecture & EAL

DPDK MASTERY · PHASE 1 OF 3 · MODULE A

Foundation, Architecture & EAL

Why DPDK exists · Full software stack · Environment Abstraction Layer · PCIe device binding

Ch 1 — Why DPDK Ch 2 — Architecture Ch 3 — EAL Deep Dive C · Linux · PCIe · VFIO Weeks 1–2

⚡ The Core Problem

DPDK (Data Plane Development Kit) exists because the Linux kernel packet processing path — perfectly adequate for general-purpose computing — is fundamentally too slow for line-rate NF processing at 10G, 25G, 100G speeds. It eliminates five categories of overhead: interrupts, allocation, copies, context switches, and protocol stack traversal.

📌 Packet budget at 100Gbps / 64-byte frames: ~148 million packets per second → 6.7 nanoseconds per packet. A single L3 cache miss (~40 cycles @ 3 GHz = ~13 ns) already exceeds this budget. Every design choice in DPDK is shaped by this constraint.

OVERHEAD BREAKDOWN

Overhead	Description	Cost	DPDK Solution
`Hardware interrupt`	CPU suspends task, saves state, runs ISR	~1–5 µs + cache pollution	Interrupts disabled — CPU polls NIC (PMD)
`softirq scheduling`	ISR → NET_RX_SOFTIRQ deferred	Unpredictable latency	No softirq — polling loop runs constantly
`sk_buff allocation`	Per-packet kmalloc from slab allocator	~100–200 ns / packet	Pre-allocated mbuf pool — zero runtime alloc
`Protocol stack traversal`	Full IP/TCP even if NF doesn't need it	Many cache misses	App implements only what it needs
`Memory copies`	DMA ring → sk_buff → user buffer	2–4 memcpy / packet	NIC DMA writes directly to user-space hugepage mbuf
`Context switch`	Kernel → user-space on recv()	~1–10 µs	No syscalls in data path — pure user-space
`TLB pressure`	4KB pages → many TLB entries → frequent misses	10–100s cycles / miss	2MB/1GB hugepages — far fewer TLB entries

NAPI — Good but Not Enough

NAPI (New API) switches from interrupt-driven to polling mode under high load, reducing interrupt rate. But it still requires: sk_buff allocation per packet, full protocol stack traversal, memory copies to user space, and context switches on recv(). DPDK eliminates all of these — NAPI only eliminates one.

LINUX KERNEL PACKET PATH — 9 STAGES

Linux Kernel Packet Path — Full Chain ① Packet arrives on wire → NIC DMA writes it to ring buffer in kernel memory ② NIC raises hardware interrupt → CPU suspends current task ③ CPU runs ISR (Interrupt Service Routine) → Quick acknowledgement, schedules softirq ④ NET_RX_SOFTIRQ scheduled → Deferred processing (unpredictable latency) ⑤ NAPI poll loop runs → Pulls packets from NIC ring into sk_buff structures ⑥ sk_buff travels up the protocol stack → Ethernet → IP → TCP/UDP (many function calls, cache misses) ⑦ Packet data copied into socket receive buffer → First memory copy ⑧ Application calls recv() / read() → Context switch to user space (~1–10 µs) ⑨ Packet data copied from socket buffer into application buffer → Second memory copy Total: 2+ memory copies · 1+ context switch · 1 interrupt · 1 softirq per packet

sk_buff — The Kernel Packet Structure

The sk_buff is the Linux kernel's equivalent of DPDK's rte_mbuf. Every packet is wrapped in an sk_buff.

Allocated per packet at interrupt time — per-packet malloc
Contains: head/data/tail/end pointers, network/transport header pointers, protocol info, device pointer, checksum fields
Supports cloning and reference counting — complex lifecycle with overhead
Size: 200+ bytes of metadata overhead per packet

Comparison Point	Linux sk_buff	DPDK rte_mbuf
Allocation	Per-packet kmalloc at interrupt time	Pre-allocated in pool at startup — zero runtime alloc
Memory location	Kernel heap (4KB pages, swappable)	Hugepage RAM (2MB/1GB, pinned, NUMA-local)
NIC DMA target	Kernel ring buffer → copied to sk_buff	Directly into mbuf data buffer — zero copy
User-space access	Requires syscall + copy	Direct pointer — no syscall, no copy
Metadata size	200+ bytes	~128 bytes (rte_mbuf header)

DPDK PACKET PATH — ZERO OVERHEAD

DPDK Fast Path — The Canonical Loop NIC Rx Queue (hardware ring in hugepage memory) ↓ NIC DMA writes packet directly into pre-allocated mbuf → ZERO CPU involvement · ZERO kernel involvement PMD polls ring — no interrupt, no softirq, no context switch rte_eth_rx_burst(port, queue, mbufs[], burst_size) → returns batch of received mbufs ↓ Application processes packets (modify headers, lookup, forward) ↓ rte_eth_tx_burst(port, queue, mbufs[], n) → NIC DMA reads from mbuf and sends on wire ↓ rte_pktmbuf_free(mbuf) → return mbuf to pool (no free — just reset head pointer) 0 interrupts · 0 copies · 0 malloc/free · 0 context switches

✅ The canonical DPDK main loop in C:

// Minimal DPDK fast-path polling loop struct rte_mbuf *pkts[BURST_SZ]; while (1) { uint16_t nb_rx = rte_eth_rx_burst(port, queue, pkts, BURST_SZ); for (uint16_t i = 0; i < nb_rx; i++) { process_packet(pkts[i]); // modify, forward, drop } uint16_t nb_tx = rte_eth_tx_burst(port, queue, pkts, nb_rx); // Free any packets the Tx ring couldn't accept for (uint16_t i = nb_tx; i < nb_rx; i++) rte_pktmbuf_free(pkts[i]); }

When NOT to Use DPDK

Low packet rate applications — the dedicated polling CPU is wasted
Need full TCP/IP stack — DPDK has no kernel TCP (VPP or mTCP fill this gap)
Hardware without PMD support — older or obscure NICs
Rapid prototyping — DPDK has significant setup complexity

DPDK SOFTWARE STACK

DPDK Software Stack ┌─────────────────────────────────────────────────────────────────┐ │ USER APPLICATION (NF / SASE-DP) │ │ SASE URL Filter │ Blaze Broker │ MRI Resolver │ NGFW │ ├──────────────┬────────────┬──────────────┬───────────────────────┤ │ rte_flow │ rte_acl │ rte_hash │ rte_lpm │ │ rte_timer │ rte_meter │ rte_sched │ rte_distributor │ ├──────────────┴────────────┴──────────────┴───────────────────────┤ │ rte_mbuf │ rte_mempool │ rte_ring │ ├─────────────────────────────────────────────────────────────────┤ │ ethdev API (rte_ethdev.h) │ ├─────────────────────────────────────────────────────────────────┤ │ Poll Mode Drivers (PMD): ixgbe │ i40e │ mlx5 │ tap │ ring │ ├─────────────────────────────────────────────────────────────────┤ │ EAL — Environment Abstraction Layer │ │ hugepages │ lcore mgmt │ PCI init │ memory │ log │ timer │ └─────────────────────────────────────────────────────────────────┘ KERNEL: UIO/VFIO driver (tiny — only device init/interrupt) NIC HARDWARE: Rx/Tx descriptor rings

Component	Responsibility	Key API
`EAL`	Hardware init, hugepage setup, lcore management, PCI device binding	`rte_eal_init()`
`PMD`	NIC-specific driver — polls hardware Rx/Tx queues directly	`rte_eth_rx_burst()` / `rte_eth_tx_burst()`
`ethdev API`	Hardware-agnostic interface to PMD — port configure, queue setup	`rte_eth_dev_configure()`
`rte_mempool`	Pre-allocated object pool — eliminates runtime malloc	`rte_pktmbuf_pool_create()`
`rte_mbuf`	Packet buffer structure — wraps packet data + metadata	`rte_pktmbuf_mtod()`
`rte_ring`	Lock-free FIFO circular buffer — inter-core packet passing	`rte_ring_enqueue_bulk()`
`rte_flow`	Hardware flow classification and steering (Flow API)	`rte_flow_create()`
`rte_hash`	Exact-match hash table — flow table lookups	`rte_hash_lookup_bulk()`
`rte_lpm`	Longest-prefix match — routing table	`rte_lpm_lookup_bulk()`
`rte_acl`	Multi-field ACL classification	`rte_acl_classify()`
`rte_distributor`	Work distributor — one RX core fans out to N worker cores	`rte_distributor_process()`

EAL — Environment Abstraction Layer

EAL is the foundation of every DPDK application. It must be the first call in main(). It initializes hugepages, discovers and probes NIC devices, pins lcores to CPUs, and provides all the primitives the rest of DPDK builds on.

// Minimal EAL initialization int main(int argc, char *argv[]) { int ret = rte_eal_init(argc, argv); if (ret < 0) rte_exit(EXIT_FAILURE, "EAL init failed\n"); argc -= ret; // EAL consumes its own args; remaining go to app argv += ret; unsigned nb_ports = rte_eth_dev_count_avail(); printf("Available NIC ports: %u\n", nb_ports); unsigned nb_lcores = rte_lcore_count(); printf("Configured lcores: %u\n", nb_lcores); // ... rest of application rte_eal_cleanup(); return 0; }

KEY EAL COMMAND-LINE FLAGS

Flag	Meaning	Example
`-l`	List of lcores to use (CPU threads)	`-l 0-3` or `-l 0,2,4,6`
`-n`	Number of memory channels	`-n 4`
`-a`	Allowlist: bind only these PCI devices	`-a 0000:03:00.0`
`--socket-mem`	Hugepage memory per NUMA socket (MB)	`--socket-mem 1024,1024`
`--huge-dir`	Path where hugepages are mounted	`--huge-dir /dev/hugepages`
`--proc-type`	Process type for multi-process DPDK	`--proc-type=primary`
`--file-prefix`	Shared memory prefix (multi-process)	`--file-prefix=myapp`
`--log-level`	Component log verbosity	`--log-level=pmd:8`

LCORE CONCEPTS

lcores vs Physical CPUs

An lcore is DPDK's logical core — maps 1:1 to a hardware CPU thread (including hyperthreads). EAL pins each lcore to a CPU using pthread_setaffinity_np() at startup, preventing OS scheduler migration. The main lcore (lcore 0 by default) runs after rte_eal_init() returns. Worker lcores are launched with rte_eal_remote_launch().

// Enumerate and query lcores unsigned lcore_id; RTE_LCORE_FOREACH_WORKER(lcore_id) { unsigned socket = rte_lcore_to_socket_id(lcore_id); unsigned cpu = rte_lcore_to_cpu_id(lcore_id); printf("lcore %u → CPU %u on socket %u\n", lcore_id, cpu, socket); } // Launch worker function on each lcore RTE_LCORE_FOREACH_WORKER(lcore_id) { rte_eal_remote_launch(worker_loop, NULL, lcore_id); } rte_eal_mp_wait_lcore(); // wait for all workers to finish

PCIe Device Binding — Taking NIC from Kernel

Before DPDK can use a NIC, that NIC must be unbound from its kernel driver and bound to a DPDK-compatible driver. This is how DPDK takes ownership of the NIC away from the kernel.

BINDING WORKFLOW

1
Default state: NIC bound to kernel driver (ixgbe, ice, mlx5_core) → Kernel uses NIC for eth0, visible to ip link
2
Unbind from kernel: dpdk-devbind.py --unbind 0000:03:00.0
3
Bind to VFIO-PCI: dpdk-devbind.py --bind=vfio-pci 0000:03:00.0
4
DPDK state: NIC invisible to ip link / ifconfig. PMD accesses NIC registers via mmap. DPDK owns all queues.

Driver	Mode	Security	Requirement	Use Case
`vfio-pci`	IOMMU-protected DMA	Safe — IOMMU blocks unauthorized DMA	IOMMU enabled in BIOS + kernel (`intel_iommu=on`)	Production — recommended
`uio_pci_generic`	No IOMMU	Unsafe — NIC can DMA anywhere	No IOMMU required	Dev/test only
`igb_uio`	No IOMMU	Unsafe	Out-of-tree kernel module	Legacy — avoid

🆕 Mellanox/NVIDIA mlx4/mlx5 Exception: These NICs do not require unbinding from the kernel driver. They use a bifurcated driver model — the kernel driver (mlx5_core) handles management traffic and control operations, while the DPDK PMD gets dedicated hardware queues for fast-path traffic. Both coexist on the same NIC simultaneously. This is what the Jio SASE-DP deployment uses.

IOMMU & IOVA

IOVA — How NIC DMA Addresses Host Memory CPU Virtual Address 0x7F3A00001080 ← what your C code uses Physical Address 0x200001080 ← actual silicon location IOVA 0x200001080 ← what NIC DMA engine uses VFIO mode: IOMMU sits between PCIe bus and RAM DPDK registers allowed DMA regions with IOMMU at startup Unregistered DMA attempts → BLOCKED by IOMMU Kernel memory is safe even if NIC has a bug How DPDK manages IOVA: ① EAL allocs hugepages (physically contiguous 2MB frames) ② mmap() maps hugepages into process virtual address space ③ rte_mem_virt2iova() converts virtual ↔ IOVA for any hugepage addr ④ Rx descriptor pre-filled: desc[i].buf_addr = IOVA of empty mbuf ⑤ NIC DMA reads descriptor → writes packet to that IOVA ⑥ PMD converts: mbuf ptr = rte_mem_iova2virt(desc[i].buf_addr)

⚠️ Why IOVAs must be physically contiguous: NIC DMA writes a packet as one contiguous PCIe burst. 2MB hugepages guarantee physical contiguity within each page — safe for DMA. Normal 4KB pages may not be contiguous in physical RAM — never use for DMA buffers.

Q: What is the main overhead DPDK eliminates?

The interrupt per packet and associated context switch. At 100Gbps/64B there are ~148 Mpps — one interrupt per packet would saturate the CPU just handling interrupts. DPDK disables interrupts and uses polling (PMD), so the CPU spends 100% of its time processing packets, not handling interrupts.

Q: What is NAPI and why isn't it enough?

NAPI (New API) switches from interrupt-driven to polling mode under high load, reducing interrupt overhead. But it still requires sk_buff allocation per packet, full protocol stack traversal, memory copies to user space, and context switches on recv(). DPDK eliminates all of these — NAPI only eliminates one.

Q: How many memory copies does Linux do vs DPDK?

Linux: DMA → kernel ring buffer → sk_buff → user-space buffer = 2–3 copies.
DPDK: NIC DMA writes directly into user-space hugepage mbuf = 0 copies (zero copy).

Q: What is the packet budget at 100Gbps with 64-byte packets?

~148 million packets per second → ~6.7 nanoseconds per packet. A single L3 cache miss (~40 cycles @ 3 GHz = ~13 ns) already exceeds this. This is why hugepages, NUMA alignment, and cache-conscious programming are mandatory in DPDK — not optional.

Q: What is a bifurcated driver model and which NIC uses it?

Mellanox/NVIDIA ConnectX NICs (mlx4/mlx5) support bifurcation: the NIC is bound to both the kernel driver (mlx5_core) and the DPDK PMD simultaneously. The kernel driver handles management traffic (ARP, ICMP, control). The DPDK PMD gets dedicated hardware queues for fast-path traffic. This allows DPDK and kernel networking on the same NIC without rebinding — a key operational advantage in production deployments.

Q: What does rte_eal_init() do?

It: (1) parses EAL command-line arguments, (2) mounts and allocates hugepages on configured NUMA sockets, (3) probes PCI devices and initializes PMD drivers, (4) pins lcores to physical CPUs via pthread_setaffinity_np(), (5) initializes the memory allocator, log system, and timer subsystem. It must be the first call in main(). Returns the number of args consumed — the rest belong to the application.

Q: Why does DPDK require hugepages?

Two reasons: (1) TLB efficiency — 2MB pages mean far fewer TLB entries, dramatically reducing TLB misses on the hot path. (2) DMA stability — hugepages are pinned (mlock'd) so they cannot be swapped out. NIC DMA needs fixed physical addresses (IOVAs). If a page is swapped out, its IOVA is stale — the NIC would write to wrong/freed memory. Hugepages guarantee the IOVA is always valid.

🔥 Lab 1: DPDK Hello World + EAL Probe

Build the standard DPDK helloworld sample and instrument it to probe every EAL primitive.

Setup hugepages: echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Mount hugetlbfs: mount -t hugetlbfs none /dev/hugepages

Bind NIC to VFIO: dpdk-devbind.py --bind=vfio-pci 0000:03:00.0

Build helloworld: cd $DPDK_BUILD/examples/helloworld && make

Run with 4 lcores: ./helloworld -l 0-3 -n 4

Add probe code: print lcore count, NUMA socket per lcore, available port count, hugepage memory per socket

Verify IOVA: Call rte_mem_virt2iova() on a stack variable — expect RTE_BAD_IOVA (stack is not a hugepage). Call on a hugepage-allocated buffer — expect valid IOVA.

🔥 Lab 2: Skeleton NIC Receiver (no mbuf pool yet)

Before setting up full rx/tx queues: use dpdk-devbind.py to cycle bind/unbind a NIC and observe how it disappears from ip link. Then probe the NIC capabilities.

Run ip link — note the NIC interface name (e.g. eth1)

Bind to vfio-pci: dpdk-devbind.py --bind=vfio-pci 0000:03:00.0

Run ip link again — eth1 is gone from the OS view

In C: call rte_eth_dev_info_get(0, &info) and print driver_name, max_rx_queues, max_tx_queues, rx_desc_lim.nb_max

Restore: dpdk-devbind.py --bind=ixgbe 0000:03:00.0 — eth1 reappears

MASTERY CHECKLIST

Can state the 6 categories of Linux kernel packet path overhead and DPDK's solution for each
Can explain why NAPI is not enough
Can draw the DPDK software stack from EAL to user application
Can explain what rte_eal_init() does and why it must be first
Can bind a NIC to vfio-pci and explain what changes in the kernel view
Can explain IOVA, why hugepages must be used for DMA, and what happens if a page is swapped out
Can explain the bifurcated driver model and which NICs use it
Can enumerate lcores, query their NUMA socket, and launch a worker function

↑ Roadmap P1B: Hugepages, mempool & mbuf →