DPDK MASTERY · PHASE 1 OF 3 · MODULE B
Hugepages, mempool & mbuf
Hugepage memory model · IOVA · rte_mempool internals · rte_mbuf anatomy · chained mbufs
Ch 4 — Hugepages & Memory Ch 5 — rte_mempool Ch 6 — rte_mbuf C · DMA · NUMA Weeks 3–5

Why Hugepages Are Mandatory in DPDK

Two orthogonal requirements drive hugepage usage: (1) DMA stability — hugepages are pinned (mlock'd) so the NIC's IOVA is always valid; (2) TLB efficiency — 2MB pages mean 512× fewer TLB entries than 4KB pages, dramatically cutting TLB miss rate on the hot packet path.
PropertyNormal 4KB PagesDPDK Hugepages (2MB)
Page size4 KB2 MB (or 1 GB)
Pinned in RAMNo — OS can swap to diskYes — mlock'd at allocation, never swapped
Physical address stableNo — IOVA becomes stale after swapYes — IOVA always valid for NIC DMA
TLB entries for 1 GB data262,144 entries512 entries (512× fewer misses)
Page fault on accessPossible — ~10 ms disk I/ONever — pages pre-faulted at EAL init
DMA safetyUnsafe — may be freed under NICSafe — physical addr never changes
⚠️ The catastrophic swap scenario: NIC DMA uses physical addresses (IOVAs). If a page is swapped out, the physical frame is freed. The NIC's IOVA is now stale — it writes to wrong or freed memory. Even without corruption, one swap = ~10 ms pause. At 100G, the ring fills in ~80 µs. 10 ms = millions of dropped packets.

HUGEPAGE SETUP COMMANDS

# Check available hugepage sizes ls /sys/kernel/mm/hugepages/ # Allocate 1024 × 2MB hugepages (= 2 GB) echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # Mount hugetlbfs (if not already mounted) mount -t hugetlbfs none /dev/hugepages # Verify allocation cat /proc/meminfo | grep Huge # DPDK EAL: use --socket-mem to specify per-NUMA-socket allocation ./my_app -l 0-3 -n 4 --socket-mem 1024,1024 # 1GB on socket 0, 1GB on socket 1
📌 1GB hugepages: For very large mempools or when 2MB pages still have too many TLB entries. Requires kernel boot parameter hugepagesz=1G hugepages=4. EAL will prefer 1GB pages if available.

VIRTUAL MEMORY — PROCESS ISOLATION

Virtual Memory vs Physical Memory — Process Isolation Process A (virtual) Process B (virtual) Physical RAM ▲ HIGH ADDRESS ▲ HIGH ADDRESS ▲ kernel space kernel space kernel code (shared) stack 0x7FFF… stack 0x7FFF… frame 1024 ← A stack heap 0x0810… heap 0x0810… frame 2048 ← B stack data 0x0805… data 0x0805… frame 5632 ← A heap code 0x0804… code 0x0804… frame 8192 ← B heap ▼ 0x0000 ▼ 0x0000 DPDK hugepages PINNED — never swapped — fixed IOVAs for NIC DMA KEY INSIGHT: Both processes may use virtual address 0x08051000. MMU translates: A → physical frame 5632 | B → physical frame 8192 Same virtual address. Completely different RAM. Complete isolation.

Virtual Memory Segments

Every process has: code (text, read-only), data (BSS + initialized globals), heap (grows up via malloc/mmap), stack (grows down, per-thread), and kernel space (top of virtual address space, Ring 0 only). DPDK hugepage allocations live in a separate mmap'd region, pinned against eviction.

NUMA MEMORY TOPOLOGY

NUMA — Non-Uniform Memory Access Socket 0 Socket 1 CPU cores 0-7 CPU cores 8-15 L1/L2/L3 cache L1/L2/L3 cache Local RAM (DDR channels 0-1) Local RAM (DDR channels 2-3) ↑ low latency (~60 ns) ↑ low latency (~60 ns) NIC port 0 (PCIe) NIC port 1 (PCIe) ↑ DMA into socket 0 RAM ↑ DMA into socket 1 RAM Cross-NUMA access (socket 0 CPU → socket 1 RAM): ~120 ns — 2× slower! DPDK rule: ALWAYS allocate mempool on the same NUMA socket as the NIC. rte_pktmbuf_pool_create("POOL", N, CACHE_SZ, 0, sz, rte_eth_dev_socket_id(port)) ^^^^^^^^^^^^^^^^^^^^^^^^^ returns NIC's socket — use it!

rte_mempool — The Allocation Eliminator

rte_mempool pre-allocates all packet buffers at startup. The hot data path never calls malloc/free — it calls rte_mempool_get() (which pops from a lock-free ring or per-lcore cache) and rte_mempool_put() (which pushes back). This is what enables zero-allocation-overhead packet processing.

MEMPOOL INTERNAL ARCHITECTURE

rte_mempool Architecture rte_mempool header ┌─────────────────────────────────────┐ │ name, size, elt_size, cache_size │ │ count = total_elts - in_use_count │ └───────────────┬─────────────────────┘ │ ┌─────────────────────┼─────────────────────┐ ▼ ▼ ▼ Per-lcore cache Per-lcore cache Per-lcore cache (lcore 0) (lcore 1) (lcore 2) up to cache_size up to cache_size up to cache_size objects (stack) objects (stack) objects (stack) │ │ │ └─────────────────────┴─────────────────────┘ │ (cache miss → fallback) ▼ Common pool (rte_ring) Lock-free MPMC ring Contains all remaining objects

ALLOCATION PATH

// Create a packet mempool struct rte_mempool *mbuf_pool = rte_pktmbuf_pool_create( "MBUF_POOL", // unique name 8192, // total number of mbufs 256, // per-lcore cache size (objects) 0, // private data size per element RTE_MBUF_DEFAULT_BUF_SIZE, // data buffer size (2048 bytes) rte_eth_dev_socket_id(port_id) // NUMA socket — MUST match NIC ); if (!mbuf_pool) rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n"); // Manual get/put (for non-packet objects) void *obj; rte_mempool_get(pool, &obj); // borrow object /* use obj */ rte_mempool_put(pool, obj); // return object // Bulk operations (preferred — reduces ring contention) void *objs[32]; rte_mempool_get_bulk(pool, objs, 32); rte_mempool_put_bulk(pool, objs, 32);
Pool SizeUse CaseNotes
1024 – 4096Dev / light loadSmall — may exhaust quickly under burst
8192Standard — most DPDK examplesGood balance of memory vs exhaustion risk
65536+High burst / 100G line rateLarge memory footprint but never exhausts under normal traffic
⚠️ Pool size must be power of 2 minus 1 (e.g. 8191, not 8192) — rte_mempool internally uses a power-of-2 ring and the actual allocated count is N+1 ring slots. The API accepts N and adjusts internally. Common mistake: using 8192 when you mean 8191.

rte_mbuf — The Packet Carrier

rte_mbuf is the kernel's sk_buff equivalent. Every received packet is wrapped in an mbuf. It has a fixed header (metadata) followed by a contiguous data buffer (where packet bytes live). The key design decision: metadata and packet data are in the same hugepage allocation — one cache line prefetch gets both.

MBUF MEMORY LAYOUT

rte_mbuf Memory Layout (one hugepage allocation) ┌─────────────────────────────────────────────────────────────────┐ │ rte_mbuf header (~128 bytes) │ │ buf_addr ─────────────────────────────────────────────────► │ │ buf_iova (physical address for NIC DMA) │ │ data_off (offset from buf_addr to first packet byte) │ │ pkt_len (total packet length in bytes) │ │ data_len (data length in this segment) │ │ nb_segs (number of segments in chain) │ │ port (Rx port index) │ │ ol_flags (offload flags: cksum, vlan, rss, etc.) │ │ hash.rss (RSS hash value from NIC hardware) │ │ vlan_tci (VLAN tag if stripped by NIC) │ │ next (pointer to next segment, or NULL) │ │ pool (pointer back to mempool for free) │ │ refcnt (reference count — for cloning) │ ├─────────────────────────────────────────────────────────────────┤ │ Headroom (RTE_PKTMBUF_HEADROOM = 128 bytes) │ │ ← reserved for prepending headers │ ├──────────── ← buf_addr + data_off (= rte_pktmbuf_mtod result) ─┤ │ [ Ethernet header (14B) │ IP (20B) │ TCP (20B) │…]│ │ Packet data (data_len bytes) │ └─────────────────────────────────────────────────────────────────┘ Total buffer: RTE_MBUF_DEFAULT_BUF_SIZE = 2048 bytes

KEY MBUF MACROS & FIELDS

// Get pointer to packet data (most common operation) struct rte_ether_hdr *eth = rte_pktmbuf_mtod(mbuf, struct rte_ether_hdr *); // Expands to: (type)(mbuf->buf_addr + mbuf->data_off) — direct pointer into hugepage // Access packet at byte offset struct rte_ipv4_hdr *ip = rte_pktmbuf_mtod_offset(mbuf, struct rte_ipv4_hdr *, sizeof(struct rte_ether_hdr)); // Packet length uint32_t total_len = mbuf->pkt_len; // total bytes across all segments uint16_t seg_len = mbuf->data_len; // bytes in this segment only // Prepend a header (uses headroom) struct rte_ether_hdr *eth = (struct rte_ether_hdr *) rte_pktmbuf_prepend(mbuf, sizeof(struct rte_ether_hdr)); // Returns NULL if no headroom available // Append to tail char *tail = rte_pktmbuf_append(mbuf, 4); // add 4 bytes at end // Remove from front (advance data_off) rte_pktmbuf_adj(mbuf, sizeof(struct rte_ether_hdr)); // Free mbuf back to pool rte_pktmbuf_free(mbuf); // also frees chained segments
ol_flags BitDirectionMeaning
RTE_MBUF_F_RX_RSS_HASHRxNIC computed RSS hash — value in mbuf->hash.rss
RTE_MBUF_F_RX_IP_CKSUM_GOODRxNIC verified IP checksum — correct
RTE_MBUF_F_RX_IP_CKSUM_BADRxNIC verified IP checksum — bad (drop the packet)
RTE_MBUF_F_RX_VLANRxVLAN tag present — stripped to mbuf->vlan_tci
RTE_MBUF_F_TX_IPV4TxIPv4 packet — required when requesting Tx IP cksum offload
RTE_MBUF_F_TX_IP_CKSUMTxAsk NIC to compute and insert IPv4 header checksum
RTE_MBUF_F_TX_TCP_CKSUMTxAsk NIC to compute and insert TCP checksum

Chained mbufs — For Jumbo Frames

A single mbuf data buffer is 2048 bytes by default. Jumbo frames (up to 9000 bytes for 9K MTU) require chained mbufs — a linked list of mbufs where mbuf->next points to the continuation segment. The first segment's pkt_len holds the total, nb_segs holds the count.
Chained mbuf Layout (jumbo frame example: 5000 bytes) Segment 0 (head) Segment 1 Segment 2 ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │ pkt_len = 5000 │ │ pkt_len = 0 │ │ pkt_len = 0 │ │ data_len = 1920 │ │ data_len = 1920 │ │ data_len = 1160 │ │ nb_segs = 3 │ │ nb_segs = 0 │ │ nb_segs = 0 │ │ next ─────────────┼────►│ next ─────────────┼──►│ next = NULL │ │ [packet data...] │ │ [packet data...] │ │ [packet data...] │ └───────────────────┘ └───────────────────┘ └───────────────────┘ 1920 bytes 1920 bytes 1160 bytes Total: 1920 + 1920 + 1160 = 5000 bytes
// Check if mbuf is chained if (mbuf->nb_segs > 1) { // Walk the chain struct rte_mbuf *seg = mbuf; while (seg != NULL) { uint8_t *data = rte_pktmbuf_mtod(seg, uint8_t *); uint16_t len = seg->data_len; /* process this segment */ seg = seg->next; } } // Linearize (copy all segments into one) — expensive, avoid on hot path char buf[9000]; uint32_t copied = rte_pktmbuf_read(mbuf, 0, mbuf->pkt_len, buf);
⚠️ Most DPDK applications avoid chained mbufs on the hot path. The preferred approach is to set RTE_ETH_RX_OFFLOAD_SCATTER and handle multi-segment mbufs only in the exception path. For performance-critical NFs, configure MTU ≤ single-segment buffer size and drop/reject jumbo frames at the port level.

MBUF CLONE vs REFERENCE COUNT

rte_pktmbuf_clone() — Sharing Without Copy

Cloning creates a new mbuf header that shares the same data buffer as the original. The data buffer's reference count (refcnt) is incremented. rte_pktmbuf_free() on either clone decrements refcnt — the data buffer is only returned to the pool when refcnt reaches zero. Use case: multicast — send the same packet out multiple ports without copying the data.
// Clone for multicast (zero-copy) struct rte_mbuf *clone = rte_pktmbuf_clone(original, pool); // original->refcnt: 1 → 2 (shared data buffer) rte_eth_tx_burst(port_a, 0, &original, 1); // refcnt: 2→1 after tx rte_eth_tx_burst(port_b, 0, &clone, 1); // refcnt: 1→0 after tx → buffer freed

Q: Why can't DPDK use normal 4KB pages for packet buffers?

Two reasons: (1) DMA instability — 4KB pages can be swapped out by the OS at any time. The NIC's IOVA would become stale, causing DMA writes to wrong memory or segfaults. (2) TLB pressure — 1 GB of packet buffers needs 262,144 TLB entries with 4KB pages vs only 512 entries with 2MB hugepages. TLB misses on the hot path at 100G rates would dominate CPU time.

Q: What is the per-lcore cache in rte_mempool and why does it matter?

The per-lcore cache is a small, lcore-local stack of pre-fetched objects (typically 256 entries). Alloc/free to the lcore cache requires no atomic operations — it's just an array index increment/decrement. Only when the cache empties or overflows does it interact with the common pool ring (one CAS for a bulk transfer). This makes rte_mempool_get/put nearly as cheap as a stack pop on the hot path.

Q: What is rte_pktmbuf_mtod() and how does it work?

It's a macro: (type)(mbuf->buf_addr + mbuf->data_off). buf_addr is the pointer to the start of the data buffer. data_off is the byte offset to the first packet byte (defaults to RTE_PKTMBUF_HEADROOM = 128 bytes, leaving space to prepend headers). The result is a typed pointer directly into hugepage memory — no copy, no syscall.

Q: What is headroom in an mbuf and when is it used?

Headroom is a reserved region at the start of the data buffer, before the packet data. Default: 128 bytes (RTE_PKTMBUF_HEADROOM). It's used when your NF needs to prepend a header to an incoming packet — e.g., adding a VXLAN or GRE encapsulation header. Instead of copying the entire packet to a new buffer, you use rte_pktmbuf_prepend() which decrements data_off to expand into the headroom. Zero allocation, zero copy.

Q: What happens when rte_mempool runs out of objects?

rte_mempool_get() returns -ENOBUFS (non-zero). For pktmbuf pools, the PMD reports this as stats.rx_nombuf and the packet is dropped by the NIC before it reaches the application. This is a critical metric to monitor — it means the application is not returning mbufs to the pool fast enough, or the pool is undersized. Fix: increase pool size, check for mbuf leaks (tx_burst without freeing unsent packets), or reduce processing latency.

Q: What is the difference between pkt_len and data_len?

data_len: bytes of packet data in this segment only.
pkt_len: total bytes across all segments in the chain (only valid on the first segment/head mbuf).
For single-segment mbufs (the common case), both are equal. For chained mbufs (jumbo frames), pkt_len = sum of all data_len values across all segments.
🔥 Lab 3: mbuf Inspector — Decode Every Field

Create a DPDK application that receives one burst of packets and prints every mbuf field. The goal is to see the real hardware values — RSS hash, ol_flags, pkt_len — not just theoretical values.

1
Create mempool with rte_pktmbuf_pool_create() on the NIC's socket
2
Configure port: enable RTE_ETH_RX_OFFLOAD_CHECKSUM and RTE_ETH_RX_OFFLOAD_RSS_HASH
3
Receive one burst: rte_eth_rx_burst(port, 0, pkts, 32)
4
For each received mbuf print: buf_addr, buf_iova, data_off, pkt_len, data_len, nb_segs, port, ol_flags (as hex), hash.rss, vlan_tci
5
Use rte_pktmbuf_mtod() to get Ethernet header — print src and dst MAC
6
Verify RTE_MBUF_F_RX_IP_CKSUM_GOOD is set on a valid IPv4 packet
7
Check mempool stats after: rte_mempool_avail_count(pool) — should decrease by nb_rx
8
Free all mbufs: rte_pktmbuf_free(pkts[i]) — verify avail_count restored
🔥 Lab 4: Pool Exhaustion Experiment

Intentionally exhaust the mempool to observe the imissed counter. This teaches defensive mbuf management.

1
Create a small pool: 64 mbufs total
2
Receive packets in a loop — do not free them
3
After pool empties: poll stats.rx_nombuf via rte_eth_stats_get() — observe it increment
4
Free all held mbufs — observe rx_nombuf stops incrementing
5
Lesson: Every code path that receives mbufs MUST free them or return them to TX. Mbuf leaks are the most common DPDK production bug.

MASTERY CHECKLIST

← P1A: Foundation & EAL ↑ Roadmap P2A: PMD & Port Config →