DPDK P1 — Hugepages, mempool & mbuf

DPDK MASTERY · PHASE 1 OF 3 · MODULE B

Hugepages, mempool & mbuf

Hugepage memory model · IOVA · rte_mempool internals · rte_mbuf anatomy · chained mbufs

Ch 4 — Hugepages & Memory Ch 5 — rte_mempool Ch 6 — rte_mbuf C · DMA · NUMA Weeks 3–5

Why Hugepages Are Mandatory in DPDK

Two orthogonal requirements drive hugepage usage: (1) DMA stability — hugepages are pinned (mlock'd) so the NIC's IOVA is always valid; (2) TLB efficiency — 2MB pages mean 512× fewer TLB entries than 4KB pages, dramatically cutting TLB miss rate on the hot packet path.

Property	Normal 4KB Pages	DPDK Hugepages (2MB)
Page size	4 KB	2 MB (or 1 GB)
Pinned in RAM	No — OS can swap to disk	Yes — mlock'd at allocation, never swapped
Physical address stable	No — IOVA becomes stale after swap	Yes — IOVA always valid for NIC DMA
TLB entries for 1 GB data	262,144 entries	512 entries (512× fewer misses)
Page fault on access	Possible — ~10 ms disk I/O	Never — pages pre-faulted at EAL init
DMA safety	Unsafe — may be freed under NIC	Safe — physical addr never changes

⚠️ The catastrophic swap scenario: NIC DMA uses physical addresses (IOVAs). If a page is swapped out, the physical frame is freed. The NIC's IOVA is now stale — it writes to wrong or freed memory. Even without corruption, one swap = ~10 ms pause. At 100G, the ring fills in ~80 µs. 10 ms = millions of dropped packets.

HUGEPAGE SETUP COMMANDS

# Check available hugepage sizes ls /sys/kernel/mm/hugepages/ # Allocate 1024 × 2MB hugepages (= 2 GB) echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # Mount hugetlbfs (if not already mounted) mount -t hugetlbfs none /dev/hugepages # Verify allocation cat /proc/meminfo | grep Huge # DPDK EAL: use --socket-mem to specify per-NUMA-socket allocation ./my_app -l 0-3 -n 4 --socket-mem 1024,1024 # 1GB on socket 0, 1GB on socket 1

📌 1GB hugepages: For very large mempools or when 2MB pages still have too many TLB entries. Requires kernel boot parameter hugepagesz=1G hugepages=4. EAL will prefer 1GB pages if available.

VIRTUAL MEMORY — PROCESS ISOLATION

Virtual Memory vs Physical Memory — Process Isolation Process A (virtual) Process B (virtual) Physical RAM ▲ HIGH ADDRESS ▲ HIGH ADDRESS ▲ kernel space kernel space kernel code (shared) stack 0x7FFF… stack 0x7FFF… frame 1024 ← A stack heap 0x0810… heap 0x0810… frame 2048 ← B stack data 0x0805… data 0x0805… frame 5632 ← A heap code 0x0804… code 0x0804… frame 8192 ← B heap ▼ 0x0000 ▼ 0x0000 DPDK hugepages PINNED — never swapped — fixed IOVAs for NIC DMA KEY INSIGHT: Both processes may use virtual address 0x08051000. MMU translates: A → physical frame 5632 | B → physical frame 8192 Same virtual address. Completely different RAM. Complete isolation.

Virtual Memory Segments

Every process has: code (text, read-only), data (BSS + initialized globals), heap (grows up via malloc/mmap), stack (grows down, per-thread), and kernel space (top of virtual address space, Ring 0 only). DPDK hugepage allocations live in a separate mmap'd region, pinned against eviction.

NUMA MEMORY TOPOLOGY

NUMA — Non-Uniform Memory Access Socket 0 Socket 1 CPU cores 0-7 CPU cores 8-15 L1/L2/L3 cache L1/L2/L3 cache Local RAM (DDR channels 0-1) Local RAM (DDR channels 2-3) ↑ low latency (~60 ns) ↑ low latency (~60 ns) NIC port 0 (PCIe) NIC port 1 (PCIe) ↑ DMA into socket 0 RAM ↑ DMA into socket 1 RAM Cross-NUMA access (socket 0 CPU → socket 1 RAM): ~120 ns — 2× slower! DPDK rule: ALWAYS allocate mempool on the same NUMA socket as the NIC. rte_pktmbuf_pool_create("POOL", N, CACHE_SZ, 0, sz, rte_eth_dev_socket_id(port)) ^^^^^^^^^^^^^^^^^^^^^^^^^ returns NIC's socket — use it!

rte_mempool — The Allocation Eliminator

rte_mempool pre-allocates all packet buffers at startup. The hot data path never calls malloc/free — it calls rte_mempool_get() (which pops from a lock-free ring or per-lcore cache) and rte_mempool_put() (which pushes back). This is what enables zero-allocation-overhead packet processing.

MEMPOOL INTERNAL ARCHITECTURE

rte_mempool Architecture rte_mempool header ┌─────────────────────────────────────┐ │ name, size, elt_size, cache_size │ │ count = total_elts - in_use_count │ └───────────────┬─────────────────────┘ │ ┌─────────────────────┼─────────────────────┐ ▼ ▼ ▼ Per-lcore cache Per-lcore cache Per-lcore cache (lcore 0) (lcore 1) (lcore 2) up to cache_size up to cache_size up to cache_size objects (stack) objects (stack) objects (stack) │ │ │ └─────────────────────┴─────────────────────┘ │ (cache miss → fallback) ▼ Common pool (rte_ring) Lock-free MPMC ring Contains all remaining objects

ALLOCATION PATH

1
rte_mempool_get(pool, &obj): Check per-lcore cache first (~3 cycles, no atomic)
2
Cache hit: Pop object from lcore-local stack. Return immediately. Zero contention.
3
Cache miss: Refill lcore cache in bulk from common pool ring (one CAS → batch transfer)
4
rte_mempool_put(pool, obj): Push to lcore cache. If cache full → flush bulk to common ring.

// Create a packet mempool struct rte_mempool *mbuf_pool = rte_pktmbuf_pool_create( "MBUF_POOL", // unique name 8192, // total number of mbufs 256, // per-lcore cache size (objects) 0, // private data size per element RTE_MBUF_DEFAULT_BUF_SIZE, // data buffer size (2048 bytes) rte_eth_dev_socket_id(port_id) // NUMA socket — MUST match NIC ); if (!mbuf_pool) rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n"); // Manual get/put (for non-packet objects) void *obj; rte_mempool_get(pool, &obj); // borrow object /* use obj */ rte_mempool_put(pool, obj); // return object // Bulk operations (preferred — reduces ring contention) void *objs[32]; rte_mempool_get_bulk(pool, objs, 32); rte_mempool_put_bulk(pool, objs, 32);

Pool Size	Use Case	Notes
1024 – 4096	Dev / light load	Small — may exhaust quickly under burst
8192	Standard — most DPDK examples	Good balance of memory vs exhaustion risk
65536+	High burst / 100G line rate	Large memory footprint but never exhausts under normal traffic

⚠️ Pool size must be power of 2 minus 1 (e.g. 8191, not 8192) — rte_mempool internally uses a power-of-2 ring and the actual allocated count is N+1 ring slots. The API accepts N and adjusts internally. Common mistake: using 8192 when you mean 8191.

rte_mbuf — The Packet Carrier

rte_mbuf is the kernel's sk_buff equivalent. Every received packet is wrapped in an mbuf. It has a fixed header (metadata) followed by a contiguous data buffer (where packet bytes live). The key design decision: metadata and packet data are in the same hugepage allocation — one cache line prefetch gets both.

MBUF MEMORY LAYOUT

rte_mbuf Memory Layout (one hugepage allocation) ┌─────────────────────────────────────────────────────────────────┐ │ rte_mbuf header (~128 bytes) │ │ buf_addr ─────────────────────────────────────────────────► │ │ buf_iova (physical address for NIC DMA) │ │ data_off (offset from buf_addr to first packet byte) │ │ pkt_len (total packet length in bytes) │ │ data_len (data length in this segment) │ │ nb_segs (number of segments in chain) │ │ port (Rx port index) │ │ ol_flags (offload flags: cksum, vlan, rss, etc.) │ │ hash.rss (RSS hash value from NIC hardware) │ │ vlan_tci (VLAN tag if stripped by NIC) │ │ next (pointer to next segment, or NULL) │ │ pool (pointer back to mempool for free) │ │ refcnt (reference count — for cloning) │ ├─────────────────────────────────────────────────────────────────┤ │ Headroom (RTE_PKTMBUF_HEADROOM = 128 bytes) │ │ ← reserved for prepending headers │ ├──────────── ← buf_addr + data_off (= rte_pktmbuf_mtod result) ─┤ │ [ Ethernet header (14B) │ IP (20B) │ TCP (20B) │…]│ │ Packet data (data_len bytes) │ └─────────────────────────────────────────────────────────────────┘ Total buffer: RTE_MBUF_DEFAULT_BUF_SIZE = 2048 bytes

KEY MBUF MACROS & FIELDS

// Get pointer to packet data (most common operation) struct rte_ether_hdr *eth = rte_pktmbuf_mtod(mbuf, struct rte_ether_hdr *); // Expands to: (type)(mbuf->buf_addr + mbuf->data_off) — direct pointer into hugepage // Access packet at byte offset struct rte_ipv4_hdr *ip = rte_pktmbuf_mtod_offset(mbuf, struct rte_ipv4_hdr *, sizeof(struct rte_ether_hdr)); // Packet length uint32_t total_len = mbuf->pkt_len; // total bytes across all segments uint16_t seg_len = mbuf->data_len; // bytes in this segment only // Prepend a header (uses headroom) struct rte_ether_hdr *eth = (struct rte_ether_hdr *) rte_pktmbuf_prepend(mbuf, sizeof(struct rte_ether_hdr)); // Returns NULL if no headroom available // Append to tail char *tail = rte_pktmbuf_append(mbuf, 4); // add 4 bytes at end // Remove from front (advance data_off) rte_pktmbuf_adj(mbuf, sizeof(struct rte_ether_hdr)); // Free mbuf back to pool rte_pktmbuf_free(mbuf); // also frees chained segments

ol_flags Bit	Direction	Meaning
`RTE_MBUF_F_RX_RSS_HASH`	Rx	NIC computed RSS hash — value in `mbuf->hash.rss`
`RTE_MBUF_F_RX_IP_CKSUM_GOOD`	Rx	NIC verified IP checksum — correct
`RTE_MBUF_F_RX_IP_CKSUM_BAD`	Rx	NIC verified IP checksum — bad (drop the packet)
`RTE_MBUF_F_RX_VLAN`	Rx	VLAN tag present — stripped to `mbuf->vlan_tci`
`RTE_MBUF_F_TX_IPV4`	Tx	IPv4 packet — required when requesting Tx IP cksum offload
`RTE_MBUF_F_TX_IP_CKSUM`	Tx	Ask NIC to compute and insert IPv4 header checksum
`RTE_MBUF_F_TX_TCP_CKSUM`	Tx	Ask NIC to compute and insert TCP checksum

Chained mbufs — For Jumbo Frames

A single mbuf data buffer is 2048 bytes by default. Jumbo frames (up to 9000 bytes for 9K MTU) require chained mbufs — a linked list of mbufs where mbuf->next points to the continuation segment. The first segment's pkt_len holds the total, nb_segs holds the count.

Chained mbuf Layout (jumbo frame example: 5000 bytes) Segment 0 (head) Segment 1 Segment 2 ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │ pkt_len = 5000 │ │ pkt_len = 0 │ │ pkt_len = 0 │ │ data_len = 1920 │ │ data_len = 1920 │ │ data_len = 1160 │ │ nb_segs = 3 │ │ nb_segs = 0 │ │ nb_segs = 0 │ │ next ─────────────┼────►│ next ─────────────┼──►│ next = NULL │ │ [packet data...] │ │ [packet data...] │ │ [packet data...] │ └───────────────────┘ └───────────────────┘ └───────────────────┘ 1920 bytes 1920 bytes 1160 bytes Total: 1920 + 1920 + 1160 = 5000 bytes

// Check if mbuf is chained if (mbuf->nb_segs > 1) { // Walk the chain struct rte_mbuf *seg = mbuf; while (seg != NULL) { uint8_t *data = rte_pktmbuf_mtod(seg, uint8_t *); uint16_t len = seg->data_len; /* process this segment */ seg = seg->next; } } // Linearize (copy all segments into one) — expensive, avoid on hot path char buf[9000]; uint32_t copied = rte_pktmbuf_read(mbuf, 0, mbuf->pkt_len, buf);

⚠️ Most DPDK applications avoid chained mbufs on the hot path. The preferred approach is to set RTE_ETH_RX_OFFLOAD_SCATTER and handle multi-segment mbufs only in the exception path. For performance-critical NFs, configure MTU ≤ single-segment buffer size and drop/reject jumbo frames at the port level.

MBUF CLONE vs REFERENCE COUNT

rte_pktmbuf_clone() — Sharing Without Copy

Cloning creates a new mbuf header that shares the same data buffer as the original. The data buffer's reference count (refcnt) is incremented. rte_pktmbuf_free() on either clone decrements refcnt — the data buffer is only returned to the pool when refcnt reaches zero. Use case: multicast — send the same packet out multiple ports without copying the data.

// Clone for multicast (zero-copy) struct rte_mbuf *clone = rte_pktmbuf_clone(original, pool); // original->refcnt: 1 → 2 (shared data buffer) rte_eth_tx_burst(port_a, 0, &original, 1); // refcnt: 2→1 after tx rte_eth_tx_burst(port_b, 0, &clone, 1); // refcnt: 1→0 after tx → buffer freed

Q: Why can't DPDK use normal 4KB pages for packet buffers?

Two reasons: (1) DMA instability — 4KB pages can be swapped out by the OS at any time. The NIC's IOVA would become stale, causing DMA writes to wrong memory or segfaults. (2) TLB pressure — 1 GB of packet buffers needs 262,144 TLB entries with 4KB pages vs only 512 entries with 2MB hugepages. TLB misses on the hot path at 100G rates would dominate CPU time.

Q: What is the per-lcore cache in rte_mempool and why does it matter?

The per-lcore cache is a small, lcore-local stack of pre-fetched objects (typically 256 entries). Alloc/free to the lcore cache requires no atomic operations — it's just an array index increment/decrement. Only when the cache empties or overflows does it interact with the common pool ring (one CAS for a bulk transfer). This makes rte_mempool_get/put nearly as cheap as a stack pop on the hot path.

Q: What is rte_pktmbuf_mtod() and how does it work?

It's a macro: (type)(mbuf->buf_addr + mbuf->data_off). buf_addr is the pointer to the start of the data buffer. data_off is the byte offset to the first packet byte (defaults to RTE_PKTMBUF_HEADROOM = 128 bytes, leaving space to prepend headers). The result is a typed pointer directly into hugepage memory — no copy, no syscall.

Q: What is headroom in an mbuf and when is it used?

Headroom is a reserved region at the start of the data buffer, before the packet data. Default: 128 bytes (RTE_PKTMBUF_HEADROOM). It's used when your NF needs to prepend a header to an incoming packet — e.g., adding a VXLAN or GRE encapsulation header. Instead of copying the entire packet to a new buffer, you use rte_pktmbuf_prepend() which decrements data_off to expand into the headroom. Zero allocation, zero copy.

Q: What happens when rte_mempool runs out of objects?

rte_mempool_get() returns -ENOBUFS (non-zero). For pktmbuf pools, the PMD reports this as stats.rx_nombuf and the packet is dropped by the NIC before it reaches the application. This is a critical metric to monitor — it means the application is not returning mbufs to the pool fast enough, or the pool is undersized. Fix: increase pool size, check for mbuf leaks (tx_burst without freeing unsent packets), or reduce processing latency.

Q: What is the difference between pkt_len and data_len?

data_len: bytes of packet data in this segment only.
pkt_len: total bytes across all segments in the chain (only valid on the first segment/head mbuf).
For single-segment mbufs (the common case), both are equal. For chained mbufs (jumbo frames), pkt_len = sum of all data_len values across all segments.

🔥 Lab 3: mbuf Inspector — Decode Every Field

Create a DPDK application that receives one burst of packets and prints every mbuf field. The goal is to see the real hardware values — RSS hash, ol_flags, pkt_len — not just theoretical values.

Create mempool with rte_pktmbuf_pool_create() on the NIC's socket

Configure port: enable RTE_ETH_RX_OFFLOAD_CHECKSUM and RTE_ETH_RX_OFFLOAD_RSS_HASH

Receive one burst: rte_eth_rx_burst(port, 0, pkts, 32)

For each received mbuf print: buf_addr, buf_iova, data_off, pkt_len, data_len, nb_segs, port, ol_flags (as hex), hash.rss, vlan_tci

Use rte_pktmbuf_mtod() to get Ethernet header — print src and dst MAC

Verify RTE_MBUF_F_RX_IP_CKSUM_GOOD is set on a valid IPv4 packet

Check mempool stats after: rte_mempool_avail_count(pool) — should decrease by nb_rx

Free all mbufs: rte_pktmbuf_free(pkts[i]) — verify avail_count restored

🔥 Lab 4: Pool Exhaustion Experiment

Intentionally exhaust the mempool to observe the imissed counter. This teaches defensive mbuf management.

Create a small pool: 64 mbufs total

Receive packets in a loop — do not free them

After pool empties: poll stats.rx_nombuf via rte_eth_stats_get() — observe it increment

Free all held mbufs — observe rx_nombuf stops incrementing

Lesson: Every code path that receives mbufs MUST free them or return them to TX. Mbuf leaks are the most common DPDK production bug.

MASTERY CHECKLIST

Can explain the two reasons DPDK requires hugepages (DMA stability + TLB efficiency)
Can draw the rte_mempool architecture including per-lcore cache and common ring
Can draw the rte_mbuf memory layout with all key fields labeled
Can explain what rte_pktmbuf_mtod() expands to and why it's zero-copy
Can explain headroom: what it is, default value, and when prepend is used
Can explain pkt_len vs data_len and when they differ
Can explain what happens when the mempool runs out and how to diagnose it
Can explain rte_pktmbuf_clone() reference counting semantics

← P1A: Foundation & EAL ↑ Roadmap P2A: PMD & Port Config →