DPDK MASTERY · PHASE 1 OF 3 · MODULE B
Hugepages, mempool & mbuf
Hugepage memory model · IOVA · rte_mempool internals · rte_mbuf anatomy · chained mbufs
Ch 4 — Hugepages & Memory
Ch 5 — rte_mempool
Ch 6 — rte_mbuf
C · DMA · NUMA
Weeks 3–5
Why Hugepages Are Mandatory in DPDK
Two orthogonal requirements drive hugepage usage: (1) DMA stability — hugepages are pinned (mlock'd) so the NIC's IOVA is always valid; (2) TLB efficiency — 2MB pages mean 512× fewer TLB entries than 4KB pages, dramatically cutting TLB miss rate on the hot packet path.| Property | Normal 4KB Pages | DPDK Hugepages (2MB) |
|---|---|---|
| Page size | 4 KB | 2 MB (or 1 GB) |
| Pinned in RAM | No — OS can swap to disk | Yes — mlock'd at allocation, never swapped |
| Physical address stable | No — IOVA becomes stale after swap | Yes — IOVA always valid for NIC DMA |
| TLB entries for 1 GB data | 262,144 entries | 512 entries (512× fewer misses) |
| Page fault on access | Possible — ~10 ms disk I/O | Never — pages pre-faulted at EAL init |
| DMA safety | Unsafe — may be freed under NIC | Safe — physical addr never changes |
⚠️ The catastrophic swap scenario: NIC DMA uses physical addresses (IOVAs). If a page is swapped out, the physical frame is freed. The NIC's IOVA is now stale — it writes to wrong or freed memory. Even without corruption, one swap = ~10 ms pause. At 100G, the ring fills in ~80 µs. 10 ms = millions of dropped packets.
HUGEPAGE SETUP COMMANDS
# Check available hugepage sizes
ls /sys/kernel/mm/hugepages/
# Allocate 1024 × 2MB hugepages (= 2 GB)
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# Mount hugetlbfs (if not already mounted)
mount -t hugetlbfs none /dev/hugepages
# Verify allocation
cat /proc/meminfo | grep Huge
# DPDK EAL: use --socket-mem to specify per-NUMA-socket allocation
./my_app -l 0-3 -n 4 --socket-mem 1024,1024 # 1GB on socket 0, 1GB on socket 1
📌 1GB hugepages: For very large mempools or when 2MB pages still have too many TLB entries. Requires kernel boot parameter
hugepagesz=1G hugepages=4. EAL will prefer 1GB pages if available.VIRTUAL MEMORY — PROCESS ISOLATION
Virtual Memory vs Physical Memory — Process Isolation
Process A (virtual) Process B (virtual) Physical RAM
▲ HIGH ADDRESS ▲ HIGH ADDRESS ▲
kernel space kernel space kernel code (shared)
stack 0x7FFF… stack 0x7FFF… frame 1024 ← A stack
heap 0x0810… heap 0x0810… frame 2048 ← B stack
data 0x0805… data 0x0805… frame 5632 ← A heap
code 0x0804… code 0x0804… frame 8192 ← B heap
▼ 0x0000 ▼ 0x0000 DPDK hugepages PINNED
— never swapped
— fixed IOVAs for NIC DMA
KEY INSIGHT: Both processes may use virtual address 0x08051000.
MMU translates: A → physical frame 5632 | B → physical frame 8192
Same virtual address. Completely different RAM. Complete isolation.
Virtual Memory Segments
Every process has: code (text, read-only), data (BSS + initialized globals), heap (grows up via malloc/mmap), stack (grows down, per-thread), and kernel space (top of virtual address space, Ring 0 only). DPDK hugepage allocations live in a separate mmap'd region, pinned against eviction.NUMA MEMORY TOPOLOGY
NUMA — Non-Uniform Memory Access
Socket 0 Socket 1
CPU cores 0-7 CPU cores 8-15
L1/L2/L3 cache L1/L2/L3 cache
Local RAM (DDR channels 0-1) Local RAM (DDR channels 2-3)
↑ low latency (~60 ns) ↑ low latency (~60 ns)
NIC port 0 (PCIe) NIC port 1 (PCIe)
↑ DMA into socket 0 RAM ↑ DMA into socket 1 RAM
Cross-NUMA access (socket 0 CPU → socket 1 RAM): ~120 ns — 2× slower!
DPDK rule: ALWAYS allocate mempool on the same NUMA socket as the NIC.
rte_pktmbuf_pool_create("POOL", N, CACHE_SZ, 0, sz, rte_eth_dev_socket_id(port))
^^^^^^^^^^^^^^^^^^^^^^^^^
returns NIC's socket — use it!
rte_mempool — The Allocation Eliminator
rte_mempool pre-allocates all packet buffers at startup. The hot data path never calls malloc/free — it calls rte_mempool_get() (which pops from a lock-free ring or per-lcore cache) and rte_mempool_put() (which pushes back). This is what enables zero-allocation-overhead packet processing.
MEMPOOL INTERNAL ARCHITECTURE
rte_mempool Architecture
rte_mempool header
┌─────────────────────────────────────┐
│ name, size, elt_size, cache_size │
│ count = total_elts - in_use_count │
└───────────────┬─────────────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
Per-lcore cache Per-lcore cache Per-lcore cache
(lcore 0) (lcore 1) (lcore 2)
up to cache_size up to cache_size up to cache_size
objects (stack) objects (stack) objects (stack)
│ │ │
└─────────────────────┴─────────────────────┘
│ (cache miss → fallback)
▼
Common pool (rte_ring)
Lock-free MPMC ring
Contains all remaining objects
ALLOCATION PATH
- 1rte_mempool_get(pool, &obj): Check per-lcore cache first (~3 cycles, no atomic)
- 2Cache hit: Pop object from lcore-local stack. Return immediately. Zero contention.
- 3Cache miss: Refill lcore cache in bulk from common pool ring (one CAS → batch transfer)
- 4rte_mempool_put(pool, obj): Push to lcore cache. If cache full → flush bulk to common ring.
// Create a packet mempool
struct rte_mempool *mbuf_pool = rte_pktmbuf_pool_create(
"MBUF_POOL", // unique name
8192, // total number of mbufs
256, // per-lcore cache size (objects)
0, // private data size per element
RTE_MBUF_DEFAULT_BUF_SIZE, // data buffer size (2048 bytes)
rte_eth_dev_socket_id(port_id) // NUMA socket — MUST match NIC
);
if (!mbuf_pool)
rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n");
// Manual get/put (for non-packet objects)
void *obj;
rte_mempool_get(pool, &obj); // borrow object
/* use obj */
rte_mempool_put(pool, obj); // return object
// Bulk operations (preferred — reduces ring contention)
void *objs[32];
rte_mempool_get_bulk(pool, objs, 32);
rte_mempool_put_bulk(pool, objs, 32);
| Pool Size | Use Case | Notes |
|---|---|---|
| 1024 – 4096 | Dev / light load | Small — may exhaust quickly under burst |
| 8192 | Standard — most DPDK examples | Good balance of memory vs exhaustion risk |
| 65536+ | High burst / 100G line rate | Large memory footprint but never exhausts under normal traffic |
⚠️ Pool size must be power of 2 minus 1 (e.g. 8191, not 8192) — rte_mempool internally uses a power-of-2 ring and the actual allocated count is
N+1 ring slots. The API accepts N and adjusts internally. Common mistake: using 8192 when you mean 8191.rte_mbuf — The Packet Carrier
rte_mbuf is the kernel's sk_buff equivalent. Every received packet is wrapped in an mbuf. It has a fixed header (metadata) followed by a contiguous data buffer (where packet bytes live). The key design decision: metadata and packet data are in the same hugepage allocation — one cache line prefetch gets both.
MBUF MEMORY LAYOUT
rte_mbuf Memory Layout (one hugepage allocation)
┌─────────────────────────────────────────────────────────────────┐
│ rte_mbuf header (~128 bytes) │
│ buf_addr ─────────────────────────────────────────────────► │
│ buf_iova (physical address for NIC DMA) │
│ data_off (offset from buf_addr to first packet byte) │
│ pkt_len (total packet length in bytes) │
│ data_len (data length in this segment) │
│ nb_segs (number of segments in chain) │
│ port (Rx port index) │
│ ol_flags (offload flags: cksum, vlan, rss, etc.) │
│ hash.rss (RSS hash value from NIC hardware) │
│ vlan_tci (VLAN tag if stripped by NIC) │
│ next (pointer to next segment, or NULL) │
│ pool (pointer back to mempool for free) │
│ refcnt (reference count — for cloning) │
├─────────────────────────────────────────────────────────────────┤
│ Headroom (RTE_PKTMBUF_HEADROOM = 128 bytes) │
│ ← reserved for prepending headers │
├──────────── ← buf_addr + data_off (= rte_pktmbuf_mtod result) ─┤
│ [ Ethernet header (14B) │ IP (20B) │ TCP (20B) │…]│
│ Packet data (data_len bytes) │
└─────────────────────────────────────────────────────────────────┘
Total buffer: RTE_MBUF_DEFAULT_BUF_SIZE = 2048 bytes
KEY MBUF MACROS & FIELDS
// Get pointer to packet data (most common operation)
struct rte_ether_hdr *eth = rte_pktmbuf_mtod(mbuf, struct rte_ether_hdr *);
// Expands to: (type)(mbuf->buf_addr + mbuf->data_off) — direct pointer into hugepage
// Access packet at byte offset
struct rte_ipv4_hdr *ip = rte_pktmbuf_mtod_offset(mbuf, struct rte_ipv4_hdr *,
sizeof(struct rte_ether_hdr));
// Packet length
uint32_t total_len = mbuf->pkt_len; // total bytes across all segments
uint16_t seg_len = mbuf->data_len; // bytes in this segment only
// Prepend a header (uses headroom)
struct rte_ether_hdr *eth = (struct rte_ether_hdr *)
rte_pktmbuf_prepend(mbuf, sizeof(struct rte_ether_hdr));
// Returns NULL if no headroom available
// Append to tail
char *tail = rte_pktmbuf_append(mbuf, 4); // add 4 bytes at end
// Remove from front (advance data_off)
rte_pktmbuf_adj(mbuf, sizeof(struct rte_ether_hdr));
// Free mbuf back to pool
rte_pktmbuf_free(mbuf); // also frees chained segments
| ol_flags Bit | Direction | Meaning |
|---|---|---|
RTE_MBUF_F_RX_RSS_HASH | Rx | NIC computed RSS hash — value in mbuf->hash.rss |
RTE_MBUF_F_RX_IP_CKSUM_GOOD | Rx | NIC verified IP checksum — correct |
RTE_MBUF_F_RX_IP_CKSUM_BAD | Rx | NIC verified IP checksum — bad (drop the packet) |
RTE_MBUF_F_RX_VLAN | Rx | VLAN tag present — stripped to mbuf->vlan_tci |
RTE_MBUF_F_TX_IPV4 | Tx | IPv4 packet — required when requesting Tx IP cksum offload |
RTE_MBUF_F_TX_IP_CKSUM | Tx | Ask NIC to compute and insert IPv4 header checksum |
RTE_MBUF_F_TX_TCP_CKSUM | Tx | Ask NIC to compute and insert TCP checksum |
Chained mbufs — For Jumbo Frames
A single mbuf data buffer is 2048 bytes by default. Jumbo frames (up to 9000 bytes for 9K MTU) require chained mbufs — a linked list of mbufs wherembuf->next points to the continuation segment. The first segment's pkt_len holds the total, nb_segs holds the count.
Chained mbuf Layout (jumbo frame example: 5000 bytes)
Segment 0 (head) Segment 1 Segment 2
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│ pkt_len = 5000 │ │ pkt_len = 0 │ │ pkt_len = 0 │
│ data_len = 1920 │ │ data_len = 1920 │ │ data_len = 1160 │
│ nb_segs = 3 │ │ nb_segs = 0 │ │ nb_segs = 0 │
│ next ─────────────┼────►│ next ─────────────┼──►│ next = NULL │
│ [packet data...] │ │ [packet data...] │ │ [packet data...] │
└───────────────────┘ └───────────────────┘ └───────────────────┘
1920 bytes 1920 bytes 1160 bytes
Total: 1920 + 1920 + 1160 = 5000 bytes
// Check if mbuf is chained
if (mbuf->nb_segs > 1) {
// Walk the chain
struct rte_mbuf *seg = mbuf;
while (seg != NULL) {
uint8_t *data = rte_pktmbuf_mtod(seg, uint8_t *);
uint16_t len = seg->data_len;
/* process this segment */
seg = seg->next;
}
}
// Linearize (copy all segments into one) — expensive, avoid on hot path
char buf[9000];
uint32_t copied = rte_pktmbuf_read(mbuf, 0, mbuf->pkt_len, buf);
⚠️ Most DPDK applications avoid chained mbufs on the hot path. The preferred approach is to set
RTE_ETH_RX_OFFLOAD_SCATTER and handle multi-segment mbufs only in the exception path. For performance-critical NFs, configure MTU ≤ single-segment buffer size and drop/reject jumbo frames at the port level.MBUF CLONE vs REFERENCE COUNT
rte_pktmbuf_clone() — Sharing Without Copy
Cloning creates a new mbuf header that shares the same data buffer as the original. The data buffer's reference count (refcnt) is incremented. rte_pktmbuf_free() on either clone decrements refcnt — the data buffer is only returned to the pool when refcnt reaches zero. Use case: multicast — send the same packet out multiple ports without copying the data.
// Clone for multicast (zero-copy)
struct rte_mbuf *clone = rte_pktmbuf_clone(original, pool);
// original->refcnt: 1 → 2 (shared data buffer)
rte_eth_tx_burst(port_a, 0, &original, 1); // refcnt: 2→1 after tx
rte_eth_tx_burst(port_b, 0, &clone, 1); // refcnt: 1→0 after tx → buffer freed
Q: Why can't DPDK use normal 4KB pages for packet buffers?
Two reasons: (1) DMA instability — 4KB pages can be swapped out by the OS at any time. The NIC's IOVA would become stale, causing DMA writes to wrong memory or segfaults. (2) TLB pressure — 1 GB of packet buffers needs 262,144 TLB entries with 4KB pages vs only 512 entries with 2MB hugepages. TLB misses on the hot path at 100G rates would dominate CPU time.Q: What is the per-lcore cache in rte_mempool and why does it matter?
The per-lcore cache is a small, lcore-local stack of pre-fetched objects (typically 256 entries). Alloc/free to the lcore cache requires no atomic operations — it's just an array index increment/decrement. Only when the cache empties or overflows does it interact with the common pool ring (one CAS for a bulk transfer). This makes rte_mempool_get/put nearly as cheap as a stack pop on the hot path.Q: What is rte_pktmbuf_mtod() and how does it work?
It's a macro:(type)(mbuf->buf_addr + mbuf->data_off). buf_addr is the pointer to the start of the data buffer. data_off is the byte offset to the first packet byte (defaults to RTE_PKTMBUF_HEADROOM = 128 bytes, leaving space to prepend headers). The result is a typed pointer directly into hugepage memory — no copy, no syscall.
Q: What is headroom in an mbuf and when is it used?
Headroom is a reserved region at the start of the data buffer, before the packet data. Default: 128 bytes (RTE_PKTMBUF_HEADROOM). It's used when your NF needs to prepend a header to an incoming packet — e.g., adding a VXLAN or GRE encapsulation header. Instead of copying the entire packet to a new buffer, you use rte_pktmbuf_prepend() which decrements data_off to expand into the headroom. Zero allocation, zero copy.
Q: What happens when rte_mempool runs out of objects?
rte_mempool_get() returns -ENOBUFS (non-zero). For pktmbuf pools, the PMD reports this as stats.rx_nombuf and the packet is dropped by the NIC before it reaches the application. This is a critical metric to monitor — it means the application is not returning mbufs to the pool fast enough, or the pool is undersized. Fix: increase pool size, check for mbuf leaks (tx_burst without freeing unsent packets), or reduce processing latency.
Q: What is the difference between pkt_len and data_len?
data_len: bytes of packet data in this segment only.pkt_len: total bytes across all segments in the chain (only valid on the first segment/head mbuf).For single-segment mbufs (the common case), both are equal. For chained mbufs (jumbo frames), pkt_len = sum of all data_len values across all segments.
🔥 Lab 3: mbuf Inspector — Decode Every Field
Create a DPDK application that receives one burst of packets and prints every mbuf field. The goal is to see the real hardware values — RSS hash, ol_flags, pkt_len — not just theoretical values.
1
Create mempool with
rte_pktmbuf_pool_create() on the NIC's socket2
Configure port: enable
RTE_ETH_RX_OFFLOAD_CHECKSUM and RTE_ETH_RX_OFFLOAD_RSS_HASH3
Receive one burst:
rte_eth_rx_burst(port, 0, pkts, 32)4
For each received mbuf print:
buf_addr, buf_iova, data_off, pkt_len, data_len, nb_segs, port, ol_flags (as hex), hash.rss, vlan_tci5
Use
rte_pktmbuf_mtod() to get Ethernet header — print src and dst MAC6
Verify
RTE_MBUF_F_RX_IP_CKSUM_GOOD is set on a valid IPv4 packet7
Check mempool stats after:
rte_mempool_avail_count(pool) — should decrease by nb_rx8
Free all mbufs:
rte_pktmbuf_free(pkts[i]) — verify avail_count restored🔥 Lab 4: Pool Exhaustion Experiment
Intentionally exhaust the mempool to observe the imissed counter. This teaches defensive mbuf management.
1
Create a small pool: 64 mbufs total
2
Receive packets in a loop — do not free them
3
After pool empties: poll
stats.rx_nombuf via rte_eth_stats_get() — observe it increment4
Free all held mbufs — observe
rx_nombuf stops incrementing5
Lesson: Every code path that receives mbufs MUST free them or return them to TX. Mbuf leaks are the most common DPDK production bug.
MASTERY CHECKLIST
- Can explain the two reasons DPDK requires hugepages (DMA stability + TLB efficiency)
- Can draw the rte_mempool architecture including per-lcore cache and common ring
- Can draw the rte_mbuf memory layout with all key fields labeled
- Can explain what rte_pktmbuf_mtod() expands to and why it's zero-copy
- Can explain headroom: what it is, default value, and when prepend is used
- Can explain pkt_len vs data_len and when they differ
- Can explain what happens when the mempool runs out and how to diagnose it
- Can explain rte_pktmbuf_clone() reference counting semantics