DPDK P3 — Packet Patterns, Tuning & Debugging

DPDK MASTERY · PHASE 3 OF 3 · MODULE B

Packet Patterns, Tuning & Debugging

Prefetching · batching · CPU isolation · hugepage sizing · benchmarking · pitfall diagnosis

Ch 16 — Packet Processing Patterns Ch 17 — Performance Tuning Ch 18 — Debugging & Pitfalls C · perf · pktgen · SASE-DP Weeks 13–14+

CANONICAL PACKET PROCESSING PATTERNS

Pattern 1: Receive → Process → Transmit (Basic RTC)

The simplest pattern. Each lcore handles one or more NIC queues. Good for stateless forwarding, filtering, and routing.

// Pattern 1: Basic receive-process-transmit static int lcore_main(void *arg) { uint16_t port = (uintptr_t)arg; uint16_t queue = rte_lcore_id(); struct rte_mbuf *pkts[BURST_SIZE]; while (1) { uint16_t nb_rx = rte_eth_rx_burst(port, queue, pkts, BURST_SIZE); if (unlikely(nb_rx == 0)) continue; for (uint16_t i = 0; i < nb_rx; i++) process_packet(pkts[i]); uint16_t nb_tx = rte_eth_tx_burst(port ^ 1, queue, pkts, nb_rx); for (uint16_t i = nb_tx; i < nb_rx; i++) rte_pktmbuf_free(pkts[i]); // free unsent } return 0; }

Pattern 2: Batch Processing with Classification

Classify the entire burst first, then process each category. Better cache utilization — same code path runs on multiple packets before switching to the next path (instruction cache stays warm).

// Pattern 2: classify burst → process by type struct rte_mbuf *tcp_pkts[BURST_SIZE], *udp_pkts[BURST_SIZE], *other[BURST_SIZE]; uint16_t nb_tcp = 0, nb_udp = 0, nb_other = 0; for (uint16_t i = 0; i < nb_rx; i++) { struct rte_ipv4_hdr *ip = rte_pktmbuf_mtod_offset(pkts[i], struct rte_ipv4_hdr *, sizeof(struct rte_ether_hdr)); if (ip->next_proto_id == IPPROTO_TCP) tcp_pkts[nb_tcp++] = pkts[i]; else if (ip->next_proto_id == IPPROTO_UDP) udp_pkts[nb_udp++] = pkts[i]; else other[nb_other++] = pkts[i]; } process_tcp_batch(tcp_pkts, nb_tcp); // one code path, warm I-cache process_udp_batch(udp_pkts, nb_udp); rte_pktmbuf_free_bulk(other, nb_other); // drop unknown

Software Prefetching — The 4-Packet Lookahead

At 100G/64B, CPU has ~6.7 ns per packet. An L3 cache miss costs ~40 cycles (~13 ns) — more than the entire packet budget. Prefetching hides this latency by telling the CPU to fetch data for a future packet while processing the current one.

📌 Prefetch distance: Typically 4 packets ahead. Too small = cache miss still hurts. Too large = cache pollution (prefetched data evicted before used). 4 is the DPDK convention validated across Intel E810, i40e, and mlx5.

// 4-packet prefetch pattern — the DPDK standard technique for (uint16_t i = 0; i < nb_rx; i++) { /* Prefetch 4 packets ahead: fetch the mbuf header */ if (i + 4 < nb_rx) rte_prefetch0(rte_pktmbuf_mtod(pkts[i + 4], void *)); /* Process current packet — prefetch for i+4 is in flight */ struct rte_ether_hdr *eth = rte_pktmbuf_mtod(pkts[i], struct rte_ether_hdr *); /* ... process packet[i] ... */ } // rte_prefetch0 = prefetch to L1 cache (highest priority) // rte_prefetch1 = prefetch to L2 cache // rte_prefetch2 = prefetch to L3 cache // Use prefetch0 for hot packet data — you'll access it very soon

What to Prefetch

Packet data: rte_pktmbuf_mtod(pkts[i+4], void*) — the Ethernet/IP/TCP header bytes
Flow table entry: if doing hash lookup, prefetch the expected hash bucket for the next packet before doing the lookup for the current packet
mbuf metadata: rte_prefetch0(pkts[i+4]) — prefetch the mbuf struct itself if you access many fields

Do NOT prefetch unconditionally for every array position — only prefetch data you will actually access soon.

SYSTEM-LEVEL TUNING FOR DPDK

Tuning Area	Command / Setting	Effect
CPU isolation	`isolcpus=4-15` in kernel cmdline	Removes cpus 4-15 from OS scheduler — dedicated to DPDK polling lcores
IRQ affinity	`irqbalance --banirq=<irqs>` or `/proc/irq/N/smp_affinity`	Move NIC IRQs away from DPDK lcores (IRQs still fire on control plane CPUs)
CPU frequency scaling	`cpupower frequency-set -g performance`	Disable P-states / frequency scaling — DPDK needs consistent cycle budget
Turbo Boost	`echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo`	Disable turbo for consistent throughput (turbo causes frequency jumps → latency variance)
Hugepages at boot	`hugepagesz=2M hugepages=2048` in kernel cmdline	Pre-allocate at boot — avoids fragmentation that makes runtime allocation fail
NUMA balancing	`echo 0 > /proc/sys/kernel/numa_balancing`	Disable automatic NUMA page migration — DPDK pages must stay pinned
Transparent hugepages	`echo never > /sys/kernel/mm/transparent_hugepage/enabled`	Disable THP — it can cause latency spikes when pages are promoted/demoted

✅ Production checklist for 100G DPDK on dual-socket server: Isolate DPDK lcores with isolcpus → set performance governor → disable turbo → pre-allocate hugepages at boot → disable NUMA balancing → disable transparent hugepages → bind NIC to vfio-pci → verify all DPDK lcores and mempool on same NUMA socket as NIC.

BURST SIZE TUNING APPROACH

// Measure throughput vs latency at different burst sizes using rte_rdtsc() uint64_t t0 = rte_rdtsc(); uint16_t nb_rx = rte_eth_rx_burst(port, queue, pkts, burst_size); uint64_t rx_cycles = rte_rdtsc() - t0; uint64_t t1 = rte_rdtsc(); for (uint16_t i = 0; i < nb_rx; i++) process_packet(pkts[i]); uint64_t proc_cycles = rte_rdtsc() - t1; // Track per-packet cycle budget: proc_cycles / nb_rx // At 3GHz, 6.7ns budget = 20 cycles per packet at 100G/64B

DPDK Testpmd — Built-In Benchmark Tool

dpdk-testpmd is DPDK's reference forwarding application for benchmarking NIC and PMD performance. Always establish a testpmd baseline before profiling your own application.

# Start testpmd in io forwarding mode (max throughput benchmark) dpdk-testpmd -l 0-3 -n 4 -a 0000:03:00.0 -- \ --nb-cores=2 --rxq=2 --txq=2 \ --burst=32 --forward-mode=io \ --auto-start # In testpmd CLI: show port stats all # throughput + packet counts show port xstats all # extended NIC counters (imissed, nombuf, etc.) show fwd stats all # forwarding engine stats clear port stats all # reset counters

KEY BENCHMARK METRICS

Metric	Tool	Healthy Range	Concern
RX throughput (Mpps)	testpmd stats	Near line rate	More than 5% below line rate
`imissed`	`show port xstats`	0	Any non-zero = ring full → drops
`rx_nombuf`	`show port xstats`	0	Any non-zero = mbuf leak or pool too small
CPU utilization	`top -H` or `htop`	~100% on DPDK lcores	Below 99% = wasted polling; above 100% = overload
NUMA local memory %	`numastat -p <pid>`	>99%	High remote% = cross-NUMA allocation bug
Per-packet cycles	`rte_rdtsc()` delta / nb_rx	Depends on NF complexity	Compare to 100G budget: ~20 cycles/packet

# Profile DPDK application with perf perf stat -C 4,5,6,7 -e cycles,instructions,cache-misses,LLC-load-misses \ -p $(pgrep my_dpdk_app) sleep 10 # LLC-load-misses high → data not cache-resident → check NUMA alignment # High IPC (instructions/cycle) → good — compute-bound, not memory-bound # Low IPC → memory-bound → check hugepages, prefetching, NUMA

DPDK DEBUGGING TOOLKIT

Symptom: Application Stops Receiving Packets

Diagnosis steps (in order):

Check stats.imissed — if non-zero: ring full, application too slow. Increase nb_rx_desc or add lcores.
Check stats.rx_nombuf — if non-zero: mempool exhausted. Find the mbuf leak.
Check rte_mempool_avail_count(pool) over time — if it trends to zero: leak confirmed.
Check all code paths: every rx_burst must eventually rte_pktmbuf_free() or tx_burst with free of unsent.

// Mbuf accounting helper — call periodically void check_pool_health(struct rte_mempool *pool, const char *tag) { unsigned avail = rte_mempool_avail_count(pool); unsigned total = rte_mempool_in_use_count(pool) + avail; float used_pct = (float)(total - avail) * 100.0f / total; printf("[%s] Pool avail: %u/%u (%.1f%% in use)\n", tag, avail, total, used_pct); if (used_pct > 90.0f) RTE_LOG(WARNING, USER1, "Pool nearly exhausted — check for mbuf leaks!\n"); }

DPDK LOGGING

// Log levels: EMERG(1) ALERT(2) CRIT(3) ERR(4) WARNING(5) NOTICE(6) INFO(7) DEBUG(8) rte_log_set_level(RTE_LOGTYPE_USER1, RTE_LOG_DEBUG); // Log from your application RTE_LOG(INFO, USER1, "Port %u: %u packets received\n", port_id, nb_rx); RTE_LOG(WARNING, USER1, "Tx ring full on port %u queue %u\n", port_id, queue_id); RTE_LOG(ERR, USER1, "Mbuf pool exhausted: avail=%u\n", avail); // Enable PMD debug logging at startup // ./my_app --log-level=pmd:8 (debug for all PMDs) // ./my_app --log-level=pmd.net.mlx5:8 (debug for mlx5 PMD only)

THE DPDK PRODUCTION PITFALL CATALOG

#	Pitfall	Symptom	Root Cause	Fix
1	Mbuf leak	rx_nombuf increments; app stops receiving	tx_burst doesn't free unsent pkts; early-return without free	Always free `pkts[nb_tx..n-1]` after tx_burst; audit every return path
2	Non-power-of-2 workers	CPU load imbalance; throughput ceiling	RETA divided unevenly across workers	Always use power-of-2 worker count or manually program RETA
3	Cross-NUMA allocation	Lower throughput than testpmd baseline; high LLC misses	Mempool/ring/queue on wrong socket	Always use `rte_eth_dev_socket_id(port)` for pool and queue setup
4	Secondary calls pool_create	EEXIST error; secondary crashes at startup	Secondary tries to create pool already owned by primary	Always use `rte_mempool_lookup()` in secondary processes
5	Accessing mbuf after tx_burst	Random data corruption; segfaults	PMD frees mbuf asynchronously after tx_burst	Never access an mbuf after passing it to tx_burst
6	Hugepages not allocated at boot	EAL init fails; "Cannot reserve memory" error	Runtime hugepage allocation fails due to memory fragmentation	Always pre-allocate hugepages in kernel cmdline: `hugepages=N`
7	Small mempool + large descriptor ring	rx_nombuf immediately at startup	Pool smaller than ring × number of queues × burst_size	Pool size must be > (nb_rx_desc × nb_rx_queues × 2) — leave 2× headroom
8	Missing tx_free after ring full	Slow mbuf leak; intermittent rx_nombuf	tx_burst returns nb_tx < nb_pkts; caller doesn't free excess	Always check: `if (nb_tx < nb_pkts) rte_pktmbuf_free_bulk(pkts+nb_tx, nb_pkts-nb_tx)`
9	False sharing on per-lcore counters	Lower throughput than expected; perf shows cache-to-cache transfers	Per-lcore stats arrays not cache-line aligned	Use `__rte_cache_aligned` on per-lcore structs
10	CPU not isolated	High latency variance; p99 much higher than p50	OS scheduler interrupts DPDK polling lcores	Add `isolcpus=<dpdk-cores>` to kernel cmdline

⚠️ Pitfall #7 — Pool Sizing Formula: Minimum pool size = nb_rx_desc × nb_rx_queues + nb_tx_desc × nb_tx_queues + nb_lcores × cache_size + burst_size. Add 2× safety margin. Use 8191 (not 8192 — rte_mempool adjusts to power-of-2 minus one internally).

Q: What is software prefetching in DPDK and how many packets ahead should you prefetch?

Software prefetching tells the CPU to load a cache line into cache before the data is needed, hiding the ~13 ns L3 miss latency. The standard DPDK pattern prefetches 4 packets ahead: while processing packet i, issue rte_prefetch0(rte_pktmbuf_mtod(pkts[i+4], void*)). By the time processing of packets i+1 through i+3 completes, packet i+4's data is in L1 cache with near-zero access cost. Too small (1-2) = miss still hurts; too large (8+) = data evicted before use.

Q: How do you diagnose and fix high imissed counter?

imissed means the NIC dropped packets because the Rx ring had no empty descriptors — software was too slow to drain it. Diagnosis: confirm with rte_eth_stats_get() and observe it incrementing under load. Fixes (in order of impact): (1) Increase nb_rx_desc (bigger ring = more burst capacity); (2) Increase burst size so each rx_burst call drains more; (3) Reduce per-packet processing time; (4) Add more worker lcores via rte_distributor.

Q: What is the DPDK pool sizing formula?

Minimum pool size = (nb_rx_desc × nb_rx_queues) + (nb_tx_desc × nb_tx_queues) + (nb_lcores × cache_size) + burst_size, multiplied by a safety margin of ~2×. The rx_desc slots need mbufs to refill; tx_desc slots hold mbufs until NIC sends them; the per-lcore cache pre-fetches from the common pool. Undersized pools cause immediate rx_nombuf at startup under load.

Q: What does CPU isolation (isolcpus) do and why does DPDK need it?

isolcpus=4-15 in the kernel boot parameters removes CPUs 4-15 from the OS scheduler's CPU pool. No kernel threads, IRQs, or user-space tasks will be scheduled on those CPUs without explicit affinity pinning. DPDK needs this because its polling loops must run continuously — even a 1 ms scheduler preemption loses ~148,000 packets at 100G/64B. With isolcpus, DPDK lcores run uninterrupted at 100% CPU consumption, which is intentional and correct.

Q: How do you find a mbuf leak in a DPDK application?

(1) Monitor rte_mempool_avail_count(pool) over time — a leak shows as a monotonic decrease toward zero. (2) Check stats.rx_nombuf — when it becomes non-zero, the pool is exhausted. (3) Audit every code path: every packet received via rx_burst must eventually be freed via rte_pktmbuf_free() or passed to tx_burst with unsent packets freed. Common sources: early return on error without freeing; tx_burst return value not checked; chained mbufs partially freed.

🔥 Lab 10: End-to-End URL Filter Dataplane Skeleton

Build a minimal version of the SASE-DP URL filter pipeline: RX → DNS extract → allow/block decision → TX or DROP. Apply all Phase 3 techniques.

Setup: Primary process owns NIC, creates pool and distributor with 4 workers. Apply isolcpus tuning.

RX coordinator: rx_burst → set hash.usr = hash.rss → distributor_process()

Worker loop: distributor_get_pkt() → classify packet type (DNS/UDP/53, HTTP/TCP/80, HTTPS/TCP/443, other) → route to processing function

DNS processing: parse UDP payload → extract queried domain name → check against a simple blocked-domain hash table (use rte_hash)

Add prefetch: prefetch packet data 4 ahead in the worker loop

Add pool health monitoring: every 1M packets, log avail_count and verify no leak

Benchmark: run dpdk-testpmd baseline, then your filter, compare throughput and imissed

Add rte_flow rule: steer DNS traffic (UDP/53) to queue 0 in hardware — measure CPU % reduction

FULL DPDK MASTERY CHECKLIST

Phase 1 — Foundation & Memory

Explain 6 categories of kernel overhead and DPDK's solution for each
Draw DPDK software stack from NIC hardware to user application
Explain EAL init: hugepages, lcore pinning, PCI probe
Explain hugepages: why needed, DMA stability, TLB efficiency
Draw rte_mempool architecture: per-lcore cache + common ring
Draw rte_mbuf layout: all key fields including buf_addr, data_off, pkt_len, ol_flags
Explain rte_pktmbuf_mtod() — what it expands to, why it's zero-copy

Phase 2 — Core Mechanics

Explain DD bit — what it is, why polling beats interrupts
Draw Rx descriptor ring lifecycle (6 steps)
Write safe tx_burst with unsent-packet free pattern
List 10-step port configuration sequence in order
Explain RSS: Toeplitz hash, RETA, symmetric key, power-of-2 requirement
Draw rte_ring CAS protocol for MPMC enqueue
Explain bulk vs burst semantics, SPSC vs MPMC tradeoffs
Compare run-to-completion vs pipeline architectures

Phase 3 — Advanced & Production

Explain primary/secondary model — who creates, who looks up, gotchas
Write a complete rte_flow rule (pattern + action + validate + create)
Explain NUMA remote access penalty and correct allocation pattern
Explain false sharing and demonstrate __rte_cache_aligned fix
Explain 4-packet prefetch pattern — what to prefetch, why 4 ahead
Diagnose imissed vs rx_nombuf: different causes and different fixes
Apply production tuning: isolcpus, performance governor, NUMA balancing off
Identify and fix all 10 pitfalls in the production pitfall catalog

← P3A: Multi-Process & rte_flow ↑ DPDK Roadmap DPDK Hub →