DPDK MASTERY · PHASE 3 OF 3 · MODULE B
Packet Patterns, Tuning & Debugging
Prefetching · batching · CPU isolation · hugepage sizing · benchmarking · pitfall diagnosis
Ch 16 — Packet Processing Patterns Ch 17 — Performance Tuning Ch 18 — Debugging & Pitfalls C · perf · pktgen · SASE-DP Weeks 13–14+

CANONICAL PACKET PROCESSING PATTERNS

Pattern 1: Receive → Process → Transmit (Basic RTC)

The simplest pattern. Each lcore handles one or more NIC queues. Good for stateless forwarding, filtering, and routing.
// Pattern 1: Basic receive-process-transmit static int lcore_main(void *arg) { uint16_t port = (uintptr_t)arg; uint16_t queue = rte_lcore_id(); struct rte_mbuf *pkts[BURST_SIZE]; while (1) { uint16_t nb_rx = rte_eth_rx_burst(port, queue, pkts, BURST_SIZE); if (unlikely(nb_rx == 0)) continue; for (uint16_t i = 0; i < nb_rx; i++) process_packet(pkts[i]); uint16_t nb_tx = rte_eth_tx_burst(port ^ 1, queue, pkts, nb_rx); for (uint16_t i = nb_tx; i < nb_rx; i++) rte_pktmbuf_free(pkts[i]); // free unsent } return 0; }

Pattern 2: Batch Processing with Classification

Classify the entire burst first, then process each category. Better cache utilization — same code path runs on multiple packets before switching to the next path (instruction cache stays warm).
// Pattern 2: classify burst → process by type struct rte_mbuf *tcp_pkts[BURST_SIZE], *udp_pkts[BURST_SIZE], *other[BURST_SIZE]; uint16_t nb_tcp = 0, nb_udp = 0, nb_other = 0; for (uint16_t i = 0; i < nb_rx; i++) { struct rte_ipv4_hdr *ip = rte_pktmbuf_mtod_offset(pkts[i], struct rte_ipv4_hdr *, sizeof(struct rte_ether_hdr)); if (ip->next_proto_id == IPPROTO_TCP) tcp_pkts[nb_tcp++] = pkts[i]; else if (ip->next_proto_id == IPPROTO_UDP) udp_pkts[nb_udp++] = pkts[i]; else other[nb_other++] = pkts[i]; } process_tcp_batch(tcp_pkts, nb_tcp); // one code path, warm I-cache process_udp_batch(udp_pkts, nb_udp); rte_pktmbuf_free_bulk(other, nb_other); // drop unknown

Software Prefetching — The 4-Packet Lookahead

At 100G/64B, CPU has ~6.7 ns per packet. An L3 cache miss costs ~40 cycles (~13 ns) — more than the entire packet budget. Prefetching hides this latency by telling the CPU to fetch data for a future packet while processing the current one.
📌 Prefetch distance: Typically 4 packets ahead. Too small = cache miss still hurts. Too large = cache pollution (prefetched data evicted before used). 4 is the DPDK convention validated across Intel E810, i40e, and mlx5.
// 4-packet prefetch pattern — the DPDK standard technique for (uint16_t i = 0; i < nb_rx; i++) { /* Prefetch 4 packets ahead: fetch the mbuf header */ if (i + 4 < nb_rx) rte_prefetch0(rte_pktmbuf_mtod(pkts[i + 4], void *)); /* Process current packet — prefetch for i+4 is in flight */ struct rte_ether_hdr *eth = rte_pktmbuf_mtod(pkts[i], struct rte_ether_hdr *); /* ... process packet[i] ... */ } // rte_prefetch0 = prefetch to L1 cache (highest priority) // rte_prefetch1 = prefetch to L2 cache // rte_prefetch2 = prefetch to L3 cache // Use prefetch0 for hot packet data — you'll access it very soon

What to Prefetch

  • Packet data: rte_pktmbuf_mtod(pkts[i+4], void*) — the Ethernet/IP/TCP header bytes
  • Flow table entry: if doing hash lookup, prefetch the expected hash bucket for the next packet before doing the lookup for the current packet
  • mbuf metadata: rte_prefetch0(pkts[i+4]) — prefetch the mbuf struct itself if you access many fields
Do NOT prefetch unconditionally for every array position — only prefetch data you will actually access soon.

SYSTEM-LEVEL TUNING FOR DPDK

Tuning AreaCommand / SettingEffect
CPU isolationisolcpus=4-15 in kernel cmdlineRemoves cpus 4-15 from OS scheduler — dedicated to DPDK polling lcores
IRQ affinityirqbalance --banirq=<irqs> or /proc/irq/N/smp_affinityMove NIC IRQs away from DPDK lcores (IRQs still fire on control plane CPUs)
CPU frequency scalingcpupower frequency-set -g performanceDisable P-states / frequency scaling — DPDK needs consistent cycle budget
Turbo Boostecho 1 > /sys/devices/system/cpu/intel_pstate/no_turboDisable turbo for consistent throughput (turbo causes frequency jumps → latency variance)
Hugepages at boothugepagesz=2M hugepages=2048 in kernel cmdlinePre-allocate at boot — avoids fragmentation that makes runtime allocation fail
NUMA balancingecho 0 > /proc/sys/kernel/numa_balancingDisable automatic NUMA page migration — DPDK pages must stay pinned
Transparent hugepagesecho never > /sys/kernel/mm/transparent_hugepage/enabledDisable THP — it can cause latency spikes when pages are promoted/demoted
Production checklist for 100G DPDK on dual-socket server: Isolate DPDK lcores with isolcpus → set performance governor → disable turbo → pre-allocate hugepages at boot → disable NUMA balancing → disable transparent hugepages → bind NIC to vfio-pci → verify all DPDK lcores and mempool on same NUMA socket as NIC.

BURST SIZE TUNING APPROACH

// Measure throughput vs latency at different burst sizes using rte_rdtsc() uint64_t t0 = rte_rdtsc(); uint16_t nb_rx = rte_eth_rx_burst(port, queue, pkts, burst_size); uint64_t rx_cycles = rte_rdtsc() - t0; uint64_t t1 = rte_rdtsc(); for (uint16_t i = 0; i < nb_rx; i++) process_packet(pkts[i]); uint64_t proc_cycles = rte_rdtsc() - t1; // Track per-packet cycle budget: proc_cycles / nb_rx // At 3GHz, 6.7ns budget = 20 cycles per packet at 100G/64B

DPDK Testpmd — Built-In Benchmark Tool

dpdk-testpmd is DPDK's reference forwarding application for benchmarking NIC and PMD performance. Always establish a testpmd baseline before profiling your own application.
# Start testpmd in io forwarding mode (max throughput benchmark) dpdk-testpmd -l 0-3 -n 4 -a 0000:03:00.0 -- \ --nb-cores=2 --rxq=2 --txq=2 \ --burst=32 --forward-mode=io \ --auto-start # In testpmd CLI: show port stats all # throughput + packet counts show port xstats all # extended NIC counters (imissed, nombuf, etc.) show fwd stats all # forwarding engine stats clear port stats all # reset counters

KEY BENCHMARK METRICS

MetricToolHealthy RangeConcern
RX throughput (Mpps)testpmd statsNear line rateMore than 5% below line rate
imissedshow port xstats0Any non-zero = ring full → drops
rx_nombufshow port xstats0Any non-zero = mbuf leak or pool too small
CPU utilizationtop -H or htop~100% on DPDK lcoresBelow 99% = wasted polling; above 100% = overload
NUMA local memory %numastat -p <pid>>99%High remote% = cross-NUMA allocation bug
Per-packet cyclesrte_rdtsc() delta / nb_rxDepends on NF complexityCompare to 100G budget: ~20 cycles/packet
# Profile DPDK application with perf perf stat -C 4,5,6,7 -e cycles,instructions,cache-misses,LLC-load-misses \ -p $(pgrep my_dpdk_app) sleep 10 # LLC-load-misses high → data not cache-resident → check NUMA alignment # High IPC (instructions/cycle) → good — compute-bound, not memory-bound # Low IPC → memory-bound → check hugepages, prefetching, NUMA

DPDK DEBUGGING TOOLKIT

Symptom: Application Stops Receiving Packets

Diagnosis steps (in order):
  1. Check stats.imissed — if non-zero: ring full, application too slow. Increase nb_rx_desc or add lcores.
  2. Check stats.rx_nombuf — if non-zero: mempool exhausted. Find the mbuf leak.
  3. Check rte_mempool_avail_count(pool) over time — if it trends to zero: leak confirmed.
  4. Check all code paths: every rx_burst must eventually rte_pktmbuf_free() or tx_burst with free of unsent.
// Mbuf accounting helper — call periodically void check_pool_health(struct rte_mempool *pool, const char *tag) { unsigned avail = rte_mempool_avail_count(pool); unsigned total = rte_mempool_in_use_count(pool) + avail; float used_pct = (float)(total - avail) * 100.0f / total; printf("[%s] Pool avail: %u/%u (%.1f%% in use)\n", tag, avail, total, used_pct); if (used_pct > 90.0f) RTE_LOG(WARNING, USER1, "Pool nearly exhausted — check for mbuf leaks!\n"); }

DPDK LOGGING

// Log levels: EMERG(1) ALERT(2) CRIT(3) ERR(4) WARNING(5) NOTICE(6) INFO(7) DEBUG(8) rte_log_set_level(RTE_LOGTYPE_USER1, RTE_LOG_DEBUG); // Log from your application RTE_LOG(INFO, USER1, "Port %u: %u packets received\n", port_id, nb_rx); RTE_LOG(WARNING, USER1, "Tx ring full on port %u queue %u\n", port_id, queue_id); RTE_LOG(ERR, USER1, "Mbuf pool exhausted: avail=%u\n", avail); // Enable PMD debug logging at startup // ./my_app --log-level=pmd:8 (debug for all PMDs) // ./my_app --log-level=pmd.net.mlx5:8 (debug for mlx5 PMD only)

THE DPDK PRODUCTION PITFALL CATALOG

#PitfallSymptomRoot CauseFix
1Mbuf leakrx_nombuf increments; app stops receivingtx_burst doesn't free unsent pkts; early-return without freeAlways free pkts[nb_tx..n-1] after tx_burst; audit every return path
2Non-power-of-2 workersCPU load imbalance; throughput ceilingRETA divided unevenly across workersAlways use power-of-2 worker count or manually program RETA
3Cross-NUMA allocationLower throughput than testpmd baseline; high LLC missesMempool/ring/queue on wrong socketAlways use rte_eth_dev_socket_id(port) for pool and queue setup
4Secondary calls pool_createEEXIST error; secondary crashes at startupSecondary tries to create pool already owned by primaryAlways use rte_mempool_lookup() in secondary processes
5Accessing mbuf after tx_burstRandom data corruption; segfaultsPMD frees mbuf asynchronously after tx_burstNever access an mbuf after passing it to tx_burst
6Hugepages not allocated at bootEAL init fails; "Cannot reserve memory" errorRuntime hugepage allocation fails due to memory fragmentationAlways pre-allocate hugepages in kernel cmdline: hugepages=N
7Small mempool + large descriptor ringrx_nombuf immediately at startupPool smaller than ring × number of queues × burst_sizePool size must be > (nb_rx_desc × nb_rx_queues × 2) — leave 2× headroom
8Missing tx_free after ring fullSlow mbuf leak; intermittent rx_nombuftx_burst returns nb_tx < nb_pkts; caller doesn't free excessAlways check: if (nb_tx < nb_pkts) rte_pktmbuf_free_bulk(pkts+nb_tx, nb_pkts-nb_tx)
9False sharing on per-lcore countersLower throughput than expected; perf shows cache-to-cache transfersPer-lcore stats arrays not cache-line alignedUse __rte_cache_aligned on per-lcore structs
10CPU not isolatedHigh latency variance; p99 much higher than p50OS scheduler interrupts DPDK polling lcoresAdd isolcpus=<dpdk-cores> to kernel cmdline
⚠️ Pitfall #7 — Pool Sizing Formula: Minimum pool size = nb_rx_desc × nb_rx_queues + nb_tx_desc × nb_tx_queues + nb_lcores × cache_size + burst_size. Add 2× safety margin. Use 8191 (not 8192 — rte_mempool adjusts to power-of-2 minus one internally).

Q: What is software prefetching in DPDK and how many packets ahead should you prefetch?

Software prefetching tells the CPU to load a cache line into cache before the data is needed, hiding the ~13 ns L3 miss latency. The standard DPDK pattern prefetches 4 packets ahead: while processing packet i, issue rte_prefetch0(rte_pktmbuf_mtod(pkts[i+4], void*)). By the time processing of packets i+1 through i+3 completes, packet i+4's data is in L1 cache with near-zero access cost. Too small (1-2) = miss still hurts; too large (8+) = data evicted before use.

Q: How do you diagnose and fix high imissed counter?

imissed means the NIC dropped packets because the Rx ring had no empty descriptors — software was too slow to drain it. Diagnosis: confirm with rte_eth_stats_get() and observe it incrementing under load. Fixes (in order of impact): (1) Increase nb_rx_desc (bigger ring = more burst capacity); (2) Increase burst size so each rx_burst call drains more; (3) Reduce per-packet processing time; (4) Add more worker lcores via rte_distributor.

Q: What is the DPDK pool sizing formula?

Minimum pool size = (nb_rx_desc × nb_rx_queues) + (nb_tx_desc × nb_tx_queues) + (nb_lcores × cache_size) + burst_size, multiplied by a safety margin of ~2×. The rx_desc slots need mbufs to refill; tx_desc slots hold mbufs until NIC sends them; the per-lcore cache pre-fetches from the common pool. Undersized pools cause immediate rx_nombuf at startup under load.

Q: What does CPU isolation (isolcpus) do and why does DPDK need it?

isolcpus=4-15 in the kernel boot parameters removes CPUs 4-15 from the OS scheduler's CPU pool. No kernel threads, IRQs, or user-space tasks will be scheduled on those CPUs without explicit affinity pinning. DPDK needs this because its polling loops must run continuously — even a 1 ms scheduler preemption loses ~148,000 packets at 100G/64B. With isolcpus, DPDK lcores run uninterrupted at 100% CPU consumption, which is intentional and correct.

Q: How do you find a mbuf leak in a DPDK application?

(1) Monitor rte_mempool_avail_count(pool) over time — a leak shows as a monotonic decrease toward zero. (2) Check stats.rx_nombuf — when it becomes non-zero, the pool is exhausted. (3) Audit every code path: every packet received via rx_burst must eventually be freed via rte_pktmbuf_free() or passed to tx_burst with unsent packets freed. Common sources: early return on error without freeing; tx_burst return value not checked; chained mbufs partially freed.
🔥 Lab 10: End-to-End URL Filter Dataplane Skeleton

Build a minimal version of the SASE-DP URL filter pipeline: RX → DNS extract → allow/block decision → TX or DROP. Apply all Phase 3 techniques.

1
Setup: Primary process owns NIC, creates pool and distributor with 4 workers. Apply isolcpus tuning.
2
RX coordinator: rx_burst → set hash.usr = hash.rss → distributor_process()
3
Worker loop: distributor_get_pkt() → classify packet type (DNS/UDP/53, HTTP/TCP/80, HTTPS/TCP/443, other) → route to processing function
4
DNS processing: parse UDP payload → extract queried domain name → check against a simple blocked-domain hash table (use rte_hash)
5
Add prefetch: prefetch packet data 4 ahead in the worker loop
6
Add pool health monitoring: every 1M packets, log avail_count and verify no leak
7
Benchmark: run dpdk-testpmd baseline, then your filter, compare throughput and imissed
8
Add rte_flow rule: steer DNS traffic (UDP/53) to queue 0 in hardware — measure CPU % reduction

FULL DPDK MASTERY CHECKLIST

Phase 1 — Foundation & Memory
Phase 2 — Core Mechanics
Phase 3 — Advanced & Production
← P3A: Multi-Process & rte_flow ↑ DPDK Roadmap DPDK Hub →