DPDK MASTERY · PHASE 3 OF 3 · MODULE B
Packet Patterns, Tuning & Debugging
Prefetching · batching · CPU isolation · hugepage sizing · benchmarking · pitfall diagnosis
Ch 16 — Packet Processing Patterns
Ch 17 — Performance Tuning
Ch 18 — Debugging & Pitfalls
C · perf · pktgen · SASE-DP
Weeks 13–14+
CANONICAL PACKET PROCESSING PATTERNS
Pattern 1: Receive → Process → Transmit (Basic RTC)
The simplest pattern. Each lcore handles one or more NIC queues. Good for stateless forwarding, filtering, and routing.// Pattern 1: Basic receive-process-transmit
static int lcore_main(void *arg) {
uint16_t port = (uintptr_t)arg;
uint16_t queue = rte_lcore_id();
struct rte_mbuf *pkts[BURST_SIZE];
while (1) {
uint16_t nb_rx = rte_eth_rx_burst(port, queue, pkts, BURST_SIZE);
if (unlikely(nb_rx == 0)) continue;
for (uint16_t i = 0; i < nb_rx; i++)
process_packet(pkts[i]);
uint16_t nb_tx = rte_eth_tx_burst(port ^ 1, queue, pkts, nb_rx);
for (uint16_t i = nb_tx; i < nb_rx; i++)
rte_pktmbuf_free(pkts[i]); // free unsent
}
return 0;
}
Pattern 2: Batch Processing with Classification
Classify the entire burst first, then process each category. Better cache utilization — same code path runs on multiple packets before switching to the next path (instruction cache stays warm).// Pattern 2: classify burst → process by type
struct rte_mbuf *tcp_pkts[BURST_SIZE], *udp_pkts[BURST_SIZE], *other[BURST_SIZE];
uint16_t nb_tcp = 0, nb_udp = 0, nb_other = 0;
for (uint16_t i = 0; i < nb_rx; i++) {
struct rte_ipv4_hdr *ip = rte_pktmbuf_mtod_offset(pkts[i], struct rte_ipv4_hdr *,
sizeof(struct rte_ether_hdr));
if (ip->next_proto_id == IPPROTO_TCP) tcp_pkts[nb_tcp++] = pkts[i];
else if (ip->next_proto_id == IPPROTO_UDP) udp_pkts[nb_udp++] = pkts[i];
else other[nb_other++] = pkts[i];
}
process_tcp_batch(tcp_pkts, nb_tcp); // one code path, warm I-cache
process_udp_batch(udp_pkts, nb_udp);
rte_pktmbuf_free_bulk(other, nb_other); // drop unknown
Software Prefetching — The 4-Packet Lookahead
At 100G/64B, CPU has ~6.7 ns per packet. An L3 cache miss costs ~40 cycles (~13 ns) — more than the entire packet budget. Prefetching hides this latency by telling the CPU to fetch data for a future packet while processing the current one.📌 Prefetch distance: Typically 4 packets ahead. Too small = cache miss still hurts. Too large = cache pollution (prefetched data evicted before used). 4 is the DPDK convention validated across Intel E810, i40e, and mlx5.
// 4-packet prefetch pattern — the DPDK standard technique
for (uint16_t i = 0; i < nb_rx; i++) {
/* Prefetch 4 packets ahead: fetch the mbuf header */
if (i + 4 < nb_rx)
rte_prefetch0(rte_pktmbuf_mtod(pkts[i + 4], void *));
/* Process current packet — prefetch for i+4 is in flight */
struct rte_ether_hdr *eth = rte_pktmbuf_mtod(pkts[i], struct rte_ether_hdr *);
/* ... process packet[i] ... */
}
// rte_prefetch0 = prefetch to L1 cache (highest priority)
// rte_prefetch1 = prefetch to L2 cache
// rte_prefetch2 = prefetch to L3 cache
// Use prefetch0 for hot packet data — you'll access it very soon
What to Prefetch
- Packet data:
rte_pktmbuf_mtod(pkts[i+4], void*)— the Ethernet/IP/TCP header bytes - Flow table entry: if doing hash lookup, prefetch the expected hash bucket for the next packet before doing the lookup for the current packet
- mbuf metadata:
rte_prefetch0(pkts[i+4])— prefetch the mbuf struct itself if you access many fields
SYSTEM-LEVEL TUNING FOR DPDK
| Tuning Area | Command / Setting | Effect |
|---|---|---|
| CPU isolation | isolcpus=4-15 in kernel cmdline | Removes cpus 4-15 from OS scheduler — dedicated to DPDK polling lcores |
| IRQ affinity | irqbalance --banirq=<irqs> or /proc/irq/N/smp_affinity | Move NIC IRQs away from DPDK lcores (IRQs still fire on control plane CPUs) |
| CPU frequency scaling | cpupower frequency-set -g performance | Disable P-states / frequency scaling — DPDK needs consistent cycle budget |
| Turbo Boost | echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo | Disable turbo for consistent throughput (turbo causes frequency jumps → latency variance) |
| Hugepages at boot | hugepagesz=2M hugepages=2048 in kernel cmdline | Pre-allocate at boot — avoids fragmentation that makes runtime allocation fail |
| NUMA balancing | echo 0 > /proc/sys/kernel/numa_balancing | Disable automatic NUMA page migration — DPDK pages must stay pinned |
| Transparent hugepages | echo never > /sys/kernel/mm/transparent_hugepage/enabled | Disable THP — it can cause latency spikes when pages are promoted/demoted |
✅ Production checklist for 100G DPDK on dual-socket server:
Isolate DPDK lcores with
isolcpus → set performance governor → disable turbo → pre-allocate hugepages at boot → disable NUMA balancing → disable transparent hugepages → bind NIC to vfio-pci → verify all DPDK lcores and mempool on same NUMA socket as NIC.BURST SIZE TUNING APPROACH
// Measure throughput vs latency at different burst sizes using rte_rdtsc()
uint64_t t0 = rte_rdtsc();
uint16_t nb_rx = rte_eth_rx_burst(port, queue, pkts, burst_size);
uint64_t rx_cycles = rte_rdtsc() - t0;
uint64_t t1 = rte_rdtsc();
for (uint16_t i = 0; i < nb_rx; i++) process_packet(pkts[i]);
uint64_t proc_cycles = rte_rdtsc() - t1;
// Track per-packet cycle budget: proc_cycles / nb_rx
// At 3GHz, 6.7ns budget = 20 cycles per packet at 100G/64B
DPDK Testpmd — Built-In Benchmark Tool
dpdk-testpmd is DPDK's reference forwarding application for benchmarking NIC and PMD performance. Always establish a testpmd baseline before profiling your own application.
# Start testpmd in io forwarding mode (max throughput benchmark)
dpdk-testpmd -l 0-3 -n 4 -a 0000:03:00.0 -- \
--nb-cores=2 --rxq=2 --txq=2 \
--burst=32 --forward-mode=io \
--auto-start
# In testpmd CLI:
show port stats all # throughput + packet counts
show port xstats all # extended NIC counters (imissed, nombuf, etc.)
show fwd stats all # forwarding engine stats
clear port stats all # reset counters
KEY BENCHMARK METRICS
| Metric | Tool | Healthy Range | Concern |
|---|---|---|---|
| RX throughput (Mpps) | testpmd stats | Near line rate | More than 5% below line rate |
imissed | show port xstats | 0 | Any non-zero = ring full → drops |
rx_nombuf | show port xstats | 0 | Any non-zero = mbuf leak or pool too small |
| CPU utilization | top -H or htop | ~100% on DPDK lcores | Below 99% = wasted polling; above 100% = overload |
| NUMA local memory % | numastat -p <pid> | >99% | High remote% = cross-NUMA allocation bug |
| Per-packet cycles | rte_rdtsc() delta / nb_rx | Depends on NF complexity | Compare to 100G budget: ~20 cycles/packet |
# Profile DPDK application with perf
perf stat -C 4,5,6,7 -e cycles,instructions,cache-misses,LLC-load-misses \
-p $(pgrep my_dpdk_app) sleep 10
# LLC-load-misses high → data not cache-resident → check NUMA alignment
# High IPC (instructions/cycle) → good — compute-bound, not memory-bound
# Low IPC → memory-bound → check hugepages, prefetching, NUMA
DPDK DEBUGGING TOOLKIT
Symptom: Application Stops Receiving Packets
Diagnosis steps (in order):- Check
stats.imissed— if non-zero: ring full, application too slow. Increase nb_rx_desc or add lcores. - Check
stats.rx_nombuf— if non-zero: mempool exhausted. Find the mbuf leak. - Check
rte_mempool_avail_count(pool)over time — if it trends to zero: leak confirmed. - Check all code paths: every
rx_burstmust eventuallyrte_pktmbuf_free()ortx_burstwith free of unsent.
// Mbuf accounting helper — call periodically
void check_pool_health(struct rte_mempool *pool, const char *tag) {
unsigned avail = rte_mempool_avail_count(pool);
unsigned total = rte_mempool_in_use_count(pool) + avail;
float used_pct = (float)(total - avail) * 100.0f / total;
printf("[%s] Pool avail: %u/%u (%.1f%% in use)\n", tag, avail, total, used_pct);
if (used_pct > 90.0f)
RTE_LOG(WARNING, USER1, "Pool nearly exhausted — check for mbuf leaks!\n");
}
DPDK LOGGING
// Log levels: EMERG(1) ALERT(2) CRIT(3) ERR(4) WARNING(5) NOTICE(6) INFO(7) DEBUG(8)
rte_log_set_level(RTE_LOGTYPE_USER1, RTE_LOG_DEBUG);
// Log from your application
RTE_LOG(INFO, USER1, "Port %u: %u packets received\n", port_id, nb_rx);
RTE_LOG(WARNING, USER1, "Tx ring full on port %u queue %u\n", port_id, queue_id);
RTE_LOG(ERR, USER1, "Mbuf pool exhausted: avail=%u\n", avail);
// Enable PMD debug logging at startup
// ./my_app --log-level=pmd:8 (debug for all PMDs)
// ./my_app --log-level=pmd.net.mlx5:8 (debug for mlx5 PMD only)
THE DPDK PRODUCTION PITFALL CATALOG
| # | Pitfall | Symptom | Root Cause | Fix |
|---|---|---|---|---|
| 1 | Mbuf leak | rx_nombuf increments; app stops receiving | tx_burst doesn't free unsent pkts; early-return without free | Always free pkts[nb_tx..n-1] after tx_burst; audit every return path |
| 2 | Non-power-of-2 workers | CPU load imbalance; throughput ceiling | RETA divided unevenly across workers | Always use power-of-2 worker count or manually program RETA |
| 3 | Cross-NUMA allocation | Lower throughput than testpmd baseline; high LLC misses | Mempool/ring/queue on wrong socket | Always use rte_eth_dev_socket_id(port) for pool and queue setup |
| 4 | Secondary calls pool_create | EEXIST error; secondary crashes at startup | Secondary tries to create pool already owned by primary | Always use rte_mempool_lookup() in secondary processes |
| 5 | Accessing mbuf after tx_burst | Random data corruption; segfaults | PMD frees mbuf asynchronously after tx_burst | Never access an mbuf after passing it to tx_burst |
| 6 | Hugepages not allocated at boot | EAL init fails; "Cannot reserve memory" error | Runtime hugepage allocation fails due to memory fragmentation | Always pre-allocate hugepages in kernel cmdline: hugepages=N |
| 7 | Small mempool + large descriptor ring | rx_nombuf immediately at startup | Pool smaller than ring × number of queues × burst_size | Pool size must be > (nb_rx_desc × nb_rx_queues × 2) — leave 2× headroom |
| 8 | Missing tx_free after ring full | Slow mbuf leak; intermittent rx_nombuf | tx_burst returns nb_tx < nb_pkts; caller doesn't free excess | Always check: if (nb_tx < nb_pkts) rte_pktmbuf_free_bulk(pkts+nb_tx, nb_pkts-nb_tx) |
| 9 | False sharing on per-lcore counters | Lower throughput than expected; perf shows cache-to-cache transfers | Per-lcore stats arrays not cache-line aligned | Use __rte_cache_aligned on per-lcore structs |
| 10 | CPU not isolated | High latency variance; p99 much higher than p50 | OS scheduler interrupts DPDK polling lcores | Add isolcpus=<dpdk-cores> to kernel cmdline |
⚠️ Pitfall #7 — Pool Sizing Formula: Minimum pool size =
nb_rx_desc × nb_rx_queues + nb_tx_desc × nb_tx_queues + nb_lcores × cache_size + burst_size. Add 2× safety margin. Use 8191 (not 8192 — rte_mempool adjusts to power-of-2 minus one internally).Q: What is software prefetching in DPDK and how many packets ahead should you prefetch?
Software prefetching tells the CPU to load a cache line into cache before the data is needed, hiding the ~13 ns L3 miss latency. The standard DPDK pattern prefetches 4 packets ahead: while processing packet i, issuerte_prefetch0(rte_pktmbuf_mtod(pkts[i+4], void*)). By the time processing of packets i+1 through i+3 completes, packet i+4's data is in L1 cache with near-zero access cost. Too small (1-2) = miss still hurts; too large (8+) = data evicted before use.
Q: How do you diagnose and fix high imissed counter?
imissed means the NIC dropped packets because the Rx ring had no empty descriptors — software was too slow to drain it. Diagnosis: confirm with rte_eth_stats_get() and observe it incrementing under load. Fixes (in order of impact): (1) Increase nb_rx_desc (bigger ring = more burst capacity); (2) Increase burst size so each rx_burst call drains more; (3) Reduce per-packet processing time; (4) Add more worker lcores via rte_distributor.
Q: What is the DPDK pool sizing formula?
Minimum pool size = (nb_rx_desc × nb_rx_queues) + (nb_tx_desc × nb_tx_queues) + (nb_lcores × cache_size) + burst_size, multiplied by a safety margin of ~2×. The rx_desc slots need mbufs to refill; tx_desc slots hold mbufs until NIC sends them; the per-lcore cache pre-fetches from the common pool. Undersized pools cause immediate rx_nombuf at startup under load.Q: What does CPU isolation (isolcpus) do and why does DPDK need it?
isolcpus=4-15 in the kernel boot parameters removes CPUs 4-15 from the OS scheduler's CPU pool. No kernel threads, IRQs, or user-space tasks will be scheduled on those CPUs without explicit affinity pinning. DPDK needs this because its polling loops must run continuously — even a 1 ms scheduler preemption loses ~148,000 packets at 100G/64B. With isolcpus, DPDK lcores run uninterrupted at 100% CPU consumption, which is intentional and correct.
Q: How do you find a mbuf leak in a DPDK application?
(1) Monitorrte_mempool_avail_count(pool) over time — a leak shows as a monotonic decrease toward zero. (2) Check stats.rx_nombuf — when it becomes non-zero, the pool is exhausted. (3) Audit every code path: every packet received via rx_burst must eventually be freed via rte_pktmbuf_free() or passed to tx_burst with unsent packets freed. Common sources: early return on error without freeing; tx_burst return value not checked; chained mbufs partially freed.
🔥 Lab 10: End-to-End URL Filter Dataplane Skeleton
Build a minimal version of the SASE-DP URL filter pipeline: RX → DNS extract → allow/block decision → TX or DROP. Apply all Phase 3 techniques.
1
Setup: Primary process owns NIC, creates pool and distributor with 4 workers. Apply isolcpus tuning.
2
RX coordinator: rx_burst → set hash.usr = hash.rss → distributor_process()
3
Worker loop: distributor_get_pkt() → classify packet type (DNS/UDP/53, HTTP/TCP/80, HTTPS/TCP/443, other) → route to processing function
4
DNS processing: parse UDP payload → extract queried domain name → check against a simple blocked-domain hash table (use rte_hash)
5
Add prefetch: prefetch packet data 4 ahead in the worker loop
6
Add pool health monitoring: every 1M packets, log avail_count and verify no leak
7
Benchmark: run dpdk-testpmd baseline, then your filter, compare throughput and imissed
8
Add rte_flow rule: steer DNS traffic (UDP/53) to queue 0 in hardware — measure CPU % reduction
FULL DPDK MASTERY CHECKLIST
Phase 1 — Foundation & Memory
- Explain 6 categories of kernel overhead and DPDK's solution for each
- Draw DPDK software stack from NIC hardware to user application
- Explain EAL init: hugepages, lcore pinning, PCI probe
- Explain hugepages: why needed, DMA stability, TLB efficiency
- Draw rte_mempool architecture: per-lcore cache + common ring
- Draw rte_mbuf layout: all key fields including buf_addr, data_off, pkt_len, ol_flags
- Explain rte_pktmbuf_mtod() — what it expands to, why it's zero-copy
Phase 2 — Core Mechanics
- Explain DD bit — what it is, why polling beats interrupts
- Draw Rx descriptor ring lifecycle (6 steps)
- Write safe tx_burst with unsent-packet free pattern
- List 10-step port configuration sequence in order
- Explain RSS: Toeplitz hash, RETA, symmetric key, power-of-2 requirement
- Draw rte_ring CAS protocol for MPMC enqueue
- Explain bulk vs burst semantics, SPSC vs MPMC tradeoffs
- Compare run-to-completion vs pipeline architectures
Phase 3 — Advanced & Production
- Explain primary/secondary model — who creates, who looks up, gotchas
- Write a complete rte_flow rule (pattern + action + validate + create)
- Explain NUMA remote access penalty and correct allocation pattern
- Explain false sharing and demonstrate __rte_cache_aligned fix
- Explain 4-packet prefetch pattern — what to prefetch, why 4 ahead
- Diagnose imissed vs rx_nombuf: different causes and different fixes
- Apply production tuning: isolcpus, performance governor, NUMA balancing off
- Identify and fix all 10 pitfalls in the production pitfall catalog