DPDK P2 — Poll Mode Drivers & Port Config

DPDK MASTERY · PHASE 2 OF 3 · MODULE A

Poll Mode Drivers & Port Config

PMD internals · NIC descriptor rings · rx_burst / tx_burst hot path · RSS · Port configuration sequence

Ch 7 — PMD Deep Dive Ch 8 — Port Configuration Ch 9 — RSS Deep Dive C · ixgbe · mlx5 · Toeplitz Weeks 6–8

What a PMD Is

A Poll Mode Driver (PMD) is a user-space NIC driver that replaces the kernel driver for a specific NIC model. It maps NIC BAR (Base Address Register) memory into user-space via VFIO/UIO and programs the NIC's hardware descriptor rings directly. It provides rte_eth_rx_burst() and rte_eth_tx_burst() implementations — called millions of times per second with zero system calls.

PMD Type	Examples	Description	Notes
Physical NIC	`ixgbe` (X520), `i40e` (XL710), `ice` (E810), `mlx5` (ConnectX)	Direct hardware driver — maximum performance	Requires device binding (except mlx5 bifurcated)
Virtual NIC	`virtio` (KVM), `vmxnet3` (VMware), `vhost-user`	VM-facing PMD — communicates via shared memory	Lower performance than physical — no DMA bypass
Software (vdev)	`net_ring`, `net_tap`, `net_pcap`, `net_null`	Software-only — testing, kernel bridging, dev	No real NIC needed — great for unit testing
Bonding	`net_bonding`	Aggregates multiple physical PMDs into one logical port	LAG/LACP support; active-backup or LACP mode

📌 PMD as function pointer table: Each PMD registers a set of function pointers (eth_rx_burst_t, eth_tx_burst_t, etc.) at probe time. When you call rte_eth_rx_burst(), it's an indirect function call through this table — PMD-specific code runs directly, fully inlined per NIC type. This is why different NICs can coexist in one DPDK application.

NIC Hardware Descriptor Rings

The descriptor ring is a circular array in hugepage memory shared between the NIC hardware and the PMD software. It is the fundamental data transfer mechanism — no pipes, no queues, no kernel — just two pointers (NIC's and PMD's) into a shared ring.

Rx Descriptor Ring (in hugepage memory, DMA-accessible by NIC) ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │desc0│desc1│desc2│desc3│desc4│desc5│desc6│desc7│ ← ring[nb_rx_desc] └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ ↑ NIC write ptr ↑ CPU read ptr Each descriptor contains: buf_addr (IOVA of pre-allocated mbuf) buf_len (size of the buffer) status (including DD = Descriptor Done bit) Flow: CPU fills ring[i].buf_addr = IOVA of empty mbuf (pre-loaded at setup) NIC DMA writes packet data into that IOVA NIC sets ring[i].status |= DD bit ← handshake signal PMD polls: if DD bit set → packet is ready → read mbuf

RX DESCRIPTOR LIFECYCLE

1
Setup: rte_eth_rx_queue_setup() pre-fills all ring slots with empty mbuf IOVAs from the mempool
2
Packet arrives: NIC DMA engine writes packet bytes into the mbuf at that IOVA — zero CPU involvement
3
NIC signals done: NIC sets DD bit in descriptor + writes pkt_len, ol_flags, RSS hash
4
PMD polls: rte_eth_rx_burst() checks DD bit → mbuf is ready → copies metadata into mbuf fields
5
Refill: PMD allocates fresh mbuf from pool → puts its IOVA into the now-empty ring slot → NIC can reuse
6
Returns: PMD returns received mbuf to application — total latency: ~20–50 ns from DD bit set

TX DESCRIPTOR LIFECYCLE

1
App calls: rte_eth_tx_burst(port, queue, mbufs[], n)
2
PMD fills Tx descriptor: writes mbuf's IOVA + length + offload flags, updates Tx tail pointer
3
NIC DMA: reads packet from mbuf buffer → sends on wire
4
NIC sets DD: on completed descriptor (async — NIC is busy sending next packets)
5
Lazy free: PMD frees completed Tx mbufs on the next tx_burst call or when tx_free_thresh crossed

⚠️ Critical: rte_eth_tx_burst() returns the count of packets actually queued (may be less than n if Tx ring is full). Caller MUST free any unsent packets: pkts[nb_tx..n-1]. Failing to do so causes mbuf leaks → mempool exhaustion → rx_burst returns 0 mbufs → application appears to stop receiving packets.

RX_BURST — THE HOT PATH FUNCTION

// rte_eth_rx_burst signature uint16_t rte_eth_rx_burst( uint16_t port_id, // which NIC port uint16_t queue_id, // which Rx queue on that port struct rte_mbuf **rx_pkts, // output: array of received mbufs uint16_t nb_pkts // max mbufs to receive (burst size) ); // Returns: actual number of mbufs received (0 to nb_pkts) // Canonical polling loop struct rte_mbuf *pkts[BURST_SIZE]; while (1) { uint16_t nb_rx = rte_eth_rx_burst(port, queue, pkts, BURST_SIZE); for (uint16_t i = 0; i < nb_rx; i++) process_packet(pkts[i]); }

TX_BURST — SAFE TRANSMIT PATTERN

// rte_eth_tx_burst — ALWAYS check return value uint16_t nb_tx = rte_eth_tx_burst(port, queue, pkts, nb_pkts); // Free unsent packets (Tx ring was full) if (unlikely(nb_tx < nb_pkts)) { for (uint16_t i = nb_tx; i < nb_pkts; i++) rte_pktmbuf_free(pkts[i]); }

BURST SIZE TUNING

Burst Size	Throughput	Latency	Cache Usage	Recommendation
8	Good for low load	Lowest	Minimal	Low-latency SLAs
32	Good balance	Moderate	Good I-cache reuse	Blaze/SASE-DP default
64	High throughput	Higher	Excellent	DPDK example default
128+	Marginal improvement	Higher	Diminishing returns	May exceed L1 cache

🆕 Blaze/SASE-DP Real-World Finding: With 100G NIC and 8 workers, burst size 32 gave the best latency/throughput balance. At burst=64 throughput was ~3% higher but p99 latency increased ~15%. At burst=16, throughput dropped ~8%. Start with 32 — tune based on your latency SLA vs throughput target.

PORT CONFIGURATION — MANDATORY ORDER

1
rte_eal_init() — initialize EAL (hugepages, lcores, PCI probe)
2
rte_eth_dev_count_avail() — how many NIC ports are available?
3
rte_eth_dev_info_get() — query NIC capabilities (max queues, offload flags, desc limits)
4
rte_pktmbuf_pool_create() — create mbuf pool on NIC's NUMA socket
5
rte_eth_dev_configure() — configure port: number of queues, offloads, RSS
6
rte_eth_rx_queue_setup() — setup each Rx queue (descriptor count, socket, pool)
7
rte_eth_tx_queue_setup() — setup each Tx queue (descriptor count, socket)
8
rte_eth_dev_start() — start the device (enables DMA, activates queues)
9
rte_eth_promiscuous_enable() — optional: receive all traffic regardless of dst MAC
10
rte_eth_link_get_nowait() — poll until link is UP

// Full port configuration example struct rte_eth_conf port_conf = { .rxmode = { .mtu = RTE_ETHER_MAX_LEN, .offloads = RTE_ETH_RX_OFFLOAD_CHECKSUM | RTE_ETH_RX_OFFLOAD_RSS_HASH, }, .txmode = { .mq_mode = RTE_ETH_MQ_TX_NONE, .offloads = RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | RTE_ETH_TX_OFFLOAD_TCP_CKSUM, }, .rx_adv_conf.rss_conf = { .rss_key = NULL, // use default 40-byte RSS key .rss_hf = RTE_ETH_RSS_IP | RTE_ETH_RSS_TCP | RTE_ETH_RSS_UDP, }, }; rte_eth_dev_configure(port_id, nb_rx_queues, nb_tx_queues, &port_conf); for (uint16_t q = 0; q < nb_rx_queues; q++) rte_eth_rx_queue_setup(port_id, q, 512, // nb_rx_desc rte_eth_dev_socket_id(port_id), NULL, mbuf_pool); for (uint16_t q = 0; q < nb_tx_queues; q++) rte_eth_tx_queue_setup(port_id, q, 512, rte_eth_dev_socket_id(port_id), NULL); rte_eth_dev_start(port_id); rte_eth_promiscuous_enable(port_id);

nb_rx_desc	Use Case	Trade-off
256	Low latency, light load	Small ring → NIC drops more under burst → imissed increments
512	Balanced — common default	Good balance of memory vs burst tolerance
1024	High throughput, bursty traffic	More memory, better burst handling
4096	Line-rate 100G with large bursts	Maximum burst tolerance — highest memory use

Offload Flag	Direction	Effect
`RTE_ETH_RX_OFFLOAD_CHECKSUM`	Rx	NIC verifies IP/TCP/UDP checksums. Sets `RTE_MBUF_F_RX_*_CKSUM_GOOD/BAD` flags.
`RTE_ETH_RX_OFFLOAD_RSS_HASH`	Rx	NIC computes RSS hash. Sets `mbuf->hash.rss` and `RTE_MBUF_F_RX_RSS_HASH`.
`RTE_ETH_RX_OFFLOAD_VLAN_STRIP`	Rx	NIC strips VLAN tag from frame. Tag stored in `mbuf->vlan_tci`.
`RTE_ETH_RX_OFFLOAD_SCATTER`	Rx	Allow multi-segment mbufs (required for jumbo frames > buf_size).
`RTE_ETH_TX_OFFLOAD_IPV4_CKSUM`	Tx	NIC computes and inserts IPv4 header checksum.
`RTE_ETH_TX_OFFLOAD_TCP_CKSUM`	Tx	NIC computes and inserts TCP checksum.
`RTE_ETH_TX_OFFLOAD_VLAN_INSERT`	Tx	NIC inserts VLAN tag from `mbuf->vlan_tci`.
`RTE_ETH_TX_OFFLOAD_TCP_TSO`	Tx	TCP Segmentation Offload — NIC segments large TCP to MTU-sized frames.

PORT STATISTICS

// Read port statistics struct rte_eth_stats stats; rte_eth_stats_get(port_id, &stats); printf("Rx: %lu pkts, %lu bytes, %lu missed, %lu errors\n", stats.ipackets, stats.ibytes, stats.imissed, stats.ierrors); printf("Tx: %lu pkts, %lu bytes, %lu errors\n", stats.opackets, stats.obytes, stats.oerrors);

Stat Field	Meaning	Action if Non-Zero
`stats.imissed`	Packets dropped by NIC hardware — Rx ring was full	Increase nb_rx_desc; increase burst_size; reduce processing latency; add more worker lcores
`stats.ierrors`	Receive errors (bad FCS, oversized frames)	Check cable/NIC health; check MTU configuration
`stats.rx_nombuf`	Packets dropped — no free mbufs in pool	Increase mempool size; check for mbuf leaks
`stats.oerrors`	Transmit errors	Check Tx configuration and offload flags

⚠️ imissed vs rx_nombuf: These are different failure modes. imissed = NIC couldn't write packet because the Rx ring had no empty descriptors (ring was full — software too slow to drain it). rx_nombuf = PMD tried to refill the ring but the mempool had no free mbufs (mbuf leak). Both result in dropped packets but have different root causes and fixes.

RSS — Hardware Multi-Core Distribution

RSS (Receive Side Scaling) distributes incoming packets across multiple Rx queues using a hardware hash of the packet's 5-tuple. Each queue is serviced by one lcore. Because the same 5-tuple always maps to the same queue, all packets of a TCP connection always land on the same core — enabling lock-free per-flow state.

RSS MECHANISM

RSS — Packet to Queue Assignment (hardware path) Packet arrives at NIC ↓ NIC parser extracts 5-tuple from fixed byte offsets: Src IP @ bytes 26–29 Dst IP @ bytes 30–33 Src Port @ bytes 34–35 Dst Port @ bytes 36–37 Protocol @ byte 23 ↓ Toeplitz Hash Unit (silicon logic — runs at wire speed): Algorithm: for each input bit → if bit=1: hash XOR= key[i:i+32] Same 5-tuple always → same 32-bit hash (deterministic) Same 5-tuple → same hash → same queue → same lcore hash = 0x3A7F1C ↓ RETA (Redirection Table) lookup: queue = RETA[hash & (reta_size - 1)] RETA[0x1C] = queue 3 ↓ Packet DMA'd into Rx Queue 3 → lcore 3 picks it up All packets of one TCP connection always land on the same lcore. Per-flow state on one lcore — no locking needed on hot path.

SYMMETRIC RSS KEY

The Symmetric RSS Problem

By default, RSS is asymmetric: hash(src=A, dst=B) ≠ hash(src=B, dst=A). For stateful NFs that process both directions of a flow, this means forward and return packets land on different cores — requiring cross-core state access. The symmetric Toeplitz key (Microsoft key) fixes this: hash(A→B) == hash(B→A).

// Symmetric RSS key (Microsoft Toeplitz key) static uint8_t sym_rss_key[] = { 0x6D, 0x5A, 0x56, 0xDA, 0x25, 0x5B, 0x0E, 0xC2, 0x41, 0x67, 0x25, 0x3D, 0x43, 0xA3, 0x8F, 0xB0, 0xD0, 0xCA, 0x2B, 0xCB, 0xAE, 0x7B, 0x30, 0xB4, 0x77, 0xCB, 0x2D, 0xA3, 0x80, 0x30, 0xF2, 0x0C, 0x6A, 0x42, 0xB7, 0x3B, 0xBE, 0xAC, 0x01, 0xFA, }; // Use in rss_conf.rss_key — guarantees forward/return on same lcore

RETA IMBALANCE — THE POWER-OF-2 REQUIREMENT

⚠️ Blaze/SASE-DP Real-World Finding: With 6 workers (non-power-of-2) and RETA size=128: 128/6=21.33 → uneven. Queues 0-3 got 22 entries, queues 4-5 got 21 entries → cores 0-3 received ~5% more traffic. Under high load, cores 0-3 saturated first → throughput ceiling. Switching to 8 workers: all at ~91% utilization, throughput +12%. Rule: always use power-of-2 worker counts.

// Update RETA programmatically for even distribution uint16_t reta_size; rte_eth_dev_info_get(port_id, &dev_info); reta_size = dev_info.reta_size; // typically 128 or 512 struct rte_eth_rss_reta_entry64 reta_conf[reta_size / RTE_ETH_RETA_GROUP_SIZE]; memset(reta_conf, 0, sizeof(reta_conf)); for (uint16_t i = 0; i < reta_size; i++) { uint16_t grp = i / RTE_ETH_RETA_GROUP_SIZE; uint16_t idx = i % RTE_ETH_RETA_GROUP_SIZE; reta_conf[grp].mask = UINT64_MAX; reta_conf[grp].reta[idx] = i % nb_workers; // nb_workers must be power-of-2 } rte_eth_dev_rss_reta_update(port_id, reta_conf, reta_size);

Q: What is the DD bit and why does DPDK use it instead of interrupts?

The DD (Descriptor Done) bit is a status bit set by NIC hardware in a descriptor after it finishes with it — for Rx: after DMA'ing the packet; for Tx: after sending it. The PMD polls this bit in a tight loop instead of sleeping and waiting for an interrupt. At 100G/64B, ~148 Mpps would require 148M interrupts/sec — impossible. Polling eliminates interrupt latency and context switches entirely.

Q: What happens if rte_eth_tx_burst() returns less than nb_pkts?

The Tx ring was full — not all packets could be queued. The caller must free the unsent packets (pkts[nb_tx..nb_pkts-1]). Failing to do so causes a mbuf leak → mempool exhaustion → rx_burst can no longer refill Rx ring → rx_nombuf stat increments → application crashes or stops receiving.

Q: When are Tx mbufs actually freed?

NOT immediately after tx_burst. The NIC needs time to DMA the data. The PMD frees completed Tx mbufs lazily: either when the next tx_burst is called and the PMD reclaims descriptors, or when a configurable tx_free_thresh is crossed. Never access an mbuf after passing it to tx_burst — the mbuf may be freed by the PMD asynchronously.

Q: What is the order of port configuration API calls and why does it matter?

Order: dev_info_get → dev_configure → rx_queue_setup (each queue) → tx_queue_setup (each queue) → dev_start. This order is mandatory: dev_configure allocates internal resources; queue_setup allocates descriptor rings using those resources; dev_start enables DMA. Calling out of order returns EINVAL or silently fails.

Q: What does stats.imissed mean and how do you fix it?

imissed counts packets the NIC hardware dropped because the Rx ring had no empty descriptor slots — the application wasn't consuming packets fast enough. Fixes: (1) Increase nb_rx_desc; (2) Increase burst_size to drain more per call; (3) Reduce per-packet processing time; (4) Add more worker lcores.

Q: Why must worker count be a power of 2 for RSS?

RETA has a fixed size (typically 128 or 512). DPDK maps RETA entries evenly to queues: RETA[i] = i % nb_workers. If nb_workers is not a power of 2, the division is uneven — some queues get more RETA entries (more traffic) than others. Under load, the heavier queues saturate first, creating a throughput bottleneck. Power-of-2 counts guarantee exact even distribution.

🔥 Lab 5: L2 Forwarder with RSS Verification

Build the classic DPDK L2 forwarder (MAC swap + forward) and add RSS verification to confirm packets are landing on the expected lcore.

Configure port with 4 Rx queues (power-of-2) and enable RSS on IP+TCP+UDP

Launch 4 worker lcores — each polls its own queue: rte_eth_rx_burst(port, lcore_id % 4, ...)

In the Rx loop: print mbuf->hash.rss and verify it's set (RTE_MBUF_F_RX_RSS_HASH in ol_flags)

MAC swap: swap src ↔ dst Ethernet addresses using rte_ether_addr_copy()

Transmit back on the same port/queue: rte_eth_tx_burst(port, queue, pkts, nb_rx)

Monitor stats: verify imissed == 0 and rx_nombuf == 0 under load

Extension: try 3 workers (non-power-of-2) — observe CPU imbalance in top -H

MASTERY CHECKLIST

Can explain what the DD bit is and why DPDK polls it instead of using interrupts
Can draw the full Rx descriptor lifecycle (6 steps from setup to application)
Can write the canonical polling loop with correct Tx free pattern
Can list the 10-step port configuration sequence in order and explain why order matters
Can explain imissed vs rx_nombuf and what causes each
Can explain RSS: Toeplitz hash, RETA, why same 5-tuple always lands on same lcore
Can explain why worker count must be power-of-2 for even RSS distribution
Can write RETA update code to manually control traffic distribution

← P1B: Hugepages, mempool & mbuf ↑ Roadmap P2B: rte_ring & App Models →