DPDK MASTERY · PHASE 2 OF 3 · MODULE A
Poll Mode Drivers & Port Config
PMD internals · NIC descriptor rings · rx_burst / tx_burst hot path · RSS · Port configuration sequence
Ch 7 — PMD Deep Dive Ch 8 — Port Configuration Ch 9 — RSS Deep Dive C · ixgbe · mlx5 · Toeplitz Weeks 6–8

What a PMD Is

A Poll Mode Driver (PMD) is a user-space NIC driver that replaces the kernel driver for a specific NIC model. It maps NIC BAR (Base Address Register) memory into user-space via VFIO/UIO and programs the NIC's hardware descriptor rings directly. It provides rte_eth_rx_burst() and rte_eth_tx_burst() implementations — called millions of times per second with zero system calls.
PMD TypeExamplesDescriptionNotes
Physical NICixgbe (X520), i40e (XL710), ice (E810), mlx5 (ConnectX)Direct hardware driver — maximum performanceRequires device binding (except mlx5 bifurcated)
Virtual NICvirtio (KVM), vmxnet3 (VMware), vhost-userVM-facing PMD — communicates via shared memoryLower performance than physical — no DMA bypass
Software (vdev)net_ring, net_tap, net_pcap, net_nullSoftware-only — testing, kernel bridging, devNo real NIC needed — great for unit testing
Bondingnet_bondingAggregates multiple physical PMDs into one logical portLAG/LACP support; active-backup or LACP mode
📌 PMD as function pointer table: Each PMD registers a set of function pointers (eth_rx_burst_t, eth_tx_burst_t, etc.) at probe time. When you call rte_eth_rx_burst(), it's an indirect function call through this table — PMD-specific code runs directly, fully inlined per NIC type. This is why different NICs can coexist in one DPDK application.

NIC Hardware Descriptor Rings

The descriptor ring is a circular array in hugepage memory shared between the NIC hardware and the PMD software. It is the fundamental data transfer mechanism — no pipes, no queues, no kernel — just two pointers (NIC's and PMD's) into a shared ring.
Rx Descriptor Ring (in hugepage memory, DMA-accessible by NIC) ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │desc0│desc1│desc2│desc3│desc4│desc5│desc6│desc7│ ← ring[nb_rx_desc] └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ ↑ NIC write ptr ↑ CPU read ptr Each descriptor contains: buf_addr (IOVA of pre-allocated mbuf) buf_len (size of the buffer) status (including DD = Descriptor Done bit) Flow: CPU fills ring[i].buf_addr = IOVA of empty mbuf (pre-loaded at setup) NIC DMA writes packet data into that IOVA NIC sets ring[i].status |= DD bit ← handshake signal PMD polls: if DD bit set → packet is ready → read mbuf

RX DESCRIPTOR LIFECYCLE

TX DESCRIPTOR LIFECYCLE

⚠️ Critical: rte_eth_tx_burst() returns the count of packets actually queued (may be less than n if Tx ring is full). Caller MUST free any unsent packets: pkts[nb_tx..n-1]. Failing to do so causes mbuf leaks → mempool exhaustion → rx_burst returns 0 mbufs → application appears to stop receiving packets.

RX_BURST — THE HOT PATH FUNCTION

// rte_eth_rx_burst signature uint16_t rte_eth_rx_burst( uint16_t port_id, // which NIC port uint16_t queue_id, // which Rx queue on that port struct rte_mbuf **rx_pkts, // output: array of received mbufs uint16_t nb_pkts // max mbufs to receive (burst size) ); // Returns: actual number of mbufs received (0 to nb_pkts) // Canonical polling loop struct rte_mbuf *pkts[BURST_SIZE]; while (1) { uint16_t nb_rx = rte_eth_rx_burst(port, queue, pkts, BURST_SIZE); for (uint16_t i = 0; i < nb_rx; i++) process_packet(pkts[i]); }

TX_BURST — SAFE TRANSMIT PATTERN

// rte_eth_tx_burst — ALWAYS check return value uint16_t nb_tx = rte_eth_tx_burst(port, queue, pkts, nb_pkts); // Free unsent packets (Tx ring was full) if (unlikely(nb_tx < nb_pkts)) { for (uint16_t i = nb_tx; i < nb_pkts; i++) rte_pktmbuf_free(pkts[i]); }

BURST SIZE TUNING

Burst SizeThroughputLatencyCache UsageRecommendation
8Good for low loadLowestMinimalLow-latency SLAs
32Good balanceModerateGood I-cache reuseBlaze/SASE-DP default
64High throughputHigherExcellentDPDK example default
128+Marginal improvementHigherDiminishing returnsMay exceed L1 cache
🆕 Blaze/SASE-DP Real-World Finding: With 100G NIC and 8 workers, burst size 32 gave the best latency/throughput balance. At burst=64 throughput was ~3% higher but p99 latency increased ~15%. At burst=16, throughput dropped ~8%. Start with 32 — tune based on your latency SLA vs throughput target.

PORT CONFIGURATION — MANDATORY ORDER

// Full port configuration example struct rte_eth_conf port_conf = { .rxmode = { .mtu = RTE_ETHER_MAX_LEN, .offloads = RTE_ETH_RX_OFFLOAD_CHECKSUM | RTE_ETH_RX_OFFLOAD_RSS_HASH, }, .txmode = { .mq_mode = RTE_ETH_MQ_TX_NONE, .offloads = RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | RTE_ETH_TX_OFFLOAD_TCP_CKSUM, }, .rx_adv_conf.rss_conf = { .rss_key = NULL, // use default 40-byte RSS key .rss_hf = RTE_ETH_RSS_IP | RTE_ETH_RSS_TCP | RTE_ETH_RSS_UDP, }, }; rte_eth_dev_configure(port_id, nb_rx_queues, nb_tx_queues, &port_conf); for (uint16_t q = 0; q < nb_rx_queues; q++) rte_eth_rx_queue_setup(port_id, q, 512, // nb_rx_desc rte_eth_dev_socket_id(port_id), NULL, mbuf_pool); for (uint16_t q = 0; q < nb_tx_queues; q++) rte_eth_tx_queue_setup(port_id, q, 512, rte_eth_dev_socket_id(port_id), NULL); rte_eth_dev_start(port_id); rte_eth_promiscuous_enable(port_id);
nb_rx_descUse CaseTrade-off
256Low latency, light loadSmall ring → NIC drops more under burst → imissed increments
512Balanced — common defaultGood balance of memory vs burst tolerance
1024High throughput, bursty trafficMore memory, better burst handling
4096Line-rate 100G with large burstsMaximum burst tolerance — highest memory use
Offload FlagDirectionEffect
RTE_ETH_RX_OFFLOAD_CHECKSUMRxNIC verifies IP/TCP/UDP checksums. Sets RTE_MBUF_F_RX_*_CKSUM_GOOD/BAD flags.
RTE_ETH_RX_OFFLOAD_RSS_HASHRxNIC computes RSS hash. Sets mbuf->hash.rss and RTE_MBUF_F_RX_RSS_HASH.
RTE_ETH_RX_OFFLOAD_VLAN_STRIPRxNIC strips VLAN tag from frame. Tag stored in mbuf->vlan_tci.
RTE_ETH_RX_OFFLOAD_SCATTERRxAllow multi-segment mbufs (required for jumbo frames > buf_size).
RTE_ETH_TX_OFFLOAD_IPV4_CKSUMTxNIC computes and inserts IPv4 header checksum.
RTE_ETH_TX_OFFLOAD_TCP_CKSUMTxNIC computes and inserts TCP checksum.
RTE_ETH_TX_OFFLOAD_VLAN_INSERTTxNIC inserts VLAN tag from mbuf->vlan_tci.
RTE_ETH_TX_OFFLOAD_TCP_TSOTxTCP Segmentation Offload — NIC segments large TCP to MTU-sized frames.

PORT STATISTICS

// Read port statistics struct rte_eth_stats stats; rte_eth_stats_get(port_id, &stats); printf("Rx: %lu pkts, %lu bytes, %lu missed, %lu errors\n", stats.ipackets, stats.ibytes, stats.imissed, stats.ierrors); printf("Tx: %lu pkts, %lu bytes, %lu errors\n", stats.opackets, stats.obytes, stats.oerrors);
Stat FieldMeaningAction if Non-Zero
stats.imissedPackets dropped by NIC hardware — Rx ring was fullIncrease nb_rx_desc; increase burst_size; reduce processing latency; add more worker lcores
stats.ierrorsReceive errors (bad FCS, oversized frames)Check cable/NIC health; check MTU configuration
stats.rx_nombufPackets dropped — no free mbufs in poolIncrease mempool size; check for mbuf leaks
stats.oerrorsTransmit errorsCheck Tx configuration and offload flags
⚠️ imissed vs rx_nombuf: These are different failure modes. imissed = NIC couldn't write packet because the Rx ring had no empty descriptors (ring was full — software too slow to drain it). rx_nombuf = PMD tried to refill the ring but the mempool had no free mbufs (mbuf leak). Both result in dropped packets but have different root causes and fixes.

RSS — Hardware Multi-Core Distribution

RSS (Receive Side Scaling) distributes incoming packets across multiple Rx queues using a hardware hash of the packet's 5-tuple. Each queue is serviced by one lcore. Because the same 5-tuple always maps to the same queue, all packets of a TCP connection always land on the same core — enabling lock-free per-flow state.

RSS MECHANISM

RSS — Packet to Queue Assignment (hardware path) Packet arrives at NIC ↓ NIC parser extracts 5-tuple from fixed byte offsets: Src IP @ bytes 26–29 Dst IP @ bytes 30–33 Src Port @ bytes 34–35 Dst Port @ bytes 36–37 Protocol @ byte 23 ↓ Toeplitz Hash Unit (silicon logic — runs at wire speed): Algorithm: for each input bit → if bit=1: hash XOR= key[i:i+32] Same 5-tuple always → same 32-bit hash (deterministic) Same 5-tuple → same hash → same queue → same lcore hash = 0x3A7F1C ↓ RETA (Redirection Table) lookup: queue = RETA[hash & (reta_size - 1)] RETA[0x1C] = queue 3 ↓ Packet DMA'd into Rx Queue 3 → lcore 3 picks it up All packets of one TCP connection always land on the same lcore. Per-flow state on one lcore — no locking needed on hot path.

SYMMETRIC RSS KEY

The Symmetric RSS Problem

By default, RSS is asymmetric: hash(src=A, dst=B) ≠ hash(src=B, dst=A). For stateful NFs that process both directions of a flow, this means forward and return packets land on different cores — requiring cross-core state access. The symmetric Toeplitz key (Microsoft key) fixes this: hash(A→B) == hash(B→A).
// Symmetric RSS key (Microsoft Toeplitz key) static uint8_t sym_rss_key[] = { 0x6D, 0x5A, 0x56, 0xDA, 0x25, 0x5B, 0x0E, 0xC2, 0x41, 0x67, 0x25, 0x3D, 0x43, 0xA3, 0x8F, 0xB0, 0xD0, 0xCA, 0x2B, 0xCB, 0xAE, 0x7B, 0x30, 0xB4, 0x77, 0xCB, 0x2D, 0xA3, 0x80, 0x30, 0xF2, 0x0C, 0x6A, 0x42, 0xB7, 0x3B, 0xBE, 0xAC, 0x01, 0xFA, }; // Use in rss_conf.rss_key — guarantees forward/return on same lcore

RETA IMBALANCE — THE POWER-OF-2 REQUIREMENT

⚠️ Blaze/SASE-DP Real-World Finding: With 6 workers (non-power-of-2) and RETA size=128: 128/6=21.33 → uneven. Queues 0-3 got 22 entries, queues 4-5 got 21 entries → cores 0-3 received ~5% more traffic. Under high load, cores 0-3 saturated first → throughput ceiling. Switching to 8 workers: all at ~91% utilization, throughput +12%. Rule: always use power-of-2 worker counts.
// Update RETA programmatically for even distribution uint16_t reta_size; rte_eth_dev_info_get(port_id, &dev_info); reta_size = dev_info.reta_size; // typically 128 or 512 struct rte_eth_rss_reta_entry64 reta_conf[reta_size / RTE_ETH_RETA_GROUP_SIZE]; memset(reta_conf, 0, sizeof(reta_conf)); for (uint16_t i = 0; i < reta_size; i++) { uint16_t grp = i / RTE_ETH_RETA_GROUP_SIZE; uint16_t idx = i % RTE_ETH_RETA_GROUP_SIZE; reta_conf[grp].mask = UINT64_MAX; reta_conf[grp].reta[idx] = i % nb_workers; // nb_workers must be power-of-2 } rte_eth_dev_rss_reta_update(port_id, reta_conf, reta_size);

Q: What is the DD bit and why does DPDK use it instead of interrupts?

The DD (Descriptor Done) bit is a status bit set by NIC hardware in a descriptor after it finishes with it — for Rx: after DMA'ing the packet; for Tx: after sending it. The PMD polls this bit in a tight loop instead of sleeping and waiting for an interrupt. At 100G/64B, ~148 Mpps would require 148M interrupts/sec — impossible. Polling eliminates interrupt latency and context switches entirely.

Q: What happens if rte_eth_tx_burst() returns less than nb_pkts?

The Tx ring was full — not all packets could be queued. The caller must free the unsent packets (pkts[nb_tx..nb_pkts-1]). Failing to do so causes a mbuf leak → mempool exhaustion → rx_burst can no longer refill Rx ring → rx_nombuf stat increments → application crashes or stops receiving.

Q: When are Tx mbufs actually freed?

NOT immediately after tx_burst. The NIC needs time to DMA the data. The PMD frees completed Tx mbufs lazily: either when the next tx_burst is called and the PMD reclaims descriptors, or when a configurable tx_free_thresh is crossed. Never access an mbuf after passing it to tx_burst — the mbuf may be freed by the PMD asynchronously.

Q: What is the order of port configuration API calls and why does it matter?

Order: dev_info_get → dev_configure → rx_queue_setup (each queue) → tx_queue_setup (each queue) → dev_start. This order is mandatory: dev_configure allocates internal resources; queue_setup allocates descriptor rings using those resources; dev_start enables DMA. Calling out of order returns EINVAL or silently fails.

Q: What does stats.imissed mean and how do you fix it?

imissed counts packets the NIC hardware dropped because the Rx ring had no empty descriptor slots — the application wasn't consuming packets fast enough. Fixes: (1) Increase nb_rx_desc; (2) Increase burst_size to drain more per call; (3) Reduce per-packet processing time; (4) Add more worker lcores.

Q: Why must worker count be a power of 2 for RSS?

RETA has a fixed size (typically 128 or 512). DPDK maps RETA entries evenly to queues: RETA[i] = i % nb_workers. If nb_workers is not a power of 2, the division is uneven — some queues get more RETA entries (more traffic) than others. Under load, the heavier queues saturate first, creating a throughput bottleneck. Power-of-2 counts guarantee exact even distribution.
🔥 Lab 5: L2 Forwarder with RSS Verification

Build the classic DPDK L2 forwarder (MAC swap + forward) and add RSS verification to confirm packets are landing on the expected lcore.

1
Configure port with 4 Rx queues (power-of-2) and enable RSS on IP+TCP+UDP
2
Launch 4 worker lcores — each polls its own queue: rte_eth_rx_burst(port, lcore_id % 4, ...)
3
In the Rx loop: print mbuf->hash.rss and verify it's set (RTE_MBUF_F_RX_RSS_HASH in ol_flags)
4
MAC swap: swap src ↔ dst Ethernet addresses using rte_ether_addr_copy()
5
Transmit back on the same port/queue: rte_eth_tx_burst(port, queue, pkts, nb_rx)
6
Monitor stats: verify imissed == 0 and rx_nombuf == 0 under load
7
Extension: try 3 workers (non-power-of-2) — observe CPU imbalance in top -H

MASTERY CHECKLIST

← P1B: Hugepages, mempool & mbuf ↑ Roadmap P2B: rte_ring & App Models →