DPDK MASTERY · PHASE 2 OF 3 · MODULE A
Poll Mode Drivers & Port Config
PMD internals · NIC descriptor rings · rx_burst / tx_burst hot path · RSS · Port configuration sequence
Ch 7 — PMD Deep Dive
Ch 8 — Port Configuration
Ch 9 — RSS Deep Dive
C · ixgbe · mlx5 · Toeplitz
Weeks 6–8
What a PMD Is
A Poll Mode Driver (PMD) is a user-space NIC driver that replaces the kernel driver for a specific NIC model. It maps NIC BAR (Base Address Register) memory into user-space via VFIO/UIO and programs the NIC's hardware descriptor rings directly. It providesrte_eth_rx_burst() and rte_eth_tx_burst() implementations — called millions of times per second with zero system calls.
| PMD Type | Examples | Description | Notes |
|---|---|---|---|
| Physical NIC | ixgbe (X520), i40e (XL710), ice (E810), mlx5 (ConnectX) | Direct hardware driver — maximum performance | Requires device binding (except mlx5 bifurcated) |
| Virtual NIC | virtio (KVM), vmxnet3 (VMware), vhost-user | VM-facing PMD — communicates via shared memory | Lower performance than physical — no DMA bypass |
| Software (vdev) | net_ring, net_tap, net_pcap, net_null | Software-only — testing, kernel bridging, dev | No real NIC needed — great for unit testing |
| Bonding | net_bonding | Aggregates multiple physical PMDs into one logical port | LAG/LACP support; active-backup or LACP mode |
📌 PMD as function pointer table: Each PMD registers a set of function pointers (
eth_rx_burst_t, eth_tx_burst_t, etc.) at probe time. When you call rte_eth_rx_burst(), it's an indirect function call through this table — PMD-specific code runs directly, fully inlined per NIC type. This is why different NICs can coexist in one DPDK application.NIC Hardware Descriptor Rings
The descriptor ring is a circular array in hugepage memory shared between the NIC hardware and the PMD software. It is the fundamental data transfer mechanism — no pipes, no queues, no kernel — just two pointers (NIC's and PMD's) into a shared ring.Rx Descriptor Ring (in hugepage memory, DMA-accessible by NIC)
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│desc0│desc1│desc2│desc3│desc4│desc5│desc6│desc7│ ← ring[nb_rx_desc]
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
↑ NIC write ptr ↑ CPU read ptr
Each descriptor contains:
buf_addr (IOVA of pre-allocated mbuf)
buf_len (size of the buffer)
status (including DD = Descriptor Done bit)
Flow:
CPU fills ring[i].buf_addr = IOVA of empty mbuf (pre-loaded at setup)
NIC DMA writes packet data into that IOVA
NIC sets ring[i].status |= DD bit ← handshake signal
PMD polls: if DD bit set → packet is ready → read mbuf
RX DESCRIPTOR LIFECYCLE
- 1Setup:
rte_eth_rx_queue_setup()pre-fills all ring slots with empty mbuf IOVAs from the mempool - 2Packet arrives: NIC DMA engine writes packet bytes into the mbuf at that IOVA — zero CPU involvement
- 3NIC signals done: NIC sets DD bit in descriptor + writes pkt_len, ol_flags, RSS hash
- 4PMD polls:
rte_eth_rx_burst()checks DD bit → mbuf is ready → copies metadata into mbuf fields - 5Refill: PMD allocates fresh mbuf from pool → puts its IOVA into the now-empty ring slot → NIC can reuse
- 6Returns: PMD returns received mbuf to application — total latency: ~20–50 ns from DD bit set
TX DESCRIPTOR LIFECYCLE
- 1App calls:
rte_eth_tx_burst(port, queue, mbufs[], n) - 2PMD fills Tx descriptor: writes mbuf's IOVA + length + offload flags, updates Tx tail pointer
- 3NIC DMA: reads packet from mbuf buffer → sends on wire
- 4NIC sets DD: on completed descriptor (async — NIC is busy sending next packets)
- 5Lazy free: PMD frees completed Tx mbufs on the next tx_burst call or when tx_free_thresh crossed
⚠️ Critical:
rte_eth_tx_burst() returns the count of packets actually queued (may be less than n if Tx ring is full). Caller MUST free any unsent packets: pkts[nb_tx..n-1]. Failing to do so causes mbuf leaks → mempool exhaustion → rx_burst returns 0 mbufs → application appears to stop receiving packets.RX_BURST — THE HOT PATH FUNCTION
// rte_eth_rx_burst signature
uint16_t rte_eth_rx_burst(
uint16_t port_id, // which NIC port
uint16_t queue_id, // which Rx queue on that port
struct rte_mbuf **rx_pkts, // output: array of received mbufs
uint16_t nb_pkts // max mbufs to receive (burst size)
);
// Returns: actual number of mbufs received (0 to nb_pkts)
// Canonical polling loop
struct rte_mbuf *pkts[BURST_SIZE];
while (1) {
uint16_t nb_rx = rte_eth_rx_burst(port, queue, pkts, BURST_SIZE);
for (uint16_t i = 0; i < nb_rx; i++)
process_packet(pkts[i]);
}
TX_BURST — SAFE TRANSMIT PATTERN
// rte_eth_tx_burst — ALWAYS check return value
uint16_t nb_tx = rte_eth_tx_burst(port, queue, pkts, nb_pkts);
// Free unsent packets (Tx ring was full)
if (unlikely(nb_tx < nb_pkts)) {
for (uint16_t i = nb_tx; i < nb_pkts; i++)
rte_pktmbuf_free(pkts[i]);
}
BURST SIZE TUNING
| Burst Size | Throughput | Latency | Cache Usage | Recommendation |
|---|---|---|---|---|
| 8 | Good for low load | Lowest | Minimal | Low-latency SLAs |
| 32 | Good balance | Moderate | Good I-cache reuse | Blaze/SASE-DP default |
| 64 | High throughput | Higher | Excellent | DPDK example default |
| 128+ | Marginal improvement | Higher | Diminishing returns | May exceed L1 cache |
🆕 Blaze/SASE-DP Real-World Finding: With 100G NIC and 8 workers, burst size 32 gave the best latency/throughput balance. At burst=64 throughput was ~3% higher but p99 latency increased ~15%. At burst=16, throughput dropped ~8%. Start with 32 — tune based on your latency SLA vs throughput target.
PORT CONFIGURATION — MANDATORY ORDER
- 1rte_eal_init() — initialize EAL (hugepages, lcores, PCI probe)
- 2rte_eth_dev_count_avail() — how many NIC ports are available?
- 3rte_eth_dev_info_get() — query NIC capabilities (max queues, offload flags, desc limits)
- 4rte_pktmbuf_pool_create() — create mbuf pool on NIC's NUMA socket
- 5rte_eth_dev_configure() — configure port: number of queues, offloads, RSS
- 6rte_eth_rx_queue_setup() — setup each Rx queue (descriptor count, socket, pool)
- 7rte_eth_tx_queue_setup() — setup each Tx queue (descriptor count, socket)
- 8rte_eth_dev_start() — start the device (enables DMA, activates queues)
- 9rte_eth_promiscuous_enable() — optional: receive all traffic regardless of dst MAC
- 10rte_eth_link_get_nowait() — poll until link is UP
// Full port configuration example
struct rte_eth_conf port_conf = {
.rxmode = {
.mtu = RTE_ETHER_MAX_LEN,
.offloads = RTE_ETH_RX_OFFLOAD_CHECKSUM |
RTE_ETH_RX_OFFLOAD_RSS_HASH,
},
.txmode = {
.mq_mode = RTE_ETH_MQ_TX_NONE,
.offloads = RTE_ETH_TX_OFFLOAD_IPV4_CKSUM |
RTE_ETH_TX_OFFLOAD_TCP_CKSUM,
},
.rx_adv_conf.rss_conf = {
.rss_key = NULL, // use default 40-byte RSS key
.rss_hf = RTE_ETH_RSS_IP | RTE_ETH_RSS_TCP | RTE_ETH_RSS_UDP,
},
};
rte_eth_dev_configure(port_id, nb_rx_queues, nb_tx_queues, &port_conf);
for (uint16_t q = 0; q < nb_rx_queues; q++)
rte_eth_rx_queue_setup(port_id, q, 512, // nb_rx_desc
rte_eth_dev_socket_id(port_id), NULL, mbuf_pool);
for (uint16_t q = 0; q < nb_tx_queues; q++)
rte_eth_tx_queue_setup(port_id, q, 512,
rte_eth_dev_socket_id(port_id), NULL);
rte_eth_dev_start(port_id);
rte_eth_promiscuous_enable(port_id);
| nb_rx_desc | Use Case | Trade-off |
|---|---|---|
| 256 | Low latency, light load | Small ring → NIC drops more under burst → imissed increments |
| 512 | Balanced — common default | Good balance of memory vs burst tolerance |
| 1024 | High throughput, bursty traffic | More memory, better burst handling |
| 4096 | Line-rate 100G with large bursts | Maximum burst tolerance — highest memory use |
| Offload Flag | Direction | Effect |
|---|---|---|
RTE_ETH_RX_OFFLOAD_CHECKSUM | Rx | NIC verifies IP/TCP/UDP checksums. Sets RTE_MBUF_F_RX_*_CKSUM_GOOD/BAD flags. |
RTE_ETH_RX_OFFLOAD_RSS_HASH | Rx | NIC computes RSS hash. Sets mbuf->hash.rss and RTE_MBUF_F_RX_RSS_HASH. |
RTE_ETH_RX_OFFLOAD_VLAN_STRIP | Rx | NIC strips VLAN tag from frame. Tag stored in mbuf->vlan_tci. |
RTE_ETH_RX_OFFLOAD_SCATTER | Rx | Allow multi-segment mbufs (required for jumbo frames > buf_size). |
RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | Tx | NIC computes and inserts IPv4 header checksum. |
RTE_ETH_TX_OFFLOAD_TCP_CKSUM | Tx | NIC computes and inserts TCP checksum. |
RTE_ETH_TX_OFFLOAD_VLAN_INSERT | Tx | NIC inserts VLAN tag from mbuf->vlan_tci. |
RTE_ETH_TX_OFFLOAD_TCP_TSO | Tx | TCP Segmentation Offload — NIC segments large TCP to MTU-sized frames. |
PORT STATISTICS
// Read port statistics
struct rte_eth_stats stats;
rte_eth_stats_get(port_id, &stats);
printf("Rx: %lu pkts, %lu bytes, %lu missed, %lu errors\n",
stats.ipackets, stats.ibytes, stats.imissed, stats.ierrors);
printf("Tx: %lu pkts, %lu bytes, %lu errors\n",
stats.opackets, stats.obytes, stats.oerrors);
| Stat Field | Meaning | Action if Non-Zero |
|---|---|---|
stats.imissed | Packets dropped by NIC hardware — Rx ring was full | Increase nb_rx_desc; increase burst_size; reduce processing latency; add more worker lcores |
stats.ierrors | Receive errors (bad FCS, oversized frames) | Check cable/NIC health; check MTU configuration |
stats.rx_nombuf | Packets dropped — no free mbufs in pool | Increase mempool size; check for mbuf leaks |
stats.oerrors | Transmit errors | Check Tx configuration and offload flags |
⚠️ imissed vs rx_nombuf: These are different failure modes.
imissed = NIC couldn't write packet because the Rx ring had no empty descriptors (ring was full — software too slow to drain it). rx_nombuf = PMD tried to refill the ring but the mempool had no free mbufs (mbuf leak). Both result in dropped packets but have different root causes and fixes.RSS — Hardware Multi-Core Distribution
RSS (Receive Side Scaling) distributes incoming packets across multiple Rx queues using a hardware hash of the packet's 5-tuple. Each queue is serviced by one lcore. Because the same 5-tuple always maps to the same queue, all packets of a TCP connection always land on the same core — enabling lock-free per-flow state.RSS MECHANISM
RSS — Packet to Queue Assignment (hardware path)
Packet arrives at NIC
↓
NIC parser extracts 5-tuple from fixed byte offsets:
Src IP @ bytes 26–29 Dst IP @ bytes 30–33
Src Port @ bytes 34–35 Dst Port @ bytes 36–37
Protocol @ byte 23
↓
Toeplitz Hash Unit (silicon logic — runs at wire speed):
Algorithm: for each input bit → if bit=1: hash XOR= key[i:i+32]
Same 5-tuple always → same 32-bit hash (deterministic)
Same 5-tuple → same hash → same queue → same lcore
hash = 0x3A7F1C
↓
RETA (Redirection Table) lookup:
queue = RETA[hash & (reta_size - 1)]
RETA[0x1C] = queue 3
↓
Packet DMA'd into Rx Queue 3 → lcore 3 picks it up
All packets of one TCP connection always land on the same lcore.
Per-flow state on one lcore — no locking needed on hot path.
SYMMETRIC RSS KEY
The Symmetric RSS Problem
By default, RSS is asymmetric:hash(src=A, dst=B) ≠ hash(src=B, dst=A). For stateful NFs that process both directions of a flow, this means forward and return packets land on different cores — requiring cross-core state access. The symmetric Toeplitz key (Microsoft key) fixes this: hash(A→B) == hash(B→A).
// Symmetric RSS key (Microsoft Toeplitz key)
static uint8_t sym_rss_key[] = {
0x6D, 0x5A, 0x56, 0xDA, 0x25, 0x5B, 0x0E, 0xC2,
0x41, 0x67, 0x25, 0x3D, 0x43, 0xA3, 0x8F, 0xB0,
0xD0, 0xCA, 0x2B, 0xCB, 0xAE, 0x7B, 0x30, 0xB4,
0x77, 0xCB, 0x2D, 0xA3, 0x80, 0x30, 0xF2, 0x0C,
0x6A, 0x42, 0xB7, 0x3B, 0xBE, 0xAC, 0x01, 0xFA,
};
// Use in rss_conf.rss_key — guarantees forward/return on same lcore
RETA IMBALANCE — THE POWER-OF-2 REQUIREMENT
⚠️ Blaze/SASE-DP Real-World Finding: With 6 workers (non-power-of-2) and RETA size=128: 128/6=21.33 → uneven. Queues 0-3 got 22 entries, queues 4-5 got 21 entries → cores 0-3 received ~5% more traffic. Under high load, cores 0-3 saturated first → throughput ceiling. Switching to 8 workers: all at ~91% utilization, throughput +12%. Rule: always use power-of-2 worker counts.
// Update RETA programmatically for even distribution
uint16_t reta_size;
rte_eth_dev_info_get(port_id, &dev_info);
reta_size = dev_info.reta_size; // typically 128 or 512
struct rte_eth_rss_reta_entry64 reta_conf[reta_size / RTE_ETH_RETA_GROUP_SIZE];
memset(reta_conf, 0, sizeof(reta_conf));
for (uint16_t i = 0; i < reta_size; i++) {
uint16_t grp = i / RTE_ETH_RETA_GROUP_SIZE;
uint16_t idx = i % RTE_ETH_RETA_GROUP_SIZE;
reta_conf[grp].mask = UINT64_MAX;
reta_conf[grp].reta[idx] = i % nb_workers; // nb_workers must be power-of-2
}
rte_eth_dev_rss_reta_update(port_id, reta_conf, reta_size);
Q: What is the DD bit and why does DPDK use it instead of interrupts?
The DD (Descriptor Done) bit is a status bit set by NIC hardware in a descriptor after it finishes with it — for Rx: after DMA'ing the packet; for Tx: after sending it. The PMD polls this bit in a tight loop instead of sleeping and waiting for an interrupt. At 100G/64B, ~148 Mpps would require 148M interrupts/sec — impossible. Polling eliminates interrupt latency and context switches entirely.Q: What happens if rte_eth_tx_burst() returns less than nb_pkts?
The Tx ring was full — not all packets could be queued. The caller must free the unsent packets (pkts[nb_tx..nb_pkts-1]). Failing to do so causes a mbuf leak → mempool exhaustion → rx_burst can no longer refill Rx ring → rx_nombuf stat increments → application crashes or stops receiving.
Q: When are Tx mbufs actually freed?
NOT immediately after tx_burst. The NIC needs time to DMA the data. The PMD frees completed Tx mbufs lazily: either when the next tx_burst is called and the PMD reclaims descriptors, or when a configurabletx_free_thresh is crossed. Never access an mbuf after passing it to tx_burst — the mbuf may be freed by the PMD asynchronously.
Q: What is the order of port configuration API calls and why does it matter?
Order: dev_info_get → dev_configure → rx_queue_setup (each queue) → tx_queue_setup (each queue) → dev_start. This order is mandatory:dev_configure allocates internal resources; queue_setup allocates descriptor rings using those resources; dev_start enables DMA. Calling out of order returns EINVAL or silently fails.
Q: What does stats.imissed mean and how do you fix it?
imissed counts packets the NIC hardware dropped because the Rx ring had no empty descriptor slots — the application wasn't consuming packets fast enough. Fixes: (1) Increase nb_rx_desc; (2) Increase burst_size to drain more per call; (3) Reduce per-packet processing time; (4) Add more worker lcores.
Q: Why must worker count be a power of 2 for RSS?
RETA has a fixed size (typically 128 or 512). DPDK maps RETA entries evenly to queues: RETA[i] = i % nb_workers. If nb_workers is not a power of 2, the division is uneven — some queues get more RETA entries (more traffic) than others. Under load, the heavier queues saturate first, creating a throughput bottleneck. Power-of-2 counts guarantee exact even distribution.🔥 Lab 5: L2 Forwarder with RSS Verification
Build the classic DPDK L2 forwarder (MAC swap + forward) and add RSS verification to confirm packets are landing on the expected lcore.
1
Configure port with 4 Rx queues (power-of-2) and enable RSS on IP+TCP+UDP
2
Launch 4 worker lcores — each polls its own queue:
rte_eth_rx_burst(port, lcore_id % 4, ...)3
In the Rx loop: print
mbuf->hash.rss and verify it's set (RTE_MBUF_F_RX_RSS_HASH in ol_flags)4
MAC swap: swap src ↔ dst Ethernet addresses using
rte_ether_addr_copy()5
Transmit back on the same port/queue:
rte_eth_tx_burst(port, queue, pkts, nb_rx)6
Monitor stats: verify
imissed == 0 and rx_nombuf == 0 under load7
Extension: try 3 workers (non-power-of-2) — observe CPU imbalance in
top -HMASTERY CHECKLIST
- Can explain what the DD bit is and why DPDK polls it instead of using interrupts
- Can draw the full Rx descriptor lifecycle (6 steps from setup to application)
- Can write the canonical polling loop with correct Tx free pattern
- Can list the 10-step port configuration sequence in order and explain why order matters
- Can explain imissed vs rx_nombuf and what causes each
- Can explain RSS: Toeplitz hash, RETA, why same 5-tuple always lands on same lcore
- Can explain why worker count must be power-of-2 for even RSS distribution
- Can write RETA update code to manually control traffic distribution