DPDK P2 — rte_ring, Distributor & App Models

DPDK MASTERY · PHASE 2 OF 3 · MODULE B

rte_ring, Distributor & App Models

Lock-free ring internals · CAS mechanics · rte_distributor · Run-to-completion vs Pipeline

Ch 10 — rte_ring Ch 11 — rte_distributor Ch 12 — App Models C · Lock-Free · MPMC Weeks 8–10

rte_ring — The Inter-Core Packet Bus

rte_ring is DPDK's lock-free, fixed-size circular buffer. It passes object pointers (typically mbuf pointers) between cores with minimal overhead — no mutexes, no condition variables, no syscalls. It is the primitive that connects Rx cores, worker cores, and Tx cores in a pipeline architecture.

rte_ring Internal Layout (in hugepage memory, power-of-2 sized) ┌─────────────────────────────────────────────────────────────────┐ │ ring metadata: name, size (power-of-2), mask, flags │ │ prod.head prod.tail (producer / enqueue side) │ │ cons.head cons.tail (consumer / dequeue side) │ ├─────────────────────────────────────────────────────────────────┤ │ ring[0] │ ring[1] │ ring[2] │ ... │ ring[size-1] │ │ (void* pointer slots — contain mbuf pointers or other objects) │ └─────────────────────────────────────────────────────────────────┘ free slots = (cons.tail - prod.head) & mask used slots = (prod.tail - cons.head) & mask Invariant: prod.tail always ≤ prod.head (producers fill between head and tail) cons.tail always ≤ prod.tail (consumers can only see committed data)

📌 Power-of-2 size requirement: rte_ring uses mask = size - 1 for modulo via bitwise AND — idx & mask instead of idx % size. Bitwise AND is a single instruction vs division which can be 20–80 cycles. This is why ring size must always be a power of 2.

Lock-Free MPMC via CAS (Compare-And-Swap)

rte_ring achieves multi-producer multi-consumer safety without mutexes using atomic CAS operations. CAS atomically checks if a memory location holds an expected value and swaps it with a new value — if another thread modified it concurrently, CAS fails and the operation retries.

MULTI-PRODUCER ENQUEUE — CAS PROTOCOL

Multi-Producer Enqueue (simplified — showing CAS retry) Producer A and Producer B both want to enqueue simultaneously: Step ①: Both read current prod.head = 10 Step ②: Both compute new_head = 10 + 1 = 11 Step ③: CAS(prod.head, old=10, new=11) — atomic operation → Producer A wins CAS: prod.head = 11, A owns slot[10] → Producer B loses CAS: prod.head already 11 → retry from ① Step ④: Producer A writes object pointer into ring[10] Step ⑤: Producer A waits for prod.tail to reach 10 (if another producer owns an earlier slot, A must wait for it to commit) Step ⑥: Producer A sets prod.tail = 11 → Consumer can now see slot[10] Key insight: CAS failure is not an error — it's the retry signal. Under low contention: CAS succeeds first try → near-zero overhead. Under high contention: retries add latency → prefer SPSC when possible.

Why Wait-Free is Not the Same as Lock-Free

rte_ring is lock-free (no thread can block indefinitely holding a lock) but not wait-free (individual threads may retry). In practice, under typical DPDK workloads with one producer and one consumer per ring (SPSC mode), there is no CAS at all — just atomic load/store, which is near-zero cost.

Mode	Enqueue	Dequeue	Overhead	Use Case
SPSC Single Producer, Single Consumer	No CAS — direct index	No CAS — direct index	Minimum — just atomic load/store	One Rx core → one worker; fastest possible ring
MPSC Multi Producer, Single Consumer	CAS on producer	No CAS on consumer	Low on consumer side	Multiple cores feeding one consumer (fan-in)
SPMC Single Producer, Multi Consumer	No CAS on producer	CAS on consumer	Low on producer side	One source, multiple workers (rare)
MPMC Multi Producer, Multi Consumer	CAS on both sides	CAS on both sides	Highest — most general	Default mode; needed when both sides have multiple cores

// Create ring with explicit mode flags struct rte_ring *ring; // SPSC — fastest (dedicate one producer and one consumer core) ring = rte_ring_create("FAST_RING", 1024, rte_socket_id(), RING_F_SP_ENQ | RING_F_SC_DEQ); // MPMC — default (most general) ring = rte_ring_create("WORK_RING", 4096, rte_socket_id(), 0); // 0 = MPMC // Check if creation succeeded if (!ring) rte_exit(EXIT_FAILURE, "Ring create failed: %s\n", rte_strerror(rte_errno));

⚠️ Ring size must be a power of 2. If you pass a non-power-of-2 size, rte_ring_create() returns NULL. The actual usable capacity is size - 1 (one slot is always kept empty to distinguish full from empty). So a ring of size 1024 holds at most 1023 objects.

CORE ENQUEUE / DEQUEUE APIs

// Single object int ret = rte_ring_enqueue(ring, obj_ptr); // 0 = success, -ENOBUFS = full int ret = rte_ring_dequeue(ring, &obj_ptr); // 0 = success, -ENOENT = empty // Bulk — preferred: reduces CAS contention + better cache efficiency unsigned enqueued = rte_ring_enqueue_bulk(ring, objs, n, &free_space); // Returns n on success, 0 on failure (ring doesn't have n free slots) unsigned dequeued = rte_ring_dequeue_bulk(ring, objs, n, &avail); // Returns n on success, 0 on failure (ring doesn't have n objects) // Burst — partial success (unlike bulk which is all-or-nothing) unsigned enqueued = rte_ring_enqueue_burst(ring, objs, n, &free_space); // Returns 0..n: enqueued as many as possible unsigned dequeued = rte_ring_dequeue_burst(ring, objs, n, &avail); // Returns 0..n: dequeued as many as available

bulk vs burst — Which to Use?

bulk: all-or-nothing. Enqueues exactly n objects or fails. Use when you need atomic batch operations — e.g., pass a full burst of 32 packets to a worker core atomically.
burst: enqueues as many as possible (0 to n). Use for drain loops where partial success is acceptable — e.g., forwarding loop that drains whatever is available.

RING INSPECTION APIs

unsigned count = rte_ring_count(ring); // objects currently in ring unsigned free_cnt = rte_ring_free_count(ring); // empty slots available int full = rte_ring_full(ring); // 1 if no free slots int empty = rte_ring_empty(ring); // 1 if no objects // Named ring lookup (for multi-process — secondary finds ring created by primary) struct rte_ring *ring = rte_ring_lookup("WORK_RING"); if (!ring) /* ring not yet created by primary */;

rte_distributor — One RX Core → N Workers

rte_distributor implements the fan-out pattern: one RX/coordinator lcore receives packets from the NIC and distributes them to a pool of worker lcores based on a flow tag. The key property: all packets with the same tag (e.g., RSS hash) are guaranteed to go to the same worker — enabling per-flow state without locking.

rte_distributor Architecture ┌─────────────────┐ │ RX / Coordinator│ lcore 0 │ rte_eth_rx_burst│ │ rte_distributor_│ │ process() │ └────────┬────────┘ │ distributes by mbuf->hash.rss ┌────────────┼────────────┐ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Worker 1 │ │ Worker 2 │ │ Worker 3 │ lcores 1, 2, 3 │ rte_dist_│ │ rte_dist_│ │ rte_dist_│ │ get_pkt()│ │ get_pkt()│ │ get_pkt()│ └──────────┘ └──────────┘ └──────────┘ All packets with same hash → same worker → per-flow state, no locks

// Coordinator lcore (lcore 0) struct rte_distributor *dist = rte_distributor_create( "SASE_DIST", // name rte_socket_id(), // NUMA socket nb_workers, // number of worker lcores RTE_DIST_ALG_BURST // burst mode (preferred over single) ); struct rte_mbuf *pkts[BURST_SIZE]; while (1) { uint16_t nb_rx = rte_eth_rx_burst(port, 0, pkts, BURST_SIZE); // Set flow tag for each packet — distributor uses this for affinity for (uint16_t i = 0; i < nb_rx; i++) pkts[i]->hash.usr = pkts[i]->hash.rss; // use RSS hash as tag rte_distributor_process(dist, pkts, nb_rx); } // Worker lcore (each runs this function) static int worker_loop(void *arg) { struct rte_distributor *dist = arg; struct rte_mbuf *pkts[BURST_SIZE]; uint16_t nb; while (1) { nb = rte_distributor_get_pkt(dist, rte_lcore_id(), pkts, NULL, 0); for (uint16_t i = 0; i < nb; i++) { process_packet(pkts[i]); rte_pktmbuf_free(pkts[i]); } } return 0; }

🆕 Blaze/SASE-DP Context: The SASE-DP URL filter uses a distributor-based architecture: the RX core receives packets from a 100G NIC and distributes by RSS hash (= 5-tuple hash) to 8 worker cores. Each worker owns its portion of the flow table — no cross-core lookups, no locking on the hot path. Enterprise and mobility traffic classes separated by RETA programming.

TWO FUNDAMENTAL DPDK APPLICATION ARCHITECTURES

Run-to-Completion (RTC)

Each lcore handles the entire processing pipeline for its packets: RX → process → TX. All processing for a packet happens on one core before the next packet is touched.

Pros: Simplest. No inter-core communication. Best cache locality — packet data stays in one core's cache throughout processing. Lowest latency for simple NFs.

Cons: Processing time per packet must fit within one core's budget. Hard to balance load when packets have variable processing time. One slow packet blocks the whole pipeline.

Best for: Simple forwarding, L2/L3 routing, stateless NFs.

Pipeline Model

Different lcores handle different stages: lcore 0 → RX, lcore 1 → classify, lcore 2 → policy, lcore 3 → TX. Packets flow through stages via rte_ring queues.

Pros: Each stage runs at its own speed. Easier to scale specific bottleneck stages by adding more cores. Stages can be optimized independently.

Cons: Each ring hand-off adds ~50–100 ns latency. Higher total latency. More complex. Ring backpressure must be handled explicitly.

Best for: Complex NFs with multiple distinct processing stages (DPI, URL filter, stateful firewalls). SASE-DP uses a hybrid.

Run-to-Completion (RTC) lcore 0: RX → process → TX (all ports, all stages) lcore 1: RX → process → TX (different queue) lcore 2: RX → process → TX (different queue) lcore 3: RX → process → TX (different queue) Ring traffic: NONE — no inter-core packets Pipeline Model lcore 0: NIC RX → ring_rx[] ──────────────────────────────► lcore 1: ring_rx[] → classify → ring_classify[] ──────────► lcore 2: ring_classify[] → policy → ring_policy[] ────────► lcore 3: ring_policy[] → TX NIC Hybrid (SASE-DP): RTC within each stage, distributor between RX and workers

Criterion	Run-to-Completion	Pipeline
Latency	Lower (no ring hand-off)	Higher (50–100 ns per ring)
Throughput	Equal if compute-bound	Better if stages can parallelize
Complexity	Simple	Complex (backpressure, stage tuning)
Load balancing	Harder with variable per-packet cost	Easier — tune per stage
Cache behavior	Excellent (packet stays in one cache)	Cold cache per stage
Use case	Simple forwarding, routing	DPI, URL filter, stateful NFs

Q: How does rte_ring achieve lock-free MPMC operation?

Using CAS (Compare-And-Swap) atomic operations. Each producer atomically claims a slot by CAS'ing the producer head pointer. If the CAS fails (another producer claimed the slot concurrently), it retries. Once a producer owns a slot, it writes the object and then waits for the producer tail to reach its slot (to maintain order), then advances the tail. Consumers similarly CAS the consumer head. Under low contention, CAS succeeds on first try with near-zero overhead.

Q: Why must rte_ring size be a power of 2?

rte_ring uses bitwise AND for modulo: idx & (size-1) instead of idx % size. Bitwise AND is a single-cycle instruction; division can take 20–80 cycles. At millions of enqueue/dequeue operations per second, this difference matters. Power-of-2 also means the mask is simply size - 1 — computed once at creation time.

Q: What is the difference between rte_ring_enqueue_bulk and enqueue_burst?

bulk: all-or-nothing. Enqueues exactly n objects or fails entirely (returns 0). The ring must have at least n free slots. Use when atomicity is required — e.g., passing a full burst to a stage.
burst: partial success. Enqueues 0 to n objects — as many as the ring can accept. Returns the actual count. Use in drain loops where you want maximum throughput regardless of how many succeed.

Q: What is rte_distributor and when would you use it over rte_ring?

rte_distributor is a higher-level fan-out primitive: one coordinator distributes packets to N workers by flow tag (hash), guaranteeing all packets of the same flow go to the same worker. Use it when you need flow affinity — per-flow state on workers without cross-core locks. Use rte_ring directly when you have simpler FIFO queuing needs or want more control over the distribution logic.

Q: When should you choose pipeline over run-to-completion?

Pipeline is better when: (1) Processing stages have very different compute costs — pipeline lets you add more cores to the bottleneck stage. (2) Stages can be developed and optimized independently. (3) You need different security/isolation boundaries between stages (separate processes via shared rings). RTC is better when: latency is paramount, processing is simple and uniform, or the NF fits cleanly within a single lcore's budget.

🔥 Lab 6: Ring-Based Worker Pipeline

Implement a two-stage pipeline: RX lcore → rte_ring → Worker lcore → TX. Measure the latency added by the ring hand-off.

Create two SPSC rings: rte_ring_create("RX_TO_WORKER", 1024, socket, RING_F_SP_ENQ | RING_F_SC_DEQ) and a symmetric TX ring

RX lcore (lcore 0): rte_eth_rx_burst() → timestamp each mbuf → rte_ring_enqueue_burst()

Worker lcore (lcore 1): rte_ring_dequeue_burst() → compute latency = rte_rdtsc() - mbuf_timestamp → rte_eth_tx_burst()

Print ring latency statistics: min, max, avg, p99 in nanoseconds

Compare with RTC: move all processing to one lcore (no ring) — measure the latency difference

Extension: try MPMC ring with 2 producers and 2 consumers — observe CAS overhead in the latency numbers

🔥 Lab 7: rte_distributor Flow Affinity Verification

Verify that the distributor routes all packets of the same 5-tuple to the same worker core.

Set up distributor with 4 workers using rte_distributor_create()

In coordinator: set pkts[i]->hash.usr = pkts[i]->hash.rss as flow tag

In each worker: maintain a per-worker hash map of rss_hash → count

Generate traffic with 8 distinct 5-tuples (e.g., using pktgen or scapy)

After 1M packets: verify each RSS hash value appears on exactly one worker lcore — never split

MASTERY CHECKLIST

Can draw rte_ring internal layout: prod/cons head/tail, ring array, mask calculation
Can explain the CAS retry protocol for multi-producer enqueue
Can explain SPSC vs MPMC tradeoffs and when to choose each
Can explain bulk vs burst semantics and when to use each
Can explain how rte_distributor guarantees flow affinity
Can draw and compare run-to-completion vs pipeline architectures
Can identify when pipeline adds value vs when RTC is better
Can explain why ring size must be a power of 2 (bitwise AND trick)

← P2A: PMD & Port Config ↑ Roadmap P3A: Multi-Process & rte_flow →