Module 03 — Ring Buffer (SPSC, lock-free)
What you learn
How to implement a lock-free Single-Producer Single-Consumer (SPSC) ring
buffer in C — the manual equivalent of DPDK’s rte_ring.
A ring buffer is the primary inter-lcore communication channel in a DPDK application. In the DP application:
- RX lcore enqueues received
rte_mbuf*pointers into a ring - Worker lcores dequeue from the ring and run policy decisions
- Worker lcores enqueue processed packets into a TX ring
- TX lcore dequeues and calls
rte_eth_tx_burst()
Understanding this module makes rte_ring’s source code immediately
readable — the same three ideas appear in both:
- Power-of-2 size with bitmask indexing
- Separate head (producer) and tail (consumer) on separate cache lines
- Acquire/release memory barriers
Where this fits in the real application
RX lcore
│
│ rte_eth_rx_burst() → mbufs[]
│ rte_ring_enqueue_burst(rx_ring, mbufs, nb_rx)
│
▼
[ rx_ring ] ← rte_ring, created with ring_create() equivalent
│
│ rte_ring_dequeue_burst(rx_ring, mbufs, BURST_SIZE)
▼
Worker lcore
│ process_dns_for_group()
│ policy decision → DROP / ALLOW / SINKHOLE
│ rte_ring_enqueue_burst(tx_ring, fwd_mbufs, nb_fwd)
│
▼
[ tx_ring ]
│
▼
TX lcore
│ rte_eth_tx_burst()
Files
| File | Purpose |
|---|---|
ring.h |
Struct, API declarations, inline status helpers |
ring.c |
Implementation, functional tests, concurrent producer/consumer demo |
Makefile |
Build rules (-lpthread required for demo threads) |
Build and run
make
./ring
Expected output:
=== Module 03: Ring Buffer (SPSC) ===
ring_t size: 128 bytes
--- Test 1: basic enqueue/dequeue ---
PASS: FIFO order preserved
--- Test 2: ring full behaviour ---
PASS: RING_FULL returned correctly
--- Test 3: ring empty behaviour ---
PASS: RING_EMPTY returned on empty ring
--- Test 4: index wraparound ---
PASS: head/tail wrapped around correctly
--- Test 5: concurrent producer/consumer ---
[producer] Starting: will enqueue 1000 items in bursts of 16
[consumer] Starting: waiting for items
[producer] Done: enqueued 1000 items
[consumer] Done: received 1000 items
consumed_sum = 499500 expected = 499500
PASS: all items delivered, no drops, no corruption
All tests passed.
Key concepts in the code
1. Power-of-2 size + bitmask (instead of modulo)
r->mask = size - 1;
// instead of: index % size (division, ~20 cycles)
// we use: index & r->mask (AND, ~1 cycle)
This is one of the most common DPDK micro-optimisations. Every index
access in the inner loop benefits from it. rte_ring uses the same trick.
2. Sentinel slot — why capacity is size - 1
The ring uses one slot as a sentinel so head == tail unambiguously means
empty. If we allowed head == tail to also mean full, we’d need a
separate count field — which requires atomic updates from both producer
and consumer (contention). The sentinel avoids that.
size = 4, mask = 3
head=0 tail=0 → empty (head == tail)
head=1 tail=0 → 1 item
head=2 tail=0 → 2 items
head=3 tail=0 → FULL ((head-tail) & mask == size-1 == 3)
3. Memory barriers — why they matter
/* producer: write data THEN advance head */
r->entries[head & r->mask] = obj;
__atomic_store_n(&r->head, next, __ATOMIC_RELEASE); // barrier here
/* consumer: read head THEN read data */
head = __atomic_load_n(&r->head, __ATOMIC_ACQUIRE); // barrier here
*obj = r->entries[tail & r->mask];
Without RELEASE/ACQUIRE:
- The CPU or compiler can reorder the store to
entries[]to happen after the store tohead. - The consumer sees the new
headand reads a slot that hasn’t been written yet → silent data corruption.
RELEASE ensures all stores before it are visible before the store itself.
ACQUIRE ensures all loads after it see everything that happened before
the paired RELEASE.
In DPDK this is expressed as rte_smp_wmb() / rte_smp_rmb().
4. False sharing prevention
head and tail are aligned to separate cache lines (64 bytes each):
uint32_t head __attribute__((aligned(RING_CACHE_LINE_SIZE)));
uint32_t tail __attribute__((aligned(RING_CACHE_LINE_SIZE)));
Without this: every time the producer writes head, the CPU broadcasts
a cache line invalidation to the consumer’s core (which holds tail in
the same cache line). The consumer stalls waiting for the line.
With separate cache lines: each core owns its index without interference.
rte_ring uses __rte_cache_aligned for the same reason.
5. DPDK mapping
| This module | DPDK equivalent |
|---|---|
ring_create(size) |
rte_ring_create(name, size, socket_id, flags) |
ring_enqueue(r, obj) |
rte_ring_enqueue(r, obj) |
ring_dequeue(r, &obj) |
rte_ring_dequeue(r, &obj) |
ring_enqueue_burst(...) |
rte_ring_enqueue_burst(...) |
ring_dequeue_burst(...) |
rte_ring_dequeue_burst(...) |
mask = size - 1 |
identical in rte_ring source |
__ATOMIC_RELEASE |
rte_smp_wmb() |
__ATOMIC_ACQUIRE |
rte_smp_rmb() |
The main differences in rte_ring:
- MPMC support via CAS (
__atomic_compare_exchange) for multi-producer safety - NUMA-aware allocation (
rte_malloc_socketon the correct socket) - Named rings registered in a global table for cross-process access
Next module
Module 04 — Hash Map: A fixed-size open-addressing hash map in pure C —
the manual equivalent of DPDK’s rte_hash.
Source files
| File | Download |
|---|---|
ring.h |
ring.h |
ring.c |
ring.c |
Makefile |
Makefile |