Module 13 — Atomic Counters + Per-lcore Stats

Pure C — fully runnable without DPDK. Build with make, run with ./atomic_stats.

What you learn

How to implement lock-free per-lcore statistics in C using _Atomic / stdatomic.h — the exact pattern used for every counter in the DP application. Includes a measured demo of false sharing (why cache-line alignment matters), atomic vs mutex performance (why atomics are mandatory in the hot path), memory ordering choices, and a live rate calculation from concurrent workers.


Atomic variables in the real DP application project

From domain_scan.h:

extern atomic_ullong hs_db_compile_count;
extern atomic_ullong hs_scratch_alloc_count;
extern atomic_ulong  match_count;
extern atomic_ulong  dns_rx_count;
extern atomic_ulong  dns_proc_count;

From policy_cache.c:

atomic_ulong malicious_domain_count;

Build and run

make
./atomic_stats

Expected output (abbreviated):

=== Module 13: Atomic Counters + Per-lcore Stats ===

sizeof(lcore_stats_t) = 128 bytes  (must be multiple of 64)
Alignment check passed.

Demo 2: False sharing
  Bad  layout (shared cache line): 1.247 sec
  Good layout (aligned, separate): 0.312 sec
  Speedup: 4.0x

Demo 3: atomic vs mutex
  4 threads × 10M increments each:
  mutex:  2.841 sec  (71 ns/op)
  atomic: 0.624 sec  (16 ns/op)
  Speedup: 4.6x

Key concepts

1. memory_order_relaxed for stats counters

/* In the hot path — worker lcore increments its own counter */
atomic_fetch_add_explicit(&stats->pkt_rx, 1, memory_order_relaxed);

/* Main lcore reads for aggregation */
unsigned long rx = atomic_load_explicit(&stats->pkt_rx, memory_order_relaxed);

memory_order_relaxed guarantees atomicity (no torn reads/writes) but no ordering (no fence instruction emitted on x86). For a stat counter this is always sufficient. Using the default memory_order_seq_cst emits an MFENCE instruction — ~20 cycles overhead per increment in the hot path.

Ordering Fence Use for
relaxed none stats counters, ref counts
release/acquire store/load fence publishing data (flag after write)
seq_cst full fence (MFENCE) global ordering — almost never needed

2. __attribute__((aligned(64))) on lcore_stats_t

Without alignment:

lcore_stats_t stats[4]:
  stats[0]: bytes 0–47   → cache line 0 (bytes 0–63)   ← lcore 3 writes
  stats[1]: bytes 48–95  → spans lines 0–1             ← lcore 4 writes
                                                           SAME cache line as lcore 3!

Every time lcore 4 writes stats[1].pkt_rx, the CPU sends a cache invalidation to lcore 3’s core. This is false sharing.

With aligned(64):

  stats[0]: bytes 0–63    → cache line 0 (lcore 3 only)
  stats[1]: bytes 64–127  → cache line 1 (lcore 4 only)

Zero cross-core invalidations. The measured 4× speedup in the demo shows the real cost at 50M increments/sec.

3. Per-lcore vs global atomics

/* Per-lcore: only THIS lcore writes — zero contention */
atomic_fetch_add_explicit(&stats->pkt_rx, 1, memory_order_relaxed);
/* Cost: ~3–5 ns (just the atomic instruction, no cache miss) */

/* Global (cross-lcore): ALL lcores write — contention */
atomic_fetch_add_explicit(&dns_rx_count, 1, memory_order_relaxed);
/* Cost: ~10–20 ns (CAS loop if contended, cache line bounces between cores) */

Keep global atomics for infrequent events: DB compiled, scratch allocated, malicious domain loaded. Use per-lcore atomics for per-packet counters.

4. Stats aggregation pattern (main lcore)

while (running) {
    sleep_ms(1000);
    stats_aggregate(all_lcore_stats, NUM_WORKERS, &curr);

    double dt = (curr.timestamp_ns - prev.timestamp_ns) / 1e9;
    double rx_pps = (curr.pkt_rx - prev.pkt_rx) / dt;

    LOG_INFO("rx=%.0f/s  dns=%.0f/s", rx_pps, (curr.pkt_dns - prev.pkt_dns) / dt);
    prev = curr;
}

No lock needed: stats_aggregate reads atomics with relaxed ordering.

5. aligned_alloc for arrays

/* Array of N cache-aligned stats structs */
lcore_stats_t *all = aligned_alloc(CACHE_LINE_SIZE,
                                    N * sizeof(lcore_stats_t));

malloc() only guarantees 16-byte alignment. Without aligned_alloc, even if the struct has aligned(64), the first element of a heap-allocated array might start at an unaligned address — breaking the false-sharing protection.


Connection to DPDK’s atomic API

DPDK has its own atomic types (rte_atomic64_t, rte_atomic32_t) which predate C11 atomics. Modern DPDK (>= 21.x) recommends using C11 _Atomic directly — which is what the DP application uses.

/* Old DPDK style (deprecated): */
rte_atomic64_t counter;
rte_atomic64_add(&counter, 1);

/* Modern C11 (the DP application / this module): */
atomic_ulong counter;
atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);

Next module

Module 14 — NUMA-aware Memory Allocation: rte_malloc_socket, rte_zmalloc_socket, rte_memzone_reserve — allocating memory on the correct NUMA socket for NIC queues, hash tables, and Hyperscan databases.


Source files

File Download
atomic_stats.c atomic_stats.c
Makefile Makefile