M17 - High-Performance Networking with DPDK

NETWORKING MASTERY · PHASE 4 · MODULE 17 · WEEK 15

🚀 High-Performance Networking with DPDK

Poll mode drivers · mbuf and mempool · RX/TX burst API · hugepages · NUMA · RSS · DPDK pipelines

Advanced Prerequisite: M14 Linux Stack DPDK 23.x Your Core Work Domain 3 Labs

WHY DPDK — THE KERNEL IS NOT FAST ENOUGH

🚀

The Performance Problem DPDK Solves

MOTIVATION

You already have 2.5 years of DPDK experience — this module deepens that foundation with the theoretical underpinning of each performance technique, connecting your practical knowledge to the "why".

/* Why kernel forwarding is slow — root causes */

1. Interrupt overhead (eliminated by PMD polling):
   10G at 64B = 14.8M packets/s = 14.8M IRQs/s
   Each IRQ: context switch + cache invalidation ≈ 1000 cycles = 14.8T cycles/s wasted

2. Memory allocation (eliminated by mempool):
   kmalloc()/kfree() per sk_buff → fragmentation, lock contention
   DPDK mempool: pre-allocated, lock-free, O(1)

3. Memory copies (eliminated by zero-copy design):
   NIC → DMA buffer → sk_buff → socket rcvbuf → userspace
   DPDK: NIC DMA → mbuf in hugepage → application (1 copy from NIC)

4. Cache misses (eliminated by hugepages + NUMA pinning):
   4KB pages: 1GB of packet buffers = 262,144 TLB entries → TLB thrash
   2MB hugepages: same 1GB = 512 TLB entries → fits in TLB

5. Lock contention (eliminated by per-core design):
   Kernel routing: locks on ARP cache, routing table, socket buffers
   DPDK: each core owns its own queues and mempools → no locks

/* Performance numbers */
Kernel stack:   ~1-3 Mpps per core (64B packets)
DPDK (Intel):   ~30-80 Mpps per core
DPDK (Mellanox/ConnectX): up to 100+ Mpps per core
Your servers: AMD EPYC + Mellanox — which PMD are you using?

EAL — ENVIRONMENT ABSTRACTION LAYER

⚙️

EAL Initialization and Configuration

EAL

/* DPDK EAL — abstracts OS and hardware */
/* Manages: hugepages, NUMA, CPU affinity, PCI devices, logging */

/* Minimal DPDK application skeleton */
#include <rte_eal.h>
#include <rte_ethdev.h>
#include <rte_mbuf.h>

int main(int argc, char **argv) {
    /* EAL init: parses EAL args, sets up hugepages, maps devices */
    int ret = rte_eal_init(argc, argv);
    if (ret < 0) rte_exit(EXIT_FAILURE, "EAL init failed\n");
    argc -= ret; argv += ret;  /* remaining args are app-specific */

    /* Check available ports */
    uint16_t nb_ports = rte_eth_dev_count_avail();
    printf("Available ports: %u\n", nb_ports);

    /* Create mempool (see Tab 2) */
    struct rte_mempool *mp = rte_pktmbuf_pool_create(
        "MBUF_POOL",          /* name */
        8192 * nb_ports,      /* n: total mbufs */
        256,                  /* cache_size: per-core cache */
        0,                    /* priv_size */
        RTE_MBUF_DEFAULT_BUF_SIZE,
        rte_socket_id());     /* NUMA socket */

    /* Configure each port */
    uint16_t port_id;
    RTE_ETH_FOREACH_DEV(port_id) {
        port_init(port_id, mp);
    }

    /* Launch worker on each lcore */
    rte_eal_mp_remote_launch(lcore_main, NULL, CALL_MAIN);

    rte_eal_cleanup();
    return 0;
}

/* Key EAL command-line arguments */
-l 0-3          # use lcores 0,1,2,3 (logical CPU cores)
-n 4            # 4 memory channels
--socket-mem 2048  # 2GB hugepage memory on socket 0
--vdev eth_pcap0,iface=eth0  # use pcap driver (for testing without real NIC)
-a 0000:01:00.0 # allow only this PCI device

/* Hugepage setup (required before DPDK runs) */
echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
mkdir -p /mnt/huge
mount -t hugetlbfs nodev /mnt/huge

mbuf AND MEMPOOL — PACKET BUFFER MANAGEMENT

📦

rte_mbuf and rte_mempool

MBUF

/* rte_mbuf — DPDK's packet buffer (analogous to sk_buff) */
struct rte_mbuf {
    /* Buffer addresses */
    void            *buf_addr;       /* virtual address of buffer start */
    rte_iova_t       buf_iova;       /* IO/DMA address */
    uint16_t         buf_len;        /* total buffer length */
    uint16_t         data_off;       /* offset to first byte of data (headroom) */
    uint16_t         data_len;       /* data length in THIS mbuf segment */
    uint32_t         pkt_len;        /* total packet length (all segments) */

    /* Segmentation (chained mbufs for large packets) */
    struct rte_mbuf *next;           /* next segment in chain (NULL if only one) */
    uint8_t          nb_segs;        /* number of segments */

    /* Offload flags */
    uint64_t         ol_flags;       /* PKT_TX_IP_CKSUM, PKT_RX_RSS_HASH, etc. */
    uint32_t         packet_type;    /* RTE_PTYPE_L3_IPV4, L4_TCP, etc. */
    uint32_t         hash.rss;       /* RSS hash computed by NIC */

    /* Port and queue */
    uint16_t         port;
    uint32_t         seqn;
};

/* Accessing packet data */
struct rte_ipv4_hdr *ip = rte_pktmbuf_mtod_offset(m,
    struct rte_ipv4_hdr *, sizeof(struct rte_ether_hdr));
uint32_t src_ip = rte_be_to_cpu_32(ip->src_addr);
uint32_t dst_ip = rte_be_to_cpu_32(ip->dst_addr);

/* Prepend header (like skb_push) */
struct rte_ether_hdr *eth = (struct rte_ether_hdr *)
    rte_pktmbuf_prepend(m, sizeof(struct rte_ether_hdr));

/* rte_mempool — pre-allocated, lock-free pool */
/* Pool has: global ring + per-lcore cache (avoids lock on common case) */

/* Allocate mbuf from pool */
struct rte_mbuf *m = rte_pktmbuf_alloc(mp);
if (!m) { /* pool exhausted — back-pressure or drop */ }

/* Free mbuf back to pool */
rte_pktmbuf_free(m);  /* returns to per-lcore cache, then global ring */

/* Bulk allocate/free (amortizes pool overhead) */
struct rte_mbuf *mbufs[32];
rte_pktmbuf_alloc_bulk(mp, mbufs, 32);
rte_mempool_put_bulk(mp, (void **)mbufs, 32);

PMD AND BURST API — THE CORE FORWARDING LOOP

🔄

Poll Mode Driver and rte_eth_rx/tx_burst

PMD

/* Port initialization */
static int port_init(uint16_t port, struct rte_mempool *mp) {
    struct rte_eth_conf port_conf = {
        .rxmode = { .max_lro_pkt_size = RTE_ETHER_MAX_LEN },
        .txmode = { .mq_mode = RTE_ETH_MQ_TX_NONE },
    };
    const int rx_rings = 1, tx_rings = 1;
    uint16_t nb_rxd = 1024, nb_txd = 1024;

    rte_eth_dev_configure(port, rx_rings, tx_rings, &port_conf);
    rte_eth_dev_adjust_nb_rx_tx_desc(port, &nb_rxd, &nb_txd);

    /* Setup RX queue on NUMA-local socket */
    rte_eth_rx_queue_setup(port, 0, nb_rxd,
        rte_eth_dev_socket_id(port), NULL, mp);
    /* Setup TX queue */
    rte_eth_tx_queue_setup(port, 0, nb_txd,
        rte_eth_dev_socket_id(port), NULL);

    rte_eth_dev_start(port);
    rte_eth_promiscuous_enable(port);
    return 0;
}

/* Core forwarding loop — the main performance-critical loop */
static int lcore_main(void *arg) {
    uint16_t port;
    struct rte_mbuf *bufs[BURST_SIZE];  /* BURST_SIZE = 32 */

    RTE_ETH_FOREACH_DEV(port) {
        if (rte_eth_dev_socket_id(port) > 0 &&
            rte_eth_dev_socket_id(port) != (int)rte_socket_id())
            printf("WARNING: port %u is on remote NUMA\n", port);
    }

    printf("Core %u forwarding. [Ctrl+C to quit]\n", rte_lcore_id());

    while (1) {
        RTE_ETH_FOREACH_DEV(port) {
            /* POLL: pull up to BURST_SIZE packets from NIC RX queue */
            uint16_t nb_rx = rte_eth_rx_burst(port, 0, bufs, BURST_SIZE);
            if (nb_rx == 0) continue;  /* nothing received */

            /* Process each packet */
            for (uint16_t i = 0; i < nb_rx; i++) {
                process_packet(bufs[i]);  /* L3 lookup, NAT, filter... */
            }

            /* Burst transmit — send all processed packets */
            uint16_t nb_tx = rte_eth_tx_burst(port ^ 1, 0, bufs, nb_rx);
            /* Free any packets that failed to transmit */
            if (nb_tx < nb_rx)
                rte_pktmbuf_free_bulk(&bufs[nb_tx], nb_rx - nb_tx);
        }
    }
}

/* Why burst size matters for performance */
# Burst=1:  function call overhead dominates → low throughput
# Burst=32: amortize call overhead, fill cache with packet data → optimal
# Burst=128: diminishing returns, prefetch distance too large
# Empirically: 32 is optimal for most workloads

RSS AND FLOW DIRECTOR — HARDWARE PACKET STEERING

🎯

RSS Configuration and Flow Director

RSS

/* RSS — Receive Side Scaling */
/* NIC hashes packet 5-tuple → assigns to RX queue → specific lcore */
/* Ensures packets of same flow always go to same core (session affinity) */

struct rte_eth_conf port_conf = {
    .rxmode = {
        .mq_mode = RTE_ETH_MQ_RX_RSS,
    },
    .rx_adv_conf = {
        .rss_conf = {
            .rss_key = NULL,     /* NULL = use default 40-byte key */
            .rss_hf  = RTE_ETH_RSS_IP | RTE_ETH_RSS_TCP | RTE_ETH_RSS_UDP,
        },
    },
};

/* Per-packet RSS hash (computed by NIC hardware) */
if (m->ol_flags & RTE_MBUF_F_RX_RSS_HASH)
    uint32_t hash = m->hash.rss;  /* use for flow table lookup */

/* Symmetric RSS — ensure fwd and return packets land on same core */
/* Standard RSS: hash(sIP,dIP,sPort,dPort) — fwd and return differ! */
/* Symmetric: hash(sIP^dIP, sPort^dPort) — XOR makes it symmetric */
/* Implemented by using a special Toeplitz key: */
static uint8_t sym_rss_key[] = {
    0x6d, 0x5a, 0x56, 0xda, 0x25, 0x5b, 0x0e, 0xc2,
    0x41, 0x67, 0x25, 0x3d, 0x43, 0xa3, 0x8f, 0xb0,
    0xd0, 0xca, 0x2b, 0xcb, 0xae, 0x7b, 0x30, 0xb4,
    0x77, 0xcb, 0x2d, 0xa3, 0x80, 0x30, 0xf2, 0x0c,
    0x6a, 0x42, 0xb7, 0x3b, 0xbe, 0xac, 0x01, 0xfa,
};

/* Flow Director — exact-match steering beyond RSS */
/* Program specific 5-tuples → specific queue */
struct rte_flow_attr attr = { .ingress = 1 };
struct rte_flow_item pattern[4];
struct rte_flow_action action[2];

/* Match IPv4 + TCP dst port 80 */
struct rte_flow_item_ipv4 ipv4_spec = { .hdr.dst_addr = htonl(0xc0a80001) };
struct rte_flow_item_tcp  tcp_spec  = { .hdr.dst_port = htons(80) };
/* → send to queue 3 */
struct rte_flow_action_queue queue_action = { .index = 3 };

/* Create the flow rule */
struct rte_flow_error error;
struct rte_flow *flow = rte_flow_create(port_id, &attr,
    pattern, action, &error);

DPDK PIPELINES — STRUCTURING COMPLEX DATA PLANES

🏭

Run-to-Completion vs Pipeline Models

PIPELINES

/* Model 1: Run-to-Completion (RTC) */
/* Each lcore processes packets end-to-end: RX → all processing → TX */
/* Pros: no inter-core communication, lowest latency */
/* Cons: all processing must fit in one core's budget */

lcore 0: RX port 0 → classify → ACL → NAT → TX port 1
lcore 1: RX port 1 → classify → ACL → NAT → TX port 0
lcore 2: RX port 2 → classify → ACL → NAT → TX port 3

/* Model 2: Pipeline (assembly line) */
/* Different cores handle different stages; communicate via ring queues */
/* Pros: each stage specialised, cache-friendly per stage */
/* Cons: ring enqueue/dequeue latency, pipeline stalls */

lcore 0 (RX):      NIC → mbuf → enqueue to classify_ring
lcore 1 (Classify): dequeue → L3 parse → enqueue to acl_ring
lcore 2 (ACL):     dequeue → policy check → enqueue to nat_ring
lcore 3 (NAT+TX):  dequeue → NAT → NIC TX

/* rte_ring — lock-free SPSC/MPSC/SPMC/MPMC ring queue */
struct rte_ring *ring = rte_ring_create("MY_RING", 4096,
    rte_socket_id(), RING_F_SP_ENQ | RING_F_SC_DEQ);

/* Enqueue (producer side) */
rte_ring_enqueue_burst(ring, (void **)mbufs, nb_mbufs, NULL);

/* Dequeue (consumer side) */
uint16_t n = rte_ring_dequeue_burst(ring, (void **)mbufs,
    BURST_SIZE, NULL);

/* DPDK Graph framework (DPDK 20.11+) */
/* Modern way to build pipelines: graph of nodes, edges are rte_rings */
/* Nodes: ip4_lookup, ip4_rewrite, acl_classify, etc. */
/* Automatic vectorisation: processes batch of packets per node */

DPDK PERFORMANCE TUNING

⚡

Systematic Performance Optimisation

TUNING

/* 1. CPU isolation — dedicate cores to DPDK */
# Kernel boot: isolcpus=4-7,nohz_full=4-7,rcu_nocbs=4-7
# DPDK EAL: -l 4-7  (use cores 4-7)
# These cores will spin 100% polling — don't share with OS

/* 2. NUMA awareness — memory on same socket as NIC */
if (rte_eth_dev_socket_id(port) != (int)rte_socket_id()) {
    printf("NUMA mismatch: NIC on socket %d, core on socket %d\n",
           rte_eth_dev_socket_id(port), rte_socket_id());
    /* Cross-NUMA memory access: +60ns latency per access */
    /* Fix: pin workers to same NUMA node as their NIC */
}
/* Mempool MUST be on same NUMA as NIC: */
rte_pktmbuf_pool_create("MP", N, 256, 0, BUF_SIZE,
    rte_eth_dev_socket_id(port));  /* NOT rte_socket_id() */

/* 3. Prefetching — hide memory latency */
for (i = 0; i < nb_rx; i++) {
    if (i + 4 < nb_rx)
        rte_prefetch0(rte_pktmbuf_mtod(bufs[i + 4], void *));
    process_packet(bufs[i]);  /* by the time we process [i], [i+4] is in L1 */
}

/* 4. TX descriptor writeback — reduce PCIe round trips */
struct rte_eth_txconf txconf = {
    .tx_thresh = { .pthresh = 32, .hthresh = 0, .wthresh = 0 },
    .tx_free_thresh = 32,        /* free 32 at once, not 1 by 1 */
};

/* 5. Offloads — hardware helps software */
struct rte_eth_conf port_conf = {
    .rxmode.offloads = RTE_ETH_RX_OFFLOAD_CHECKSUM |  /* HW validates cksum */
                       RTE_ETH_RX_OFFLOAD_RSS_HASH,    /* HW computes RSS */
    .txmode.offloads = RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | /* HW fills IP cksum */
                       RTE_ETH_TX_OFFLOAD_TCP_CKSUM,   /* HW fills TCP cksum */
};

/* 6. Measuring performance */
uint64_t hz = rte_get_timer_hz();
uint64_t start = rte_get_timer_cycles();
/* ... process N packets ... */
uint64_t elapsed = rte_get_timer_cycles() - start;
double mpps = (double)N / ((double)elapsed / hz) / 1e6;
printf("%.2f Mpps (%.1f ns/packet)\n", mpps, 1e9 * elapsed / hz / N);

LAB 1

DPDK Packet Counter and L3 Forwarder

Objective: Build the classic DPDK "basicfwd" — receive packets, swap MAC addresses, transmit back. Add per-flow counters.

Set up hugepages and bind a test NIC (or use vdev): dpdk-devbind.py --bind=vfio-pci 0000:01:00.0. Build and run the DPDK skeleton from the EAL tab. Verify you can receive packets.

Implement L2 forwarding: for each received packet, swap src/dst MAC (rte_ether_addr_copy), update checksums if needed, transmit on the other port. This is the DPDK equivalent of a wire — measure throughput with pktgen-dpdk.

Add an rte_hash table (BPF-style hash map for DPDK): key = 5-tuple, value = packet count. For each received packet, parse IP+TCP/UDP headers, lookup/create entry, increment count. Print top-10 flows by packet count every 5 seconds.

Benchmark: measure Mpps with different burst sizes (1, 4, 8, 16, 32, 64). Plot throughput vs burst size. Identify the optimal burst size for your hardware. Document what limits throughput (PCIe bandwidth? CPU cycles? Memory bandwidth?).

LAB 2

Multi-Core DPDK with RSS

Objective: Scale DPDK forwarding to multiple cores using RSS for flow distribution.

Configure 4 RX queues and 4 TX queues on your test NIC. Set RSS to distribute based on IP+TCP 5-tuple. Launch 4 worker threads, one per queue: lcore 4 handles queue 0, lcore 5 handles queue 1, etc.

Generate flows from a traffic generator with varying 5-tuples. Verify RSS distribution: read per-queue packet counters from rte_eth_stats_get(). Distribution should be roughly even (within 10%).

Compare symmetric vs asymmetric RSS: try the standard Toeplitz key, then the symmetric key. Verify that with symmetric RSS, forward and reverse flows of the same connection land on the same queue.

LAB 3

Performance Profiling Deep-Dive

Objective: Profile your DPDK application at the cycle level and identify bottlenecks.

Enable Intel PMU counters in your forwarding loop: measure cycles per packet, LLC cache misses per packet, DRAM accesses per packet using perf on the isolated cores: perf stat -e cycles,cache-misses,dTLB-misses -C 4 sleep 5.

Profile memory access patterns: add artificial 5-tuple lookups against a 10K-entry rte_hash table. Measure performance with table in L3 cache (small table) vs DRAM (large table). Quantify the cost of a single cache miss in ns.

Test NUMA effects: move your mempool to the remote NUMA node (use socket_id = 1 - rte_eth_dev_socket_id(port)). Measure the throughput degradation. Verify the ~60ns cross-NUMA penalty empirically.

M17 MASTERY CHECKLIST

Know the 5 root causes of kernel networking overhead and how DPDK eliminates each
Know DPDK performance numbers: ~30-80 Mpps/core vs kernel ~1-3 Mpps/core
Know EAL responsibilities: hugepage management, CPU affinity, PCI device mapping, NUMA awareness
Know EAL CLI options: -l (cores), -n (memory channels), --socket-mem, -a (PCI allowlist)
Know hugepage requirement and why: 2MB pages reduce TLB pressure (512 vs 262,144 entries for 1GB)
Know rte_mbuf key fields: buf_addr/buf_iova, data_off, data_len, pkt_len, next, ol_flags, hash.rss
Know rte_mempool design: global ring + per-lcore cache, lock-free on common path
Know the PMD polling model: spin loop calling rte_eth_rx_burst, no interrupts ever
Know why burst size matters: amortizes function call overhead; optimal typically 32
Know NUMA mismatch penalty: ~60ns latency per cross-NUMA memory access
Know RSS configuration: mq_mode=RSS, rss_hf for hash fields (IP, TCP, UDP)
Know symmetric RSS problem and solution: XOR-based Toeplitz key ensures fwd/return on same core
Know Flow Director: exact-match 5-tuple → specific queue steering (harder than RSS, more precise)
Know run-to-completion vs pipeline models; when to choose each
Know rte_ring: lock-free SPSC/MPSC/SPMC/MPMC; used to connect pipeline stages
Know TX offloads: RTE_ETH_TX_OFFLOAD_IPV4_CKSUM, TCP_CKSUM — hardware fills checksums
Know prefetching pattern: prefetch N+4 while processing N to hide memory latency
Completed Lab 1: built DPDK L2 forwarder with per-flow hash table counters, benchmarked burst sizes
Completed Lab 2: configured multi-core with RSS, verified symmetric flow distribution
Completed Lab 3: profiled with PMU counters, quantified NUMA penalty and cache miss cost

✅ When complete: Move to M18 - VPP and Data Plane Development — the final Phase 4 module, covering the vector packet processor your team actively uses for R&D.

← M16 eBPF/XDP 🗺️ Roadmap Next: M18 - VPP →