VPP MASTERY · PHASE 3A · WEEKS 9–10
🔌 DPDK Plugin Deep Dive
dpdk-input · zero-copy mbuf bridge · startup.conf DPDK stanza · Mellanox mlx5 · xstats
src/plugins/dpdk/ device/node.c init.c dpdk.h Mellanox mlx5

DPDK PLUGIN ARCHITECTURE

🏗️

How the DPDK Plugin Integrates

ARCHITECTURE

The DPDK plugin (dpdk_plugin.so) bridges DPDK's poll-mode driver (PMD) model and VPP's graph-node model. It is responsible for: initialising DPDK EAL, binding physical ports, polling RX queues, converting mbufs to vlib buffers, and transmitting vlib buffers back through DPDK's TX burst API.

/* Plugin source layout: src/plugins/dpdk/ */
dpdk/
├── device/
│   ├── node.c       # dpdk-input node function - the RX hot path
│   ├── tx_func.c    # dpdk-output / dpdk-tx - the TX hot path
│   ├── init.c       # EAL init, port setup, queue allocation
│   └── format.c     # CLI formatting: show dpdk interface
├── dpdk.h           # dpdk_main_t, dpdk_device_t - master structs
└── api/
    └── dpdk.api     # Binary API: set DPDK interface config etc

/* Key structs */
dpdk_main_t   - singleton: EAL args, device pool, per-worker tx queues
dpdk_device_t - per-port: port_id, n_rx_queues, rx/tx descriptors, stats
  • The DPDK plugin calls rte_eal_init() during VPP startup - before any graph nodes run
  • One dpdk_device_t exists per physical port; stored in a vec indexed by xd_index
  • Each RX queue is polled by exactly one worker thread - the assignment is in dpdk_device_t.rx_queues[q].thread_index
  • TX uses per-worker tx queue buffers to avoid locking: worker N uses tx queue N exclusively

dpdk-input - THE RX HOT PATH

dpdk-input Node Internals

INTERNALS

dpdk-input is a VLIB_NODE_TYPE_INPUT node that polls DPDK RX queues. It is the entry point for all physical network traffic in VPP. The key performance insight is that it processes burst of up to DPDK_NB_RX_DESC mbufs per call and converts them all to vlib buffer indices before dispatching to the next graph node.

/* Simplified dpdk-input hot path (src/plugins/dpdk/device/node.c) */
VLIB_NODE_FN(dpdk_input_node)(vlib_main_t *vm, vlib_node_runtime_t *node,
                              vlib_frame_t *frame)
{
    dpdk_main_t *dm = &dpdk_main;
    dpdk_per_thread_data_t *ptd = vec_elt_at_index(dm->per_thread_data,
                                                    vm->thread_index);
    u32 n_rx_packets = 0;

    /* Poll each queue assigned to this worker */
    dpdk_device_and_queue_t *dq;
    vec_foreach(dq, dm->devices_by_cpu[vm->thread_index]) {
        dpdk_device_t *xd = vec_elt_at_index(dm->devices, dq->device_index);

        /* DPDK burst receive - fills ptd->mbufs[] */
        u32 n_rx = rte_eth_rx_burst(xd->port_id, dq->queue_id,
                                   ptd->mbufs, DPDK_RX_BURST_SZ);
        if (n_rx == 0) continue;

        /* Convert mbufs to vlib buffer indices + dispatch to ethernet-input */
        n_rx_packets += dpdk_process_rx_burst(vm, node, xd, dq->queue_id,
                                             ptd, n_rx);
    }
    return n_rx_packets;
}

/* What dpdk_process_rx_burst does: */
/* 1. For each mbuf: derive vlib_buffer_t pointer (they share memory) */
/* 2. Set vlib_buffer fields: current_data, current_length, sw_if_index */
/* 3. Copy DPDK offload flags to vlib_buffer flags (RSS hash, checksum) */
/* 4. Enqueue u32 buffer indices to ethernet-input frame */

💡 Key performance detail: dpdk-input does NOT call vlib_buffer_alloc(). Instead, vlib buffers and DPDK mbufs share the same memory pool - the vlib buffer header IS the mbuf's private data area. This zero-copy design means RX never allocates memory; the conversion from mbuf to vlib_buffer is a pointer offset calculation.

MBUF ↔ VLIB_BUFFER MEMORY BRIDGE

🔗

Shared Memory Layout

ZERO-COPY

The DPDK plugin pre-allocates a single rte_mempool with a custom private data size large enough to hold a vlib_buffer_t. Each rte_mbuf in this pool has its rte_mbuf_priv_data area occupied by the vlib_buffer_t header. They overlap in memory.

/* Memory layout of a DPDK+VPP buffer */

+──────────────────────────────────────────────────────────────────+
|  rte_mbuf (128 bytes)  |  vlib_buffer_t (128 bytes)  |  data[]  |
|  (DPDK header)         |  (VPP header = mbuf priv)   |          |
+──────────────────────────────────────────────────────────────────+
                         ↑                              ↑
                   vlib_buffer ptr               packet data

/* Converting between the two */
/* mbuf → vlib_buffer */
vlib_buffer_t *b = vlib_buffer_from_rte_mbuf(mb);
/* equivalent to: (vlib_buffer_t *)RTE_PTR_ADD(mb, sizeof(struct rte_mbuf)) */

/* vlib_buffer → mbuf */
struct rte_mbuf *mb = rte_mbuf_from_vlib_buffer(b);
/* equivalent to: (struct rte_mbuf *)RTE_PTR_SUB(b, sizeof(struct rte_mbuf)) */

/* Fields are synced at RX entry and TX exit */
/* RX: DPDK fills mbuf, plugin copies to vlib_buffer fields */
b->current_data   = mb->data_off - RTE_PKTMBUF_HEADROOM;
b->current_length = mb->data_len;
b->flags |= (mb->ol_flags & PKT_RX_RSS_HASH) ? VLIB_BUFFER_TOTAL_LENGTH_VALID : 0;
vnet_buffer(b)->sw_if_index[VLIB_RX] = xd->sw_if_index;
vnet_buffer(b)->sw_if_index[VLIB_TX] = ~0;  /* unknown at RX */

/* TX: vlib_buffer → mbuf */
mb->data_off = b->current_data + RTE_PKTMBUF_HEADROOM;
mb->data_len = b->current_length;
mb->pkt_len  = b->current_length;
⚙️ DPDK KNOWLEDGE APPLIED
  • You know rte_mempool with custom private size - VPP uses exactly this to embed vlib_buffer_t in each mbuf's private data region
  • You know rte_mbuf.data_off is the offset from the mbuf start to packet data - VPP's current_data is the equivalent from the vlib_buffer start
  • RSS hash in mb->hash.rss is copied to b->flow_id - used for per-flow worker assignment in some configurations
  • DPDK scatter-gather (multi-segment mbufs) maps to VPP chained buffers via b->next_buffer - the DPDK plugin chains them during RX conversion

MELLANOX mlx5 - YOUR NIC

🔬

mlx5 PMD Specifics for VPP

MELLANOX

Mellanox ConnectX-4/5/6 (mlx5 PMD) in VPP behaves differently from Intel NICs. Understanding the mlx5-specific behaviour prevents the most common VPP + Mellanox configuration issues.

Topicmlx5 BehaviourAction Required
Driver bindingmlx5 does NOT use vfio-pci as primary. Uses kernel mlx5_core + mlx5_ib alongside DPDKDo NOT unbind from mlx5_core. DPDK mlx5 PMD works on top of it via rdma
IOVA modeRequires Virtual Address (VA) IOVA modeSet iova-mode va in startup.conf dpdk stanza
Hugepagesmlx5 uses DMA mapping - works with 2MB and 1GB pagesBoth work; 1GB pages give fewer TLB misses at high load
Multi-queue RSSFull RSS support: Toeplitz hash on IPv4/IPv6/TCP/UDPSet num-rx-queues = num worker threads for full parallelism
Checksum offloadFull IPv4/TCP/UDP TX and RX checksum offloadEnable in dpdk stanza: enable-tcp-udp-checksum
TSO (TCP Segmentation)Supported on ConnectX-5 and laterEnable per-port in startup.conf if using TCP session layer
Multi-seg mbufsmlx5 handles scatter-gather nativelyEnable multi-seg in dpdk stanza for jumbo frames
VF / SR-IOVCreate VFs on the PF, each VF gets its own PMD instanceOne VF per container - standard SR-IOV workflow you know from DPDK
# Correct startup.conf for Mellanox ConnectX-5 with VPP
dpdk {
  dev 0000:03:00.0 {
    name eth0                       # human-readable name in VPP
    num-rx-queues 4                 # = number of worker threads
    num-tx-queues 4
    num-rx-desc 2048
    num-tx-desc 2048
    rss-fn 0x3c8                    # RSS on IPv4+IPv6+TCP+UDP
    enable-tcp-udp-checksum         # TX checksum offload
  }
  uio-driver none                   # mlx5: no vfio-pci binding needed
  iova-mode va                      # REQUIRED for mlx5
  socket-mem 2048,0                 # 2GB on NUMA 0, 0 on NUMA 1
  log-level notice
}

# Verify mlx5 detection
# vppctl: show dpdk interface
# Should show: driver mlx5_pmd, link state up

DPDK STANZA REFERENCE

⚙️

Complete startup.conf DPDK Options

CONFIGURATION
OptionScopeDescriptionRecommended for AMD+mlx5
dev <PCI> { ... }Per-portConfigure a specific DPDK device by PCI addressRequired for each Mellanox port
num-rx-queues NPer-portNumber of RX queues. Must ≤ number of worker threadsSet equal to workers
num-tx-queues NPer-portNumber of TX queues. One per workerSet equal to workers
num-rx-desc NPer-portRX ring size. Power of 2. 1024–40962048 for high-throughput
num-tx-desc NPer-portTX ring size. Power of 22048
uio-driver vfio-pciGlobalUse vfio-pci for Intel/virtio. For mlx5: use noneuio-driver none
iova-mode vaGlobalVirtual address IOVA mode. Required for mlx5Always set for mlx5
socket-mem N,NGlobalHugepage memory per NUMA socket in MBMatch to your topology
no-multi-segGlobalDisable multi-segment mbufs (faster for small packets)Set unless using jumbo frames
enable-tcp-udp-checksumPer-portEnable HW TX checksum offload for TCP/UDPEnable on mlx5 ConnectX-5+
log-level <level>GlobalDPDK log verbosity: debug/info/notice/warning/errornotice in production
dev default { ... }GlobalDefault settings applied to all DPDK devicesUse to avoid repeating per-port config
PROJECT 4

Interface Technology Comparison Lab

Objective: Quantitatively compare DPDK, memif, and TAP throughput using identical test traffic. Understand the performance cost of each interface type.

1
Set up three VPP containers: Container A (testpmd sending 64B frames), Container B (VPP with all three interface types), Container C (testpmd receiving). Create: one DPDK-to-DPDK path, one memif path, one TAP path between the same endpoints.
2
Use testpmd's txonly mode to send at line rate (10 Gbps) on each path. Record: throughput (Mpps), latency (p50/p99 from dpdk-testpmd rxonly with timestamps), and CPU usage per worker thread.
3
Examine show run on each VPP instance. Compare vectors/call and clocks/vector for dpdk-input vs memif-input vs af-packet-input. Build a table of results.
4
Check show dpdk interface xstats GigabitEthernet0/8/0 for hardware-level counters: rx_missed_errors, rx_no_mbuf_errors, tx_errors. These indicate buffer exhaustion or descriptor ring underflow.
5
Identify the bottleneck in each path using the data collected. Write a 1-page analysis: when would you choose each interface type in a production deployment?

P3A COMPLETION CHECKLIST

✅ Next: P3B - memif and shared-memory interfaces. This is where VPP shines for container-to-container connectivity.

← vnet 🗺️ Roadmap Next: memif →