VPP P3A - DPDK Plugin Deep Dive

VPP MASTERY · PHASE 3A · WEEKS 9–10

🔌 DPDK Plugin Deep Dive

dpdk-input · zero-copy mbuf bridge · startup.conf DPDK stanza · Mellanox mlx5 · xstats

src/plugins/dpdk/ device/node.c init.c dpdk.h Mellanox mlx5

DPDK PLUGIN ARCHITECTURE

🏗️

How the DPDK Plugin Integrates

ARCHITECTURE

The DPDK plugin (dpdk_plugin.so) bridges DPDK's poll-mode driver (PMD) model and VPP's graph-node model. It is responsible for: initialising DPDK EAL, binding physical ports, polling RX queues, converting mbufs to vlib buffers, and transmitting vlib buffers back through DPDK's TX burst API.

/* Plugin source layout: src/plugins/dpdk/ */
dpdk/
├── device/
│   ├── node.c       # dpdk-input node function - the RX hot path
│   ├── tx_func.c    # dpdk-output / dpdk-tx - the TX hot path
│   ├── init.c       # EAL init, port setup, queue allocation
│   └── format.c     # CLI formatting: show dpdk interface
├── dpdk.h           # dpdk_main_t, dpdk_device_t - master structs
└── api/
    └── dpdk.api     # Binary API: set DPDK interface config etc

/* Key structs */
dpdk_main_t   - singleton: EAL args, device pool, per-worker tx queues
dpdk_device_t - per-port: port_id, n_rx_queues, rx/tx descriptors, stats

The DPDK plugin calls rte_eal_init() during VPP startup - before any graph nodes run
One dpdk_device_t exists per physical port; stored in a vec indexed by xd_index
Each RX queue is polled by exactly one worker thread - the assignment is in dpdk_device_t.rx_queues[q].thread_index
TX uses per-worker tx queue buffers to avoid locking: worker N uses tx queue N exclusively

dpdk-input - THE RX HOT PATH

⚡

dpdk-input Node Internals

INTERNALS

dpdk-input is a VLIB_NODE_TYPE_INPUT node that polls DPDK RX queues. It is the entry point for all physical network traffic in VPP. The key performance insight is that it processes burst of up to DPDK_NB_RX_DESC mbufs per call and converts them all to vlib buffer indices before dispatching to the next graph node.

/* Simplified dpdk-input hot path (src/plugins/dpdk/device/node.c) */
VLIB_NODE_FN(dpdk_input_node)(vlib_main_t *vm, vlib_node_runtime_t *node,
                              vlib_frame_t *frame)
{
    dpdk_main_t *dm = &dpdk_main;
    dpdk_per_thread_data_t *ptd = vec_elt_at_index(dm->per_thread_data,
                                                    vm->thread_index);
    u32 n_rx_packets = 0;

    /* Poll each queue assigned to this worker */
    dpdk_device_and_queue_t *dq;
    vec_foreach(dq, dm->devices_by_cpu[vm->thread_index]) {
        dpdk_device_t *xd = vec_elt_at_index(dm->devices, dq->device_index);

        /* DPDK burst receive - fills ptd->mbufs[] */
        u32 n_rx = rte_eth_rx_burst(xd->port_id, dq->queue_id,
                                   ptd->mbufs, DPDK_RX_BURST_SZ);
        if (n_rx == 0) continue;

        /* Convert mbufs to vlib buffer indices + dispatch to ethernet-input */
        n_rx_packets += dpdk_process_rx_burst(vm, node, xd, dq->queue_id,
                                             ptd, n_rx);
    }
    return n_rx_packets;
}

/* What dpdk_process_rx_burst does: */
/* 1. For each mbuf: derive vlib_buffer_t pointer (they share memory) */
/* 2. Set vlib_buffer fields: current_data, current_length, sw_if_index */
/* 3. Copy DPDK offload flags to vlib_buffer flags (RSS hash, checksum) */
/* 4. Enqueue u32 buffer indices to ethernet-input frame */

💡 Key performance detail: dpdk-input does NOT call vlib_buffer_alloc(). Instead, vlib buffers and DPDK mbufs share the same memory pool - the vlib buffer header IS the mbuf's private data area. This zero-copy design means RX never allocates memory; the conversion from mbuf to vlib_buffer is a pointer offset calculation.

MBUF ↔ VLIB_BUFFER MEMORY BRIDGE

🔗

Shared Memory Layout

ZERO-COPY

The DPDK plugin pre-allocates a single rte_mempool with a custom private data size large enough to hold a vlib_buffer_t. Each rte_mbuf in this pool has its rte_mbuf_priv_data area occupied by the vlib_buffer_t header. They overlap in memory.

/* Memory layout of a DPDK+VPP buffer */

+──────────────────────────────────────────────────────────────────+
|  rte_mbuf (128 bytes)  |  vlib_buffer_t (128 bytes)  |  data[]  |
|  (DPDK header)         |  (VPP header = mbuf priv)   |          |
+──────────────────────────────────────────────────────────────────+
                         ↑                              ↑
                   vlib_buffer ptr               packet data

/* Converting between the two */
/* mbuf → vlib_buffer */
vlib_buffer_t *b = vlib_buffer_from_rte_mbuf(mb);
/* equivalent to: (vlib_buffer_t *)RTE_PTR_ADD(mb, sizeof(struct rte_mbuf)) */

/* vlib_buffer → mbuf */
struct rte_mbuf *mb = rte_mbuf_from_vlib_buffer(b);
/* equivalent to: (struct rte_mbuf *)RTE_PTR_SUB(b, sizeof(struct rte_mbuf)) */

/* Fields are synced at RX entry and TX exit */
/* RX: DPDK fills mbuf, plugin copies to vlib_buffer fields */
b->current_data   = mb->data_off - RTE_PKTMBUF_HEADROOM;
b->current_length = mb->data_len;
b->flags |= (mb->ol_flags & PKT_RX_RSS_HASH) ? VLIB_BUFFER_TOTAL_LENGTH_VALID : 0;
vnet_buffer(b)->sw_if_index[VLIB_RX] = xd->sw_if_index;
vnet_buffer(b)->sw_if_index[VLIB_TX] = ~0;  /* unknown at RX */

/* TX: vlib_buffer → mbuf */
mb->data_off = b->current_data + RTE_PKTMBUF_HEADROOM;
mb->data_len = b->current_length;
mb->pkt_len  = b->current_length;

⚙️ DPDK KNOWLEDGE APPLIED

You know rte_mempool with custom private size - VPP uses exactly this to embed vlib_buffer_t in each mbuf's private data region
You know rte_mbuf.data_off is the offset from the mbuf start to packet data - VPP's current_data is the equivalent from the vlib_buffer start
RSS hash in mb->hash.rss is copied to b->flow_id - used for per-flow worker assignment in some configurations
DPDK scatter-gather (multi-segment mbufs) maps to VPP chained buffers via b->next_buffer - the DPDK plugin chains them during RX conversion

MELLANOX mlx5 - YOUR NIC

🔬

mlx5 PMD Specifics for VPP

MELLANOX

Mellanox ConnectX-4/5/6 (mlx5 PMD) in VPP behaves differently from Intel NICs. Understanding the mlx5-specific behaviour prevents the most common VPP + Mellanox configuration issues.

Topic	mlx5 Behaviour	Action Required
Driver binding	mlx5 does NOT use vfio-pci as primary. Uses kernel mlx5_core + mlx5_ib alongside DPDK	Do NOT unbind from mlx5_core. DPDK mlx5 PMD works on top of it via rdma
IOVA mode	Requires Virtual Address (VA) IOVA mode	Set `iova-mode va` in startup.conf dpdk stanza
Hugepages	mlx5 uses DMA mapping - works with 2MB and 1GB pages	Both work; 1GB pages give fewer TLB misses at high load
Multi-queue RSS	Full RSS support: Toeplitz hash on IPv4/IPv6/TCP/UDP	Set num-rx-queues = num worker threads for full parallelism
Checksum offload	Full IPv4/TCP/UDP TX and RX checksum offload	Enable in dpdk stanza: `enable-tcp-udp-checksum`
TSO (TCP Segmentation)	Supported on ConnectX-5 and later	Enable per-port in startup.conf if using TCP session layer
Multi-seg mbufs	mlx5 handles scatter-gather natively	Enable `multi-seg` in dpdk stanza for jumbo frames
VF / SR-IOV	Create VFs on the PF, each VF gets its own PMD instance	One VF per container - standard SR-IOV workflow you know from DPDK

# Correct startup.conf for Mellanox ConnectX-5 with VPP
dpdk {
  dev 0000:03:00.0 {
    name eth0                       # human-readable name in VPP
    num-rx-queues 4                 # = number of worker threads
    num-tx-queues 4
    num-rx-desc 2048
    num-tx-desc 2048
    rss-fn 0x3c8                    # RSS on IPv4+IPv6+TCP+UDP
    enable-tcp-udp-checksum         # TX checksum offload
  }
  uio-driver none                   # mlx5: no vfio-pci binding needed
  iova-mode va                      # REQUIRED for mlx5
  socket-mem 2048,0                 # 2GB on NUMA 0, 0 on NUMA 1
  log-level notice
}

# Verify mlx5 detection
# vppctl: show dpdk interface
# Should show: driver mlx5_pmd, link state up

DPDK STANZA REFERENCE

⚙️

Complete startup.conf DPDK Options

CONFIGURATION

Option	Scope	Description	Recommended for AMD+mlx5
`dev <PCI> { ... }`	Per-port	Configure a specific DPDK device by PCI address	Required for each Mellanox port
`num-rx-queues N`	Per-port	Number of RX queues. Must ≤ number of worker threads	Set equal to workers
`num-tx-queues N`	Per-port	Number of TX queues. One per worker	Set equal to workers
`num-rx-desc N`	Per-port	RX ring size. Power of 2. 1024–4096	2048 for high-throughput
`num-tx-desc N`	Per-port	TX ring size. Power of 2	2048
`uio-driver vfio-pci`	Global	Use vfio-pci for Intel/virtio. For mlx5: use `none`	`uio-driver none`
`iova-mode va`	Global	Virtual address IOVA mode. Required for mlx5	Always set for mlx5
`socket-mem N,N`	Global	Hugepage memory per NUMA socket in MB	Match to your topology
`no-multi-seg`	Global	Disable multi-segment mbufs (faster for small packets)	Set unless using jumbo frames
`enable-tcp-udp-checksum`	Per-port	Enable HW TX checksum offload for TCP/UDP	Enable on mlx5 ConnectX-5+
`log-level <level>`	Global	DPDK log verbosity: debug/info/notice/warning/error	`notice` in production
`dev default { ... }`	Global	Default settings applied to all DPDK devices	Use to avoid repeating per-port config

PROJECT 4

Interface Technology Comparison Lab

Objective: Quantitatively compare DPDK, memif, and TAP throughput using identical test traffic. Understand the performance cost of each interface type.

Set up three VPP containers: Container A (testpmd sending 64B frames), Container B (VPP with all three interface types), Container C (testpmd receiving). Create: one DPDK-to-DPDK path, one memif path, one TAP path between the same endpoints.

Use testpmd's txonly mode to send at line rate (10 Gbps) on each path. Record: throughput (Mpps), latency (p50/p99 from dpdk-testpmd rxonly with timestamps), and CPU usage per worker thread.

Examine show run on each VPP instance. Compare vectors/call and clocks/vector for dpdk-input vs memif-input vs af-packet-input. Build a table of results.

Check show dpdk interface xstats GigabitEthernet0/8/0 for hardware-level counters: rx_missed_errors, rx_no_mbuf_errors, tx_errors. These indicate buffer exhaustion or descriptor ring underflow.

Identify the bottleneck in each path using the data collected. Write a 1-page analysis: when would you choose each interface type in a production deployment?

P3A COMPLETION CHECKLIST

Know dpdk_plugin.so source layout: node.c (RX), tx_func.c (TX), init.c (setup), dpdk.h
Understand dpdk-input's poll loop: rte_eth_rx_burst → convert → enqueue to ethernet-input
Can explain the mbuf/vlib_buffer shared memory layout and the zero-copy design
Know the offset conversion macros: vlib_buffer_from_rte_mbuf / rte_mbuf_from_vlib_buffer
Know which mbuf fields are synced to vlib_buffer fields at RX (data_off, data_len, ol_flags)
Understand mlx5 PMD specifics: no vfio-pci unbind, iova-mode va required, RSS configuration
Can write a complete dpdk stanza for Mellanox ConnectX-5 with multi-queue, checksum offload
Know the key dpdk stanza options and their effects (num-rx-queues, socket-mem, no-multi-seg)
Can interpret show dpdk interface xstats: know what rx_missed_errors and rx_no_mbuf_errors mean
Completed Project 4: interface technology comparison with quantitative results

✅ Next: P3B - memif and shared-memory interfaces. This is where VPP shines for container-to-container connectivity.

← vnet 🗺️ Roadmap Next: memif →