DPDK PLUGIN ARCHITECTURE
How the DPDK Plugin Integrates
ARCHITECTUREThe DPDK plugin (dpdk_plugin.so) bridges DPDK's poll-mode driver (PMD) model and VPP's graph-node model. It is responsible for: initialising DPDK EAL, binding physical ports, polling RX queues, converting mbufs to vlib buffers, and transmitting vlib buffers back through DPDK's TX burst API.
/* Plugin source layout: src/plugins/dpdk/ */ dpdk/ ├── device/ │ ├── node.c # dpdk-input node function - the RX hot path │ ├── tx_func.c # dpdk-output / dpdk-tx - the TX hot path │ ├── init.c # EAL init, port setup, queue allocation │ └── format.c # CLI formatting: show dpdk interface ├── dpdk.h # dpdk_main_t, dpdk_device_t - master structs └── api/ └── dpdk.api # Binary API: set DPDK interface config etc /* Key structs */ dpdk_main_t - singleton: EAL args, device pool, per-worker tx queues dpdk_device_t - per-port: port_id, n_rx_queues, rx/tx descriptors, stats
- The DPDK plugin calls
rte_eal_init()during VPP startup - before any graph nodes run - One
dpdk_device_texists per physical port; stored in a vec indexed byxd_index - Each RX queue is polled by exactly one worker thread - the assignment is in
dpdk_device_t.rx_queues[q].thread_index - TX uses per-worker tx queue buffers to avoid locking: worker N uses tx queue N exclusively
dpdk-input - THE RX HOT PATH
dpdk-input Node Internals
INTERNALSdpdk-input is a VLIB_NODE_TYPE_INPUT node that polls DPDK RX queues. It is the entry point for all physical network traffic in VPP. The key performance insight is that it processes burst of up to DPDK_NB_RX_DESC mbufs per call and converts them all to vlib buffer indices before dispatching to the next graph node.
/* Simplified dpdk-input hot path (src/plugins/dpdk/device/node.c) */ VLIB_NODE_FN(dpdk_input_node)(vlib_main_t *vm, vlib_node_runtime_t *node, vlib_frame_t *frame) { dpdk_main_t *dm = &dpdk_main; dpdk_per_thread_data_t *ptd = vec_elt_at_index(dm->per_thread_data, vm->thread_index); u32 n_rx_packets = 0; /* Poll each queue assigned to this worker */ dpdk_device_and_queue_t *dq; vec_foreach(dq, dm->devices_by_cpu[vm->thread_index]) { dpdk_device_t *xd = vec_elt_at_index(dm->devices, dq->device_index); /* DPDK burst receive - fills ptd->mbufs[] */ u32 n_rx = rte_eth_rx_burst(xd->port_id, dq->queue_id, ptd->mbufs, DPDK_RX_BURST_SZ); if (n_rx == 0) continue; /* Convert mbufs to vlib buffer indices + dispatch to ethernet-input */ n_rx_packets += dpdk_process_rx_burst(vm, node, xd, dq->queue_id, ptd, n_rx); } return n_rx_packets; } /* What dpdk_process_rx_burst does: */ /* 1. For each mbuf: derive vlib_buffer_t pointer (they share memory) */ /* 2. Set vlib_buffer fields: current_data, current_length, sw_if_index */ /* 3. Copy DPDK offload flags to vlib_buffer flags (RSS hash, checksum) */ /* 4. Enqueue u32 buffer indices to ethernet-input frame */
💡 Key performance detail: dpdk-input does NOT call vlib_buffer_alloc(). Instead, vlib buffers and DPDK mbufs share the same memory pool - the vlib buffer header IS the mbuf's private data area. This zero-copy design means RX never allocates memory; the conversion from mbuf to vlib_buffer is a pointer offset calculation.
MBUF ↔ VLIB_BUFFER MEMORY BRIDGE
Shared Memory Layout
ZERO-COPYThe DPDK plugin pre-allocates a single rte_mempool with a custom private data size large enough to hold a vlib_buffer_t. Each rte_mbuf in this pool has its rte_mbuf_priv_data area occupied by the vlib_buffer_t header. They overlap in memory.
/* Memory layout of a DPDK+VPP buffer */ +──────────────────────────────────────────────────────────────────+ | rte_mbuf (128 bytes) | vlib_buffer_t (128 bytes) | data[] | | (DPDK header) | (VPP header = mbuf priv) | | +──────────────────────────────────────────────────────────────────+ ↑ ↑ vlib_buffer ptr packet data /* Converting between the two */ /* mbuf → vlib_buffer */ vlib_buffer_t *b = vlib_buffer_from_rte_mbuf(mb); /* equivalent to: (vlib_buffer_t *)RTE_PTR_ADD(mb, sizeof(struct rte_mbuf)) */ /* vlib_buffer → mbuf */ struct rte_mbuf *mb = rte_mbuf_from_vlib_buffer(b); /* equivalent to: (struct rte_mbuf *)RTE_PTR_SUB(b, sizeof(struct rte_mbuf)) */ /* Fields are synced at RX entry and TX exit */ /* RX: DPDK fills mbuf, plugin copies to vlib_buffer fields */ b->current_data = mb->data_off - RTE_PKTMBUF_HEADROOM; b->current_length = mb->data_len; b->flags |= (mb->ol_flags & PKT_RX_RSS_HASH) ? VLIB_BUFFER_TOTAL_LENGTH_VALID : 0; vnet_buffer(b)->sw_if_index[VLIB_RX] = xd->sw_if_index; vnet_buffer(b)->sw_if_index[VLIB_TX] = ~0; /* unknown at RX */ /* TX: vlib_buffer → mbuf */ mb->data_off = b->current_data + RTE_PKTMBUF_HEADROOM; mb->data_len = b->current_length; mb->pkt_len = b->current_length;
- You know
rte_mempoolwith custom private size - VPP uses exactly this to embed vlib_buffer_t in each mbuf's private data region - You know
rte_mbuf.data_offis the offset from the mbuf start to packet data - VPP'scurrent_datais the equivalent from the vlib_buffer start - RSS hash in
mb->hash.rssis copied tob->flow_id- used for per-flow worker assignment in some configurations - DPDK scatter-gather (multi-segment mbufs) maps to VPP chained buffers via
b->next_buffer- the DPDK plugin chains them during RX conversion
MELLANOX mlx5 - YOUR NIC
mlx5 PMD Specifics for VPP
MELLANOXMellanox ConnectX-4/5/6 (mlx5 PMD) in VPP behaves differently from Intel NICs. Understanding the mlx5-specific behaviour prevents the most common VPP + Mellanox configuration issues.
| Topic | mlx5 Behaviour | Action Required |
|---|---|---|
| Driver binding | mlx5 does NOT use vfio-pci as primary. Uses kernel mlx5_core + mlx5_ib alongside DPDK | Do NOT unbind from mlx5_core. DPDK mlx5 PMD works on top of it via rdma |
| IOVA mode | Requires Virtual Address (VA) IOVA mode | Set iova-mode va in startup.conf dpdk stanza |
| Hugepages | mlx5 uses DMA mapping - works with 2MB and 1GB pages | Both work; 1GB pages give fewer TLB misses at high load |
| Multi-queue RSS | Full RSS support: Toeplitz hash on IPv4/IPv6/TCP/UDP | Set num-rx-queues = num worker threads for full parallelism |
| Checksum offload | Full IPv4/TCP/UDP TX and RX checksum offload | Enable in dpdk stanza: enable-tcp-udp-checksum |
| TSO (TCP Segmentation) | Supported on ConnectX-5 and later | Enable per-port in startup.conf if using TCP session layer |
| Multi-seg mbufs | mlx5 handles scatter-gather natively | Enable multi-seg in dpdk stanza for jumbo frames |
| VF / SR-IOV | Create VFs on the PF, each VF gets its own PMD instance | One VF per container - standard SR-IOV workflow you know from DPDK |
# Correct startup.conf for Mellanox ConnectX-5 with VPP dpdk { dev 0000:03:00.0 { name eth0 # human-readable name in VPP num-rx-queues 4 # = number of worker threads num-tx-queues 4 num-rx-desc 2048 num-tx-desc 2048 rss-fn 0x3c8 # RSS on IPv4+IPv6+TCP+UDP enable-tcp-udp-checksum # TX checksum offload } uio-driver none # mlx5: no vfio-pci binding needed iova-mode va # REQUIRED for mlx5 socket-mem 2048,0 # 2GB on NUMA 0, 0 on NUMA 1 log-level notice } # Verify mlx5 detection # vppctl: show dpdk interface # Should show: driver mlx5_pmd, link state up
DPDK STANZA REFERENCE
Complete startup.conf DPDK Options
CONFIGURATION| Option | Scope | Description | Recommended for AMD+mlx5 |
|---|---|---|---|
dev <PCI> { ... } | Per-port | Configure a specific DPDK device by PCI address | Required for each Mellanox port |
num-rx-queues N | Per-port | Number of RX queues. Must ≤ number of worker threads | Set equal to workers |
num-tx-queues N | Per-port | Number of TX queues. One per worker | Set equal to workers |
num-rx-desc N | Per-port | RX ring size. Power of 2. 1024–4096 | 2048 for high-throughput |
num-tx-desc N | Per-port | TX ring size. Power of 2 | 2048 |
uio-driver vfio-pci | Global | Use vfio-pci for Intel/virtio. For mlx5: use none | uio-driver none |
iova-mode va | Global | Virtual address IOVA mode. Required for mlx5 | Always set for mlx5 |
socket-mem N,N | Global | Hugepage memory per NUMA socket in MB | Match to your topology |
no-multi-seg | Global | Disable multi-segment mbufs (faster for small packets) | Set unless using jumbo frames |
enable-tcp-udp-checksum | Per-port | Enable HW TX checksum offload for TCP/UDP | Enable on mlx5 ConnectX-5+ |
log-level <level> | Global | DPDK log verbosity: debug/info/notice/warning/error | notice in production |
dev default { ... } | Global | Default settings applied to all DPDK devices | Use to avoid repeating per-port config |
Interface Technology Comparison Lab
Objective: Quantitatively compare DPDK, memif, and TAP throughput using identical test traffic. Understand the performance cost of each interface type.
txonly mode to send at line rate (10 Gbps) on each path. Record: throughput (Mpps), latency (p50/p99 from dpdk-testpmd rxonly with timestamps), and CPU usage per worker thread.show run on each VPP instance. Compare vectors/call and clocks/vector for dpdk-input vs memif-input vs af-packet-input. Build a table of results.show dpdk interface xstats GigabitEthernet0/8/0 for hardware-level counters: rx_missed_errors, rx_no_mbuf_errors, tx_errors. These indicate buffer exhaustion or descriptor ring underflow.P3A COMPLETION CHECKLIST
- Know dpdk_plugin.so source layout: node.c (RX), tx_func.c (TX), init.c (setup), dpdk.h
- Understand dpdk-input's poll loop: rte_eth_rx_burst → convert → enqueue to ethernet-input
- Can explain the mbuf/vlib_buffer shared memory layout and the zero-copy design
- Know the offset conversion macros: vlib_buffer_from_rte_mbuf / rte_mbuf_from_vlib_buffer
- Know which mbuf fields are synced to vlib_buffer fields at RX (data_off, data_len, ol_flags)
- Understand mlx5 PMD specifics: no vfio-pci unbind, iova-mode va required, RSS configuration
- Can write a complete dpdk stanza for Mellanox ConnectX-5 with multi-queue, checksum offload
- Know the key dpdk stanza options and their effects (num-rx-queues, socket-mem, no-multi-seg)
- Can interpret show dpdk interface xstats: know what rx_missed_errors and rx_no_mbuf_errors mean
- Completed Project 4: interface technology comparison with quantitative results
✅ Next: P3B - memif and shared-memory interfaces. This is where VPP shines for container-to-container connectivity.