THE FUNDAMENTAL SHIFT
Scalar vs Vector Packet Processing
CORE CONCEPTScalar (traditional stacks): One packet enters the stack, traverses all processing stages, exits. Then the next packet starts. Every packet re-warms the CPU instruction cache from scratch.
Vector (VPP's model): A batch of packets - the vector - enters a single graph node together. That node processes all N packets before any packet moves to the next node. The first packet in the batch warms the I-cache; every subsequent packet in the batch benefits at zero cost.
// Scalar processing - per-packet cache thrash for each packet: ip4_lookup(pkt) // I-cache warm ip4_rewrite(pkt) // I-cache cold again ethernet_output(pkt) // Vector processing - VPP's model ip4_lookup(pkt[0..255]) // warm once, amortised over 256 pkts ip4_rewrite(pkt[0..255]) // warm once, amortised over 256 pkts ethernet_output(pkt[0..255]) // warm once, amortised over 256 pkts
This single architectural decision - processing a vector of packets per node invocation - gives VPP its performance edge. It enables prefetching, SIMD vectorisation, and cache-efficient branch prediction that simply cannot happen one packet at a time.
- rte_eth_rx_burst() is VPP's equivalent of "get a vector of packets" - you already use burst RX for the same reason
- PMD poll loop maps to VPP's INPUT node polling: both spin on hardware without interrupts
- rte_mbuf** array from rx_burst ≈ VPP's
vlib_frame_tof buffer indices - a batch of packet references processed together - VPP generalises the single DPDK burst loop into a chain of N graph nodes, each processing the same batch
The Packet Processing Graph - Core Mental Model
ARCHITECTUREVPP's dataplane is a directed graph of processing nodes. Each node is a C function. Packets (as buffer indices) flow along graph edges. A single packet traversal from RX to TX typically looks like:
dpdk-input → ethernet-input → ip4-input → ip4-lookup (FIB lookup → next-hop) → ip4-rewrite (rewrite L2 header) → dpdk-output (TX to NIC)
The graph is not acyclic - a packet can re-visit ip4-lookup multiple times (e.g., MPLS label push/pop). Each node's output is a next index that selects the outgoing edge.
- Nodes communicate via
vlib_frame_t: arrays of u32 buffer indices, not pointers - All nodes for a given phase run to completion before the next phase begins
- The graph dispatcher (
vlib_main_loop) drives everything - you never write a main loop
💡 Key insight - why u32 indices, not pointers? A u32 is 4 bytes; a pointer is 8. A frame of 256 packet references is 1 KB with indices vs 2 KB with pointers. This matters: the entire frame fits in a cache line set. Buffer pool base address + index = pointer at any time - zero overhead to dereference.
IMPLEMENTATION TAXONOMY
Container application - the vpp binary itself. Ties all layers together, runs the main loop, loads plugins. Source: src/vpp/
Shared libraries loaded at startup. DPDK, memif, NAT, ACL, GTP, QUIC - all plugins. Your own features go here. Source: src/plugins/
Key plugins: dpdk_plugin.so, memif_plugin.so, nat_plugin.so, acl_plugin.so, af_xdp_plugin.so
Networking layer. L2/L3/L4 graph nodes, interface abstraction (sw_if_index), FIB, ARP, neighbour tables, session layer. Source: src/vnet/
Key subdirs: src/vnet/ip/, src/vnet/ethernet/, src/vnet/fib/, src/vnet/devices/
Vector processing library. Graph node scheduler, buffer management, cooperative threads (process nodes), CLI, packet tracing, counters. Source: src/vlib/
Key files: src/vlib/main.c (dispatch loop), src/vlib/node.h, src/vlib/buffer.h
Core library - VPP's libc. Memory allocators, vectors, pools, hash tables, ring buffers, format/unformat, timers. Everything is built on top of this. Source: src/vppinfra/
Key files: pool.h, vec.h, hash.h, bihash_8_8.h, clib.h, format.h
Source Repository Layout
CODEBASE MAPgithub.com/FDio/vpp ├── src/vppinfra/ # Core library: vec.h, pool.h, hash.h, bihash_*.h ├── src/vlib/ # Graph dispatcher: main.c, node.h, buffer.h, threads.c ├── src/vnet/ # Networking: ip/, ethernet/, fib/, devices/, feature/ ├── src/plugins/ # Plugins: dpdk/, memif/, nat/, acl/, af_xdp/, linux-cp/ ├── src/vpp/ # Container binary: app/vpe_cli.c ├── src/vpp-api/ # API bindings: python/vpp_papi/, .api.json files ├── src/svm/ # Shared virtual memory ├── src/examples/ # Sample plugin, handoff demo └── test/ # Python test framework: test_*.py
When you explore a new VPP subsystem, start by reading the .h file - it contains the data structures and macro definitions. The .c file contains the implementations. API definitions live in .api files alongside each plugin.
BUILD FROM SOURCE
Building VPP
HANDS-ONAlways build from source for development. Binary packages hide important details. The VPP build system is CMake-based with a convenience Makefile wrapper.
# Clone the repo git clone https://github.com/FDio/vpp.git && cd vpp # Install build dependencies (Ubuntu 22.04) make install-dep # Debug build - has symbols, ASAN-compatible, slower make build # Release/optimised build - production performance make build-release # Run debug VPP interactively (reads /etc/vpp/startup.conf) make run # Run under GDB for debugging make run-gdb # Run full test suite make test # Run a specific test make test TEST=test_nat
- Debug binary lives at:
build-root/install-vpp_debug-native/vpp/bin/vpp - Release binary:
build-root/install-vpp-native/vpp/bin/vpp - Plugins: compiled as
.sofiles, loaded from the plugin directory at startup
DOCKER + AMD + MELLANOX SETUP
Container Setup for Mellanox Ports
YOUR ENVYour environment: Docker containers on AMD server with Mellanox Ethernet ports. VPP needs privileged access to hugepages, VFIO devices, and the PCI bus. The following setup gives VPP everything it needs.
# Step 1: Allocate hugepages on the host (2MB pages) echo 2048 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages sudo mkdir -p /dev/hugepages sudo mount -t hugetlbfs nodev /dev/hugepages # Step 2: Bind Mellanox port to vfio-pci (use PCI address from lspci) sudo dpdk-devbind.py --status # find PCI address sudo dpdk-devbind.py --bind vfio-pci 0000:03:00.0 sudo dpdk-devbind.py --bind vfio-pci 0000:03:00.1 # Step 3: Run VPP container with all required resources docker run --privileged --network host \ -v /dev/hugepages:/dev/hugepages \ -v /sys/bus/pci:/sys/bus/pci \ -v /run/vpp:/run/vpp \ -v /dev/vfio:/dev/vfio \ -v /dev/vfio/vfio:/dev/vfio/vfio \ -v /etc/vpp:/etc/vpp \ -it ubuntu:22.04 /bin/bash
- mlx5 PMD: Mellanox ConnectX-4/5/6 use the mlx5 poll-mode driver. VPP's DPDK plugin includes mlx5 support. No separate binding needed for mlx5 - it works through the kernel
mlx5_core+ VFIO - IOVA mode: For Mellanox with DPDK, use
--iova-mode va(VA mode). Set in VPP viadpdk { iova-mode va }in startup.conf - SR-IOV VFs: For multi-container setups, create VFs on the PF and pass one VF per container - same as standard DPDK SR-IOV workflow
- No KNI: VPP does not use DPDK KNI. Use TAP v2 or linux-cp for Linux kernel access
STARTUP CONFIGURATION
startup.conf - Every Stanza Explained
CONFIGURATIONstartup.conf is VPP's single configuration file, read at launch. It controls process behaviour, CPU pinning, DPDK ports, buffer pools, and plugin loading. Here is a production-annotated example for your environment:
unix { nodaemon # run in foreground (good for containers) log /var/log/vpp/vpp.log full-coredump # core dumps on crash cli-listen /run/vpp/cli.sock # vppctl connects here startup-config /etc/vpp/setup.gate # CLI commands run at startup } api-trace { on # record API calls (for replay debugging) } cpu { main-core 0 # pin main thread to core 0 corelist-workers 2-5 # 4 workers on cores 2-5 # corelist-workers 2,4,6,8 # non-contiguous cores also OK } dpdk { dev 0000:03:00.0 { # Mellanox port 0 num-rx-queues 4 # 1 queue per worker thread num-tx-queues 4 num-rx-desc 1024 num-tx-desc 1024 } dev 0000:03:00.1 { # Mellanox port 1 num-rx-queues 4 num-tx-queues 4 } uio-driver vfio-pci iova-mode va # required for Mellanox mlx5 socket-mem 1024,1024 # 1 GB per NUMA socket no-multi-seg # disable jumbo unless needed log-level notice } buffers { buffers-per-numa 128000 # buffer pool size per NUMA node default-data-size 2048 # buffer data area in bytes # use 10240 for jumbo/MTU 9000 } plugins { path /usr/lib/x86_64-linux-gnu/vpp_plugins plugin dpdk_plugin.so { enable } plugin memif_plugin.so { enable } # plugin some_plugin.so { disable } } statseg { size 128m # stats segment size per-node-counters on }
Key rules:
corelist-workerscount must equal total RX queues across all interfaces for full utilisationsocket-memuses hugepages - must be pre-allocated on host before container startsbuffers-per-numa- if you see buffer allocation failures in logs, increase thisstartup-config- put CLI commands here (set interface state, add routes) for auto-config at boot
ESSENTIAL CLI COMMANDS
vppctl - Your Primary Interface
CLI REFERENCEvppctl connects to VPP's Unix socket (/run/vpp/cli.sock) and sends CLI commands. You can use it interactively or pipe commands:
vppctl # interactive shell vppctl show version # single command echo "show run" | vppctl # pipe
| Command | What It Shows / Does | Use When |
|---|---|---|
show version | VPP version, build date, plugins loaded | First thing after starting VPP |
show plugins | All loaded plugins with versions | Verify dpdk_plugin, memif_plugin loaded |
show interface | All interfaces: state, RX/TX packet+byte counters, error counts | Check interface is up, count packets |
show run | Per-node stats: calls, vectors processed, suspends, clocks/vector | Most important perf view - check vectors/call |
show buffers | Buffer pool utilisation per NUMA node | Check for buffer starvation (free < 20%) |
show error | Error counter table: which nodes are dropping and why | Debug drops - e.g. "ip4 source lookup miss" |
show ip fib | FIB routing table: all prefixes and their DPO chains | Verify routes are programmed correctly |
show ip neighbors | ARP/ND neighbour table | Check ARP resolution |
trace add dpdk-input 100 | Capture next 100 packets entering from DPDK input | Start trace before sending test traffic |
show trace | Full per-packet trace: every node the packet visited with timestamps | After trace capture - shows complete packet path |
clear trace | Clear the trace buffer | Before new capture |
show interface rx-placement | Which worker thread handles which interface RX queue | Verify NUMA-local queue assignments |
set interface rx-placement <if> queue 0 worker 0 | Assign interface queue to specific worker | Manual NUMA-aware pinning |
set interface state <if> up | Bring interface up | After creating interface |
set interface ip address <if> 10.0.0.1/24 | Assign IP address | Configure L3 interface |
show dpdk interface | DPDK-specific interface info: queues, link speed, driver | Verify mlx5 link is up at correct speed |
show dpdk interface xstats <if> | Extended NIC statistics from the DPDK ethdev layer | Deep NIC-level counters |
show log | VPP internal log messages | Troubleshoot startup and plugin errors |
event-logger on | Enable high-resolution event logger | Timing analysis - use with g2 viewer |
💡 The most important command: show run - look at vectors/call for your input node. A value of 32–256 means VPP is batching well. A value of 1–4 means the system is lightly loaded or misconfigured. Clocks/vector is your per-packet CPU cost - lower is better.
VPP Container Lab - First Packet
Objective: Spin up a VPP instance inside Docker with Mellanox ports, configure two interfaces, send traffic, and fully trace the packet path through the graph.
show plugins that dpdk_plugin.so is loaded.startup.conf with your Mellanox PCI addresses, 1 GB hugepages per socket, and 2 worker threads pinned to non-overlapping cores.show interface. Both Mellanox ports should appear as GigabitEthernet or Ethernet devices. Bring them up: set interface state <if> up.ip route add 192.168.2.0/24 via 192.168.1.2.trace add dpdk-input 100. Then send 10 ICMP pings to VPP's interface IP.show trace. For each captured packet, identify every graph node it visited and the time spent (in clock ticks) at each node.show run. Record: vectors/call for dpdk-input, clocks/vector for ip4-lookup and ip4-rewrite. This is your baseline performance fingerprint.show run output. Does throughput scale linearly?show error and verify there are no unexpected drops. If there are, trace a dropped packet and identify the error node.PHASE 1 COMPLETION CHECKLIST
- Can explain scalar vs vector processing and why vector processing improves I-cache utilisation
- Know the 5 VPP layers (VPPInfra, vlib, vnet, plugins, VPP binary) and what each is responsible for
- Can build VPP from source (
make buildandmake build-release) and know where the binaries are - Can run a VPP container on the AMD/Mellanox environment with correct hugepage and VFIO setup
- Can write a complete
startup.conffrom scratch with DPDK stanza, CPU pinning, and buffer sizing - Know the difference between
main-coreandcorelist-workersand how to size them for NIC queues - Can use
vppctlto bring up interfaces, assign IPs, add routes - Can capture and interpret a packet trace - identify each graph node in the trace output
- Understand what
show runshows: vectors/call, clocks/vector, and what good values look like - Completed Mini-Project 1: first packet traced end-to-end through the VPP graph
✅ When complete: ready for Phase 2 - Core VPP Internals. Start with vppinfra - every data structure you'll use in plugins is defined there.