VPP — VECTOR PACKET PROCESSOR (FD.io)
What VPP Is and Why Your Team Uses It
OVERVIEWVPP (Vector Packet Processor, FD.io project by Cisco/Linux Foundation) is a full-featured userspace network stack built on DPDK. Where DPDK is a toolkit for packet I/O, VPP is a complete forwarding engine with L2/L3/L4 processing, routing, NAT, ACL, GRE, VxLAN, MPLS, IPsec, and a plugin framework — running at tens to hundreds of millions of packets per second.
VPP is ideal for NGFW development because it provides the fast data plane and rich protocol support you'd otherwise spend years building from scratch, while leaving the door open for custom processing nodes via its plugin system.
| System | Mpps/core (64B) | Features available |
|---|---|---|
| Linux kernel | 1–3 | Everything, but slow |
| DPDK bare (basicfwd) | 30–80 | Only what you code |
| VPP (L3 forwarding) | 20–100 | Full routing, NAT, ACL, tunnels — built-in |
| VPP + ACL plugin | 15–60 | + stateful conntrack |
| VPP + IPsec | 5–20 | + encryption (DPDK crypto offload available) |
VPP Startup Configuration
SETUP# /etc/vpp/startup.conf — key sections unix { nodaemon log /var/log/vpp/vpp.log full-coredump cli-listen /run/vpp/cli.sock # vppctl connects here } dpdk { dev 0000:01:00.0 { name eth0 num-rx-queues 4 num-tx-queues 4 } dev 0000:01:00.1 { name eth1 num-rx-queues 4 num-tx-queues 4 } num-mbufs 131072 socket-mem 2048,0 # 2GB hugepages on NUMA 0 } cpu { main-core 0 # main thread (management) corelist-workers 2,3,4,5 # 4 worker threads } buffers { buffers-per-numa 131072 default data-size 2048 } # Start VPP sudo systemctl start vpp sudo vppctl show version sudo vppctl show interface
VECTOR PROCESSING — VPP'S CORE INNOVATION
Why Processing Vectors Beats One-at-a-Time
CONCEPTVPP's central innovation is processing a batch (vector) of packets through each graph node at once, rather than processing each packet through all nodes in sequence. This exploits CPU microarchitecture in four ways:
I-Cache Efficiency
When the same code path executes for 32 packets in a row, the instruction cache stays warm throughout. One-at-a-time processing causes I-cache eviction between the lengthy gap between packet arrivals. VPP nodes measure vector sizes of 16–64 packets as optimal.
Branch Predictor Warm
Processing 32 IPv4 packets in a row means the same branches (version==4, ihl==5, no options) execute with the same outcome repeatedly. The CPU branch predictor achieves near-100% accuracy across the vector.
Prefetch Pipelining
While processing packet N, you prefetch packet N+4. The 100ns DRAM latency is hidden behind actual computation. The canonical VPP 4x unrolled loop with prefetch is specifically designed to fill the memory latency gap.
SIMD Opportunity
Processing multiple identical structures (IP headers) in sequence creates opportunities for AVX2/AVX512 SIMD optimisation — operating on 4–8 headers simultaneously. The VPP checksum and hash inner loops exploit this.
/* Vector size measurement */ show run # Thread 1 vpp_wk_0: # Name Calls Vectors Clocks Vectors/Call # dpdk-input 100 3200 8.7e3 32.0 # ip4-input 100 3200 1.9e3 32.0 # ip4-lookup 100 3200 2.8e3 32.0 # ip4-rewrite 100 3200 1.4e3 32.0 # # Vectors/Call = average batch size (32 = optimal for most hardware) # Clocks/Vector = CPU cycles per packet in this node # ip4-lookup: 2800 clocks / 32 packets = 87.5 clocks/packet = ~30ns at 3GHz
GRAPH NODE FRAMEWORK — PACKET PIPELINE ARCHITECTURE
Nodes, Frames, and Packet Flow
GRAPH/* VPP graph: directed acyclic graph of processing nodes */ /* Each edge carries a vlib_frame_t — an array of buffer indices */ Default IP4 forwarding path: dpdk-input → ethernet-input → ip4-input → ip4-lookup → ip4-rewrite → interface-output With ACL and NAT inserted: dpdk-input → ethernet-input → ip4-input → [ip4-unicast feature arc]: acl-plugin-in-ip4-fa (ingress ACL + conntrack) nat44-ed-in2out (NAT inbound) → ip4-lookup → ip4-rewrite → [ip4-output feature arc]: nat44-ed-out2in-worker (NAT outbound) acl-plugin-out-ip4-fa (egress ACL) → interface-output /* Node types */ VLIB_NODE_TYPE_INPUT: Poll loop entry (dpdk-input, tap-inject) VLIB_NODE_TYPE_INTERNAL: Processing nodes (ip4-lookup, acl-plugin) VLIB_NODE_TYPE_PRE_INPUT: Runs before INPUT (for scheduling) VLIB_NODE_TYPE_PROCESS: Background process threads /* vlib_frame_t — the unit of work between nodes */ typedef struct { u16 n_vectors; /* number of packets in this frame */ u32 vector_offset; /* offset to u32[] array of buffer indices */ } vlib_frame_t; /* Get the array of buffer indices from a frame */ u32 *bufs = vlib_frame_vector_args(frame); /* bufs[0..n_vectors-1] are indices into vlib_main.buffer_pool */ /* Get packet data from a buffer index */ vlib_buffer_t *b = vlib_get_buffer(vm, bufs[0]); ip4_header_t *ip = vlib_buffer_get_current(b); /* vlib_buffer_get_current(b) = b->data + b->current_data */ /* Key node commands */ show vlib graph # all nodes and their next-node connections show vlib graph ip4-input # next nodes of ip4-input show run # per-node performance (vectors, clocks) show errors # error counters per node
WRITING A VPP PLUGIN — THE CANONICAL PATTERN
Minimal Plugin with the 4x Unroll Pattern
PLUGIN/* my_node.c — packet counter plugin with canonical 4x loop */ #include <vnet/vnet.h> #include <vnet/plugin/plugin.h> #include <vpp/app/version.h> VLIB_PLUGIN_REGISTER() = { .version = VPP_BUILD_VER, .description = "Packet counter plugin", }; typedef enum { MY_NEXT_IP4_LOOKUP, MY_NEXT_DROP, MY_N_NEXT } my_next_t; typedef struct { u64 pkt_count[VLIB_MAX_WORKERS + 1]; /* per-thread, no locking */ } my_main_t; my_main_t my_main; VLIB_NODE_FN(my_counter_node)(vlib_main_t *vm, vlib_node_runtime_t *node, vlib_frame_t *frame) { u32 n_left = frame->n_vectors; u32 *from = vlib_frame_vector_args(frame); u16 nexts[VLIB_FRAME_SIZE], *next = nexts; u64 pkts = 0; /* ── 4x unrolled loop with prefetch ─────────────── */ while (n_left >= 8) { /* Prefetch packet data 4 ahead */ vlib_prefetch_buffer_with_index(vm, from[4], LOAD); vlib_prefetch_buffer_with_index(vm, from[5], LOAD); vlib_prefetch_buffer_with_index(vm, from[6], LOAD); vlib_prefetch_buffer_with_index(vm, from[7], LOAD); /* Get 4 buffers */ vlib_buffer_t *b0 = vlib_get_buffer(vm, from[0]); vlib_buffer_t *b1 = vlib_get_buffer(vm, from[1]); vlib_buffer_t *b2 = vlib_get_buffer(vm, from[2]); vlib_buffer_t *b3 = vlib_get_buffer(vm, from[3]); (void)b0; (void)b1; (void)b2; (void)b3; next[0] = next[1] = next[2] = next[3] = MY_NEXT_IP4_LOOKUP; from += 4; next += 4; n_left -= 4; pkts += 4; } /* ── Scalar tail ─────────────────────────────────── */ while (n_left > 0) { next[0] = MY_NEXT_IP4_LOOKUP; from++; next++; n_left--; pkts++; } my_main.pkt_count[vm->thread_index] += pkts; vlib_buffer_enqueue_to_next(vm, node, vlib_frame_vector_args(frame), nexts, frame->n_vectors); return frame->n_vectors; } VLIB_REGISTER_NODE(my_counter_node) = { .name = "my-counter", .vector_size = sizeof(u32), .type = VLIB_NODE_TYPE_INTERNAL, .n_next_nodes = MY_N_NEXT, .next_nodes = { [MY_NEXT_IP4_LOOKUP] = "ip4-lookup", [MY_NEXT_DROP] = "error-drop", }, }; /* Insert into ip4-unicast feature arc on an interface */ /* vnet_feature_enable_disable("ip4-unicast", "my-counter", sw_if_index, 1, 0, 0); */ /* CMakeLists.txt */ # add_vpp_plugin(my_plugin SOURCES my_node.c API_FILES my_plugin.api) # Plugins auto-loaded from /usr/lib/vpp_plugins/ at VPP startup
💡 The 4x unroll + prefetch pattern is canonical VPP. Every performance-critical node in VPP core uses this exact structure. The prefetch distance of 4 is tuned for typical L1/L2 miss latency (~60–100ns) on Intel Xeon. Copy this template when writing your own DPI or NGFW nodes.
VPP FIB — THE MULTI-LAYER FORWARDING DATABASE
VPP FIB Architecture
FIB/* VPP FIB is a three-layer structure */ Layer 1: IP4 FIB table (per VRF) Hash table → O(1) exact match for /32 host routes mtrie → LPM for all other prefixes (4-level trie, 8 bits/level) Layer 2: Load-Balance (LB) object Created when a prefix has multiple equal-cost next-hops (ECMP) Contains N hash buckets, each pointing to an adjacency Flow-hash over 5-tuple selects bucket (consistent per flow) Layer 3: Adjacency Pre-built rewrite string: "dst_mac src_mac ethertype" (14 bytes) Stored as raw bytes — ip4-rewrite just memcpy's directly into packet Interface index for output /* FIB inspection commands */ show ip fib # entire IPv4 FIB (can be huge) show ip fib table 0 # VRF 0 (default) show ip fib 10.0.0.0/8 # specific prefix details show ip fib 8.8.8.8/32 # host route show ip fib summary # count of prefixes by length show ip adjacency # all adjacency objects show ip adjacency 42 # specific adjacency: rewrite bytes, interface show ip adjacency summary # count by type (glean/rewrite/midchain) /* Route management */ ip route add 10.0.0.0/8 via 192.168.1.1 GigabitEthernet0/8/0 ip route del 10.0.0.0/8 via 192.168.1.1 GigabitEthernet0/8/0 /* ECMP: add same prefix twice = LB with 2 buckets */ ip route add 10.0.0.0/8 via 192.168.1.1 GigabitEthernet0/8/0 ip route add 10.0.0.0/8 via 192.168.1.2 GigabitEthernet0/8/1 show ip fib 10.0.0.0/8 # Displays: load-balance [index N] buckets 2 # [0]: adj[via 192.168.1.1 GigE0/8/0] # [1]: adj[via 192.168.1.2 GigE0/8/1] /* Null routes — blackhole */ ip route add 192.0.2.0/24 drop ip route add 198.51.100.0/24 local # deliver to local stack /* Multiple VRFs (for tenant isolation in NGFW) */ ip table add 100 ip route add table 100 0.0.0.0/0 via 10.100.0.1 GigabitEthernet0/8/0 set interface ip table GigabitEthernet0/8/2 100 # assign interface to VRF 100
VAPI AND CLI — CONTROLLING VPP FROM CODE AND SCRIPTS
VPP Control Plane Interfaces
VAPI/* Three control interfaces */ 1. vppctl CLI — interactive and scripted vppctl show version vppctl ip route add 0.0.0.0/0 via 10.0.0.1 echo "show ip fib summary" | vppctl vppctl exec /etc/vpp/setup.vpp # run a config file 2. Python VAPI — programmatic automation import vpp_papi from vpp_papi import VPP vpp = VPP(['/usr/share/vpp/api/vpe.api.json', '/usr/share/vpp/api/interface.api.json', '/usr/share/vpp/api/ip.api.json']) vpp.connect('my-control-app') # Show version rv = vpp.api.show_version() print(f"VPP version: {rv.version.decode()}") # Add an IP route from ipaddress import ip_address, ip_network rv = vpp.api.ip_route_add_del( is_add=1, route={ 'prefix': {'address': {'af': 0, 'un': {'ip4': b'\x00\x00\x00\x00'}}, 'len': 0}, 'n_paths': 1, 'paths': [{'nh': {'address': {'af': 0, 'un': {'ip4': b'\x0a\x00\x00\x01'}}}, 'sw_if_index': 1, 'proto': 0}] } ) # Create loopback interface rv = vpp.api.create_loopback() sw_if_index = rv.sw_if_index vpp.disconnect() 3. VAT2 (JSON-based API test tool) # vat2 show_version # vat2 show_interface sw_if_index 0 /* Useful diagnostic commands */ show interface # all interfaces, TX/RX stats show hardware-interfaces # NIC capabilities, link state show run # node performance (vectors/call, clocks) show run summary # top CPU-consuming nodes show errors # drop counters per node show buffers # mempool usage show threads # worker thread info and CPU pinning show plugins # loaded plugins show log # VPP log buffer
VPP AS AN NGFW DATA PLANE
Building an NGFW Data Plane on VPP
NGFWVPP's feature arc system lets you insert custom processing nodes into the packet pipeline without modifying VPP core. The ip4-unicast arc is the primary insertion point for NGFW functions on inbound IPv4 traffic.
/* NGFW pipeline using VPP feature arcs */ ip4-input ↓ [ip4-unicast feature arc — ordered by feature weight] ├── acl-plugin-in-ip4-fa (stateful conntrack + ACL rules) ├── nat44-ed-in2out (DNAT / inbound NAT) ├── ipsec-input-ip4 (IPsec decrypt) └── YOUR-NGFW-DPI-NODE (your custom DPI plugin) ↓ ip4-lookup → ip4-rewrite ↓ [ip4-output feature arc] ├── nat44-ed-out2in-worker (SNAT / outbound NAT) └── acl-plugin-out-ip4-fa (egress ACL) ↓ interface-output /* Enable your plugin on an interface */ vnet_feature_enable_disable("ip4-unicast", "my-ngfw-dpi", sw_if_index, 1, 0, 0); /* VPP ACL plugin — built-in stateful firewall */ # Create an ACL (permit HTTPS, permit HTTP, deny all) acl_add_replace acl_index 0 r {is_permit 1 proto 6 dst_port 443 443 dst_ip 0.0.0.0/0}, {is_permit 1 proto 6 dst_port 80 80 dst_ip 0.0.0.0/0}, {is_permit 0} # Apply to interface (inbound = filter traffic entering through eth0) set acl-list interface GigabitEthernet0/8/0 input 0 /* VPP NAT44 — stateful NAT */ nat44 enable sessions 65536 set interface nat44 in GigabitEthernet0/8/0 out GigabitEthernet0/8/1 nat44 add interface address GigabitEthernet0/8/1 /* Connection tracking for custom node */ /* Access conntrack state from within your node: */ clib_bihash_kv_16_8_t kv; /* Key = 5-tuple; Value = session state struct */ if (!clib_bihash_search_16_8(&ngfw_main.session_table, &kv, &kv)) { ngfw_session_t *s = (ngfw_session_t *)(uword)kv.value; /* session found — check state, increment counters */ }
💡 VPP clib_bihash is your primary data structure for session tables. It's a cache-friendly, lock-free concurrent hash table that VPP uses internally for ARP, FIB, and conntrack. For your NGFW session table keyed on 5-tuple, clib_bihash_16_8 (16-byte key = 5-tuple, 8-byte value = session index) achieves ~100ns lookup at millions of sessions — far better than any kernel-side alternative.
VPP PERFORMANCE ANALYSIS TOOLS
Reading show run and Diagnosing Bottlenecks
PERF TOOLS/* show run output — interpreting the numbers */ vppctl show run # Thread 1 vpp_wk_0 (lcore 2): # Name State Calls Vectors Clocks Vec/Call Clk/Vec # dpdk-input active 1000 32000 8.70e+06 32.0 272 # ip4-input active 1000 32000 1.92e+06 32.0 60 # ip4-lookup active 1000 32000 2.84e+06 32.0 89 # ip4-rewrite active 1000 32000 1.44e+06 32.0 45 # my-ngfw-dpi active 1000 32000 9.60e+06 32.0 300 # Clk/Vec = CPU cycles per packet in this node (at 3GHz: 300 cycles = 100ns) # Sum of all Clk/Vec = total cycles per packet through the pipeline # my-ngfw-dpi is the bottleneck here (300 cycles vs 60-89 for built-ins) /* Optimisation workflow */ 1. Run: vppctl clear run; sleep 5; vppctl show run 2. Identify highest Clk/Vec node (your bottleneck) 3. Check: are we prefetching? 4x unrolled? NUMA-local memory? 4. Profile: perf stat -e cycles,cache-misses -C 2 sleep 5 5. Check vector sizes: Vectors/Call < 8 = under-loaded (not batching enough) /* show errors — drop counter diagnosis */ vppctl show errors # ip4-input: ip4 src address is multicast 12 # ip4-input: ip4 spoofed local-address 5 # acl-plugin-in-ip4-fa: ACL deny packets 4821 /* Buffer pressure — detect mempool exhaustion */ vppctl show buffers # If "allocated" approaches "total": mempool running low → increase num-mbufs /* Per-interface counters */ vppctl show interface GigabitEthernet0/8/0 # RX packets/bytes, TX packets/bytes, drops, errors vppctl clear interfaces # reset counters /* Packet capture in VPP (pcap trace) */ pcap dispatch trace on max 1000 file /tmp/vpp.pcap # ... generate traffic ... pcap dispatch trace off # Open /tmp/vpp.pcap in Wireshark — shows packet at each graph node!
VPP from Zero to Forwarding Packet
Objective: Install VPP, configure interfaces and routing, verify packet forwarding, explore the FIB and graph.
sudo apt install vpp vpp-plugin-core vpp-plugin-dpdk. Use tap interfaces for testing (no physical NIC required): create two tap interfaces in startup.conf using tuntap { dev tap0 }. Start VPP and verify: sudo vppctl show version.vppctl set interface state tap0 up, vppctl set interface ip address tap0 10.1.0.1/24. Add a static route: vppctl ip route add 10.2.0.0/24 via 10.1.0.2 tap0. Inspect the FIB: vppctl show ip fib. Find the adjacency for your route: vppctl show ip adjacency.vppctl show vlib graph ip4-input — note the next nodes. Generate traffic (ping through tap interface) and run vppctl show run. Identify which nodes execute and their Clk/Vec values. Calculate: at your measured Clk/Vec, what is the maximum Mpps per core?vppctl acl_add_replace acl_index 0 r {is_permit 0}, vppctl set acl-list interface tap0 input 0. Verify pings are dropped. Check: vppctl show errors — see the ACL deny counter increment.Write a Custom VPP Counter Plugin
Objective: Write, build, and load a VPP plugin that counts packets per source IP using clib_bihash.
sudo apt install vpp-dev. Create a plugin directory structure: my_plugin/CMakeLists.txt and my_plugin/my_node.c. Use the template from Tab 3.clib_bihash_8_8_t (key=src_ip u64, value=pkt_count u64). In the processing loop, extract the source IP from the IP header, look up/insert in the hash table, increment the count. Handle IPv4 only; pass all packets to ip4-lookup.VLIB_CLI_COMMAND(show_top_sources_cmd, static) = { .path = "show ngfw top-sources", .function = show_top_sources_fn }.mkdir build && cd build && cmake .. && make. Copy the .so to VPP plugin directory. Restart VPP and verify the plugin loads: vppctl show plugins | grep my. Enable on an interface, generate traffic, and run your CLI command.NGFW Prototype — ACL + NAT + Custom Node
Objective: Assemble a minimal NGFW data plane with VPP's ACL plugin, NAT44, and your custom counter node all operating in the same pipeline.
nat44 enable sessions 1024, set interface nat44 in tap0 out tap1, nat44 add interface address tap1.show errors for ACL deny counts.show ngfw top-sources command and verify counts. Use show run to confirm your node's Clk/Vec — compare it to the built-in ACL node.pcap dispatch trace on max 500 file /tmp/vpp.pcap. Open in Wireshark and identify the same packet at different graph nodes. Observe: pre-NAT vs post-NAT IP addresses confirming NAT rewrote the packet.M18 MASTERY CHECKLIST
- Know VPP's position: full-featured userspace network stack on DPDK, not just I/O toolkit
- Know VPP performance range: 20–100 Mpps/core for L3 forwarding; 15–60 with ACL; 5–20 with IPsec
- Know the 4 CPU microarchitectural benefits of vector processing: I-cache, branch predictor, prefetch pipeline, SIMD
- Know what Vectors/Call and Clocks/Vector mean in show run output and how to use them for bottleneck diagnosis
- Know the VPP graph: nodes receive vlib_frame_t (array of buffer indices), process, dispatch to next nodes
- Know node types: INPUT (polling), INTERNAL (processing), PROCESS (background)
- Know the canonical 4x unroll + prefetch pattern: prefetch N+4 while processing N; 8-packet outer loop
- Know feature arcs: ip4-unicast and ip4-output arcs allow inserting custom nodes without modifying VPP core
- Know VLIB_REGISTER_NODE fields: name, vector_size, type, n_next_nodes, next_nodes array
- Know VPP FIB three layers: FIB table (hash + mtrie) → load-balance object → adjacency (pre-built rewrite)
- Know vppctl route commands: ip route add/del, ECMP via multiple add of same prefix
- Know null routes: ip route add prefix drop/local
- Know VRF support: ip table add N; set interface ip table sw_if_index N
- Know three control interfaces: vppctl CLI, Python VAPI, VAT2
- Know key diagnostic commands: show run, show errors, show interface, show buffers, show plugins, pcap dispatch trace
- Know VPP ACL plugin: stateful conntrack + rule matching; set acl-list interface in/out
- Know VPP NAT44: set interface nat44 in/out; nat44 add interface address
- Know clib_bihash as the primary data structure for session tables in VPP plugins
- Completed Lab 1: installed VPP, configured tap interfaces, routing, ACL; read FIB and graph
- Completed Lab 2: wrote plugin with clib_bihash per-IP counter and CLI show command
- Completed Lab 3: assembled ACL + NAT44 + custom node pipeline; pcap-traced through all stages
🎉 Phase 4 Complete — Linux Networking and Socket Programming
You have completed all 5 modules of Phase 4: Linux Network Stack (M14), Socket Programming (M15), eBPF and XDP (M16), DPDK (M17), and VPP (M18). You now have a complete and deep understanding of the Linux networking toolkit from kernel internals to the most advanced data-plane frameworks. Move to Phase 5 — Security Protocols, starting with M19 - Cryptography Foundations.