THE FOUR NODE TYPES
VLIB_NODE_TYPE_INPUT
Polled by the dispatcher on every main loop iteration. Used for packet ingress (dpdk-input, pg-input, memif-input). Returns number of vectors processed - used to switch between polling and sleep modes.
VLIB_NODE_TYPE_INTERNAL
Called only when another node enqueues packets to it via vlib_frame_t. The vast majority of nodes: ip4-lookup, ip4-rewrite, ethernet-input, your custom processing nodes.
VLIB_NODE_TYPE_PROCESS
Cooperative coroutine - runs with vlib_process_suspend() and vlib_process_wait_for_event(). Used for slow-path: ARP resolution, control-plane responses. Never handles packets directly.
VLIB_NODE_TYPE_PRE_INPUT
Called before INPUT nodes on every loop. Used for global preprocessing. Rare - only a few VPP nodes use this type.
Registering a Node - VLIB_REGISTER_NODE
MACRO PATTERN/* The node function - processes up to n_vectors packets */ static uword my_node_fn (vlib_main_t * vm, vlib_node_runtime_t * node, vlib_frame_t * frame) { u32 n_left_from, *from; from = vlib_frame_vector_args(frame); /* array of buf indices */ n_left_from = frame->n_vectors; /* ... process packets ... */ return frame->n_vectors; /* always return vectors processed */ } /* Node registration - at file scope, executed at startup */ VLIB_REGISTER_NODE (my_node) = { .function = my_node_fn, .name = "my-node", .vector_size = sizeof(u32), /* buffer index size */ .type = VLIB_NODE_TYPE_INTERNAL, .n_errors = MY_NODE_N_ERROR, .error_strings = my_node_error_strings, .n_next_nodes = MY_NODE_N_NEXT, .next_nodes = { [MY_NODE_NEXT_IP4_LOOKUP] = "ip4-lookup", [MY_NODE_NEXT_DROP] = "error-drop", }, /* Optional: for show run output */ .format_trace = format_my_node_trace, }; /* Error strings (for show error) */ static char *my_node_error_strings[] = { #define _(n, s) s, foreach_my_node_error #undef _ };
MAIN LOOP - src/vlib/main.c
vlib_main_loop - The Heart of VPP
ARCHITECTUREThe dispatcher lives in vlib_main_loop(). You never write a main loop in VPP - the framework calls your nodes. Understanding the loop explains VPP's performance model.
/* Simplified pseudocode of vlib_main_loop (src/vlib/main.c) */ while (1) { /* 1. Poll all INPUT nodes */ foreach input_node: vectors = input_node.fn(vm, node, frame); /* vectors returned drives adaptive polling rate */ /* 2. Run INTERNAL nodes that have pending frames */ while pending_frames: dispatch_node(next_pending_node); /* this may enqueue more frames to other nodes */ /* 3. Run PROCESS nodes that are ready */ foreach ready_process: resume_process(proc); /* 4. Adaptive sleep if no work (avoids busy-spin at 0 pps) */ if (total_vectors == 0): sleep_us(min(sleep_us * 2, max_sleep_us)); else: sleep_us = 0; /* busy poll when traffic present */ }
Frame lifecycle: When an INPUT node receives packets, it allocates a vlib_frame_t and fills it with buffer indices. It calls vlib_frame_enqueue to schedule INTERNAL nodes. The dispatcher runs each INTERNAL node when its frame is non-empty. Each INTERNAL node can enqueue further frames - the graph unfolds packet by packet.
vlib_frame_t - Passing Packets Between Nodes
DATA STRUCTURE/* A frame is a batch of buffer indices destined for one next node */ typedef struct { u16 n_vectors; /* number of valid buffer indices in this frame */ u16 flags; u32 frame_flags; /* followed immediately by: u32 buffer_indices[n_vectors] */ } vlib_frame_t; /* Get the buffer index array from a frame */ u32 *from = vlib_frame_vector_args(frame); /* Enqueue packet(s) to a next node */ vlib_frame_t *f = vlib_get_next_frame(vm, node, MY_NEXT_INDEX); u32 *to = vlib_frame_vector_args(f); to[0] = buf_index; vlib_put_next_frame(vm, node, MY_NEXT_INDEX, /* n_left_to_next= */ n_remaining); /* Modern API: enqueue by next index array (preferred for multi-next nodes) */ u16 nexts[VLIB_FRAME_SIZE]; nexts[i] = MY_NODE_NEXT_IP4_LOOKUP; vlib_buffer_enqueue_to_next(vm, node, from, nexts, n_vectors);
💡 Frame size limit: VLIB_FRAME_SIZE = 256. No single node invocation processes more than 256 packets. This is by design - it bounds worst-case latency for other nodes. INPUT nodes should return early once they have 256 packets.
vlib_buffer_t - EVERY PACKET IS ONE OF THESE
Buffer Memory Layout
src/vlib/buffer.h/* Simplified vlib_buffer_t (src/vlib/buffer.h) */ typedef struct { /* ── Cache line 0: hot fields ──────────────────────── */ CLIB_CACHE_LINE_ALIGN_MARK(cacheline0); u32 current_data; /* offset from data_u8[] to current L2/L3 header */ u16 current_length; /* bytes of valid data from current_data onwards */ u16 flags; /* VLIB_BUFFER_IS_TRACED, etc. */ u32 flow_id; /* per-packet flow identifier */ u32 next_buffer; /* chained buffer index (for multi-seg packets) */ u32 current_config_index; /* feature arc state */ u8 error; /* error code set by any node */ u8 n_add_refs; /* reference count for cloning */ /* ── Cache line 1: opaque per-node scratch space ───── */ CLIB_CACHE_LINE_ALIGN_MARK(cacheline1); vnet_buffer_opaque_t opaque; /* vnet_buffer(b)->ip.adj_index, etc. */ /* ── Cache line 2: second opaque area ──────────────── */ CLIB_CACHE_LINE_ALIGN_MARK(cacheline2); vnet_buffer_opaque2_t opaque2; /* for your plugin's scratch data */ /* ── Cache line 3+: packet data ───────────────────── */ u8 pre_data[VLIB_BUFFER_PRE_DATA_SIZE]; /* pre-data area (for encap) */ u8 data_u8[0]; /* actual packet bytes start here */ } vlib_buffer_t;
Working With Buffers - Essential Macros
REFERENCE/* Get pointer to current header (L2, L3, or wherever we are) */ void *hdr = vlib_buffer_get_current(b); /* Advance past current header (e.g. past Ethernet to reach IP) */ vlib_buffer_advance(b, sizeof(ethernet_header_t)); /* current_data += sizeof(eth_hdr); current_length -= sizeof(eth_hdr) */ /* Step back (e.g. to prepend an encap header) */ vlib_buffer_advance(b, -sizeof(ip4_header_t)); /* Access vnet buffer opaque (contains L3/L4 metadata) */ vnet_buffer_opaque_t *vo = vnet_buffer(b); u32 adj_idx = vo->ip.adj_index[VLIB_TX]; u32 sw_if_idx = vo->sw_if_index[VLIB_RX]; /* Get buffer from index (O(1) - base + offset) */ vlib_buffer_t *b = vlib_get_buffer(vm, buf_index); /* Get buffer index from pointer */ u32 bi = vlib_get_buffer_index(vm, b); /* Prefetch next buffer's header - critical for dual-loop perf */ vlib_buffer_t *p2 = vlib_get_buffer(vm, from[2]); CLIB_PREFETCH(&p2->data, sizeof(*ip0), LOAD); /* Allocate and free buffers */ u32 bi; vlib_buffer_alloc(vm, &bi, 1); /* allocate 1 buffer */ vlib_buffer_free(vm, &bi, 1); /* free 1 buffer */ /* Clone a buffer (reference counting) */ vlib_buffer_clone(vm, src_bi, &dst_bi, 1, head_end_offset);
- current_data ≈
rte_mbuf.data_off- both are byte offsets into the data area - current_length ≈
rte_mbuf.data_len- both track the valid data span - opaque / opaque2 ≈
rte_mbuf.udata64/ private mbuf area - per-packet scratch space - next_buffer ≈
rte_mbuf.next- both support chained multi-segment packets - Key difference: VPP passes u32 indices between nodes, not pointers - index-to-pointer conversion is a single array offset
- Pre-data area: VPP reserves bytes before
data_u8[]for encap headers - you can prepend headers by movingcurrent_datanegative without a new buffer allocation
THE DUAL-LOOP PERFORMANCE PATTERN
Why Dual-Loop?
PERFORMANCEMemory latency is the bottleneck in packet processing. A 64-byte cache line takes ~200 cycles to load from DRAM. Processing one packet at a time means those 200 cycles are wasted. The dual-loop pattern hides latency by prefetching packet N+2 while processing packet N.
Structure: an outer loop processes 2 packets per iteration (prefetch 2 ahead). When fewer than 4 remain, fall into a single loop. This is the canonical VPP pattern - used in ip4-lookup, ip4-rewrite, and every high-performance node.
Dual-Loop Template - Annotated
CANONICAL PATTERN · src/vnet/ip/ip4_forward.cstatic uword my_node_fn (vlib_main_t *vm, vlib_node_runtime_t *node, vlib_frame_t *frame) { u32 n_left_from, *from; u16 nexts[VLIB_FRAME_SIZE], *next = nexts; from = vlib_frame_vector_args(frame); n_left_from = frame->n_vectors; /* ── Prefetch first 4 buffers before entering loop ── */ { vlib_buffer_t *p; p = vlib_get_buffer(vm, from[0]); vlib_prefetch_buffer_header(p, LOAD); p = vlib_get_buffer(vm, from[1]); vlib_prefetch_buffer_header(p, LOAD); p = vlib_get_buffer(vm, from[2]); vlib_prefetch_buffer_header(p, LOAD); p = vlib_get_buffer(vm, from[3]); vlib_prefetch_buffer_header(p, LOAD); } /* ── DUAL LOOP: 2 packets per iteration ─────────── */ while (n_left_from >= 4) { vlib_buffer_t *b0, *b1; u32 bi0, bi1; /* Prefetch 2 buffers ahead (hides DRAM latency) */ { vlib_buffer_t *p2 = vlib_get_buffer(vm, from[2]); vlib_buffer_t *p3 = vlib_get_buffer(vm, from[3]); vlib_prefetch_buffer_header(p2, LOAD); vlib_prefetch_buffer_header(p3, LOAD); CLIB_PREFETCH(p2->data, CLIB_CACHE_LINE_BYTES, LOAD); CLIB_PREFETCH(p3->data, CLIB_CACHE_LINE_BYTES, LOAD); } bi0 = from[0]; bi1 = from[1]; from += 2; n_left_from -= 2; b0 = vlib_get_buffer(vm, bi0); /* now in cache - no stall */ b1 = vlib_get_buffer(vm, bi1); /* ── YOUR PROCESSING LOGIC FOR b0 AND b1 ── */ ip4_header_t *ip0 = vlib_buffer_get_current(b0); ip4_header_t *ip1 = vlib_buffer_get_current(b1); next[0] = classify_packet(ip0); /* determine next node */ next[1] = classify_packet(ip1); next += 2; /* ─────────────────────────────────────────── */ } /* ── SINGLE LOOP: handle remaining 0-3 packets ── */ while (n_left_from > 0) { vlib_buffer_t *b0 = vlib_get_buffer(vm, from[0]); next[0] = classify_packet(vlib_buffer_get_current(b0)); from++; next++; n_left_from--; } /* Enqueue all packets to their respective next nodes */ vlib_buffer_enqueue_to_next(vm, node, vlib_frame_vector_args(frame), nexts, frame->n_vectors); return frame->n_vectors; }
Modern "qs" Pattern - vlib_get_buffers
VPP v22+Newer VPP nodes use the "quad-single" helper which fetches all buffers upfront using SIMD-friendly bulk get:
/* Bulk fetch all buffer pointers - compiler can vectorise */ vlib_buffer_t *bufs[VLIB_FRAME_SIZE]; vlib_get_buffers(vm, from, bufs, n_vectors); /* Now iterate bufs[] directly */ for (int i = 0; i < n_vectors; i++) { nexts[i] = my_classify(bufs[i]); } vlib_buffer_enqueue_to_next(vm, node, from, nexts, n_vectors);
Use the qs pattern for new nodes. Use the hand-written dual-loop when you need ultra-precise prefetch control for memory-intensive operations (e.g., FIB lookup with pointer chasing).
MULTI-THREADING MODEL
Per-Worker vlib_main_t
ARCHITECTUREVPP uses a share-nothing threading model. Each worker thread has its own vlib_main_t, its own buffer pool, and its own set of graph nodes. There is no global lock on the fast path.
- Worker 0 handles RX queue 0 of each interface; Worker 1 handles RX queue 1; etc.
- Each worker thread runs an independent copy of
vlib_main_loop - Workers never share packet ownership - a packet assigned to Worker 0 stays on Worker 0 unless explicitly handed off
- The main thread (Thread 0) handles control-plane: PROCESS nodes, CLI, API requests
/* Get the current worker's vlib_main_t (in node function context) */ vlib_main_t *vm = ...; /* already passed to your node function */ u32 thread_index = vm->thread_index; /* 0 = main, 1..N = workers */ /* Access another thread's vlib_main */ vlib_main_t *wm = vlib_get_main_by_index(worker_idx); /* Per-worker data in your plugin - index by thread_index */ typedef struct { my_flow_t *flow_pool; clib_bihash_8_8_t flow_table; } my_worker_t; my_main_t *mm = &my_main; my_worker_t *w = vec_elt_at_index(mm->workers, vm->thread_index);
Handoff - Cross-Worker Packet Transfer
src/vlib/threads.cSometimes a packet must be processed by a specific worker - for example, if your plugin requires all packets of the same flow to be handled by the same thread (stateful processing). Use the handoff mechanism.
/* Enqueue buffers to a different worker's input queue */ u32 target_worker = compute_flow_worker(flow_id); if (target_worker != vm->thread_index) { vlib_buffer_enqueue_to_thread(vm, node, handoff_queue_index, /* registered queue */ &bi, &target_worker, 1); /* n_buffers */ }
See src/examples/handoffdemo/ for a complete working example. The handoff node approach is also used by the NAT plugin to ensure symmetric flow handling.
⚠️ Avoid unnecessary handoffs. Each cross-worker transfer adds latency and overhead. Design your hashing strategy (startup.conf num-rx-queues + RSS hash type) so packets of the same flow arrive at the same worker naturally through NIC RSS. Handoff is the fallback, not the primary mechanism.
PACKET TRACING
Adding Trace Support to Your Node
DEBUGGING/* Step 1: define your trace structure */ typedef struct { u32 sw_if_index; u8 next_index; u8 error; u32 flow_id; } my_node_trace_t; /* Step 2: format function - called by 'show trace' */ static u8 * format_my_node_trace (u8 *s, va_list *args) { vlib_main_t *vm = va_arg(*args, vlib_main_t *); vlib_node_t *node = va_arg(*args, vlib_node_t *); my_node_trace_t *t = va_arg(*args, my_node_trace_t *); s = format(s, "MY-NODE: sw_if_index %d next %d flow 0x%x", t->sw_if_index, t->next_index, t->flow_id); return s; } /* Step 3: in your node function, check trace flag and record */ if (PREDICT_FALSE(b0->flags & VLIB_BUFFER_IS_TRACED)) { my_node_trace_t *t = vlib_add_trace(vm, node, b0, sizeof(*t)); t->sw_if_index = vnet_buffer(b0)->sw_if_index[VLIB_RX]; t->next_index = next0; t->flow_id = b0->flow_id; } /* Step 4: in VLIB_REGISTER_NODE, set: */ /* .format_trace = format_my_node_trace */
Error Counters - show error
OBSERVABILITY/* Define errors with a foreach macro (standard VPP convention) */ #define foreach_my_node_error \ _(PROCESSED, "packets processed") \ _(NO_FLOW, "flow not found") \ _(CHECKSUM, "checksum error") typedef enum { #define _(n,s) MY_NODE_ERROR_##n, foreach_my_node_error #undef _ MY_NODE_N_ERROR, } my_node_error_t; static char * my_node_error_strings[] = { #define _(n,s) s, foreach_my_node_error #undef _ }; /* Increment a counter (atomic, safe from any worker) */ vlib_node_increment_counter(vm, my_node.index, MY_NODE_ERROR_PROCESSED, n_processed);
Graph Node Inspector
Objective: Understand the dispatch loop and node statistics by observation - no code yet.
pg) plugin. Create a pg interface and configure it as an L3 interface with an IP address.packet-generator new { name pg0 limit 10000 ... }. Run show run before and after. Record vectors/call, clocks/vector for each active node.trace add pg-input 50, send 50 packets, then show trace. Map each line of the trace to the corresponding node function in the source tree.ip4_lookup's node function. Inspect frame->n_vectors and the first 4 buffer indices in from[]. Dereference one buffer index and read current_data and current_length.show run. Observe that vectors/call increases toward VLIB_FRAME_SIZE (256). Explain why.Custom Buffer Inspector Node
Objective: Write your first VPP plugin - a simple node that reads buffer headers and emits trace output. No packet modification yet.
src/examples/sample-plugin/ to a new directory src/plugins/buffer-inspector/. Rename all symbols.ip4-lookup. In the node function, implement the dual-loop pattern. For each packet, read current_data, current_length, and the first 4 bytes of the IP header.format_buffer_inspector_trace.ip4-unicast arc. Test with set interface feature vpp0 buffer-inspector ip4-unicast enable.show error.P2B COMPLETION CHECKLIST
- Know all four node types and when to use each (INPUT, INTERNAL, PROCESS, PRE_INPUT)
- Understand
vlib_main_loop: poll INPUT → dispatch INTERNAL frames → run PROCESS nodes - Know
vlib_buffer_tlayout:current_data,current_length, opaque, pre-data area - Can convert between buffer index and pointer; know why indices are passed between nodes, not pointers
- Can implement the dual-loop pattern with correct 2-ahead prefetch
- Understand the modern
vlib_get_buffers+vlib_buffer_enqueue_to_nextpattern - Know VPP's share-nothing threading model: per-worker
vlib_main_t, no fast-path locks - Understand handoff: when it's needed and the overhead cost
- Can add trace support to a node with a custom
format_tracefunction - Can define and increment error counters that appear in
show error - Completed Projects 2 and 3: graph inspector and buffer inspector node with traces + counters