VPP MASTERY · PHASE 2B · WEEKS 5–6
⚙️ vlib - Graph Dispatcher
Node Types · Dispatch Loop · Buffer Layout · Dual-Loop Pattern · Multi-Threading · Packet Tracing
src/vlib/main.c src/vlib/node.h src/vlib/buffer.h 2 Mini-Projects

THE FOUR NODE TYPES

VLIB_NODE_TYPE_INPUT

Polled by the dispatcher on every main loop iteration. Used for packet ingress (dpdk-input, pg-input, memif-input). Returns number of vectors processed - used to switch between polling and sleep modes.

VLIB_NODE_TYPE_INTERNAL

Called only when another node enqueues packets to it via vlib_frame_t. The vast majority of nodes: ip4-lookup, ip4-rewrite, ethernet-input, your custom processing nodes.

VLIB_NODE_TYPE_PROCESS

Cooperative coroutine - runs with vlib_process_suspend() and vlib_process_wait_for_event(). Used for slow-path: ARP resolution, control-plane responses. Never handles packets directly.

VLIB_NODE_TYPE_PRE_INPUT

Called before INPUT nodes on every loop. Used for global preprocessing. Rare - only a few VPP nodes use this type.

📝

Registering a Node - VLIB_REGISTER_NODE

MACRO PATTERN
/* The node function - processes up to n_vectors packets */
static uword
my_node_fn (vlib_main_t * vm,
            vlib_node_runtime_t * node,
            vlib_frame_t * frame)
{
  u32 n_left_from, *from;
  from        = vlib_frame_vector_args(frame);   /* array of buf indices */
  n_left_from = frame->n_vectors;

  /* ... process packets ... */

  return frame->n_vectors;   /* always return vectors processed */
}

/* Node registration - at file scope, executed at startup */
VLIB_REGISTER_NODE (my_node) = {
  .function      = my_node_fn,
  .name          = "my-node",
  .vector_size   = sizeof(u32),   /* buffer index size */
  .type          = VLIB_NODE_TYPE_INTERNAL,
  .n_errors      = MY_NODE_N_ERROR,
  .error_strings = my_node_error_strings,
  .n_next_nodes  = MY_NODE_N_NEXT,
  .next_nodes    = {
    [MY_NODE_NEXT_IP4_LOOKUP] = "ip4-lookup",
    [MY_NODE_NEXT_DROP]       = "error-drop",
  },
  /* Optional: for show run output */
  .format_trace  = format_my_node_trace,
};

/* Error strings (for show error) */
static char *my_node_error_strings[] = {
#define _(n, s) s,
  foreach_my_node_error
#undef _
};

MAIN LOOP - src/vlib/main.c

🔄

vlib_main_loop - The Heart of VPP

ARCHITECTURE

The dispatcher lives in vlib_main_loop(). You never write a main loop in VPP - the framework calls your nodes. Understanding the loop explains VPP's performance model.

/* Simplified pseudocode of vlib_main_loop (src/vlib/main.c) */
while (1) {
  /* 1. Poll all INPUT nodes */
  foreach input_node:
    vectors = input_node.fn(vm, node, frame);
    /* vectors returned drives adaptive polling rate */

  /* 2. Run INTERNAL nodes that have pending frames */
  while pending_frames:
    dispatch_node(next_pending_node);
    /* this may enqueue more frames to other nodes */

  /* 3. Run PROCESS nodes that are ready */
  foreach ready_process:
    resume_process(proc);

  /* 4. Adaptive sleep if no work (avoids busy-spin at 0 pps) */
  if (total_vectors == 0):
    sleep_us(min(sleep_us * 2, max_sleep_us));
  else:
    sleep_us = 0;  /* busy poll when traffic present */
}

Frame lifecycle: When an INPUT node receives packets, it allocates a vlib_frame_t and fills it with buffer indices. It calls vlib_frame_enqueue to schedule INTERNAL nodes. The dispatcher runs each INTERNAL node when its frame is non-empty. Each INTERNAL node can enqueue further frames - the graph unfolds packet by packet.

📦

vlib_frame_t - Passing Packets Between Nodes

DATA STRUCTURE
/* A frame is a batch of buffer indices destined for one next node */
typedef struct {
  u16  n_vectors;     /* number of valid buffer indices in this frame */
  u16  flags;
  u32  frame_flags;
  /* followed immediately by: u32 buffer_indices[n_vectors] */
} vlib_frame_t;

/* Get the buffer index array from a frame */
u32 *from = vlib_frame_vector_args(frame);

/* Enqueue packet(s) to a next node */
vlib_frame_t *f = vlib_get_next_frame(vm, node, MY_NEXT_INDEX);
u32 *to = vlib_frame_vector_args(f);
to[0] = buf_index;
vlib_put_next_frame(vm, node, MY_NEXT_INDEX, /* n_left_to_next= */ n_remaining);

/* Modern API: enqueue by next index array (preferred for multi-next nodes) */
u16 nexts[VLIB_FRAME_SIZE];
nexts[i] = MY_NODE_NEXT_IP4_LOOKUP;
vlib_buffer_enqueue_to_next(vm, node, from, nexts, n_vectors);

💡 Frame size limit: VLIB_FRAME_SIZE = 256. No single node invocation processes more than 256 packets. This is by design - it bounds worst-case latency for other nodes. INPUT nodes should return early once they have 256 packets.

vlib_buffer_t - EVERY PACKET IS ONE OF THESE

📋

Buffer Memory Layout

src/vlib/buffer.h
/* Simplified vlib_buffer_t (src/vlib/buffer.h) */
typedef struct {
  /* ── Cache line 0: hot fields ──────────────────────── */
  CLIB_CACHE_LINE_ALIGN_MARK(cacheline0);

  u32 current_data;    /* offset from data_u8[] to current L2/L3 header */
  u16 current_length;  /* bytes of valid data from current_data onwards */
  u16 flags;           /* VLIB_BUFFER_IS_TRACED, etc. */

  u32 flow_id;         /* per-packet flow identifier */
  u32 next_buffer;     /* chained buffer index (for multi-seg packets) */
  u32 current_config_index; /* feature arc state */
  u8  error;           /* error code set by any node */
  u8  n_add_refs;      /* reference count for cloning */

  /* ── Cache line 1: opaque per-node scratch space ───── */
  CLIB_CACHE_LINE_ALIGN_MARK(cacheline1);
  vnet_buffer_opaque_t opaque;   /* vnet_buffer(b)->ip.adj_index, etc. */

  /* ── Cache line 2: second opaque area ──────────────── */
  CLIB_CACHE_LINE_ALIGN_MARK(cacheline2);
  vnet_buffer_opaque2_t opaque2; /* for your plugin's scratch data */

  /* ── Cache line 3+: packet data ─────────────────────  */
  u8 pre_data[VLIB_BUFFER_PRE_DATA_SIZE]; /* pre-data area (for encap) */
  u8 data_u8[0];    /* actual packet bytes start here */
} vlib_buffer_t;
🔧

Working With Buffers - Essential Macros

REFERENCE
/* Get pointer to current header (L2, L3, or wherever we are) */
void *hdr = vlib_buffer_get_current(b);

/* Advance past current header (e.g. past Ethernet to reach IP) */
vlib_buffer_advance(b, sizeof(ethernet_header_t));
/* current_data += sizeof(eth_hdr); current_length -= sizeof(eth_hdr) */

/* Step back (e.g. to prepend an encap header) */
vlib_buffer_advance(b, -sizeof(ip4_header_t));

/* Access vnet buffer opaque (contains L3/L4 metadata) */
vnet_buffer_opaque_t *vo = vnet_buffer(b);
u32 adj_idx   = vo->ip.adj_index[VLIB_TX];
u32 sw_if_idx = vo->sw_if_index[VLIB_RX];

/* Get buffer from index (O(1) - base + offset) */
vlib_buffer_t *b = vlib_get_buffer(vm, buf_index);

/* Get buffer index from pointer */
u32 bi = vlib_get_buffer_index(vm, b);

/* Prefetch next buffer's header - critical for dual-loop perf */
vlib_buffer_t *p2 = vlib_get_buffer(vm, from[2]);
CLIB_PREFETCH(&p2->data, sizeof(*ip0), LOAD);

/* Allocate and free buffers */
u32 bi;
vlib_buffer_alloc(vm, &bi, 1);    /* allocate 1 buffer */
vlib_buffer_free(vm, &bi, 1);     /* free 1 buffer */

/* Clone a buffer (reference counting) */
vlib_buffer_clone(vm, src_bi, &dst_bi, 1, head_end_offset);
⚙️ vlib_buffer_t vs rte_mbuf
  • current_datarte_mbuf.data_off - both are byte offsets into the data area
  • current_lengthrte_mbuf.data_len - both track the valid data span
  • opaque / opaque2rte_mbuf.udata64 / private mbuf area - per-packet scratch space
  • next_bufferrte_mbuf.next - both support chained multi-segment packets
  • Key difference: VPP passes u32 indices between nodes, not pointers - index-to-pointer conversion is a single array offset
  • Pre-data area: VPP reserves bytes before data_u8[] for encap headers - you can prepend headers by moving current_data negative without a new buffer allocation

THE DUAL-LOOP PERFORMANCE PATTERN

Why Dual-Loop?

PERFORMANCE

Memory latency is the bottleneck in packet processing. A 64-byte cache line takes ~200 cycles to load from DRAM. Processing one packet at a time means those 200 cycles are wasted. The dual-loop pattern hides latency by prefetching packet N+2 while processing packet N.

Structure: an outer loop processes 2 packets per iteration (prefetch 2 ahead). When fewer than 4 remain, fall into a single loop. This is the canonical VPP pattern - used in ip4-lookup, ip4-rewrite, and every high-performance node.

🔧

Dual-Loop Template - Annotated

CANONICAL PATTERN · src/vnet/ip/ip4_forward.c
static uword
my_node_fn (vlib_main_t *vm, vlib_node_runtime_t *node,
            vlib_frame_t *frame)
{
  u32 n_left_from, *from;
  u16 nexts[VLIB_FRAME_SIZE], *next = nexts;

  from        = vlib_frame_vector_args(frame);
  n_left_from = frame->n_vectors;

  /* ── Prefetch first 4 buffers before entering loop ── */
  {
    vlib_buffer_t *p;
    p = vlib_get_buffer(vm, from[0]); vlib_prefetch_buffer_header(p, LOAD);
    p = vlib_get_buffer(vm, from[1]); vlib_prefetch_buffer_header(p, LOAD);
    p = vlib_get_buffer(vm, from[2]); vlib_prefetch_buffer_header(p, LOAD);
    p = vlib_get_buffer(vm, from[3]); vlib_prefetch_buffer_header(p, LOAD);
  }

  /* ── DUAL LOOP: 2 packets per iteration ─────────── */
  while (n_left_from >= 4) {
    vlib_buffer_t *b0, *b1;
    u32 bi0, bi1;

    /* Prefetch 2 buffers ahead (hides DRAM latency) */
    {
      vlib_buffer_t *p2 = vlib_get_buffer(vm, from[2]);
      vlib_buffer_t *p3 = vlib_get_buffer(vm, from[3]);
      vlib_prefetch_buffer_header(p2, LOAD);
      vlib_prefetch_buffer_header(p3, LOAD);
      CLIB_PREFETCH(p2->data, CLIB_CACHE_LINE_BYTES, LOAD);
      CLIB_PREFETCH(p3->data, CLIB_CACHE_LINE_BYTES, LOAD);
    }

    bi0 = from[0]; bi1 = from[1];
    from += 2; n_left_from -= 2;

    b0 = vlib_get_buffer(vm, bi0);   /* now in cache - no stall */
    b1 = vlib_get_buffer(vm, bi1);

    /* ── YOUR PROCESSING LOGIC FOR b0 AND b1 ── */
    ip4_header_t *ip0 = vlib_buffer_get_current(b0);
    ip4_header_t *ip1 = vlib_buffer_get_current(b1);

    next[0] = classify_packet(ip0);   /* determine next node */
    next[1] = classify_packet(ip1);
    next += 2;
    /* ─────────────────────────────────────────── */
  }

  /* ── SINGLE LOOP: handle remaining 0-3 packets ── */
  while (n_left_from > 0) {
    vlib_buffer_t *b0 = vlib_get_buffer(vm, from[0]);
    next[0] = classify_packet(vlib_buffer_get_current(b0));
    from++; next++; n_left_from--;
  }

  /* Enqueue all packets to their respective next nodes */
  vlib_buffer_enqueue_to_next(vm, node, vlib_frame_vector_args(frame),
                              nexts, frame->n_vectors);
  return frame->n_vectors;
}
🚀

Modern "qs" Pattern - vlib_get_buffers

VPP v22+

Newer VPP nodes use the "quad-single" helper which fetches all buffers upfront using SIMD-friendly bulk get:

/* Bulk fetch all buffer pointers - compiler can vectorise */
vlib_buffer_t *bufs[VLIB_FRAME_SIZE];
vlib_get_buffers(vm, from, bufs, n_vectors);

/* Now iterate bufs[] directly */
for (int i = 0; i < n_vectors; i++) {
  nexts[i] = my_classify(bufs[i]);
}

vlib_buffer_enqueue_to_next(vm, node, from, nexts, n_vectors);

Use the qs pattern for new nodes. Use the hand-written dual-loop when you need ultra-precise prefetch control for memory-intensive operations (e.g., FIB lookup with pointer chasing).

MULTI-THREADING MODEL

🧵

Per-Worker vlib_main_t

ARCHITECTURE

VPP uses a share-nothing threading model. Each worker thread has its own vlib_main_t, its own buffer pool, and its own set of graph nodes. There is no global lock on the fast path.

  • Worker 0 handles RX queue 0 of each interface; Worker 1 handles RX queue 1; etc.
  • Each worker thread runs an independent copy of vlib_main_loop
  • Workers never share packet ownership - a packet assigned to Worker 0 stays on Worker 0 unless explicitly handed off
  • The main thread (Thread 0) handles control-plane: PROCESS nodes, CLI, API requests
/* Get the current worker's vlib_main_t (in node function context) */
vlib_main_t *vm = ...;   /* already passed to your node function */
u32 thread_index = vm->thread_index;   /* 0 = main, 1..N = workers */

/* Access another thread's vlib_main */
vlib_main_t *wm = vlib_get_main_by_index(worker_idx);

/* Per-worker data in your plugin - index by thread_index */
typedef struct {
  my_flow_t       *flow_pool;
  clib_bihash_8_8_t flow_table;
} my_worker_t;

my_main_t *mm = &my_main;
my_worker_t *w = vec_elt_at_index(mm->workers, vm->thread_index);
🔀

Handoff - Cross-Worker Packet Transfer

src/vlib/threads.c

Sometimes a packet must be processed by a specific worker - for example, if your plugin requires all packets of the same flow to be handled by the same thread (stateful processing). Use the handoff mechanism.

/* Enqueue buffers to a different worker's input queue */
u32 target_worker = compute_flow_worker(flow_id);
if (target_worker != vm->thread_index) {
  vlib_buffer_enqueue_to_thread(vm, node,
                                handoff_queue_index,  /* registered queue */
                                &bi, &target_worker,
                                1);   /* n_buffers */
}

See src/examples/handoffdemo/ for a complete working example. The handoff node approach is also used by the NAT plugin to ensure symmetric flow handling.

⚠️ Avoid unnecessary handoffs. Each cross-worker transfer adds latency and overhead. Design your hashing strategy (startup.conf num-rx-queues + RSS hash type) so packets of the same flow arrive at the same worker naturally through NIC RSS. Handoff is the fallback, not the primary mechanism.

PACKET TRACING

🔍

Adding Trace Support to Your Node

DEBUGGING
/* Step 1: define your trace structure */
typedef struct {
  u32 sw_if_index;
  u8  next_index;
  u8  error;
  u32 flow_id;
} my_node_trace_t;

/* Step 2: format function - called by 'show trace' */
static u8 *
format_my_node_trace (u8 *s, va_list *args) {
  vlib_main_t *vm = va_arg(*args, vlib_main_t *);
  vlib_node_t *node = va_arg(*args, vlib_node_t *);
  my_node_trace_t *t = va_arg(*args, my_node_trace_t *);
  s = format(s, "MY-NODE: sw_if_index %d next %d flow 0x%x",
             t->sw_if_index, t->next_index, t->flow_id);
  return s;
}

/* Step 3: in your node function, check trace flag and record */
if (PREDICT_FALSE(b0->flags & VLIB_BUFFER_IS_TRACED)) {
  my_node_trace_t *t = vlib_add_trace(vm, node, b0, sizeof(*t));
  t->sw_if_index = vnet_buffer(b0)->sw_if_index[VLIB_RX];
  t->next_index  = next0;
  t->flow_id     = b0->flow_id;
}

/* Step 4: in VLIB_REGISTER_NODE, set: */
/* .format_trace = format_my_node_trace */
📊

Error Counters - show error

OBSERVABILITY
/* Define errors with a foreach macro (standard VPP convention) */
#define foreach_my_node_error  \
  _(PROCESSED,   "packets processed") \
  _(NO_FLOW,     "flow not found")    \
  _(CHECKSUM,    "checksum error")

typedef enum {
#define _(n,s) MY_NODE_ERROR_##n,
  foreach_my_node_error
#undef _
  MY_NODE_N_ERROR,
} my_node_error_t;

static char * my_node_error_strings[] = {
#define _(n,s) s,
  foreach_my_node_error
#undef _
};

/* Increment a counter (atomic, safe from any worker) */
vlib_node_increment_counter(vm, my_node.index,
                            MY_NODE_ERROR_PROCESSED, n_processed);
PROJECT 2

Graph Node Inspector

Objective: Understand the dispatch loop and node statistics by observation - no code yet.

1
Start VPP with the packet generator (pg) plugin. Create a pg interface and configure it as an L3 interface with an IP address.
2
Generate traffic: packet-generator new { name pg0 limit 10000 ... }. Run show run before and after. Record vectors/call, clocks/vector for each active node.
3
Use trace add pg-input 50, send 50 packets, then show trace. Map each line of the trace to the corresponding node function in the source tree.
4
Set a breakpoint in GDB on ip4_lookup's node function. Inspect frame->n_vectors and the first 4 buffer indices in from[]. Dereference one buffer index and read current_data and current_length.
5
Increase pg traffic to 1M packets/sec. Re-run show run. Observe that vectors/call increases toward VLIB_FRAME_SIZE (256). Explain why.
PROJECT 3

Custom Buffer Inspector Node

Objective: Write your first VPP plugin - a simple node that reads buffer headers and emits trace output. No packet modification yet.

1
Copy src/examples/sample-plugin/ to a new directory src/plugins/buffer-inspector/. Rename all symbols.
2
Create an INTERNAL node with one next: ip4-lookup. In the node function, implement the dual-loop pattern. For each packet, read current_data, current_length, and the first 4 bytes of the IP header.
3
Add trace support with a struct that stores: sw_if_index, IP src addr, IP dst addr, protocol. Implement format_buffer_inspector_trace.
4
Add a feature arc registration so your node can be inserted into the ip4-unicast arc. Test with set interface feature vpp0 buffer-inspector ip4-unicast enable.
5
Add error counters for: packets seen, packets with TTL==1, packets with unknown protocol. Verify they appear correctly in show error.
6
Run under the VPP test framework: write a Python test that sends 100 packets through the interface and asserts that the "packets seen" counter equals 100.

P2B COMPLETION CHECKLIST

← P2A: vppinfra 🗺️ Roadmap Next: vnet →