VPP P2B - vlib - Graph Dispatcher

VPP MASTERY · PHASE 2B · WEEKS 5–6

⚙️ vlib - Graph Dispatcher

Node Types · Dispatch Loop · Buffer Layout · Dual-Loop Pattern · Multi-Threading · Packet Tracing

src/vlib/main.c src/vlib/node.h src/vlib/buffer.h 2 Mini-Projects

THE FOUR NODE TYPES

VLIB_NODE_TYPE_INPUT

Polled by the dispatcher on every main loop iteration. Used for packet ingress (dpdk-input, pg-input, memif-input). Returns number of vectors processed - used to switch between polling and sleep modes.

VLIB_NODE_TYPE_INTERNAL

Called only when another node enqueues packets to it via vlib_frame_t. The vast majority of nodes: ip4-lookup, ip4-rewrite, ethernet-input, your custom processing nodes.

VLIB_NODE_TYPE_PROCESS

Cooperative coroutine - runs with vlib_process_suspend() and vlib_process_wait_for_event(). Used for slow-path: ARP resolution, control-plane responses. Never handles packets directly.

VLIB_NODE_TYPE_PRE_INPUT

Called before INPUT nodes on every loop. Used for global preprocessing. Rare - only a few VPP nodes use this type.

📝

Registering a Node - VLIB_REGISTER_NODE

MACRO PATTERN

/* The node function - processes up to n_vectors packets */
static uword
my_node_fn (vlib_main_t * vm,
            vlib_node_runtime_t * node,
            vlib_frame_t * frame)
{
  u32 n_left_from, *from;
  from        = vlib_frame_vector_args(frame);   /* array of buf indices */
  n_left_from = frame->n_vectors;

  /* ... process packets ... */

  return frame->n_vectors;   /* always return vectors processed */
}

/* Node registration - at file scope, executed at startup */
VLIB_REGISTER_NODE (my_node) = {
  .function      = my_node_fn,
  .name          = "my-node",
  .vector_size   = sizeof(u32),   /* buffer index size */
  .type          = VLIB_NODE_TYPE_INTERNAL,
  .n_errors      = MY_NODE_N_ERROR,
  .error_strings = my_node_error_strings,
  .n_next_nodes  = MY_NODE_N_NEXT,
  .next_nodes    = {
    [MY_NODE_NEXT_IP4_LOOKUP] = "ip4-lookup",
    [MY_NODE_NEXT_DROP]       = "error-drop",
  },
  /* Optional: for show run output */
  .format_trace  = format_my_node_trace,
};

/* Error strings (for show error) */
static char *my_node_error_strings[] = {
#define _(n, s) s,
  foreach_my_node_error
#undef _
};

MAIN LOOP - src/vlib/main.c

🔄

vlib_main_loop - The Heart of VPP

ARCHITECTURE

The dispatcher lives in vlib_main_loop(). You never write a main loop in VPP - the framework calls your nodes. Understanding the loop explains VPP's performance model.

/* Simplified pseudocode of vlib_main_loop (src/vlib/main.c) */
while (1) {
  /* 1. Poll all INPUT nodes */
  foreach input_node:
    vectors = input_node.fn(vm, node, frame);
    /* vectors returned drives adaptive polling rate */

  /* 2. Run INTERNAL nodes that have pending frames */
  while pending_frames:
    dispatch_node(next_pending_node);
    /* this may enqueue more frames to other nodes */

  /* 3. Run PROCESS nodes that are ready */
  foreach ready_process:
    resume_process(proc);

  /* 4. Adaptive sleep if no work (avoids busy-spin at 0 pps) */
  if (total_vectors == 0):
    sleep_us(min(sleep_us * 2, max_sleep_us));
  else:
    sleep_us = 0;  /* busy poll when traffic present */
}

Frame lifecycle: When an INPUT node receives packets, it allocates a vlib_frame_t and fills it with buffer indices. It calls vlib_frame_enqueue to schedule INTERNAL nodes. The dispatcher runs each INTERNAL node when its frame is non-empty. Each INTERNAL node can enqueue further frames - the graph unfolds packet by packet.

📦

vlib_frame_t - Passing Packets Between Nodes

DATA STRUCTURE

/* A frame is a batch of buffer indices destined for one next node */
typedef struct {
  u16  n_vectors;     /* number of valid buffer indices in this frame */
  u16  flags;
  u32  frame_flags;
  /* followed immediately by: u32 buffer_indices[n_vectors] */
} vlib_frame_t;

/* Get the buffer index array from a frame */
u32 *from = vlib_frame_vector_args(frame);

/* Enqueue packet(s) to a next node */
vlib_frame_t *f = vlib_get_next_frame(vm, node, MY_NEXT_INDEX);
u32 *to = vlib_frame_vector_args(f);
to[0] = buf_index;
vlib_put_next_frame(vm, node, MY_NEXT_INDEX, /* n_left_to_next= */ n_remaining);

/* Modern API: enqueue by next index array (preferred for multi-next nodes) */
u16 nexts[VLIB_FRAME_SIZE];
nexts[i] = MY_NODE_NEXT_IP4_LOOKUP;
vlib_buffer_enqueue_to_next(vm, node, from, nexts, n_vectors);

💡 Frame size limit: VLIB_FRAME_SIZE = 256. No single node invocation processes more than 256 packets. This is by design - it bounds worst-case latency for other nodes. INPUT nodes should return early once they have 256 packets.

vlib_buffer_t - EVERY PACKET IS ONE OF THESE

📋

Buffer Memory Layout

src/vlib/buffer.h

/* Simplified vlib_buffer_t (src/vlib/buffer.h) */
typedef struct {
  /* ── Cache line 0: hot fields ──────────────────────── */
  CLIB_CACHE_LINE_ALIGN_MARK(cacheline0);

  u32 current_data;    /* offset from data_u8[] to current L2/L3 header */
  u16 current_length;  /* bytes of valid data from current_data onwards */
  u16 flags;           /* VLIB_BUFFER_IS_TRACED, etc. */

  u32 flow_id;         /* per-packet flow identifier */
  u32 next_buffer;     /* chained buffer index (for multi-seg packets) */
  u32 current_config_index; /* feature arc state */
  u8  error;           /* error code set by any node */
  u8  n_add_refs;      /* reference count for cloning */

  /* ── Cache line 1: opaque per-node scratch space ───── */
  CLIB_CACHE_LINE_ALIGN_MARK(cacheline1);
  vnet_buffer_opaque_t opaque;   /* vnet_buffer(b)->ip.adj_index, etc. */

  /* ── Cache line 2: second opaque area ──────────────── */
  CLIB_CACHE_LINE_ALIGN_MARK(cacheline2);
  vnet_buffer_opaque2_t opaque2; /* for your plugin's scratch data */

  /* ── Cache line 3+: packet data ─────────────────────  */
  u8 pre_data[VLIB_BUFFER_PRE_DATA_SIZE]; /* pre-data area (for encap) */
  u8 data_u8[0];    /* actual packet bytes start here */
} vlib_buffer_t;

🔧

Working With Buffers - Essential Macros

REFERENCE

/* Get pointer to current header (L2, L3, or wherever we are) */
void *hdr = vlib_buffer_get_current(b);

/* Advance past current header (e.g. past Ethernet to reach IP) */
vlib_buffer_advance(b, sizeof(ethernet_header_t));
/* current_data += sizeof(eth_hdr); current_length -= sizeof(eth_hdr) */

/* Step back (e.g. to prepend an encap header) */
vlib_buffer_advance(b, -sizeof(ip4_header_t));

/* Access vnet buffer opaque (contains L3/L4 metadata) */
vnet_buffer_opaque_t *vo = vnet_buffer(b);
u32 adj_idx   = vo->ip.adj_index[VLIB_TX];
u32 sw_if_idx = vo->sw_if_index[VLIB_RX];

/* Get buffer from index (O(1) - base + offset) */
vlib_buffer_t *b = vlib_get_buffer(vm, buf_index);

/* Get buffer index from pointer */
u32 bi = vlib_get_buffer_index(vm, b);

/* Prefetch next buffer's header - critical for dual-loop perf */
vlib_buffer_t *p2 = vlib_get_buffer(vm, from[2]);
CLIB_PREFETCH(&p2->data, sizeof(*ip0), LOAD);

/* Allocate and free buffers */
u32 bi;
vlib_buffer_alloc(vm, &bi, 1);    /* allocate 1 buffer */
vlib_buffer_free(vm, &bi, 1);     /* free 1 buffer */

/* Clone a buffer (reference counting) */
vlib_buffer_clone(vm, src_bi, &dst_bi, 1, head_end_offset);

⚙️ vlib_buffer_t vs rte_mbuf

current_data ≈ rte_mbuf.data_off - both are byte offsets into the data area
current_length ≈ rte_mbuf.data_len - both track the valid data span
opaque / opaque2 ≈ rte_mbuf.udata64 / private mbuf area - per-packet scratch space
next_buffer ≈ rte_mbuf.next - both support chained multi-segment packets
Key difference: VPP passes u32 indices between nodes, not pointers - index-to-pointer conversion is a single array offset
Pre-data area: VPP reserves bytes before data_u8[] for encap headers - you can prepend headers by moving current_data negative without a new buffer allocation

THE DUAL-LOOP PERFORMANCE PATTERN

⚡

Why Dual-Loop?

PERFORMANCE

Memory latency is the bottleneck in packet processing. A 64-byte cache line takes ~200 cycles to load from DRAM. Processing one packet at a time means those 200 cycles are wasted. The dual-loop pattern hides latency by prefetching packet N+2 while processing packet N.

Structure: an outer loop processes 2 packets per iteration (prefetch 2 ahead). When fewer than 4 remain, fall into a single loop. This is the canonical VPP pattern - used in ip4-lookup, ip4-rewrite, and every high-performance node.

🔧

Dual-Loop Template - Annotated

CANONICAL PATTERN · src/vnet/ip/ip4_forward.c

static uword
my_node_fn (vlib_main_t *vm, vlib_node_runtime_t *node,
            vlib_frame_t *frame)
{
  u32 n_left_from, *from;
  u16 nexts[VLIB_FRAME_SIZE], *next = nexts;

  from        = vlib_frame_vector_args(frame);
  n_left_from = frame->n_vectors;

  /* ── Prefetch first 4 buffers before entering loop ── */
  {
    vlib_buffer_t *p;
    p = vlib_get_buffer(vm, from[0]); vlib_prefetch_buffer_header(p, LOAD);
    p = vlib_get_buffer(vm, from[1]); vlib_prefetch_buffer_header(p, LOAD);
    p = vlib_get_buffer(vm, from[2]); vlib_prefetch_buffer_header(p, LOAD);
    p = vlib_get_buffer(vm, from[3]); vlib_prefetch_buffer_header(p, LOAD);
  }

  /* ── DUAL LOOP: 2 packets per iteration ─────────── */
  while (n_left_from >= 4) {
    vlib_buffer_t *b0, *b1;
    u32 bi0, bi1;

    /* Prefetch 2 buffers ahead (hides DRAM latency) */
    {
      vlib_buffer_t *p2 = vlib_get_buffer(vm, from[2]);
      vlib_buffer_t *p3 = vlib_get_buffer(vm, from[3]);
      vlib_prefetch_buffer_header(p2, LOAD);
      vlib_prefetch_buffer_header(p3, LOAD);
      CLIB_PREFETCH(p2->data, CLIB_CACHE_LINE_BYTES, LOAD);
      CLIB_PREFETCH(p3->data, CLIB_CACHE_LINE_BYTES, LOAD);
    }

    bi0 = from[0]; bi1 = from[1];
    from += 2; n_left_from -= 2;

    b0 = vlib_get_buffer(vm, bi0);   /* now in cache - no stall */
    b1 = vlib_get_buffer(vm, bi1);

    /* ── YOUR PROCESSING LOGIC FOR b0 AND b1 ── */
    ip4_header_t *ip0 = vlib_buffer_get_current(b0);
    ip4_header_t *ip1 = vlib_buffer_get_current(b1);

    next[0] = classify_packet(ip0);   /* determine next node */
    next[1] = classify_packet(ip1);
    next += 2;
    /* ─────────────────────────────────────────── */
  }

  /* ── SINGLE LOOP: handle remaining 0-3 packets ── */
  while (n_left_from > 0) {
    vlib_buffer_t *b0 = vlib_get_buffer(vm, from[0]);
    next[0] = classify_packet(vlib_buffer_get_current(b0));
    from++; next++; n_left_from--;
  }

  /* Enqueue all packets to their respective next nodes */
  vlib_buffer_enqueue_to_next(vm, node, vlib_frame_vector_args(frame),
                              nexts, frame->n_vectors);
  return frame->n_vectors;
}

🚀

Modern "qs" Pattern - vlib_get_buffers

VPP v22+

Newer VPP nodes use the "quad-single" helper which fetches all buffers upfront using SIMD-friendly bulk get:

/* Bulk fetch all buffer pointers - compiler can vectorise */
vlib_buffer_t *bufs[VLIB_FRAME_SIZE];
vlib_get_buffers(vm, from, bufs, n_vectors);

/* Now iterate bufs[] directly */
for (int i = 0; i < n_vectors; i++) {
  nexts[i] = my_classify(bufs[i]);
}

vlib_buffer_enqueue_to_next(vm, node, from, nexts, n_vectors);

Use the qs pattern for new nodes. Use the hand-written dual-loop when you need ultra-precise prefetch control for memory-intensive operations (e.g., FIB lookup with pointer chasing).

MULTI-THREADING MODEL

🧵

Per-Worker vlib_main_t

ARCHITECTURE

VPP uses a share-nothing threading model. Each worker thread has its own vlib_main_t, its own buffer pool, and its own set of graph nodes. There is no global lock on the fast path.

Worker 0 handles RX queue 0 of each interface; Worker 1 handles RX queue 1; etc.
Each worker thread runs an independent copy of vlib_main_loop
Workers never share packet ownership - a packet assigned to Worker 0 stays on Worker 0 unless explicitly handed off
The main thread (Thread 0) handles control-plane: PROCESS nodes, CLI, API requests

/* Get the current worker's vlib_main_t (in node function context) */
vlib_main_t *vm = ...;   /* already passed to your node function */
u32 thread_index = vm->thread_index;   /* 0 = main, 1..N = workers */

/* Access another thread's vlib_main */
vlib_main_t *wm = vlib_get_main_by_index(worker_idx);

/* Per-worker data in your plugin - index by thread_index */
typedef struct {
  my_flow_t       *flow_pool;
  clib_bihash_8_8_t flow_table;
} my_worker_t;

my_main_t *mm = &my_main;
my_worker_t *w = vec_elt_at_index(mm->workers, vm->thread_index);

🔀

Handoff - Cross-Worker Packet Transfer

src/vlib/threads.c

Sometimes a packet must be processed by a specific worker - for example, if your plugin requires all packets of the same flow to be handled by the same thread (stateful processing). Use the handoff mechanism.

/* Enqueue buffers to a different worker's input queue */
u32 target_worker = compute_flow_worker(flow_id);
if (target_worker != vm->thread_index) {
  vlib_buffer_enqueue_to_thread(vm, node,
                                handoff_queue_index,  /* registered queue */
                                &bi, &target_worker,
                                1);   /* n_buffers */
}

See src/examples/handoffdemo/ for a complete working example. The handoff node approach is also used by the NAT plugin to ensure symmetric flow handling.

⚠️ Avoid unnecessary handoffs. Each cross-worker transfer adds latency and overhead. Design your hashing strategy (startup.conf num-rx-queues + RSS hash type) so packets of the same flow arrive at the same worker naturally through NIC RSS. Handoff is the fallback, not the primary mechanism.

PACKET TRACING

🔍

Adding Trace Support to Your Node

DEBUGGING

/* Step 1: define your trace structure */
typedef struct {
  u32 sw_if_index;
  u8  next_index;
  u8  error;
  u32 flow_id;
} my_node_trace_t;

/* Step 2: format function - called by 'show trace' */
static u8 *
format_my_node_trace (u8 *s, va_list *args) {
  vlib_main_t *vm = va_arg(*args, vlib_main_t *);
  vlib_node_t *node = va_arg(*args, vlib_node_t *);
  my_node_trace_t *t = va_arg(*args, my_node_trace_t *);
  s = format(s, "MY-NODE: sw_if_index %d next %d flow 0x%x",
             t->sw_if_index, t->next_index, t->flow_id);
  return s;
}

/* Step 3: in your node function, check trace flag and record */
if (PREDICT_FALSE(b0->flags & VLIB_BUFFER_IS_TRACED)) {
  my_node_trace_t *t = vlib_add_trace(vm, node, b0, sizeof(*t));
  t->sw_if_index = vnet_buffer(b0)->sw_if_index[VLIB_RX];
  t->next_index  = next0;
  t->flow_id     = b0->flow_id;
}

/* Step 4: in VLIB_REGISTER_NODE, set: */
/* .format_trace = format_my_node_trace */

📊

Error Counters - show error

OBSERVABILITY

/* Define errors with a foreach macro (standard VPP convention) */
#define foreach_my_node_error  \
  _(PROCESSED,   "packets processed") \
  _(NO_FLOW,     "flow not found")    \
  _(CHECKSUM,    "checksum error")

typedef enum {
#define _(n,s) MY_NODE_ERROR_##n,
  foreach_my_node_error
#undef _
  MY_NODE_N_ERROR,
} my_node_error_t;

static char * my_node_error_strings[] = {
#define _(n,s) s,
  foreach_my_node_error
#undef _
};

/* Increment a counter (atomic, safe from any worker) */
vlib_node_increment_counter(vm, my_node.index,
                            MY_NODE_ERROR_PROCESSED, n_processed);

PROJECT 2

Graph Node Inspector

Objective: Understand the dispatch loop and node statistics by observation - no code yet.

Start VPP with the packet generator (pg) plugin. Create a pg interface and configure it as an L3 interface with an IP address.

Generate traffic: packet-generator new { name pg0 limit 10000 ... }. Run show run before and after. Record vectors/call, clocks/vector for each active node.

Use trace add pg-input 50, send 50 packets, then show trace. Map each line of the trace to the corresponding node function in the source tree.

Set a breakpoint in GDB on ip4_lookup's node function. Inspect frame->n_vectors and the first 4 buffer indices in from[]. Dereference one buffer index and read current_data and current_length.

Increase pg traffic to 1M packets/sec. Re-run show run. Observe that vectors/call increases toward VLIB_FRAME_SIZE (256). Explain why.

PROJECT 3

Custom Buffer Inspector Node

Objective: Write your first VPP plugin - a simple node that reads buffer headers and emits trace output. No packet modification yet.

Copy src/examples/sample-plugin/ to a new directory src/plugins/buffer-inspector/. Rename all symbols.

Create an INTERNAL node with one next: ip4-lookup. In the node function, implement the dual-loop pattern. For each packet, read current_data, current_length, and the first 4 bytes of the IP header.

Add trace support with a struct that stores: sw_if_index, IP src addr, IP dst addr, protocol. Implement format_buffer_inspector_trace.

Add a feature arc registration so your node can be inserted into the ip4-unicast arc. Test with set interface feature vpp0 buffer-inspector ip4-unicast enable.

Add error counters for: packets seen, packets with TTL==1, packets with unknown protocol. Verify they appear correctly in show error.

Run under the VPP test framework: write a Python test that sends 100 packets through the interface and asserts that the "packets seen" counter equals 100.

P2B COMPLETION CHECKLIST

Know all four node types and when to use each (INPUT, INTERNAL, PROCESS, PRE_INPUT)
Understand vlib_main_loop: poll INPUT → dispatch INTERNAL frames → run PROCESS nodes
Know vlib_buffer_t layout: current_data, current_length, opaque, pre-data area
Can convert between buffer index and pointer; know why indices are passed between nodes, not pointers
Can implement the dual-loop pattern with correct 2-ahead prefetch
Understand the modern vlib_get_buffers + vlib_buffer_enqueue_to_next pattern
Know VPP's share-nothing threading model: per-worker vlib_main_t, no fast-path locks
Understand handoff: when it's needed and the overhead cost
Can add trace support to a node with a custom format_trace function
Can define and increment error counters that appear in show error
Completed Projects 2 and 3: graph inspector and buffer inspector node with traces + counters

← P2A: vppinfra 🗺️ Roadmap Next: vnet →