VPP Host Stack

VPP MASTERY · HOST STACK · BONUS MODULE

🌐 VPP Host Stack

TCP & Session Layers - Userspace networking, SVM FIFOs, VCL, Cut-through connections

src/vnet/session/ src/vnet/tcp/ src/svm/ src/vcl/ 200K CPS · 8 Gbps/core

THE PROBLEM IT SOLVES

🔍

Why Traditional Networking is Slow for High-Performance Apps

MOTIVATION

In the standard Linux model, every network operation - send(), recv(), connect() - crosses the kernel boundary. This means:

System call overhead: Each send() traps into the kernel, switches CPU context, runs kernel TCP code, copies data to a kernel socket buffer, then returns. At 10 Gbps line rate this becomes the bottleneck.
Data copies: Data moves from your app buffer → kernel socket buffer → NIC DMA buffer. Multiple copies per packet.
Cache pollution: Kernel code runs on the same CPU cores, evicting your application's data from L1/L2 cache.
Scheduling jitter: The kernel may deschedule your process at any moment, adding microseconds of latency.

VPP's host stack eliminates all of this. The entire TCP/IP stack runs inside VPP's userspace process. Applications communicate with it via shared memory FIFOs - no syscalls, no copies, no kernel crossing on the data path.

🐢

Traditional Model

SLOW PATH

App calls send() → syscall trap → kernel TCP → kernel socket buffer → NIC driver → DMA → wire

Data path crosses: user/kernel boundary × 2, 2–3 memory copies, context switches, scheduler jitter

Ceiling: ~1–5 Gbps per core with significant CPU load

⚡

FAST PATH

App writes to TX FIFO → VPP reads shared memory → TCP → IP → DPDK → DMA → wire

Data path crosses: shared memory write only, zero copies, no kernel, no syscalls

Demonstrated: 8 Gbps/core, 200K connections/sec (2017 numbers - much higher on modern HW)

200K

connections/sec

on a single Intel Xeon E2690

8 Gbps

throughput per core

normal TCP sessions

~120 Gbps

cut-through mode

memory bandwidth limit

syscalls on data path

pure shared memory

BEGINNER VOCABULARY - KNOW THESE BEFORE GOING DEEPER

userspace TCP stack Normally TCP/IP lives inside the Linux kernel. A "userspace stack" means VPP implements its own complete TCP state machine as a regular process - no kernel involvement in packet processing. Your application doesn't use kernel sockets at all.

shared memory Two processes map the same physical RAM pages into their own virtual address spaces. Process A writes bytes at address 0x7f000000; Process B reads those same bytes at whatever address it mapped them to. Zero copies, zero kernel involvement - just memory reads and writes.

SVM Shared Virtual Memory. VPP's allocator for shared memory regions. An SVM segment is a named, fixed-size chunk of shared memory. Multiple SVM segments can exist simultaneously - one per application namespace, or one per application. Source: src/svm/

FIFO First In First Out. A ring buffer - bytes written at the head are read from the tail in order. In VPP host stack, every TCP session has two FIFOs allocated inside a shared memory segment: one for data flowing VPP→App (RX FIFO) and one for App→VPP (TX FIFO).

lock-free The FIFO can be written and read simultaneously by two threads (VPP worker and app) without mutex locks. This works because each FIFO has exactly one writer and one reader - SPSC (single producer single consumer). Atomic operations on head/tail pointers ensure consistency without blocking.

Binary API VPP's control-plane message protocol. Structured binary messages sent over a Unix socket or shared memory queue. Used for control operations: create a session, bind a port, connect, set options. NOT used for packet data - that goes through SVM FIFOs. Like the difference between a REST API (control) and a database file (data).

session (vs connection) VPP uses "session" as the generic term for an endpoint-to-endpoint communication channel, regardless of transport protocol (TCP, UDP, TLS, QUIC). A TCP session wraps a TCP connection. The session layer manages all sessions uniformly; the transport layer (TCP) handles the specific protocol.

5-tuple The 5 fields that uniquely identify a TCP/UDP flow: source IP, source port, destination IP, destination port, protocol. VPP's session lookup table maps 5-tuple → session object in O(1) using a bihash. Every arriving packet is looked up by its 5-tuple to find the right session and FIFO.

VCL VPP Communications Library. A C library that applications link against. It provides POSIX-socket-like functions (vcl_connect, vcl_read, vcl_write, vcl_epoll_wait) that talk to VPP's session layer via Binary API and SVM FIFOs instead of calling into the kernel.

LD_PRELOAD A Linux environment variable that forces the dynamic linker to load a specified shared library before all others. VCL provides an LD_PRELOAD library (libvcl_ldpreload.so) that intercepts standard POSIX socket calls (connect, send, recv, epoll_wait) and redirects them to VPP - without modifying or recompiling the application. nginx can use VPP's stack with just LD_PRELOAD=libvcl_ldpreload.so nginx.

namespace VPP session namespaces isolate network resources between applications. Each namespace has its own local session lookup table and can be associated with a specific VRF (routing table). App A in namespace 1 cannot see App B's sessions in namespace 2, even though they share the same VPP instance. Think of it like Linux network namespaces, but inside VPP.

cut-through (redirect) When a server application advertises itself as a cut-through target, VPP can redirect a new client connection directly to the server's shared memory segment - bypassing TCP entirely for the data path. The client writes to what it thinks is a TCP socket; the bytes appear directly in the server's RX FIFO. Throughput is limited only by memory bandwidth (~120 Gbps).

NewReno A TCP congestion control algorithm. When packet loss is detected, NewReno reduces the sending window size and slowly increases it again. VPP implements NewReno as its baseline congestion control, plus SACK-based fast recovery. You don't need to tune this for most use cases, but knowing it exists matters for latency-sensitive workloads.

SACK Selective Acknowledgement. A TCP extension where the receiver tells the sender exactly which segments it has received (not just the highest in-order byte). This allows the sender to retransmit only the missing segments rather than everything after a loss. VPP's TCP implementation supports SACK for efficient loss recovery.

FULL STACK ARCHITECTURE

Application

Your process (nginx, iperf, custom app). Links against VCL or uses LD_PRELOAD.

↕ vcl_connect / vcl_read / vcl_write ↕

VCL - VPP Communications Library

src/vcl/ · POSIX-like API · LD_PRELOAD intercept · epoll shim

↕ Binary API messages (control only) ↕

- VPP process boundary - shared memory segment allocated here —

Session Layer

src/vnet/session/ · App state · SVM FIFO alloc · 5-tuple lookup · namespaces · pluggable transport

↕ session_tx_fifo / session_rx_fifo ↕

TCP (clean-slate userspace)

src/vnet/tcp/ · Full state machine · NewReno · SACK · retransmit timers · checksum offload

↕ ip4-lookup / ip4-rewrite graph nodes ↕

IP / vnet

FIB lookup · routing · adjacency rewrite

↕ dpdk-input / dpdk-output ↕

DPDK / NIC driver

Physical NIC · hugepages · zero-copy mbuf RX/TX

💡 Key insight - two independent paths: The Binary API (control plane) and the SVM FIFOs (data plane) are completely separate. The Binary API is used only for setup: creating sessions, binding ports, setting options - think of it like the control socket. The SVM FIFOs are the actual data highway - once a session is established, the app and VPP only talk through shared memory reads/writes. No Binary API messages on the hot path.

SOURCE DIRECTORY MAP

📁

Where the Code Lives

SOURCE

src/vnet/session/   # Session layer: session.c, session_api.c, application.c
src/vnet/tcp/       # TCP implementation: tcp.c, tcp_input.c, tcp_output.c
                    #   tcp_cc.c (congestion control), tcp_timer.c
src/svm/            # Shared Virtual Memory: svm_fifo.c, fifo_segment.c
src/vcl/            # VCL library: vcl_private.c, vppcom.c, ldp.c (LD_PRELOAD)
src/vnet/session/   # Binary API: session.api

SESSION LAYER - src/vnet/session/

🗂️

What the Session Layer Manages

OVERVIEW

The session layer sits between transport protocols (TCP, UDP, TLS, QUIC) and applications. It is the broker that connects "application wants to receive data on port 8080" with "TCP received bytes for the flow matching (srcIP:srcPort → dstIP:8080)".

It owns five responsibilities:

Application registration: When an app attaches via Binary API, the session layer creates an application_t object and maps it to a namespace and set of permissions.
Session allocation: For every accepted TCP connection, one session_t object is allocated (from a pool), and two SVM FIFOs are allocated (RX and TX) inside the app's shared memory segment.
Lookup tables: Two bihash tables - a global table (keyed by 5-tuple) for ingress matching, and a local table (keyed by local endpoint) for bind/accept matching.
Namespace isolation: Each namespace has its own local session table and can be pinned to a VRF. Apps in different namespaces cannot see each other's sessions.
Transport abstraction: The session layer defines a transport protocol interface (session_transport_vft_t) - TCP, UDP, TLS, QUIC all register themselves. New transport protocols can be added as plugins.

🔍

Session Lookup Tables - Two-Table Design

INTERNALS

/* Two lookup tables - different key spaces */

/* Global session table: ingress packet → active session */
/* Key: full 5-tuple (src_ip, src_port, dst_ip, dst_port, proto) */
/* Value: session index */
/* Used by: ip4-lookup → session-queue node (data path) */
session_table_t *global_table = &session_main.session_tables[fib_index];

/* Local session table: per-namespace, for listen/bind */
/* Key: local endpoint (dst_ip, dst_port, proto) */
/* Value: listen session index (which app is listening here?) */
/* Used by: TCP SYN processing - find who owns this port */
session_table_t *local_table = &ns->local_session_table;

/* Fast-path lookup (called per packet in session-queue node) */
session_t *s = session_lookup_connection_wt4(fib_index,
    &ip4_hdr->src_address, &ip4_hdr->dst_address,
    tcp_hdr->src_port, tcp_hdr->dst_port,
    TRANSPORT_PROTO_TCP);

/* Both tables are backed by bihash_48_8 */
/* 48-byte key: 4+4 byte IPs + 2+2 ports + 1 proto + padding */

The two-table design also supports session rules - filter rules attached to either table:

Local table rules - namespace-specific, used for egress filtering (which apps can connect out)
Global table rules - VRF-specific, used for ingress filtering (which connections are accepted into a namespace)

🔌

Transport Protocol Plugin Interface

EXTENSIBILITY

Any transport protocol registers itself with the session layer by implementing session_transport_vft_t. This is a vtable of function pointers - the session layer calls these without knowing which protocol it's talking to.

typedef struct {
    /* Connection management */
    u32  (*open)  (transport_endpoint_cfg_t *tep);
    void (*close) (u32 conn_index, u32 thread_index);
    void (*reset) (u32 conn_index, u32 thread_index);

    /* Data transfer */
    u32  (*push_header) (transport_connection_t *tc, vlib_buffer_t **b, u32 n);
    u16  (*send_mss)    (transport_connection_t *tc);

    /* Introspection */
    transport_connection_t *(*get_connection)(u32 idx, u32 thread);
    u8  *(*format_connection)(u8 *s, va_list *args);
} transport_proto_vft_t;

/* Registration (in TCP plugin init) */
transport_register_protocol(TRANSPORT_PROTO_TCP, &tcp_proto, FIB_PROTOCOL_IP4, ~0);
transport_register_protocol(TRANSPORT_PROTO_TCP, &tcp_proto, FIB_PROTOCOL_IP6, ~0);

This is why VPP can support TCP, UDP, TLS (via OpenSSL/mbedTLS plugins), QUIC (via quicly), and custom protocols - they all plug into the same session layer infrastructure.

SVM FIFOs - src/svm/svm_fifo.c

🔄

What a FIFO Looks Like in Memory

INTERNALS

An SVM FIFO is a ring buffer allocated inside a shared memory segment. Two processes - VPP's dataplane worker and the application - access it simultaneously, with no locks. Safety comes from the SPSC (Single Producer Single Consumer) guarantee: exactly one writer advances the head, exactly one reader advances the tail.

RX FIFO - session 42 - 64 KB - VPP writes (head) · App reads (tail)

—

data

—

Data (5 segments)

Empty

Head (VPP writes here next)

Tail (App reads from here)

typedef struct svm_fifo {
    CLIB_CACHE_LINE_ALIGN_MARK(cacheline0);
    atomic_u32 head;          /* consumer (app) advances this */
    atomic_u32 tail;          /* producer (VPP) advances this */
    u32  size;                /* ring capacity in bytes */
    u32  nitems;              /* number of items (size / chunk) */
    u8  *data;                /* pointer into shared memory region */
    u32  master_session_index;/* which session owns this FIFO */
    u8   master_thread_index; /* worker thread that manages it */
    svm_fifo_chunk_t *ooo_enqueues; /* out-of-order data list */
} svm_fifo_t;

Important design properties:

Fixed position in shared memory: Once allocated, a FIFO never moves. The app holds a pointer to it from the moment the session is established.
Out-of-order support: TCP can receive segments out of order. The FIFO supports enqueueing OOO data directly - VPP enqueues each TCP segment at its byte-offset position. When the gap fills, the FIFO contiguous range advances.
Lock-free dequeue with option to peek: The app can peek at data without consuming it - useful for protocols where you need to inspect a header before deciding how much to read.
Atomic size increment: While head/tail are advanced with plain writes (safe due to SPSC), the total available bytes counter uses an atomic increment to allow safe multi-threaded introspection.

📦

SVM Segment - The Shared Memory Container

MEMORY MODEL

FIFOs are allocated inside an SVM segment - a named, fixed-size region of shared memory created with shm_open + mmap. The segment is created by VPP and mapped into both the VPP process and the application process at session establishment time.

/* Segment creation (VPP side, triggered by app attach) */
fifo_segment_create_args_t a = {
    .segment_name = "app-42-segment",
    .segment_size = 64 << 20,   /* 64 MB */
    .segment_type = SSVM_SEGMENT_SHM,
};
fifo_segment_create(sm, &a);

/* App maps the segment (VCL side) */
ssvm_slave_init_shm(sh);  /* mmap's the segment into the app's VA space */

/* After mmap: both sides hold pointers to the same physical pages */
svm_fifo_t *rx_fifo = session->rx_fifo;   /* VPP's pointer */
svm_fifo_t *rx_fifo = vcl_session->rx_fifo; /* App's pointer - same memory */

/* Write (VPP, on packet receive): */
svm_fifo_enqueue(s->rx_fifo, b->current_length,
                 vlib_buffer_get_current(b));

/* Read (app, via VCL): */
n = svm_fifo_dequeue(vcl_s->rx_fifo, buf_len, buf);

⚠️ Segment size tuning: The segment is allocated at app attach time and its size is fixed. If your app's sessions have large buffers (e.g., 1 MB RX FIFO per connection) and you have many concurrent connections, segment exhaustion is a common issue. Tune via: session { evt-q-length 64 segment-size 256m add-segment-size 128m } in startup.conf.

VPP TCP - src/vnet/tcp/

📡

Clean-Slate TCP Implementation

OVERVIEW

VPP implements TCP from scratch - it does not use any Linux kernel TCP code. This was a deliberate choice: the kernel's TCP is optimised for general-purpose use across millions of different scenarios. VPP's TCP is optimised for high throughput with many concurrent connections in a polling dataplane.

What it implements:

Full state machine: CLOSED → SYN_SENT → SYN_RCVD → ESTABLISHED → FIN_WAIT_1 → FIN_WAIT_2 → TIME_WAIT → CLOSED (and all error transitions)
Flow control: Sliding window, window scaling (RFC 7323), receive window advertisement
Congestion control: NewReno (default), with a pluggable congestion control interface for cubic, BBR, etc.
Loss recovery: Fast retransmit, fast recovery, RTO-based retransmission, SACK-based selective retransmit
Timers: RTO (retransmission timeout), persist, keepalive, TIME_WAIT - all implemented on VPP's timer wheel infrastructure (tw_timer_*)
Checksum offloading: Delegates to DPDK TX offload when hardware supports it
PMTU discovery: Path MTU discovery via ICMP unreachable handling
TSO: TCP Segmentation Offload on supporting NICs

📊

TCP in the VPP Graph - Node Chain

GRAPH NODES

TCP processing is decomposed into graph nodes like everything else in VPP. The RX and TX paths are separate node chains:

/* RX path (incoming segment) */
dpdk-input
  → ethernet-input
    → ip4-input
      → ip4-lookup            /* FIB lookup → local delivery */
        → ip4-local           /* dst is local → demux by proto */
          → tcp4-input        /* TCP header validation */
            → tcp4-established /* ESTABLISHED state: enqueue to rx_fifo */
            → tcp4-syn-sent    /* SYN_SENT state: process SYN-ACK */
            → tcp4-rcv-process /* other states: FIN, RST processing */
              → session-queue  /* notify app: data available on rx_fifo */

/* TX path (app writes to tx_fifo) */
session-queue                  /* reads from tx_fifo */
  → tcp4-output                /* build TCP segment, set headers */
    → ip4-rewrite              /* L3 rewrite */
      → dpdk-output

The session-queue node is the bridge between the session layer and the graph. On the RX side it notifies the application of new data. On the TX side it reads from the TX FIFO and passes data to TCP for segmentation and transmission.

⏱️

TCP Timers - Timer Wheel Integration

INTERNALS

VPP's TCP uses the tw_timer wheel infrastructure from vppinfra. Each TCP connection can have multiple concurrent timers (RTO, persist, keepalive, TIME_WAIT). The timer wheel fires callbacks at O(1) per tick regardless of how many active timers exist.

/* Timer types per connection */
typedef enum {
    TCP_TIMER_RETRANSMIT = 0,  /* RTO: retransmit if no ACK received */
    TCP_TIMER_DELACK,          /* delayed ACK: batch ACKs for efficiency */
    TCP_TIMER_PERSIST,         /* zero-window probe */
    TCP_TIMER_KEEPALIVE,       /* detect dead connections */
    TCP_TIMER_WAITCLOSE,       /* TIME_WAIT expiry */
    TCP_TIMER_RETRANSMIT_SYN,  /* SYN retransmit before connection est. */
    TCP_N_TIMERS,
} tcp_timers_e;

/* Starting a timer (inside TCP processing) */
tcp_timer_set(tc, TCP_TIMER_RETRANSMIT,
              clib_max(tc->rto * TCP_TO_TIMER_TICK, 1));
/* tc->rto is in ms; TCP_TO_TIMER_TICK converts to wheel ticks */

/* Timer callback (fires on expiry) */
static void tcp_timer_retransmit_handler(u32 conn_index) {
    tcp_connection_t *tc = tcp_connection_get(conn_index, vlib_get_thread_index());
    /* double RTO (exponential backoff), retransmit */
    tc->rto = clib_min(tc->rto << 1, TCP_RTO_MAX);
    tcp_retransmit_first_unacked(tc);
}

VCL - VPP COMMUNICATIONS LIBRARY - src/vcl/

🔌

What VCL Is and Why It Exists

OVERVIEW

VCL is the application-side library that hides all VPP host stack complexity behind a clean API. Without VCL, an application would need to: implement the Binary API wire format, manage its own shared memory segment mappings, handle FIFO enqueue/dequeue directly, and implement epoll on top of FIFO state. VCL does all of this.

VCL provides two integration modes:

Native VCL API: Application calls vppcom_* functions directly. Maximum control and performance. Requires modifying the application.
LD_PRELOAD (ldpreload.c): VCL intercepts POSIX socket calls at the dynamic linker level. Zero code changes needed - but some advanced socket features (like SO_REUSEPORT) may not be supported.

📋

Native VCL API Reference

API

VCL Function	POSIX Equivalent	Notes
`vppcom_app_create()`	—	Attach to VPP via Binary API. Must be called first.
`vppcom_session_create()`	`socket()`	Allocate a VCL session object. Returns session handle (integer).
`vppcom_session_bind()`	`bind()`	Register local endpoint with session layer.
`vppcom_session_listen()`	`listen()`	Register listen session in local session table.
`vppcom_session_accept()`	`accept()`	Dequeue next accepted connection. Blocks until one arrives (or non-blocking).
`vppcom_session_connect()`	`connect()`	Trigger TCP SYN via Binary API; waits for session established event.
`vppcom_session_read()`	`read()`	Dequeue bytes from RX FIFO. Zero-copy if using peek + advance.
`vppcom_session_write()`	`write()`	Enqueue bytes into TX FIFO. Returns bytes written.
`vppcom_epoll_create()`	`epoll_create()`	Create an epoll handle backed by VPP session events.
`vppcom_epoll_ctl()`	`epoll_ctl()`	Add/remove/modify session monitoring.
`vppcom_epoll_wait()`	`epoll_wait()`	Block until events (EPOLLIN, EPOLLOUT) on any monitored session.
`vppcom_session_close()`	`close()`	Close session; sends TCP FIN if established.

/* Minimal VCL server skeleton */
vppcom_app_create("my-server");

int ls = vppcom_session_create(VPPCOM_PROTO_TCP, 0 /* is_nonblocking */);
vppcom_session_bind(ls, &ep);     /* ep = { .is_ip4=1, .ip=..., .port=8080 } */
vppcom_session_listen(ls, 10);

while (1) {
    int cs = vppcom_session_accept(ls, &client_ep, 0);
    /* cs is the connected session handle */

    char buf[4096];
    int n = vppcom_session_read(cs, buf, sizeof(buf));
    /* buf now contains TCP payload - read directly from RX FIFO */
    
    vppcom_session_write(cs, response, resp_len);
    vppcom_session_close(cs);
}

🎭

LD_PRELOAD - Zero-Code-Change Integration

LD_PRELOAD

The LD_PRELOAD library (src/vcl/ldp.c) wraps every relevant POSIX socket function. When the dynamic linker loads a program, it loads this library first, so calls to connect(), send(), recv() etc. hit the VCL wrapper, not libc.

# Run nginx against VPP's stack - no code changes to nginx
export VCL_CONFIG=/etc/vpp/vcl.conf
LD_PRELOAD=/usr/lib/libvcl_ldpreload.so nginx -g "daemon off;"

# Run iperf3 as server against VPP's stack
LD_PRELOAD=/usr/lib/libvcl_ldpreload.so iperf3 -s -p 5201

# Run iperf3 as client, connecting to a VPP host-stack server
LD_PRELOAD=/usr/lib/libvcl_ldpreload.so iperf3 -c 10.0.0.1 -p 5201 -t 10

/* How the interception works (ldp.c) */
/* The shim re-defines connect() with the same signature as libc */
int connect(int fd, const struct sockaddr *addr, socklen_t len) {
    ldp_worker_ctx_t *ldpw = ldp_worker_get_current();

    if (ldp_is_vcl_session(fd)) {
        /* fd belongs to VCL - use VPP's stack */
        return vppcom_session_connect(fd - LDP_SID_BIT, &ep);
    } else {
        /* Regular fd - fall through to libc */
        return libc_connect(fd, addr, len);
    }
}

💡 Hybrid mode: LD_PRELOAD supports a hybrid model - sockets created for non-network purposes (local Unix sockets, pipes, files) continue to use the kernel. Only sockets on the configured IP address/port ranges are redirected to VPP. This allows apps that mix network I/O and file I/O to work without modification.

STEP-BY-STEP: SESSION ESTABLISHMENT & DATA TRANSFER

🤝

Phase 1 - App Attachment

SETUP

Before any session can be created, both client and server must attach to VPP:

App calls vppcom_app_create()

VCL sends an app_attach Binary API message to VPP. VPP creates an application_t object, assigns a namespace, and allocates a shared memory segment for this app's sessions.

Binary API → app_attach_reply

VPP returns segment fd

VPP replies with the shared memory segment descriptor. VCL calls ssvm_slave_init_shm() to mmap the segment into the app's address space. Now both processes share the same physical pages.

mmap(segment_fd) → shared memory mapped

📞

Phase 2 - Session Establishment (TCP Handshake)

CONNECT

Server App	VPP (server side)	Network	VPP (client side)	Client App
bind + listen	registers local table entry
				connect()
			allocate TCP conn, SYN
		──── SYN ────→
	SYN_RCVD, send SYN-ACK
		←── SYN-ACK ──
		──── ACK ────→	ESTABLISHED
	ESTABLISHED → alloc FIFOs		alloc FIFOs
	notify app: new session
accept() returns				connect() returns
Both apps now hold a session handle. Two FIFOs (RX + TX) are live in shared memory for each side.

Key point: the TCP handshake (SYN/SYN-ACK/ACK) is handled entirely inside VPP's graph nodes. The application is not involved until the handshake completes. Only then does VPP allocate the FIFOs and notify the app via an event message on the Binary API channel.

📦

Phase 3 - Data Transfer

DATA PATH

App writes to TX FIFO

App calls vppcom_session_write(cs, buf, len). VCL calls svm_fifo_enqueue(tx_fifo, len, buf) - a pure memory copy into the shared region. No syscall, no kernel crossing.

svm_fifo_enqueue(tx_fifo)

App sends TX write event

VCL sends a SESSION_IO_EVT_TX event to VPP's worker thread via the session event queue (a shared memory MPSC queue). This wakes up the session-queue node to process the TX FIFO.

session_event_queue → SESSION_IO_EVT_TX

VPP reads TX FIFO, builds TCP segments

The session-queue node dequeues the TX event, reads bytes from the TX FIFO, passes them to TCP output, which builds TCP segments respecting MSS, window size, and congestion window. Segments enter the graph at tcp4-output.

svm_fifo_dequeue(tx_fifo) → tcp4-output → dpdk-output

Remote VPP receives, enqueues to RX FIFO

On the receiving side, tcp4-established processes the segment, enqueues the payload directly into the session's RX FIFO, and sends a SESSION_IO_EVT_RX event to the app.

svm_fifo_enqueue(rx_fifo) → SESSION_IO_EVT_RX

Remote app reads from RX FIFO

App's epoll_wait wakes on the RX event. App calls vppcom_session_read() → svm_fifo_dequeue(rx_fifo). Data is now in the app's buffer. Total copies: 1 (FIFO → app buffer). Zero kernel crossings.

svm_fifo_dequeue(rx_fifo) → app buffer

CUT-THROUGH (REDIRECTED) CONNECTIONS

⚡

What is a Cut-Through Connection?

CONCEPT

A cut-through connection (also called "redirect" in the VPP source) takes the host stack to its logical extreme. When both the client and server are applications talking to the same VPP instance, VPP can skip TCP entirely for the data path and connect the two apps' shared memory FIFOs directly.

The flow works like this:

Server app calls bind + listen but also sends a redirect message indicating it wants cut-through connections.
Client app calls connect. VPP's session layer sees that both endpoints are local applications.
VPP redirects the connection: the client's TX FIFO becomes the server's RX FIFO. They share the same memory region.
Data written by the client appears in the server's buffer with zero copies and zero TCP overhead.

The throughput ceiling is no longer CPU or NIC speed - it is memory bandwidth (typically 80–150 GB/s on modern systems), which gives the ~120 Gbps figure quoted in the presentation.

🔵

Normal TCP Session

COMPARISON

App → TX FIFO → TCP segment → IP → NIC TX → wire → NIC RX → IP → TCP reassembly → RX FIFO → App

Each direction: 1 copy + TCP processing (header parsing, ACK, window management, congestion control)

Performance: ~8 Gbps/core

🟢

Cut-Through Connection

COMPARISON

App → shared FIFO → App

No TCP headers. No IP routing. No NIC. No ACK. No congestion control. Pure shared memory reads and writes.

Performance: ~120 Gbps (memory BW limited)

💡 When is cut-through useful? Any time two services on the same host need to pass large volumes of data between each other: a proxy and an origin server, a load balancer and an application, two stages of a data processing pipeline. In container/pod deployments where both ends run on the same physical node, cut-through gives you essentially in-process performance over a network-like API.

MULTI-THREADING MODEL

🧵

Session Layer with Multiple Worker Threads

THREADING

VPP's session layer follows the same per-worker model as the rest of VPP. Each worker thread owns a set of sessions - the sessions whose TCP connections are RSS-hashed to that worker's NIC RX queue. A session's FIFOs are only accessed by its owning worker on the VPP side, and by the application on the app side.

/* Session ownership - pinned to worker by RSS hash */
/* Worker 0 owns sessions hashed to NIC queue 0 */
/* Worker 1 owns sessions hashed to NIC queue 1 */
/* etc. */

/* Per-thread session pools */
session_t **sessions_by_thread;    /* sessions_by_thread[thread_idx] = pool */
tcp_connection_t **connections;    /* per-thread TCP connection pool */

/* Per-worker timer wheels - no shared state */
tw_timer_wheel_2t_1w_2048sl_t *timer_wheels; /* one per worker */

/* App event queues - one per app per thread */
/* VPP workers post RX/TX events here; app polls them */
svm_msg_q_t *app_event_queue[MAX_THREADS];

The multi-app, multi-thread picture (from the slide deck):

Core 0: App1 process + VPP TCP/IP/Session for App1's sessions
Core 1: Additional VPP worker handling different sessions (different NIC queue)
App1 and its VPP sessions may span both cores if RSS distributes its flows to both queues
Each core has its own FIFO pairs - no locking between cores on the data path

📬

App Event Queue - Cross-Boundary Notification

EVENTS

The app event queue (also in shared memory) is the notification channel from VPP workers to the application. It is an MPSC queue (multiple VPP workers can post, one app reads) - so it requires atomic operations, unlike the SPSC FIFOs.

typedef struct {
    u32  session_index;
    u8   event_type;   /* SESSION_IO_EVT_RX / TX / CLOSE / etc. */
} session_event_t;

/* VPP worker posts event when rx_fifo has new data */
session_event_t evt = {
    .session_index = s->session_index,
    .event_type    = SESSION_IO_EVT_RX,
};
svm_msg_q_add(app->event_queue, &evt, SVM_Q_NOWAIT);

/* App (VCL) polls for events */
while (1) {
    svm_msg_q_msg_t msg;
    if (svm_msg_q_sub(eq, &msg, SVM_Q_NOWAIT, 0) == 0) {
        session_event_t *e = svm_msg_q_msg_data(eq, &msg);
        if (e->event_type == SESSION_IO_EVT_RX)
            notify_app_readable(e->session_index);
        svm_msg_q_free_msg(eq, &msg);
    }
}

HOST STACK MASTERY CHECKLIST

Can explain why userspace TCP outperforms kernel TCP for high-connection-count workloads
Know the two planes: Binary API (control) vs SVM FIFO (data) - and which is used on the hot path
Understand the layered architecture: App → VCL → [Binary API] → Session → TCP → IP → DPDK
Know what an SVM segment is and how it is mapped into both VPP and the app process
Understand the SVM FIFO memory layout: head/tail pointers, ring buffer, out-of-order support
Know why the FIFO is lock-free: SPSC guarantee - one writer (VPP), one reader (app)
Understand the session layer's 5 responsibilities: app reg, session alloc, lookup, namespaces, transport abstraction
Know the two lookup tables: global (5-tuple → active session) and local (endpoint → listen session)
Understand TCP graph node chain: dpdk-input → ip4-local → tcp4-input → tcp4-established → session-queue
Know VPP's TCP features: full state machine, NewReno, SACK, tw_timer wheel, checksum offload
Can list the VCL native API functions and their POSIX equivalents
Understand how LD_PRELOAD interception works: shim re-defines connect/send/recv, checks fd type
Can trace a complete session establishment: vppcom_app_create → bind → listen → TCP handshake → FIFO alloc → accept
Can trace a complete data transfer: write → svm_fifo_enqueue → TX event → tcp4-output → dpdk-output → wire → dpdk-input → tcp4-established → svm_fifo_enqueue → RX event → read
Understand cut-through connections: direct FIFO sharing, no TCP overhead, ~120 Gbps ceiling
Know the multi-threading model: per-worker session pools, SPSC FIFOs, MPSC app event queue
Know startup.conf session stanza: segment-size, evt-q-length, add-segment-size
Know key source directories: src/vnet/session, src/vnet/tcp, src/svm, src/vcl

✅ Host Stack module complete. Suggested next steps: run VCL iperf3 against a VPP instance (LD_PRELOAD=libvcl_ldpreload.so iperf3), inspect show session verbose and show tcp statistics while traffic is flowing, then explore src/vnet/tcp/tcp_input.c to trace a SYN through the state machine.

← Control Plane 🗺️ Roadmap ↑ VPP Hub