M16 - eBPF and XDP

NETWORKING MASTERY · PHASE 4 · MODULE 16 · WEEK 14

🔮 eBPF and XDP

eBPF virtual machine · BPF maps · Verifier · XDP hook · TC eBPF · AF_XDP · bpftool and libbpf

Advanced Prerequisite: M14 Linux Stack Kernel 5.x+ Modern Networking Paradigm 2 Labs

eBPF — PROGRAMMABLE KERNEL WITHOUT KERNEL MODULES

🔮

What eBPF Is

OVERVIEW

eBPF (extended Berkeley Packet Filter) is a revolutionary Linux kernel technology that allows you to run sandboxed programs inside the kernel without writing kernel modules or rebooting. eBPF programs are loaded from userspace, verified for safety by the kernel verifier, JIT-compiled to native machine code, and attached to hook points throughout the kernel.

Why eBPF transformed networking:

Performance — XDP eBPF programs run in the NIC driver, before sk_buff allocation. Drop speed: ~100ns per packet vs ~1µs in iptables
Safety — the verifier proves the program terminates, accesses only valid memory, and doesn't crash the kernel. Safer than kernel modules
Programmability — change packet processing logic at runtime without kernel recompile or reboot. Deploy new features in seconds
Observability — instrument any kernel function without overhead of traditional probes; used by tools like bpftrace, Cilium, Falco, Pixie

Who uses eBPF in production: Cloudflare uses XDP to drop DDoS traffic at 100+ Gbps. Facebook uses eBPF for load balancing (Katran). Google uses it for security policy enforcement. Cilium uses eBPF to replace iptables in Kubernetes.

📍

eBPF Hook Points in the Network Stack

HOOKS

Hook Type	Location	Performance	Capabilities
XDP (Native)	NIC driver, before sk_buff	~10-30 Mpps/core	DROP, PASS, TX, REDIRECT. Modify packet bytes. No sk_buff access.
XDP (Generic)	After sk_buff allocation	~5-10 Mpps/core	Same actions; works on any NIC (no driver support needed)
TC (ingress)	After sk_buff, before routing	~5 Mpps/core	Full sk_buff access, conntrack, modify headers, redirect to other interfaces
TC (egress)	After routing, before NIC	~5 Mpps/core	Modify outgoing packets, traffic shaping, redirect
socket filter	Socket recv path	Per-socket	Filter which packets delivered to socket (classic tcpdump use)
cgroup/sock	Per-cgroup socket operations	Per-operation	Control network access per container/cgroup (Cilium network policy)
kprobe/tracepoint	Any kernel function	Observability only	Read kernel data structures, send to userspace via maps

eBPF ARCHITECTURE — VM, VERIFIER, JIT

🏗️

eBPF Virtual Machine

ARCHITECTURE

/* eBPF ISA (Instruction Set Architecture) */
64-bit RISC architecture
11 general-purpose 64-bit registers:
  r0:  return value / function return
  r1-r5: function arguments (calling convention)
  r6-r9: callee-saved (preserved across helper calls)
  r10: read-only frame pointer (stack base)

512 bytes of stack space per eBPF program
Pointer arithmetic allowed but bounds-checked by verifier
No unbounded loops (kernel ≥5.3 allows bounded loops)
Max instruction count: 1 million (kernel ≥5.2)

/* eBPF program lifecycle */

1. Write eBPF program in C with restricted syntax
   (No: user function calls, global vars, unbounded loops)

2. Compile with clang + libbpf:
   clang -O2 -target bpf -c prog.c -o prog.o

3. Load into kernel via bpf() syscall:
   bpf(BPF_PROG_LOAD, &attr, sizeof(attr))

4. Verifier validates:
   - All code paths terminate (DAG, no infinite loops)
   - All memory accesses in bounds
   - Helper function signatures correct
   - Pointer arithmetic safe
   If verification fails: EACCES/EINVAL with verifier log

5. JIT compiler: eBPF bytecode → native x86-64 machine code
   Zero interpretation overhead at runtime

6. Attach to hook point:
   XDP: bpf_set_link_xdp_fd(ifindex, prog_fd, flags)
   TC:  tc filter add dev eth0 ingress bpf obj prog.o

7. Program executes for every packet at hook point
   Returns action code (XDP_DROP, XDP_PASS, etc.)

/* eBPF helper functions */
# eBPF programs cannot call arbitrary kernel functions
# They call only whitelisted "helper functions"
bpf_map_lookup_elem()   # lookup in BPF map
bpf_map_update_elem()   # update BPF map
bpf_redirect()          # redirect packet to another interface
bpf_xdp_adjust_head()   # push/pop bytes at packet head
bpf_ktime_get_ns()      # current timestamp
bpf_trace_printk()      # debug print to /sys/kernel/debug/tracing/trace_pipe
bpf_perf_event_output() # send events to userspace

BPF MAPS — KERNEL-USERSPACE SHARED STATE

🗺️

BPF Map Types

MAPS

BPF maps are the primary mechanism for state sharing: eBPF programs (running in kernel) and userspace applications both access the same map. This enables per-flow counters, blocklists, connection tables, and configuration without stopping the packet processor.

/* BPF map types */
BPF_MAP_TYPE_HASH:       Hash table. Key→value lookup. Most common.
BPF_MAP_TYPE_ARRAY:      Fixed-size indexed array. Access by index.
BPF_MAP_TYPE_LPM_TRIE:  Longest Prefix Match. For IP prefix tables!
BPF_MAP_TYPE_PERCPU_HASH: Per-CPU hash (no lock contention)
BPF_MAP_TYPE_PERF_EVENT_ARRAY: Send events to userspace perf ring
BPF_MAP_TYPE_RINGBUF:    Lock-free ring buffer (kernel 5.8+)
BPF_MAP_TYPE_DEVMAP:     Interface index map for XDP_REDIRECT

/* Defining a map in eBPF C program */
struct {
    __uint(type,        BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1024);
    __type(key,         __u32);     /* src IP */
    __type(value,       __u64);     /* packet count */
} pkt_count SEC(".maps");

/* Using the map in eBPF program */
__u32 src_ip = iph->saddr;
__u64 *count = bpf_map_lookup_elem(&pkt_count, &src_ip);
if (count)
    __sync_fetch_and_add(count, 1);
else {
    __u64 one = 1;
    bpf_map_update_elem(&pkt_count, &src_ip, &one, BPF_ANY);
}

/* Reading map from userspace (libbpf) */
struct bpf_object *obj = bpf_object__open("prog.o");
bpf_object__load(obj);
struct bpf_map *map = bpf_object__find_map_by_name(obj, "pkt_count");
int map_fd = bpf_map__fd(map);

__u32 key = inet_addr("192.168.1.5");
__u64 value;
bpf_map_lookup_elem(map_fd, &key, &value);
printf("Packets from 192.168.1.5: %llu\n", value);

/* BPF LPM trie for IP blocklist */
struct lpm_key {
    __u32 prefixlen;
    __u8  data[4];  /* IPv4 address */
};
/* Insert 192.168.0.0/16 → drop */
struct lpm_key key16 = { .prefixlen = 16, .data = {192, 168, 0, 0} };
__u32 action = XDP_DROP;
bpf_map_update_elem(lpm_fd, &key16, &action, BPF_ANY);
/* Any packet with src in 192.168.0.0/16 matches! */

XDP PROGRAMMING — PACKET PROCESSING AT WIRE SPEED

⚡

Complete XDP Program — IP Firewall

XDP

// xdp_firewall.c — drop packets from blocked IPs using BPF hash map
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>

/* Map: blocked source IPs → 1 */
struct {
    __uint(type,        BPF_MAP_TYPE_HASH);
    __uint(max_entries, 65536);
    __type(key,         __u32);  /* IPv4 src addr */
    __type(value,       __u8);   /* 1 = blocked */
} blocklist SEC(".maps");

/* Map: per-IP packet counters */
struct {
    __uint(type,        BPF_MAP_TYPE_PERCPU_HASH);
    __uint(max_entries, 65536);
    __type(key,         __u32);
    __type(value,       __u64);
} pkt_stats SEC(".maps");

SEC("xdp")
int xdp_firewall_prog(struct xdp_md *ctx)
{
    void *data_end = (void *)(long)ctx->data_end;
    void *data     = (void *)(long)ctx->data;

    /* Parse Ethernet header */
    struct ethhdr *eth = data;
    if (data + sizeof(*eth) > data_end)
        return XDP_DROP;  /* malformed — drop */

    /* Only handle IPv4 */
    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return XDP_PASS;

    /* Parse IP header */
    struct iphdr *iph = data + sizeof(*eth);
    if (data + sizeof(*eth) + sizeof(*iph) > data_end)
        return XDP_DROP;

    __u32 src = iph->saddr;

    /* Update per-IP packet counter */
    __u64 *stat = bpf_map_lookup_elem(&pkt_stats, &src);
    if (stat) {
        __sync_fetch_and_add(stat, 1);
    } else {
        __u64 one = 1;
        bpf_map_update_elem(&pkt_stats, &src, &one, BPF_NOEXIST);
    }

    /* Check blocklist */
    __u8 *blocked = bpf_map_lookup_elem(&blocklist, &src);
    if (blocked && *blocked == 1)
        return XDP_DROP;

    return XDP_PASS;
}

char _license[] SEC("license") = "GPL";

/* Compile and load */
// clang -O2 -target bpf -c xdp_firewall.c -o xdp_firewall.o
// ip link set dev eth0 xdp obj xdp_firewall.o sec xdp
// ip link set dev eth0 xdp off  # detach

TC eBPF — FULL STACK ACCESS WITH sk_buff

🚦

TC BPF vs XDP

TC BPF

TC (traffic control) eBPF programs run later in the stack than XDP — after sk_buff allocation. This gives them access to richer metadata: conntrack state, socket information, routing decisions, VLAN tags. They can also generate new packets and redirect to sockets.

/* TC BPF key differences from XDP */

Access to sk_buff → can read:
  - skb->mark, skb->priority (for QoS)
  - skb->sk (associated socket — if known)
  - conntrack state (via helper bpf_skb_get_tunnel_key)
  - Full packet headers (same as XDP) + can modify them
  - Can call bpf_sk_lookup_tcp() to find socket

Return values (different from XDP!):
  TC_ACT_OK (0):       pass to next TC filter/action
  TC_ACT_SHOT (2):     drop packet
  TC_ACT_REDIRECT (7): redirect to another interface or socket
  TC_ACT_STOLEN (4):   take ownership (used for skb→socket delivery)

/* TC BPF for packet marking (QoS) */
SEC("tc")
int mark_voip(struct __sk_buff *skb) {
    void *data_end = (void *)(long)skb->data_end;
    void *data     = (void *)(long)skb->data;
    struct iphdr *iph = data + sizeof(struct ethhdr);
    if ((__u8 *)iph + sizeof(*iph) > (__u8 *)data_end)
        return TC_ACT_OK;
    struct udphdr *udp = (void *)iph + iph->ihl * 4;
    if ((__u8 *)udp + sizeof(*udp) > (__u8 *)data_end)
        return TC_ACT_OK;
    /* Mark SIP (UDP 5060) and RTP (ports 10000-20000) for EF DSCP */
    __u16 dport = bpf_ntohs(udp->dest);
    if (iph->protocol == IPPROTO_UDP &&
        (dport == 5060 || (dport >= 10000 && dport <= 20000))) {
        bpf_skb_store_bytes(skb, offsetof(struct ethhdr, h_dest) +
                            sizeof(struct ethhdr) + 1,
                            &(__u8){0xB8}, 1, 0);  /* DSCP EF = 46 << 2 */
    }
    return TC_ACT_OK;
}

/* Attach TC eBPF */
# tc qdisc add dev eth0 clsact
# tc filter add dev eth0 ingress bpf obj tc_qos.o sec tc direct-action
# tc filter show dev eth0 ingress

AF_XDP — ZERO-COPY USERSPACE PACKET PROCESSING

⚡

AF_XDP Architecture

AF_XDP

AF_XDP is a socket type that allows userspace applications to receive and send packets directly from/to NIC memory with zero kernel copies. Unlike DPDK, AF_XDP keeps the NIC under kernel control — only selected packet queues are redirected to userspace.

/* AF_XDP architecture */

NIC Queue N → [XDP program runs in driver] → XDP_REDIRECT → AF_XDP socket
NIC Queue 0 → [passes to kernel network stack normally]

/* UMEM — userspace memory region registered with kernel */
void *umem_area = mmap(NULL, UMEM_SIZE, PROT_READ|PROT_WRITE,
                       MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

struct xsk_umem *umem;
xsk_umem__create(&umem, umem_area, UMEM_SIZE, &fill_ring, &comp_ring, NULL);

/* Four rings between kernel and userspace */
Fill ring   (userspace → kernel): "here are free buffers you can fill with RX packets"
Completion ring (kernel → userspace): "here are TX buffers I'm done with"
RX ring     (kernel → userspace): "here are received packets"
TX ring     (userspace → kernel): "here are packets to transmit"

/* Receive loop */
while (1) {
    rcvd = xsk_ring_cons__peek(&sock->rx, BATCH, &idx_rx);
    for (i = 0; i < rcvd; i++) {
        addr = xsk_ring_cons__rx_desc(&sock->rx, idx_rx + i)->addr;
        len  = xsk_ring_cons__rx_desc(&sock->rx, idx_rx + i)->len;
        pkt  = xsk_umem__get_data(sock->umem->buffer, addr);
        /* pkt points directly to NIC DMA buffer — zero copy! */
        process_packet(pkt, len);
    }
    xsk_ring_cons__release(&sock->rx, rcvd);
    /* Refill fill ring so kernel has buffers for next batch */
    replenish_fill_ring(sock, rcvd);
}

/* XDP program to steer traffic to AF_XDP socket */
struct {
    __uint(type,        BPF_MAP_TYPE_XSKMAP);
    __uint(max_entries, MAX_QUEUES);
    __type(key,         __u32);
    __type(value,       __u32);
} xsks_map SEC(".maps");

SEC("xdp_sock")
int xdp_redirect_to_xsk(struct xdp_md *ctx) {
    __u32 queue = ctx->rx_queue_index;
    if (bpf_map_lookup_elem(&xsks_map, &queue))
        return bpf_redirect_map(&xsks_map, queue, XDP_PASS);
    return XDP_PASS;
}

eBPF TOOLING — bpftool, libbpf, bpftrace

🔧

Essential eBPF Tools

TOOLING

/* bpftool — Swiss Army knife for eBPF */

# List all loaded eBPF programs
bpftool prog list
bpftool prog show id 42

# Dump eBPF bytecode (disassemble)
bpftool prog dump xlated id 42

# Show JIT-compiled machine code
bpftool prog dump jited id 42

# List all BPF maps
bpftool map list
bpftool map dump id 7         # dump all entries
bpftool map lookup id 7 key 0x01 0x02 0x03 0x04   # lookup specific key
bpftool map update id 7 key 0x01 0x02 0x03 0x04 value 0x01  # add to blocklist

# Show XDP programs attached to interfaces
bpftool net list
ip link show  # also shows "xdp" flag if XDP is attached

# Perf output from bpf_trace_printk()
cat /sys/kernel/debug/tracing/trace_pipe

/* bpftrace — high-level eBPF tracing language */

# Trace every TCP connection
bpftrace -e 'kprobe:tcp_connect { printf("connect: pid=%d\n", pid); }'

# Count packets by protocol
bpftrace -e 'tracepoint:net:netif_receive_skb { @[args->protocol] = count(); }'

# Track kernel networking function latency
bpftrace -e '
kprobe:ip_rcv { @start[tid] = nsecs; }
kretprobe:ip_rcv /@start[tid]/ {
  @latency = hist(nsecs - @start[tid]);
  delete(@start[tid]);
}'

/* Cilium's eBPF-based Kubernetes networking */
# cilium status — health of eBPF programs
# cilium monitor — real-time packet events
# cilium bpf ct list global — connection tracking table

LAB 1

Write and Load Your First XDP Program

Objective: Write a functional XDP program that counts packets per source IP and drops packets from a blocklist.

Install prerequisites: sudo apt install clang llvm libbpf-dev linux-headers-$(uname -r) bpftool. Verify: clang --version (need 10+) and bpftool version.

Write xdp_counter.c with the BPF_MAP_TYPE_PERCPU_HASH for per-IP counters. Implement the XDP program to increment the counter for each source IP. Compile: clang -O2 -target bpf -c xdp_counter.c -o xdp_counter.o.

Attach to a test interface (use veth from M14 Lab 2): sudo ip link set veth0 xdp obj xdp_counter.o sec xdp. Verify attachment: ip link show veth0 should show "xdp" flag. Generate traffic (ping) and read counters: sudo bpftool map dump name pkt_count.

Add a blocklist map. Write a userspace control program (C with libbpf) that: opens the loaded BPF object, finds the blocklist map by name, inserts a test IP, verifies pings from that IP are dropped. Use bpftool map update as an alternative.

LAB 2

bpftrace Network Observability

Objective: Use bpftrace to instrument the kernel network stack without writing eBPF C code.

Install bpftrace: sudo apt install bpftrace. Run the one-liner to count packets by protocol: sudo bpftrace -e 'tracepoint:net:netif_receive_skb { @[args->skbaddr] = count(); }'. While running, generate traffic and observe the output.

Trace TCP connection lifecycle: sudo bpftrace -e 'kprobe:tcp_connect { printf("pid=%d comm=%s\n", pid, comm); }'. Open several websites in a browser — you should see a connect event for each. Extend to also trace tcp_close.

Measure ip_rcv latency histogram: use the kprobe/kretprobe pattern from Tab 6. Run while doing iperf3. Output the latency histogram. Identify the median and 99th percentile kernel processing time per packet.

M16 MASTERY CHECKLIST

Know what eBPF is: sandboxed kernel programs, loaded from userspace, verified for safety, JIT compiled
Know eBPF's 3 key properties: safety (verifier), performance (JIT, kernel execution), programmability (runtime updates)
Know 7 eBPF hook types and their positions: XDP native/generic, TC ingress/egress, socket filter, cgroup/sock, kprobe/tracepoint
Know eBPF VM: 11 registers (r0=return, r1-r5=args, r10=stack pointer), 512B stack, no unbounded loops
Know eBPF program lifecycle: C source → clang (target bpf) → verifier → JIT → attach to hook
Know what the verifier checks: all paths terminate, bounds-checked memory access, helper call validity
Know eBPF helper functions: bpf_map_lookup/update_elem, bpf_redirect, bpf_xdp_adjust_head, bpf_trace_printk
Know BPF map types: HASH, ARRAY, LPM_TRIE (IP prefix match!), PERCPU_HASH, PERF_EVENT_ARRAY, RINGBUF
Know how maps enable kernel-userspace communication: both sides access same map via file descriptor
Know XDP return codes: XDP_DROP, XDP_PASS, XDP_TX, XDP_REDIRECT
Know XDP vs TC eBPF: XDP = before sk_buff (faster, less context); TC = has sk_buff (richer context, slower)
Know TC return codes: TC_ACT_OK, TC_ACT_SHOT, TC_ACT_REDIRECT
Know AF_XDP: NIC DMA → userspace UMEM (zero copy); 4 rings: fill, completion, RX, TX
Know AF_XDP vs DPDK: AF_XDP keeps kernel driver control; DPDK takes exclusive NIC ownership
Know bpftool: list/inspect programs and maps, dump bytecode, update map entries at runtime
Know bpftrace: high-level tracing language, kprobe/tracepoint access, histogram output
Completed Lab 1: wrote and loaded XDP packet counter + IP blocklist with libbpf
Completed Lab 2: used bpftrace to trace TCP connections and measure ip_rcv latency

✅ When complete: Move to M17 - High-Performance Networking with DPDK — your existing DPDK knowledge plus this eBPF foundation prepares you for the deepest performance engineering content in the curriculum.

← M15 Sockets 🗺️ Roadmap Next: M17 - DPDK →