eBPF — PROGRAMMABLE KERNEL WITHOUT KERNEL MODULES
What eBPF Is
OVERVIEWeBPF (extended Berkeley Packet Filter) is a revolutionary Linux kernel technology that allows you to run sandboxed programs inside the kernel without writing kernel modules or rebooting. eBPF programs are loaded from userspace, verified for safety by the kernel verifier, JIT-compiled to native machine code, and attached to hook points throughout the kernel.
Why eBPF transformed networking:
- Performance — XDP eBPF programs run in the NIC driver, before sk_buff allocation. Drop speed: ~100ns per packet vs ~1µs in iptables
- Safety — the verifier proves the program terminates, accesses only valid memory, and doesn't crash the kernel. Safer than kernel modules
- Programmability — change packet processing logic at runtime without kernel recompile or reboot. Deploy new features in seconds
- Observability — instrument any kernel function without overhead of traditional probes; used by tools like bpftrace, Cilium, Falco, Pixie
Who uses eBPF in production: Cloudflare uses XDP to drop DDoS traffic at 100+ Gbps. Facebook uses eBPF for load balancing (Katran). Google uses it for security policy enforcement. Cilium uses eBPF to replace iptables in Kubernetes.
eBPF Hook Points in the Network Stack
HOOKS| Hook Type | Location | Performance | Capabilities |
|---|---|---|---|
| XDP (Native) | NIC driver, before sk_buff | ~10-30 Mpps/core | DROP, PASS, TX, REDIRECT. Modify packet bytes. No sk_buff access. |
| XDP (Generic) | After sk_buff allocation | ~5-10 Mpps/core | Same actions; works on any NIC (no driver support needed) |
| TC (ingress) | After sk_buff, before routing | ~5 Mpps/core | Full sk_buff access, conntrack, modify headers, redirect to other interfaces |
| TC (egress) | After routing, before NIC | ~5 Mpps/core | Modify outgoing packets, traffic shaping, redirect |
| socket filter | Socket recv path | Per-socket | Filter which packets delivered to socket (classic tcpdump use) |
| cgroup/sock | Per-cgroup socket operations | Per-operation | Control network access per container/cgroup (Cilium network policy) |
| kprobe/tracepoint | Any kernel function | Observability only | Read kernel data structures, send to userspace via maps |
eBPF ARCHITECTURE — VM, VERIFIER, JIT
eBPF Virtual Machine
ARCHITECTURE/* eBPF ISA (Instruction Set Architecture) */ 64-bit RISC architecture 11 general-purpose 64-bit registers: r0: return value / function return r1-r5: function arguments (calling convention) r6-r9: callee-saved (preserved across helper calls) r10: read-only frame pointer (stack base) 512 bytes of stack space per eBPF program Pointer arithmetic allowed but bounds-checked by verifier No unbounded loops (kernel ≥5.3 allows bounded loops) Max instruction count: 1 million (kernel ≥5.2) /* eBPF program lifecycle */ 1. Write eBPF program in C with restricted syntax (No: user function calls, global vars, unbounded loops) 2. Compile with clang + libbpf: clang -O2 -target bpf -c prog.c -o prog.o 3. Load into kernel via bpf() syscall: bpf(BPF_PROG_LOAD, &attr, sizeof(attr)) 4. Verifier validates: - All code paths terminate (DAG, no infinite loops) - All memory accesses in bounds - Helper function signatures correct - Pointer arithmetic safe If verification fails: EACCES/EINVAL with verifier log 5. JIT compiler: eBPF bytecode → native x86-64 machine code Zero interpretation overhead at runtime 6. Attach to hook point: XDP: bpf_set_link_xdp_fd(ifindex, prog_fd, flags) TC: tc filter add dev eth0 ingress bpf obj prog.o 7. Program executes for every packet at hook point Returns action code (XDP_DROP, XDP_PASS, etc.) /* eBPF helper functions */ # eBPF programs cannot call arbitrary kernel functions # They call only whitelisted "helper functions" bpf_map_lookup_elem() # lookup in BPF map bpf_map_update_elem() # update BPF map bpf_redirect() # redirect packet to another interface bpf_xdp_adjust_head() # push/pop bytes at packet head bpf_ktime_get_ns() # current timestamp bpf_trace_printk() # debug print to /sys/kernel/debug/tracing/trace_pipe bpf_perf_event_output() # send events to userspace
BPF MAPS — KERNEL-USERSPACE SHARED STATE
BPF Map Types
MAPSBPF maps are the primary mechanism for state sharing: eBPF programs (running in kernel) and userspace applications both access the same map. This enables per-flow counters, blocklists, connection tables, and configuration without stopping the packet processor.
/* BPF map types */ BPF_MAP_TYPE_HASH: Hash table. Key→value lookup. Most common. BPF_MAP_TYPE_ARRAY: Fixed-size indexed array. Access by index. BPF_MAP_TYPE_LPM_TRIE: Longest Prefix Match. For IP prefix tables! BPF_MAP_TYPE_PERCPU_HASH: Per-CPU hash (no lock contention) BPF_MAP_TYPE_PERF_EVENT_ARRAY: Send events to userspace perf ring BPF_MAP_TYPE_RINGBUF: Lock-free ring buffer (kernel 5.8+) BPF_MAP_TYPE_DEVMAP: Interface index map for XDP_REDIRECT /* Defining a map in eBPF C program */ struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 1024); __type(key, __u32); /* src IP */ __type(value, __u64); /* packet count */ } pkt_count SEC(".maps"); /* Using the map in eBPF program */ __u32 src_ip = iph->saddr; __u64 *count = bpf_map_lookup_elem(&pkt_count, &src_ip); if (count) __sync_fetch_and_add(count, 1); else { __u64 one = 1; bpf_map_update_elem(&pkt_count, &src_ip, &one, BPF_ANY); } /* Reading map from userspace (libbpf) */ struct bpf_object *obj = bpf_object__open("prog.o"); bpf_object__load(obj); struct bpf_map *map = bpf_object__find_map_by_name(obj, "pkt_count"); int map_fd = bpf_map__fd(map); __u32 key = inet_addr("192.168.1.5"); __u64 value; bpf_map_lookup_elem(map_fd, &key, &value); printf("Packets from 192.168.1.5: %llu\n", value); /* BPF LPM trie for IP blocklist */ struct lpm_key { __u32 prefixlen; __u8 data[4]; /* IPv4 address */ }; /* Insert 192.168.0.0/16 → drop */ struct lpm_key key16 = { .prefixlen = 16, .data = {192, 168, 0, 0} }; __u32 action = XDP_DROP; bpf_map_update_elem(lpm_fd, &key16, &action, BPF_ANY); /* Any packet with src in 192.168.0.0/16 matches! */
XDP PROGRAMMING — PACKET PROCESSING AT WIRE SPEED
Complete XDP Program — IP Firewall
XDP// xdp_firewall.c — drop packets from blocked IPs using BPF hash map #include <linux/bpf.h> #include <linux/if_ether.h> #include <linux/ip.h> #include <bpf/bpf_helpers.h> #include <bpf/bpf_endian.h> /* Map: blocked source IPs → 1 */ struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 65536); __type(key, __u32); /* IPv4 src addr */ __type(value, __u8); /* 1 = blocked */ } blocklist SEC(".maps"); /* Map: per-IP packet counters */ struct { __uint(type, BPF_MAP_TYPE_PERCPU_HASH); __uint(max_entries, 65536); __type(key, __u32); __type(value, __u64); } pkt_stats SEC(".maps"); SEC("xdp") int xdp_firewall_prog(struct xdp_md *ctx) { void *data_end = (void *)(long)ctx->data_end; void *data = (void *)(long)ctx->data; /* Parse Ethernet header */ struct ethhdr *eth = data; if (data + sizeof(*eth) > data_end) return XDP_DROP; /* malformed — drop */ /* Only handle IPv4 */ if (eth->h_proto != bpf_htons(ETH_P_IP)) return XDP_PASS; /* Parse IP header */ struct iphdr *iph = data + sizeof(*eth); if (data + sizeof(*eth) + sizeof(*iph) > data_end) return XDP_DROP; __u32 src = iph->saddr; /* Update per-IP packet counter */ __u64 *stat = bpf_map_lookup_elem(&pkt_stats, &src); if (stat) { __sync_fetch_and_add(stat, 1); } else { __u64 one = 1; bpf_map_update_elem(&pkt_stats, &src, &one, BPF_NOEXIST); } /* Check blocklist */ __u8 *blocked = bpf_map_lookup_elem(&blocklist, &src); if (blocked && *blocked == 1) return XDP_DROP; return XDP_PASS; } char _license[] SEC("license") = "GPL"; /* Compile and load */ // clang -O2 -target bpf -c xdp_firewall.c -o xdp_firewall.o // ip link set dev eth0 xdp obj xdp_firewall.o sec xdp // ip link set dev eth0 xdp off # detach
TC eBPF — FULL STACK ACCESS WITH sk_buff
TC BPF vs XDP
TC BPFTC (traffic control) eBPF programs run later in the stack than XDP — after sk_buff allocation. This gives them access to richer metadata: conntrack state, socket information, routing decisions, VLAN tags. They can also generate new packets and redirect to sockets.
/* TC BPF key differences from XDP */ Access to sk_buff → can read: - skb->mark, skb->priority (for QoS) - skb->sk (associated socket — if known) - conntrack state (via helper bpf_skb_get_tunnel_key) - Full packet headers (same as XDP) + can modify them - Can call bpf_sk_lookup_tcp() to find socket Return values (different from XDP!): TC_ACT_OK (0): pass to next TC filter/action TC_ACT_SHOT (2): drop packet TC_ACT_REDIRECT (7): redirect to another interface or socket TC_ACT_STOLEN (4): take ownership (used for skb→socket delivery) /* TC BPF for packet marking (QoS) */ SEC("tc") int mark_voip(struct __sk_buff *skb) { void *data_end = (void *)(long)skb->data_end; void *data = (void *)(long)skb->data; struct iphdr *iph = data + sizeof(struct ethhdr); if ((__u8 *)iph + sizeof(*iph) > (__u8 *)data_end) return TC_ACT_OK; struct udphdr *udp = (void *)iph + iph->ihl * 4; if ((__u8 *)udp + sizeof(*udp) > (__u8 *)data_end) return TC_ACT_OK; /* Mark SIP (UDP 5060) and RTP (ports 10000-20000) for EF DSCP */ __u16 dport = bpf_ntohs(udp->dest); if (iph->protocol == IPPROTO_UDP && (dport == 5060 || (dport >= 10000 && dport <= 20000))) { bpf_skb_store_bytes(skb, offsetof(struct ethhdr, h_dest) + sizeof(struct ethhdr) + 1, &(__u8){0xB8}, 1, 0); /* DSCP EF = 46 << 2 */ } return TC_ACT_OK; } /* Attach TC eBPF */ # tc qdisc add dev eth0 clsact # tc filter add dev eth0 ingress bpf obj tc_qos.o sec tc direct-action # tc filter show dev eth0 ingress
AF_XDP — ZERO-COPY USERSPACE PACKET PROCESSING
AF_XDP Architecture
AF_XDPAF_XDP is a socket type that allows userspace applications to receive and send packets directly from/to NIC memory with zero kernel copies. Unlike DPDK, AF_XDP keeps the NIC under kernel control — only selected packet queues are redirected to userspace.
/* AF_XDP architecture */ NIC Queue N → [XDP program runs in driver] → XDP_REDIRECT → AF_XDP socket NIC Queue 0 → [passes to kernel network stack normally] /* UMEM — userspace memory region registered with kernel */ void *umem_area = mmap(NULL, UMEM_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); struct xsk_umem *umem; xsk_umem__create(&umem, umem_area, UMEM_SIZE, &fill_ring, &comp_ring, NULL); /* Four rings between kernel and userspace */ Fill ring (userspace → kernel): "here are free buffers you can fill with RX packets" Completion ring (kernel → userspace): "here are TX buffers I'm done with" RX ring (kernel → userspace): "here are received packets" TX ring (userspace → kernel): "here are packets to transmit" /* Receive loop */ while (1) { rcvd = xsk_ring_cons__peek(&sock->rx, BATCH, &idx_rx); for (i = 0; i < rcvd; i++) { addr = xsk_ring_cons__rx_desc(&sock->rx, idx_rx + i)->addr; len = xsk_ring_cons__rx_desc(&sock->rx, idx_rx + i)->len; pkt = xsk_umem__get_data(sock->umem->buffer, addr); /* pkt points directly to NIC DMA buffer — zero copy! */ process_packet(pkt, len); } xsk_ring_cons__release(&sock->rx, rcvd); /* Refill fill ring so kernel has buffers for next batch */ replenish_fill_ring(sock, rcvd); } /* XDP program to steer traffic to AF_XDP socket */ struct { __uint(type, BPF_MAP_TYPE_XSKMAP); __uint(max_entries, MAX_QUEUES); __type(key, __u32); __type(value, __u32); } xsks_map SEC(".maps"); SEC("xdp_sock") int xdp_redirect_to_xsk(struct xdp_md *ctx) { __u32 queue = ctx->rx_queue_index; if (bpf_map_lookup_elem(&xsks_map, &queue)) return bpf_redirect_map(&xsks_map, queue, XDP_PASS); return XDP_PASS; }
eBPF TOOLING — bpftool, libbpf, bpftrace
Essential eBPF Tools
TOOLING/* bpftool — Swiss Army knife for eBPF */ # List all loaded eBPF programs bpftool prog list bpftool prog show id 42 # Dump eBPF bytecode (disassemble) bpftool prog dump xlated id 42 # Show JIT-compiled machine code bpftool prog dump jited id 42 # List all BPF maps bpftool map list bpftool map dump id 7 # dump all entries bpftool map lookup id 7 key 0x01 0x02 0x03 0x04 # lookup specific key bpftool map update id 7 key 0x01 0x02 0x03 0x04 value 0x01 # add to blocklist # Show XDP programs attached to interfaces bpftool net list ip link show # also shows "xdp" flag if XDP is attached # Perf output from bpf_trace_printk() cat /sys/kernel/debug/tracing/trace_pipe /* bpftrace — high-level eBPF tracing language */ # Trace every TCP connection bpftrace -e 'kprobe:tcp_connect { printf("connect: pid=%d\n", pid); }' # Count packets by protocol bpftrace -e 'tracepoint:net:netif_receive_skb { @[args->protocol] = count(); }' # Track kernel networking function latency bpftrace -e ' kprobe:ip_rcv { @start[tid] = nsecs; } kretprobe:ip_rcv /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' /* Cilium's eBPF-based Kubernetes networking */ # cilium status — health of eBPF programs # cilium monitor — real-time packet events # cilium bpf ct list global — connection tracking table
Write and Load Your First XDP Program
Objective: Write a functional XDP program that counts packets per source IP and drops packets from a blocklist.
sudo apt install clang llvm libbpf-dev linux-headers-$(uname -r) bpftool. Verify: clang --version (need 10+) and bpftool version.xdp_counter.c with the BPF_MAP_TYPE_PERCPU_HASH for per-IP counters. Implement the XDP program to increment the counter for each source IP. Compile: clang -O2 -target bpf -c xdp_counter.c -o xdp_counter.o.sudo ip link set veth0 xdp obj xdp_counter.o sec xdp. Verify attachment: ip link show veth0 should show "xdp" flag. Generate traffic (ping) and read counters: sudo bpftool map dump name pkt_count.bpftool map update as an alternative.bpftrace Network Observability
Objective: Use bpftrace to instrument the kernel network stack without writing eBPF C code.
sudo apt install bpftrace. Run the one-liner to count packets by protocol: sudo bpftrace -e 'tracepoint:net:netif_receive_skb { @[args->skbaddr] = count(); }'. While running, generate traffic and observe the output.sudo bpftrace -e 'kprobe:tcp_connect { printf("pid=%d comm=%s\n", pid, comm); }'. Open several websites in a browser — you should see a connect event for each. Extend to also trace tcp_close.M16 MASTERY CHECKLIST
- Know what eBPF is: sandboxed kernel programs, loaded from userspace, verified for safety, JIT compiled
- Know eBPF's 3 key properties: safety (verifier), performance (JIT, kernel execution), programmability (runtime updates)
- Know 7 eBPF hook types and their positions: XDP native/generic, TC ingress/egress, socket filter, cgroup/sock, kprobe/tracepoint
- Know eBPF VM: 11 registers (r0=return, r1-r5=args, r10=stack pointer), 512B stack, no unbounded loops
- Know eBPF program lifecycle: C source → clang (target bpf) → verifier → JIT → attach to hook
- Know what the verifier checks: all paths terminate, bounds-checked memory access, helper call validity
- Know eBPF helper functions: bpf_map_lookup/update_elem, bpf_redirect, bpf_xdp_adjust_head, bpf_trace_printk
- Know BPF map types: HASH, ARRAY, LPM_TRIE (IP prefix match!), PERCPU_HASH, PERF_EVENT_ARRAY, RINGBUF
- Know how maps enable kernel-userspace communication: both sides access same map via file descriptor
- Know XDP return codes: XDP_DROP, XDP_PASS, XDP_TX, XDP_REDIRECT
- Know XDP vs TC eBPF: XDP = before sk_buff (faster, less context); TC = has sk_buff (richer context, slower)
- Know TC return codes: TC_ACT_OK, TC_ACT_SHOT, TC_ACT_REDIRECT
- Know AF_XDP: NIC DMA → userspace UMEM (zero copy); 4 rings: fill, completion, RX, TX
- Know AF_XDP vs DPDK: AF_XDP keeps kernel driver control; DPDK takes exclusive NIC ownership
- Know bpftool: list/inspect programs and maps, dump bytecode, update map entries at runtime
- Know bpftrace: high-level tracing language, kprobe/tracepoint access, histogram output
- Completed Lab 1: wrote and loaded XDP packet counter + IP blocklist with libbpf
- Completed Lab 2: used bpftrace to trace TCP connections and measure ip_rcv latency
✅ When complete: Move to M17 - High-Performance Networking with DPDK — your existing DPDK knowledge plus this eBPF foundation prepares you for the deepest performance engineering content in the curriculum.