M14 - Linux Network Stack

NETWORKING MASTERY · PHASE 4 · MODULE 14 · WEEK 12

🐧 Linux Network Stack

sk_buff · NIC RX/TX path · Netfilter/iptables · Namespaces · tc qdisc · RSS and XPS · Kernel bypass concepts

Advanced Prerequisite: M05 TCP, M10 Routing Kernel 5.x+ Essential for DPDK/VPP Context 3 Labs

THE LINUX NETWORK STACK — 5 MILLION LINES OF KERNEL CODE

🗺️

End-to-End Packet Journey Through the Kernel

OVERVIEW

When a packet arrives at a Linux machine, it traverses roughly 12 distinct subsystems before reaching a userspace application. Understanding this path is foundational for DPDK/VPP work — the entire value proposition of kernel bypass is eliminating the overhead of these steps.

/* Inbound packet journey — NIC to application */

1. NIC hardware receives frame, places in RX ring buffer (DMA)
2. NIC raises hardware interrupt (IRQ)
3. NIC driver ISR: disable NIC IRQ, schedule NAPI poll (softirq NET_RX)
4. NAPI poll: driver pulls packets from RX ring → builds sk_buff objects
5. netif_receive_skb(): packet enters kernel network stack
6. Protocol demultiplexing: Ethernet → IP → TCP/UDP
7. Netfilter hooks: PRE_ROUTING → FORWARD/INPUT → POST_ROUTING
8. IP routing: FIB lookup, determine local delivery or forward
9. Transport layer: TCP reassembly / UDP delivery
10. Socket receive buffer: sk_buff copied to socket's sk_rcvbuf
11. Wakeup sleeping process (epoll/select/read)
12. copy_to_user(): kernel→userspace data copy

/* Where cycles are spent (approximate) */
Driver/NAPI:           ~5%   (hardware-accelerated on modern NICs)
sk_buff allocation:    ~15%  (alloc/free + cache misses)
Protocol processing:   ~20%  (IP/TCP checksum, state machine)
Netfilter:             ~25%  (each hook traverses rule list)
Memory copies:         ~35%  (DMA buffer → sk_buff → socket buf → userspace)

/* DPDK bypass eliminates steps 2-12 entirely */
# Packet goes: NIC DMA → hugepage memory → userspace application
# Zero interrupts, zero copies, zero kernel involvement

sk_buff — THE KERNEL'S PACKET ABSTRACTION

📦

sk_buff Structure

sk_buff

The sk_buff (socket buffer) is the central data structure for all packets in the Linux kernel. Every packet in flight is represented as an sk_buff. Understanding it explains how the kernel avoids copying data as headers are added/removed.

/* sk_buff key fields (simplified from include/linux/skbuff.h) */
struct sk_buff {
    /* Pointers into the data buffer */
    unsigned char   *head;      /* start of allocated buffer */
    unsigned char   *data;      /* start of valid data (moves on push/pull) */
    unsigned char   *tail;      /* end of valid data */
    unsigned char   *end;       /* end of allocated buffer */

    /* len = tail - data = bytes of valid packet data */
    unsigned int     len;
    unsigned int     data_len;  /* bytes in page fragments (non-linear data) */

    /* Protocol info */
    __be16           protocol;  /* ETH_P_IP, ETH_P_IPV6, etc. */
    __u8             pkt_type;  /* PACKET_HOST, BROADCAST, MULTICAST */

    /* Device info */
    struct net_device *dev;     /* ingress/egress network interface */

    /* Checksums */
    __wsum           csum;
    __u8             ip_summed; /* CHECKSUM_NONE/PARTIAL/COMPLETE/UNNECESSARY */

    /* Netfilter connection tracking */
    struct nf_conntrack *nfct;

    /* Transport header pointers */
    union { struct tcphdr *th; struct udphdr *uh; ... } h; /* L4 header */
    union { struct iphdr *iph; struct ipv6hdr *ipv6h; ... } nh; /* L3 */
    union { struct ethhdr *ethernet; unsigned char *raw; } mac; /* L2 */
};

/* Header manipulation — NO data copy required */
skb_push(skb, hdr_len);  /* data -= hdr_len  (add header at front) */
skb_pull(skb, hdr_len);  /* data += hdr_len  (remove header at front) */
skb_put(skb,  data_len); /* tail += data_len (add data at end) */
skb_trim(skb, len);      /* tail = data + len (remove tail data) */

💡 Why sk_buff is efficient: When TCP adds a header to a payload, it calls skb_push() which just moves the data pointer backwards — no memcpy. The physical data stays in place. This is possible because the buffer was allocated with headroom specifically for headers. The same principle applies for all layers adding/removing headers as the packet traverses up/down the stack.

RECEIVE PATH — NIC INTERRUPT TO SOCKET BUFFER

📥

NAPI — New API for High-Speed Packet Reception

NAPI

The original interrupt-per-packet model fails at high packet rates — at 10 Gbps with 64-byte packets, you get 14.8 million interrupts per second, consuming 100% CPU just acknowledging interrupts. NAPI (New API) solves this with interrupt coalescing:

/* NAPI receive flow */

Packet arrives → NIC raises IRQ
  ↓
ISR (interrupt context, runs fast):
  napi_schedule(&napi);    /* queue NAPI poll for softirq */
  napi_disable_irq();      /* DISABLE further NIC interrupts */
  ↓
NET_RX softirq (process context, can be deferred):
  driver->poll(napi, budget=64);  /* pull up to 64 packets per poll */
    for each packet in RX ring:
        alloc sk_buff
        DMA: NIC buffer → sk_buff->data
        refill RX ring with new DMA buffer
        netif_receive_skb(skb)  → up the stack
    if ring empty:
        napi_complete();         /* re-enable NIC interrupts */
    if budget exhausted (ring still has packets):
        return budget;           /* reschedule next softirq tick */

/* Interrupt coalescing (ethtool) */
ethtool -C eth0 rx-usecs 50      # coalesce for 50µs before interrupt
ethtool -C eth0 rx-frames 32     # or coalesce 32 frames
ethtool -S eth0 | grep -i drop   # NIC-level drop counters

/* RSS — Receive Side Scaling (multi-queue) */
# Modern NICs have multiple RX queues
# RSS hashes flow 5-tuple → assigns to queue
# Each queue has its own NAPI instance → different CPU core
# Enables true parallel packet processing
ethtool -l eth0          # show number of RX/TX queues
ethtool -L eth0 combined 8  # set 8 combined queues
cat /proc/interrupts | grep eth0  # shows per-queue IRQ counts

🔧

RX Ring Buffer and DMA

RX RING

/* NIC RX ring buffer structure */
The ring buffer is a circular array of DMA descriptors.
Each descriptor contains:
  - Physical address of a pre-allocated sk_buff data buffer
  - Buffer length
  - Status flags (owned by NIC vs owned by CPU)

NIC owns descriptor: fills buffer with incoming packet, sets status=done, raises IRQ
CPU owns descriptor: NAPI pulls packet, allocates new sk_buff, refills descriptor

/* Key: buffers pre-allocated before packet arrives */
# Driver pre-populates ring with empty sk_buffs on startup
# NIC writes directly into these buffers via DMA (zero-copy from NIC perspective)
# AFTER NAPI pulls the packet, driver allocates a NEW sk_buff to refill the slot

/* Tuning the ring buffer size */
ethtool -g eth0                   # show current ring sizes
ethtool -G eth0 rx 4096 tx 4096  # set 4096-entry ring
# Larger ring: fewer drops under burst, more memory used
# Smaller ring: less latency (data sits in ring shorter time)

/* Drop diagnosis */
cat /proc/net/dev                 # interface stats including drops
ip -s link show eth0              # TX/RX errors and drops
ethtool -S eth0 | grep drop       # NIC-level drop counters
ss -s                             # socket-level stats

TRANSMIT PATH — APPLICATION TO WIRE

📤

TX Path — Socket to NIC

TX PATH

/* TX path: application write() → NIC */

1. Application: write(fd, data, len)  or  send(fd, data, len, flags)
2. copy_from_user(): data copied from userspace to kernel sk_buff
3. TCP/UDP: segment, add transport header, update sequence numbers
4. IP: add IP header, route lookup (FIB), fragment if needed
5. Netfilter OUTPUT hook
6. IP routing OUTPUT: select egress interface
7. Netfilter POSTROUTING hook
8. Neighbour (ARP) cache lookup for next-hop MAC
9. L2: add Ethernet header (src MAC = interface MAC, dst = next-hop MAC)
10. qdisc (traffic control): enqueue to output queue
11. dev_hard_start_xmit(): hand to driver TX ring
12. NIC DMA: reads from TX ring, sends on wire
13. Interrupt: NIC signals TX complete → free sk_buff

/* XPS — Transmit Packet Steering */
# Like RSS for TX: map CPU cores to TX queues
# Ensures TX and RX of a flow use the same CPU → better cache locality
ls /sys/class/net/eth0/queues/tx-0/xps_cpus  # affinity mask for TX queue 0

/* TSO — TCP Segmentation Offload */
# Application writes large buffer (64KB)
# Without TSO: kernel segments into MTU-sized sk_buffs, adds TCP/IP hdr each
# With TSO: kernel sends one large sk_buff, NIC hardware segments
# Saves CPU: N segments → 1 kernel operation, NIC does N hardware operations
ethtool -K eth0 tso on    # enable TSO
ethtool -K eth0 gso on    # Generic Segmentation Offload (software TSO)
ethtool -K eth0 gro on    # Generic Receive Offload (coalesce on RX)

NETFILTER — KERNEL PACKET FILTERING FRAMEWORK

🔥

Netfilter Hooks and iptables

NETFILTER

Netfilter is the kernel framework for packet filtering, NAT, and connection tracking. iptables (and the modern nftables) is the userspace tool that configures Netfilter rules. Understanding hook points is essential for firewall development.

/* Netfilter hook points */

Incoming packet:
  NIC → [PREROUTING] → routing decision →
    if local:  [INPUT] → socket
    if forward:[FORWARD] → [POSTROUTING] → NIC

Outgoing packet:
  socket → [OUTPUT] → [POSTROUTING] → NIC

/* Five hook points */
NF_INET_PRE_ROUTING:   After L2 demux, before routing. Used for DNAT.
NF_INET_INPUT:         After routing, for locally-destined packets.
NF_INET_FORWARD:       For packets being forwarded (not local).
NF_INET_OUTPUT:        Locally-generated packets, before routing.
NF_INET_POST_ROUTING:  After routing, before sending. Used for SNAT/masquerade.

/* iptables tables (each hooks into specific netfilter hooks) */
filter:   INPUT, FORWARD, OUTPUT — packet accept/drop decisions
nat:      PREROUTING (DNAT), OUTPUT (DNAT), POSTROUTING (SNAT)
mangle:   all 5 hooks — modify packet headers (TTL, TOS, marks)
raw:      PREROUTING, OUTPUT — bypass conntrack (NOTRACK)
security: INPUT, FORWARD, OUTPUT — SELinux mandatory access control

/* iptables command structure */
iptables -t TABLE -A CHAIN -m match --opt val -j TARGET

/* Common rules */
iptables -A INPUT -p tcp --dport 22 -j ACCEPT           # allow SSH
iptables -A INPUT -m state --state ESTABLISHED -j ACCEPT # stateful accept
iptables -A INPUT -j DROP                               # default deny
iptables -t nat -A POSTROUTING -s 10.0.0.0/8 -j MASQUERADE  # NAT
iptables -t nat -A PREROUTING -p tcp --dport 80 -j DNAT --to 10.0.0.5:8080

/* conntrack — connection tracking */
conntrack -L                  # list all tracked connections
conntrack -D -s 192.168.1.5   # delete connections from this source
cat /proc/sys/net/netfilter/nf_conntrack_count    # current count
cat /proc/sys/net/netfilter/nf_conntrack_max      # maximum
# conntrack table full → all new connections dropped (NOTRACK bypass for DoS)

/* nftables — modern replacement for iptables */
nft list ruleset
nft add table inet filter
nft add chain inet filter input  { type filter hook input priority 0\; policy drop\; }
nft add rule inet filter input tcp dport 22 accept

NETWORK NAMESPACES — LINUX NETWORK VIRTUALISATION

🏗️

Network Namespaces

NAMESPACES

Linux network namespaces provide complete network stack isolation: each namespace has its own interfaces, routing table, iptables rules, ARP cache, and socket namespace. This is the foundation of Docker container networking, Kubernetes pod networking, and network function testing.

/* Network namespace fundamentals */

# Create namespace
ip netns add ns1
ip netns add ns2

# Create a veth pair (virtual ethernet — always come in pairs)
ip link add veth0 type veth peer name veth1

# Move one end into each namespace
ip link set veth0 netns ns1
ip link set veth1 netns ns2

# Configure IPs in each namespace
ip netns exec ns1 ip addr add 10.0.0.1/24 dev veth0
ip netns exec ns1 ip link set veth0 up
ip netns exec ns2 ip addr add 10.0.0.2/24 dev veth1
ip netns exec ns2 ip link set veth1 up

# Test connectivity
ip netns exec ns1 ping 10.0.0.2

# Connect namespace to external network via bridge
ip link add br0 type bridge
ip link set br0 up
ip link add veth-ext type veth peer name veth-br
ip link set veth-br master br0
ip link set veth-ext netns ns1
ip netns exec ns1 ip addr add 192.168.1.10/24 dev veth-ext

# Run a process in a namespace
ip netns exec ns1 bash              # shell in ns1
ip netns exec ns1 tcpdump -i veth0  # capture in ns1

# Inspect
ip netns list
ip netns exec ns1 ip route show
ip netns exec ns1 ip link show

/* Docker uses namespaces internally */
# Each container gets its own netns
# docker inspect container | grep -i pid
# nsenter -t PID -n ip addr  → enter container's netns

TRAFFIC CONTROL — QDISC AND SHAPING

🚦

Linux tc — Traffic Control

Linux tc (traffic control) implements packet scheduling, shaping, and classification on the output path. It is the kernel's QoS subsystem and also serves as the attachment point for eBPF programs. Understanding qdiscs is important for both performance tuning and network emulation (netem).

/* Qdisc types */
pfifo_fast:   Default. Three-band FIFO based on IP TOS. Fast but simple.
fq_codel:     Fair Queue CoDel. Modern default. Fair per-flow + AQM.
tbf:          Token Bucket Filter. Rate limiting.
htb:          Hierarchical Token Bucket. Traffic shaping with classes.
netem:        Network Emulator. Add delay, loss, reorder, corrupt.
fq:           Fair Queue. Per-flow scheduling. Used with BBR.
cake:         Combined AQM and FQ. Best for home/edge routers.

/* netem — network emulation for testing */
# Add 100ms delay to all outgoing packets on eth0
tc qdisc add dev eth0 root netem delay 100ms

# Add delay + jitter (uniform distribution ±20ms)
tc qdisc add dev eth0 root netem delay 100ms 20ms

# Add 1% random packet loss
tc qdisc add dev eth0 root netem loss 1%

# Add 1% duplication + 0.5% corruption
tc qdisc add dev eth0 root netem duplicate 1% corrupt 0.5%

# Combine: 50ms delay + 10ms jitter + 0.5% loss
tc qdisc replace dev eth0 root netem delay 50ms 10ms loss 0.5%

# Remove
tc qdisc del dev eth0 root

/* HTB — rate limiting / shaping */
# Limit eth0 to 10Mbps
tc qdisc add dev eth0 root handle 1: htb default 10
tc class add dev eth0 parent 1: classid 1:10 htb rate 10mbit

/* View current qdisc */
tc qdisc show dev eth0
tc -s qdisc show dev eth0   # with statistics (packets, drops)

KERNEL BYPASS — WHY AND HOW

⚡

The Case for Kernel Bypass

BYPASS

The Linux kernel network stack was designed for generality, not for the highest possible forwarding performance. At line rate on a 100G NIC (148 Mpps for 64-byte packets), the overhead of interrupts, sk_buff allocation, netfilter traversal, and multiple memory copies becomes the bottleneck. Kernel bypass eliminates this overhead.

/* Performance comparison */
Linux kernel stack:      ~1-3 Mpps per core (64-byte packets)
DPDK (PMD polling):     ~30-80 Mpps per core
VPP (vector processing): ~30-100 Mpps per core
XDP (eBPF in driver):   ~10-30 Mpps per core (with kernel features)

/* Kernel bypass mechanisms */

1. DPDK (Data Plane Development Kit):
   - PMD (Poll Mode Driver) replaces kernel driver
   - Application polls NIC directly — no interrupts ever
   - Hugepage memory for packet buffers (no TLB misses)
   - Runs in userspace — full application control
   Con: NIC is dedicated to DPDK, kernel cannot use it

2. AF_XDP (eXpress Data Path socket):
   - Kernel feature (5.x+)
   - Selective bypass: some queues to XDP, others to kernel
   - eBPF program in driver decides: XDP socket or kernel
   - Zero-copy between NIC and userspace possible
   - NIC still managed by kernel driver

3. XDP (eXpress Data Path):
   - eBPF program runs at NIC driver level (before sk_buff)
   - Can DROP, PASS, TX, REDIRECT
   - Native XDP: runs in driver ISR (fastest)
   - Generic XDP: runs after sk_buff allocation (slower, any NIC)
   - Use case: fast packet filtering, DDoS mitigation, load balancing

4. io_uring (for sockets):
   - Async I/O interface for socket operations
   - Reduces syscall overhead for high-connection-count servers

/* XDP program example (simplified) */
SEC("xdp")
int xdp_drop_icmp(struct xdp_md *ctx) {
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    struct ethhdr *eth = data;
    if (data + sizeof(*eth) > data_end) return XDP_DROP;
    if (eth->h_proto != htons(ETH_P_IP)) return XDP_PASS;
    struct iphdr *ip = data + sizeof(*eth);
    if (data + sizeof(*eth) + sizeof(*ip) > data_end) return XDP_DROP;
    if (ip->protocol == IPPROTO_ICMP) return XDP_DROP;
    return XDP_PASS;
}

LAB 1

sk_buff Tracing and Stack Profiling

Objective: Use kernel tracing tools to observe the packet path in real time.

Trace packet path with perf: sudo perf stat -e net:net_dev_xmit,net:netif_receive_skb,net:napi_poll ping -c 100 google.com. Count the kernel events fired per ping packet. Calculate overhead per packet in nanoseconds.

Observe conntrack table: watch -n1 'cat /proc/sys/net/netfilter/nf_conntrack_count'. Run a web benchmark (ab -n 10000 http://localhost/) and watch the count grow. Observe TTL-based cleanup afterward.

Tune NAPI: ethtool -C eth0 rx-usecs 0 rx-frames 1 (minimum coalescing = one interrupt per packet). Measure latency with ping -i 0.01. Then set rx-usecs 1000 (batch). Measure throughput with iperf3. Document the latency vs throughput tradeoff.

Profile with perf top: sudo perf top -e cycles:k while running iperf3. Identify which kernel functions consume most cycles during heavy network load (look for napi_poll, __netif_receive_skb, ip_rcv, tcp_rcv_established).

LAB 2

Network Namespaces — Build a Virtual Network

Objective: Build a 3-namespace virtual network with a bridge router in the middle. Use this topology for all future protocol labs.

Create namespaces ns-client, ns-router, ns-server. Create veth pairs: veth-c0/veth-r0 (client↔router) and veth-r1/veth-s0 (router↔server). Move veth-c0 to ns-client, veth-s0 to ns-server, veth-r0 and veth-r1 to ns-router.

Assign IPs: ns-client: 10.1.0.2/24 on veth-c0; ns-router: 10.1.0.1/24 on veth-r0 and 10.2.0.1/24 on veth-r1; ns-server: 10.2.0.2/24 on veth-s0. Enable forwarding in ns-router: ip netns exec ns-router sysctl net.ipv4.ip_forward=1.

Add routes: ns-client default via 10.1.0.1; ns-server default via 10.2.0.1. Test: ip netns exec ns-client ping 10.2.0.2. Capture in ns-router to verify forwarding: ip netns exec ns-router tcpdump -i any icmp.

Add iptables rules in ns-router: allow ESTABLISHED/RELATED, allow ICMP, block TCP 23 (telnet), log dropped packets. Test each rule. This is your personal NGFW testbed — reuse for Phase 5/6 labs.

LAB 3

Network Emulation with netem

Objective: Use netem to simulate WAN conditions and measure TCP behaviour under loss and delay.

In your namespace topology from Lab 2, add 50ms delay to ns-router's veth-r0: ip netns exec ns-router tc qdisc add dev veth-r0 root netem delay 50ms. Run ping and iperf3. Record RTT and throughput.

Add progressive loss: 0%, 0.1%, 0.5%, 1%, 5%. For each, measure TCP throughput with iperf3 (-t 10 -P 4). Plot the results. At what loss rate does TCP throughput degrade significantly? Compare with QUIC if available.

Simulate packet reordering (typical with ECMP): netem delay 50ms 10ms distribution normal reorder 25% 50%. Observe TCP reorder counter: ss -ti | grep reord. Explain why reordering triggers spurious retransmits.

M14 MASTERY CHECKLIST

Can describe the 12-step packet journey from NIC hardware to userspace application
Know where cycles are spent: sk_buff alloc (15%), netfilter (25%), copies (35%)
Know sk_buff key fields: head/data/tail/end pointers, len, dev, protocol, nfct
Know how skb_push/pull/put/trim avoid data copies by moving pointers only
Know NAPI: interrupt coalescing, disable-IRQ-then-poll, budget parameter, re-enable on ring empty
Know why old interrupt-per-packet fails at 10G+: 14.8 Mpps = 14.8M IRQs/s
Know RSS: multiple RX queues, flow-hash to queue, per-queue NAPI on separate CPU
Know TSO/GSO/GRO: hardware/software segmentation offloads reduce CPU overhead
Know the 5 Netfilter hook points: PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING
Know iptables tables and which hooks they attach to: filter, nat, mangle, raw
Know conntrack: state table for stateful firewall; conntrack full → new connections dropped
Know nftables as modern replacement for iptables
Know network namespaces: isolated network stack per namespace; veth pairs to connect
Know how Docker/Kubernetes use namespaces: each container/pod gets own netns
Know tc qdisc types: pfifo_fast, fq_codel, htb, netem, tbf
Know netem for testing: delay, jitter, loss, reorder, corrupt emulation
Know kernel bypass options: DPDK (poll mode, dedicated NIC), AF_XDP (selective bypass), XDP (eBPF in driver)
Know performance numbers: kernel stack ~1-3 Mpps/core, DPDK ~30-80 Mpps/core
Know XDP return codes: XDP_DROP, XDP_PASS, XDP_TX, XDP_REDIRECT
Completed Lab 1: profiled kernel network path with perf, observed conntrack, tuned NAPI coalescing
Completed Lab 2: built 3-namespace virtual network with routing and iptables
Completed Lab 3: emulated WAN conditions with netem, measured TCP under delay and loss

✅ When complete: Move to M15 - Socket Programming — now that you understand the kernel stack these sockets interact with, the API will make much deeper sense.

← M13 Tunneling 🗺️ Roadmap Next: M15 - Sockets →