DPDK P3 — Multi-Process, rte_flow & NUMA

DPDK MASTERY · PHASE 3 OF 3 · MODULE A

Multi-Process, rte_flow & NUMA

Primary/secondary model · shared resources · hardware flow classification · NUMA-aware allocation

Ch 13 — Multi-Process Ch 14 — rte_flow Ch 15 — Multi-Core & NUMA C · VFIO · FDIR · Cache-Line Weeks 11–13

Primary / Secondary Process Model

DPDK supports multiple OS processes sharing the same NIC and hugepage memory. The primary process owns all resources; secondary processes attach and use them. This enables hot-restartable components, traffic class isolation, and separation of control and data planes.

Multi-Process Architecture Primary Process Secondary Processes ┌──────────────────────┐ ┌─────────────────────┐ │ rte_eal_init() │ │ rte_eal_init() │ │ (creates resources) │ │ --proc-type=secondary│ │ │ hugepage │ (attaches) │ │ hugepage memory │◄──shared───►│ hugepage memory │ │ rte_mempool (named) │ │ rte_mempool_lookup() │ │ rte_ring (named) │ │ rte_ring_lookup() │ │ NIC: dev_configure │ │ NIC: rx_burst only │ │ NIC: dev_start │ │ (cannot reconfigure) │ │ │ └─────────────────────┘ │ Queues 0-3 → │ Secondary-Enterprise │ Queues 4-7 → │ ┌─────────────────────┐ └──────────────────────┘ │ Secondary-Mobility │ └─────────────────────┘ Jio SASE-DP production pattern: primary owns both 100G ports. Enterprise secondary handles queues 0-3 (URL filter). Mobility secondary handles queues 4-7 (5G/SCEF policy).

EAL ARGUMENTS FOR MULTI-PROCESS

# Primary process — creates all shared resources ./my_primary -l 0-3 -n 4 --proc-type=primary --file-prefix=sase \ -a 0000:03:00.0 -- [app args] # Secondary process — attaches to primary's shared memory ./my_secondary -l 4-7 -n 4 --proc-type=secondary --file-prefix=sase \ -- [app args] # auto: becomes primary if none exists, secondary otherwise ./my_app -l 0-3 -n 4 --proc-type=auto --file-prefix=sase

⚠️ Rules: Secondary does NOT need -a (device allowlist) — it inherits device info from primary. Secondary CAN specify -l (lcores) — must NOT overlap with primary's lcores. Both must use the same --file-prefix to share the same /dev/shm/ files.

Resource	Shared?	Access from Secondary
Hugepage memory segments	YES — read/write	All processes map same physical hugepages
`rte_mempool` (named)	YES — shared pool	`rte_mempool_lookup("MBUF_POOL")`
`rte_ring` (named)	YES — shared ring	`rte_ring_lookup("WORK_RING")`
`rte_hash` / `rte_lpm` / `rte_acl`	YES — if in named memzone	`rte_memzone_lookup("FLOW_TABLE")`
NIC port state	YES — read-only for secondary	Can rx_burst/tx_burst but NOT reconfigure
Per-process heap (`rte_malloc`)	NO — per-process	Each process has own heap allocation
lcores / thread pool	NO — per-process	Each process runs its own lcore threads

// Secondary process — find shared pool created by primary struct rte_mempool *pool = rte_mempool_lookup("MBUF_POOL"); if (!pool) rte_exit(EXIT_FAILURE, "Cannot find mempool — is primary running?\n"); struct rte_ring *ring = rte_ring_lookup("WORK_RING"); if (!ring) rte_exit(EXIT_FAILURE, "Cannot find ring\n"); // Find a custom data structure placed in a named memzone by primary const struct rte_memzone *mz = rte_memzone_lookup("FLOW_TABLE"); struct my_flow_table *ftbl = (struct my_flow_table *)mz->addr; // Receive packets using NIC queue assigned to this secondary struct rte_mbuf *pkts[32]; uint16_t nb = rte_eth_rx_burst(port_id, my_queue_id, pkts, 32);

Limitations and Gotchas

Secondary cannot call rte_eth_dev_configure() or rte_eth_dev_start() — NIC already owned by primary
All processes must use the same --socket-mem — mismatch causes attach failure
Pointer sharing: raw pointers in shared memory are only valid in the process that set them. Use IOVA offsets or named memzones, not raw pointers.
Primary exit kills shared memory — all secondaries lose access to pools and rings immediately → segfault
Cannot mix DPDK versions between primary and secondary — ABI must match exactly

🆕 Common mistake: Secondary calls rte_pktmbuf_pool_create() instead of rte_mempool_lookup(). This fails with EEXIST (name already taken by primary's pool). Always use _lookup() in secondary processes for resources created by primary.

rte_flow — Hardware Flow Classification

rte_flow allows applications to program the NIC's hardware to perform packet classification and queue steering in silicon — with zero CPU involvement for matched flows. Matched packets bypass RSS entirely and are sent directly to a specific queue. Non-matching packets continue through normal RSS.

HOW rte_flow WORKS

rte_flow Architecture Application defines flow rule (generic DPDK format) ↓ rte_flow_create() — PMD validates rule PMD translates to NIC-specific hardware instructions ↓ NIC FDIR / Flow Table programmed (in NIC silicon) ↓ Packet arrives: Matching packets → NIC classifies in hardware → steered to specific Rx queue Non-matching packets → go through normal RSS pipeline Key benefit: matched flows bypass RSS entirely — zero CPU for classification Use case: steer specific traffic class to dedicated queue/lcore Example: steer all traffic from enterprise VPN subnet → queue 0 (enterprise secondary) steer all 5G/GTP traffic → queue 4 (mobility secondary)

FLOW RULE STRUCTURE

Three Building Blocks

Attributes: ingress/egress, priority, group number
Pattern (match criteria): chain of item types — ETH, IPV4, TCP, UDP, VXLAN, GTP, etc. Each item specifies field values and masks.
Actions (what to do with matched packets): QUEUE (steer to specific Rx queue), DROP, COUNT, MARK (tag mbuf), RSS (apply RSS to matched subset), JUMP (goto another group)

// Complete rte_flow example: steer all TCP port 443 traffic → queue 0 struct rte_flow_attr attr = { .ingress = 1, // match incoming packets .priority = 0, // highest priority }; // Pattern: ETH / IPV4 / TCP(dport=443) / END struct rte_flow_item_tcp tcp_spec = { .hdr.dst_port = rte_cpu_to_be_16(443) }; struct rte_flow_item_tcp tcp_mask = { .hdr.dst_port = 0xFFFF }; struct rte_flow_item pattern[] = { { .type = RTE_FLOW_ITEM_TYPE_ETH }, { .type = RTE_FLOW_ITEM_TYPE_IPV4 }, { .type = RTE_FLOW_ITEM_TYPE_TCP , .spec = &tcp_spec, .mask = &tcp_mask }, { .type = RTE_FLOW_ITEM_TYPE_END }, }; // Action: steer to queue 0 struct rte_flow_action_queue queue_action = { .index = 0 }; struct rte_flow_action actions[] = { { .type = RTE_FLOW_ACTION_TYPE_QUEUE, .conf = &queue_action }, { .type = RTE_FLOW_ACTION_TYPE_END }, }; // Validate (check NIC supports this rule — no changes to HW) struct rte_flow_error err; int ret = rte_flow_validate(port_id, &attr, pattern, actions, &err); // Create (programs the NIC hardware) struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, actions, &err); if (!flow) rte_exit(EXIT_FAILURE, "Flow create failed: %s\n", err.message); // Destroy when no longer needed rte_flow_destroy(port_id, flow, &err);

COMMON FLOW RULE PATTERNS

Use Case	Pattern Items	Action
Drop all traffic from IP	ETH / IPV4(src=1.2.3.4/32)	DROP
Steer VoIP (UDP 5060) to dedicated queue	ETH / IPV4 / UDP(dport=5060)	QUEUE(index=2)
Count ARP packets	ETH(type=0x0806)	COUNT + QUEUE(index=0)
VXLAN tunnel traffic to specific queue	ETH / IPV4 / UDP(4789) / VXLAN(vni=100)	QUEUE(index=3)
GTP-U (5G traffic) to mobility queue	ETH / IPV4 / UDP(2152) / GTP(teid=X)	QUEUE(index=4)
Mark HTTPS packets (apply DPI only to marked)	ETH / IPV4 / TCP(dport=443)	MARK(id=1) + RSS

rte_flow Groups and Priority

Priority: lower number = higher priority. If two rules match the same packet, the higher priority (lower number) wins.
Group: rules in group 0 are evaluated first. A JUMP action sends matching packets to another group for further classification — enabling multi-level classification trees.
Always call rte_flow_validate() before rte_flow_create() — different NICs support different item/action combinations. Validation catches unsupported combos before touching hardware.

⚠️ NIC capability check: Not all NICs support all flow item/action combinations. Intel i40e supports 5-tuple exact match (FDIR). mlx5 supports a much richer flow API including VXLAN, GTP inner headers. Always call rte_flow_validate() first — it returns an error with a descriptive message if the NIC cannot implement the rule.

NUMA — Non-Uniform Memory Access

In multi-socket servers, each CPU socket has local RAM. Accessing memory on the same socket (local) takes ~60 ns; accessing the other socket (remote) takes ~120 ns — 2× slower. DPDK makes NUMA awareness explicit throughout: every allocation API takes a socket_id parameter.

NUMA ALLOCATION RULES

Resource	Correct Socket	Why
mbuf pool	`rte_eth_dev_socket_id(port)`	NIC DMA writes to local socket RAM — remote access doubles latency
rte_ring	`rte_socket_id()` of the producer/consumer lcore	Ring data read/written by lcores on that socket
rte_hash / rte_lpm	`rte_socket_id()` of the lookup lcore	Hash table entries accessed at line rate — remote access unacceptable
Rx/Tx queues	`rte_eth_dev_socket_id(port)`	Queue descriptors DMA'd between NIC and RAM — must be local

// NUMA-correct mempool creation int nic_socket = rte_eth_dev_socket_id(port_id); struct rte_mempool *pool = rte_pktmbuf_pool_create( "MBUF_POOL", N_MBUFS, CACHE_SZ, 0, RTE_MBUF_DEFAULT_BUF_SIZE, nic_socket // MUST match NIC's socket ); // Check lcore-to-socket alignment unsigned lcore_id; RTE_LCORE_FOREACH_WORKER(lcore_id) { unsigned lcore_socket = rte_lcore_to_socket_id(lcore_id); if (lcore_socket != nic_socket) printf("WARNING: lcore %u on socket %u, NIC on socket %u — cross-NUMA!\n", lcore_id, lcore_socket, nic_socket); }

CACHE-LINE ALIGNMENT & FALSE SHARING

False Sharing — The Hidden Serializer

When two different variables on the same cache line (64 bytes) are written by different cores, every write invalidates the other core's cached copy — causing cache coherency traffic even though the cores access different variables. This can reduce throughput by 10–100×.

// WRONG — counter and flag on same cache line → false sharing struct per_core_data { uint64_t rx_count; // 8 bytes uint64_t tx_count; // 8 bytes int running; // 4 bytes — on same 64-byte line! } cores[RTE_MAX_LCORE]; // core 0 and core 1 share a cache line // CORRECT — pad each entry to a full cache line struct per_core_data { uint64_t rx_count; uint64_t tx_count; int running; uint8_t _pad[64 - sizeof(uint64_t)*2 - sizeof(int)]; // pad to 64 bytes } __rte_cache_aligned cores[RTE_MAX_LCORE]; // each core gets its own cache line

📌 __rte_cache_aligned is a DPDK macro that expands to __attribute__((aligned(RTE_CACHE_LINE_SIZE))). Always use it for per-lcore data structures to prevent false sharing.

Q: What is the difference between primary and secondary DPDK processes?

Primary creates and owns all shared resources: hugepage memory, named mempools, rings, and NIC configuration. Secondary attaches to existing primary memory via --proc-type=secondary and --file-prefix matching, finds named objects via lookup APIs, and can use NIC queues but cannot reconfigure the NIC. Primary must be started first.

Q: What happens to secondary processes if the primary exits?

The shared hugepage memory is unmapped by the OS when the primary exits. Secondary processes lose access to all shared mempools, rings, and hash tables. Any access to those objects causes segfault. Production systems should monitor primary health and gracefully shut down secondaries before primary exits.

Q: Why must --file-prefix match between primary and secondary?

DPDK uses the file prefix to name shared memory files in /dev/shm/ (e.g., /dev/shm/sase_config). Primary creates these files; secondary maps them. If prefixes differ, secondary maps a different (empty) shared memory file — it finds no mempools or rings and fails to start.

Q: What is rte_flow and how does it differ from RSS?

RSS distributes packets across queues by hashing the 5-tuple — the NIC computes a hash and uses a lookup table (RETA) to pick the queue. rte_flow programs the NIC to match specific packet fields (exact values + masks) and steer matching packets directly to a specific queue — bypassing RSS entirely. rte_flow is more precise (5-tuple, VLAN, VXLAN VNI, GTP TEID…) but consumes NIC hardware resources (FDIR table entries). RSS is always-on and handles all traffic; rte_flow handles specific classified flows.

Q: What is false sharing and how does DPDK prevent it?

False sharing occurs when two cores write to different variables that happen to reside on the same 64-byte cache line. Each write forces the cache line to be transferred between cores via the coherency protocol — causing serialization even though the cores are touching different data. DPDK prevents this by padding per-lcore data structures to 64 bytes using __rte_cache_aligned, ensuring each core's data occupies its own cache line.

Q: Why must mempool be allocated on the NIC's NUMA socket?

NIC DMA writes packet data into mbuf buffers. If those buffers are on the remote NUMA socket, every DMA write crosses the QPI/UPI interconnect — ~120 ns instead of ~60 ns. At 100G/64B (148 Mpps), the interconnect bandwidth becomes the bottleneck. NUMA-local allocation keeps DMA writes on the same socket as the NIC → no interconnect crossing → maximum throughput.

🔥 Lab 8: Multi-Process SASE-DP Skeleton

Build a minimal primary/secondary DPDK application that mirrors the Jio SASE-DP architecture: primary owns the NIC, enterprise secondary handles traffic on queues 0-1.

Primary: rte_eal_init() with --proc-type=primary --file-prefix=sase. Configure NIC with 4 queues. Create named mempool "MBUF_POOL" and ring "RX_TO_ENTERPRISE".

Primary RX loop: rx_burst on queues 0-1 → enqueue to "RX_TO_ENTERPRISE" ring

Secondary: rte_eal_init() with --proc-type=secondary --file-prefix=sase. Lookup "MBUF_POOL" and "RX_TO_ENTERPRISE".

Secondary process loop: dequeue from ring → process (print src IP) → free mbuf

Run primary and secondary in separate terminals — verify packets flow through

Kill secondary — verify primary keeps running. Kill primary — observe secondary behavior (segfault or graceful exit)

🔥 Lab 9: rte_flow Hardware Classifier

Program the NIC to steer specific traffic to queue 0 and observe zero-CPU classification.

Configure port with 4 queues and start device

Call rte_flow_validate() for a TCP/443 rule — check NIC supports it

Create flow rule: ETH / IPV4 / TCP(dport=443) → QUEUE(0)

Send mixed traffic: HTTPS (443), HTTP (80), DNS (53)

Verify only HTTPS packets appear on queue 0; HTTP/DNS go through RSS to other queues

Add a second rule: DROP all traffic from 10.0.0.0/8 (test with spoofed packets)

Destroy rules and verify traffic reverts to pure RSS distribution

MASTERY CHECKLIST

Can explain primary vs secondary: who creates, who looks up, who can't reconfigure NIC
Can explain what --file-prefix does and what happens if primaries/secondaries use different prefixes
Can write a secondary process that finds a named mempool and ring created by a primary
Can explain what rte_flow does that RSS cannot
Can write a complete rte_flow rule with pattern + action + validate + create + destroy
Can explain NUMA remote memory access penalty and how to avoid it
Can explain false sharing and demonstrate the __rte_cache_aligned fix
Can identify the NUMA socket for a given NIC port and allocate resources on it

← P2B: rte_ring & App Models ↑ Roadmap P3B: Patterns, Tuning & Debug →