DPDK MASTERY · PHASE 3 OF 3 · MODULE A
Multi-Process, rte_flow & NUMA
Primary/secondary model · shared resources · hardware flow classification · NUMA-aware allocation
Ch 13 — Multi-Process
Ch 14 — rte_flow
Ch 15 — Multi-Core & NUMA
C · VFIO · FDIR · Cache-Line
Weeks 11–13
Primary / Secondary Process Model
DPDK supports multiple OS processes sharing the same NIC and hugepage memory. The primary process owns all resources; secondary processes attach and use them. This enables hot-restartable components, traffic class isolation, and separation of control and data planes.Multi-Process Architecture
Primary Process Secondary Processes
┌──────────────────────┐ ┌─────────────────────┐
│ rte_eal_init() │ │ rte_eal_init() │
│ (creates resources) │ │ --proc-type=secondary│
│ │ hugepage │ (attaches) │
│ hugepage memory │◄──shared───►│ hugepage memory │
│ rte_mempool (named) │ │ rte_mempool_lookup() │
│ rte_ring (named) │ │ rte_ring_lookup() │
│ NIC: dev_configure │ │ NIC: rx_burst only │
│ NIC: dev_start │ │ (cannot reconfigure) │
│ │ └─────────────────────┘
│ Queues 0-3 → │ Secondary-Enterprise
│ Queues 4-7 → │ ┌─────────────────────┐
└──────────────────────┘ │ Secondary-Mobility │
└─────────────────────┘
Jio SASE-DP production pattern: primary owns both 100G ports.
Enterprise secondary handles queues 0-3 (URL filter).
Mobility secondary handles queues 4-7 (5G/SCEF policy).
EAL ARGUMENTS FOR MULTI-PROCESS
# Primary process — creates all shared resources
./my_primary -l 0-3 -n 4 --proc-type=primary --file-prefix=sase \
-a 0000:03:00.0 -- [app args]
# Secondary process — attaches to primary's shared memory
./my_secondary -l 4-7 -n 4 --proc-type=secondary --file-prefix=sase \
-- [app args]
# auto: becomes primary if none exists, secondary otherwise
./my_app -l 0-3 -n 4 --proc-type=auto --file-prefix=sase
⚠️ Rules: Secondary does NOT need
-a (device allowlist) — it inherits device info from primary. Secondary CAN specify -l (lcores) — must NOT overlap with primary's lcores. Both must use the same --file-prefix to share the same /dev/shm/ files.rte_flow — Hardware Flow Classification
rte_flow allows applications to program the NIC's hardware to perform packet classification and queue steering in silicon — with zero CPU involvement for matched flows. Matched packets bypass RSS entirely and are sent directly to a specific queue. Non-matching packets continue through normal RSS.
HOW rte_flow WORKS
rte_flow Architecture
Application defines flow rule (generic DPDK format)
↓
rte_flow_create() — PMD validates rule
PMD translates to NIC-specific hardware instructions
↓
NIC FDIR / Flow Table programmed (in NIC silicon)
↓
Packet arrives:
Matching packets → NIC classifies in hardware → steered to specific Rx queue
Non-matching packets → go through normal RSS pipeline
Key benefit: matched flows bypass RSS entirely — zero CPU for classification
Use case: steer specific traffic class to dedicated queue/lcore
Example: steer all traffic from enterprise VPN subnet → queue 0 (enterprise secondary)
steer all 5G/GTP traffic → queue 4 (mobility secondary)
FLOW RULE STRUCTURE
Three Building Blocks
- Attributes: ingress/egress, priority, group number
- Pattern (match criteria): chain of item types — ETH, IPV4, TCP, UDP, VXLAN, GTP, etc. Each item specifies field values and masks.
- Actions (what to do with matched packets): QUEUE (steer to specific Rx queue), DROP, COUNT, MARK (tag mbuf), RSS (apply RSS to matched subset), JUMP (goto another group)
// Complete rte_flow example: steer all TCP port 443 traffic → queue 0
struct rte_flow_attr attr = {
.ingress = 1, // match incoming packets
.priority = 0, // highest priority
};
// Pattern: ETH / IPV4 / TCP(dport=443) / END
struct rte_flow_item_tcp tcp_spec = { .hdr.dst_port = rte_cpu_to_be_16(443) };
struct rte_flow_item_tcp tcp_mask = { .hdr.dst_port = 0xFFFF };
struct rte_flow_item pattern[] = {
{ .type = RTE_FLOW_ITEM_TYPE_ETH },
{ .type = RTE_FLOW_ITEM_TYPE_IPV4 },
{ .type = RTE_FLOW_ITEM_TYPE_TCP , .spec = &tcp_spec, .mask = &tcp_mask },
{ .type = RTE_FLOW_ITEM_TYPE_END },
};
// Action: steer to queue 0
struct rte_flow_action_queue queue_action = { .index = 0 };
struct rte_flow_action actions[] = {
{ .type = RTE_FLOW_ACTION_TYPE_QUEUE, .conf = &queue_action },
{ .type = RTE_FLOW_ACTION_TYPE_END },
};
// Validate (check NIC supports this rule — no changes to HW)
struct rte_flow_error err;
int ret = rte_flow_validate(port_id, &attr, pattern, actions, &err);
// Create (programs the NIC hardware)
struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, actions, &err);
if (!flow)
rte_exit(EXIT_FAILURE, "Flow create failed: %s\n", err.message);
// Destroy when no longer needed
rte_flow_destroy(port_id, flow, &err);
COMMON FLOW RULE PATTERNS
| Use Case | Pattern Items | Action |
|---|---|---|
| Drop all traffic from IP | ETH / IPV4(src=1.2.3.4/32) | DROP |
| Steer VoIP (UDP 5060) to dedicated queue | ETH / IPV4 / UDP(dport=5060) | QUEUE(index=2) |
| Count ARP packets | ETH(type=0x0806) | COUNT + QUEUE(index=0) |
| VXLAN tunnel traffic to specific queue | ETH / IPV4 / UDP(4789) / VXLAN(vni=100) | QUEUE(index=3) |
| GTP-U (5G traffic) to mobility queue | ETH / IPV4 / UDP(2152) / GTP(teid=X) | QUEUE(index=4) |
| Mark HTTPS packets (apply DPI only to marked) | ETH / IPV4 / TCP(dport=443) | MARK(id=1) + RSS |
rte_flow Groups and Priority
- Priority: lower number = higher priority. If two rules match the same packet, the higher priority (lower number) wins.
- Group: rules in group 0 are evaluated first. A JUMP action sends matching packets to another group for further classification — enabling multi-level classification trees.
- Always call
rte_flow_validate()beforerte_flow_create()— different NICs support different item/action combinations. Validation catches unsupported combos before touching hardware.
⚠️ NIC capability check: Not all NICs support all flow item/action combinations. Intel i40e supports 5-tuple exact match (FDIR). mlx5 supports a much richer flow API including VXLAN, GTP inner headers. Always call
rte_flow_validate() first — it returns an error with a descriptive message if the NIC cannot implement the rule.NUMA — Non-Uniform Memory Access
In multi-socket servers, each CPU socket has local RAM. Accessing memory on the same socket (local) takes ~60 ns; accessing the other socket (remote) takes ~120 ns — 2× slower. DPDK makes NUMA awareness explicit throughout: every allocation API takes asocket_id parameter.
NUMA ALLOCATION RULES
| Resource | Correct Socket | Why |
|---|---|---|
| mbuf pool | rte_eth_dev_socket_id(port) | NIC DMA writes to local socket RAM — remote access doubles latency |
| rte_ring | rte_socket_id() of the producer/consumer lcore | Ring data read/written by lcores on that socket |
| rte_hash / rte_lpm | rte_socket_id() of the lookup lcore | Hash table entries accessed at line rate — remote access unacceptable |
| Rx/Tx queues | rte_eth_dev_socket_id(port) | Queue descriptors DMA'd between NIC and RAM — must be local |
// NUMA-correct mempool creation
int nic_socket = rte_eth_dev_socket_id(port_id);
struct rte_mempool *pool = rte_pktmbuf_pool_create(
"MBUF_POOL", N_MBUFS, CACHE_SZ, 0, RTE_MBUF_DEFAULT_BUF_SIZE,
nic_socket // MUST match NIC's socket
);
// Check lcore-to-socket alignment
unsigned lcore_id;
RTE_LCORE_FOREACH_WORKER(lcore_id) {
unsigned lcore_socket = rte_lcore_to_socket_id(lcore_id);
if (lcore_socket != nic_socket)
printf("WARNING: lcore %u on socket %u, NIC on socket %u — cross-NUMA!\n",
lcore_id, lcore_socket, nic_socket);
}
CACHE-LINE ALIGNMENT & FALSE SHARING
False Sharing — The Hidden Serializer
When two different variables on the same cache line (64 bytes) are written by different cores, every write invalidates the other core's cached copy — causing cache coherency traffic even though the cores access different variables. This can reduce throughput by 10–100×.// WRONG — counter and flag on same cache line → false sharing
struct per_core_data {
uint64_t rx_count; // 8 bytes
uint64_t tx_count; // 8 bytes
int running; // 4 bytes — on same 64-byte line!
} cores[RTE_MAX_LCORE]; // core 0 and core 1 share a cache line
// CORRECT — pad each entry to a full cache line
struct per_core_data {
uint64_t rx_count;
uint64_t tx_count;
int running;
uint8_t _pad[64 - sizeof(uint64_t)*2 - sizeof(int)]; // pad to 64 bytes
} __rte_cache_aligned cores[RTE_MAX_LCORE]; // each core gets its own cache line
📌 __rte_cache_aligned is a DPDK macro that expands to
__attribute__((aligned(RTE_CACHE_LINE_SIZE))). Always use it for per-lcore data structures to prevent false sharing.Q: What is the difference between primary and secondary DPDK processes?
Primary creates and owns all shared resources: hugepage memory, named mempools, rings, and NIC configuration. Secondary attaches to existing primary memory via--proc-type=secondary and --file-prefix matching, finds named objects via lookup APIs, and can use NIC queues but cannot reconfigure the NIC. Primary must be started first.
Q: What happens to secondary processes if the primary exits?
The shared hugepage memory is unmapped by the OS when the primary exits. Secondary processes lose access to all shared mempools, rings, and hash tables. Any access to those objects causes segfault. Production systems should monitor primary health and gracefully shut down secondaries before primary exits.Q: Why must --file-prefix match between primary and secondary?
DPDK uses the file prefix to name shared memory files in/dev/shm/ (e.g., /dev/shm/sase_config). Primary creates these files; secondary maps them. If prefixes differ, secondary maps a different (empty) shared memory file — it finds no mempools or rings and fails to start.
Q: What is rte_flow and how does it differ from RSS?
RSS distributes packets across queues by hashing the 5-tuple — the NIC computes a hash and uses a lookup table (RETA) to pick the queue. rte_flow programs the NIC to match specific packet fields (exact values + masks) and steer matching packets directly to a specific queue — bypassing RSS entirely. rte_flow is more precise (5-tuple, VLAN, VXLAN VNI, GTP TEID…) but consumes NIC hardware resources (FDIR table entries). RSS is always-on and handles all traffic; rte_flow handles specific classified flows.Q: What is false sharing and how does DPDK prevent it?
False sharing occurs when two cores write to different variables that happen to reside on the same 64-byte cache line. Each write forces the cache line to be transferred between cores via the coherency protocol — causing serialization even though the cores are touching different data. DPDK prevents this by padding per-lcore data structures to 64 bytes using__rte_cache_aligned, ensuring each core's data occupies its own cache line.
Q: Why must mempool be allocated on the NIC's NUMA socket?
NIC DMA writes packet data into mbuf buffers. If those buffers are on the remote NUMA socket, every DMA write crosses the QPI/UPI interconnect — ~120 ns instead of ~60 ns. At 100G/64B (148 Mpps), the interconnect bandwidth becomes the bottleneck. NUMA-local allocation keeps DMA writes on the same socket as the NIC → no interconnect crossing → maximum throughput.🔥 Lab 8: Multi-Process SASE-DP Skeleton
Build a minimal primary/secondary DPDK application that mirrors the Jio SASE-DP architecture: primary owns the NIC, enterprise secondary handles traffic on queues 0-1.
1
Primary:
rte_eal_init() with --proc-type=primary --file-prefix=sase. Configure NIC with 4 queues. Create named mempool "MBUF_POOL" and ring "RX_TO_ENTERPRISE".2
Primary RX loop: rx_burst on queues 0-1 → enqueue to
"RX_TO_ENTERPRISE" ring3
Secondary:
rte_eal_init() with --proc-type=secondary --file-prefix=sase. Lookup "MBUF_POOL" and "RX_TO_ENTERPRISE".4
Secondary process loop: dequeue from ring → process (print src IP) → free mbuf
5
Run primary and secondary in separate terminals — verify packets flow through
6
Kill secondary — verify primary keeps running. Kill primary — observe secondary behavior (segfault or graceful exit)
🔥 Lab 9: rte_flow Hardware Classifier
Program the NIC to steer specific traffic to queue 0 and observe zero-CPU classification.
1
Configure port with 4 queues and start device
2
Call
rte_flow_validate() for a TCP/443 rule — check NIC supports it3
Create flow rule: ETH / IPV4 / TCP(dport=443) → QUEUE(0)
4
Send mixed traffic: HTTPS (443), HTTP (80), DNS (53)
5
Verify only HTTPS packets appear on queue 0; HTTP/DNS go through RSS to other queues
6
Add a second rule: DROP all traffic from 10.0.0.0/8 (test with spoofed packets)
7
Destroy rules and verify traffic reverts to pure RSS distribution
MASTERY CHECKLIST
- Can explain primary vs secondary: who creates, who looks up, who can't reconfigure NIC
- Can explain what --file-prefix does and what happens if primaries/secondaries use different prefixes
- Can write a secondary process that finds a named mempool and ring created by a primary
- Can explain what rte_flow does that RSS cannot
- Can write a complete rte_flow rule with pattern + action + validate + create + destroy
- Can explain NUMA remote memory access penalty and how to avoid it
- Can explain false sharing and demonstrate the __rte_cache_aligned fix
- Can identify the NUMA socket for a given NIC port and allocate resources on it