Module 15 — Hyperscan: Pattern Compilation

Requires Hyperscan (libhs) installed. See setup instructions below.

What you learn

How to compile Hyperscan pattern databases — both regex (hs_compile_multi) and literal (hs_compile_lit_multi) — including the exact parseFlags(), parseFile(), and hs_create_db() implementations from domain_scan.c in the DP application. Also covers DB info query, serialization for persistence, and compile error handling.


Hyperscan’s role in the DP application

Two databases, two purposes:

1. domainsPatternDB (global, regex):
   Compiled once at startup from patterns.txt + patterns2.txt.
   Patterns match the structure of TLS ClientHello (SNI extension header),
   HTTP Host headers, and IP addresses in URLs.
   IDs: TLS=1, HTTP_IPV4=2, HTTP_DOMAIN=3, HTTP_IPV6=4

2. group->database (per-group, literal):
   Compiled once per enterprise group when policy syncs from Kafka.
   Contains exact domain names: "google.com", "malware.ru", etc.
   Used as the Hyperscan fallback when rte_hash exact lookup misses.

Setup

# RedHat 8 / Rocky Linux 8
dnf install hyperscan hyperscan-devel

# Ubuntu 22.04+
apt-get install libhyperscan-dev

# From source (latest version):
git clone https://github.com/intel/hyperscan
cd hyperscan && mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
ninja && sudo ninja install

Build and run

make
./hs_compile

Files

File Purpose
hs_compile.c parseFlags, parseFile, hs_create_db, 6 demos
sample_patterns.txt Example patterns.txt in the real format (ID:/pattern/flags)
Makefile Links with -lhs (pkg-config aware)

Key concepts

1. hs_compile_multi vs hs_compile_lit_multi

/* REGEX mode — for patterns.txt (TLS/HTTP patterns) */
hs_compile_multi(
    patterns,    /* char *[] of regex strings */
    flags,       /* unsigned[] of HS_FLAG_* per pattern */
    ids,         /* unsigned[] of IDs per pattern */
    count,
    HS_MODE_BLOCK,
    NULL,        /* platform: NULL = current CPU */
    &db, &err
);

/* LITERAL mode — for domain policy (no regex engine) */
hs_compile_lit_multi(
    patterns,          /* char *[] of byte strings */
    flags,
    ids,
    lens,              /* size_t[] — length of each literal */
    count,
    HS_MODE_BLOCK,
    NULL,
    &db, &err
);

When to use literal: domain names are exact strings, not patterns. Literal mode is significantly faster to compile AND scan because Hyperscan uses SIMD byte-comparison algorithms instead of building a full NFA/DFA. For 10000 domain literals, literal compile takes ~50ms vs ~2s for regex.

2. HS_FLAG_SINGLEMATCH — the most important performance flag

flags[i] = HS_FLAG_CASELESS | HS_FLAG_SINGLEMATCH;

Without SINGLEMATCH: Hyperscan fires a match callback for every position in the data where the pattern matches.

With SINGLEMATCH: Hyperscan fires exactly once per scan per pattern. Since the DP application only needs to know IF a pattern matched, SINGLEMATCH eliminates redundant callbacks and is always used.

3. Pattern IDs — the callback dispatch mechanism

/* In the on_hs_match callback (domain_scan.c): */
int on_hs_match(unsigned int id, unsigned long long from,
              unsigned long long to, unsigned int flags, void *ctx)
{
    switch (id) {
    case HS_PATTERN_ID_TLS:         /* 4 */
        /* read SNI at from+7 / from+9 (Module 07 pattern) */
        break;
    case HS_PATTERN_ID_HTTP_DOMAIN: /* 3 */
        /* extract Host: header domain */
        break;
    }
    return 0;  /* 0 = continue scanning, non-zero = stop */
}

4. parseFile — pattern file format

# Comment
ID:/regex_or_literal/flags

Example:
4:/\x00\x00\x00\x00\x00/H
6:/Host: [a-zA-Z0-9._-]+/iH

5. Serialization for fast restart

/* After compilation — save to disk: */
size_t sz;
hs_serialized_database_size(db, &sz);
char *buf = malloc(sz);
hs_serialize_database(db, buf, sz);
fwrite(buf, 1, sz, fp);

/* On restart — load instead of recompile: */
fread(buf, 1, sz, fp);
hs_deserialize_database(buf, sz, &db);

For the global DB with 100+ patterns, compilation takes ~500ms. With serialization, restart takes ~5ms (just a memory copy).

6. Compile error handling

hs_compile_error_t *err = NULL;
hs_error_t r = hs_compile_multi(..., &db, &err);

if (r != HS_SUCCESS) {
    if (err) {
        LOG_ERROR("Pattern %d failed: %s", err->expression, err->message);
        hs_free_compile_error(err);  /* MANDATORY: free the error struct */
    }
    return -1;
}

7. HS_MODE_BLOCK — always use this in the DP application

HS_MODE_BLOCK:    Scan a complete buffer at once.
                  Most efficient for fixed-size packet payloads.

HS_MODE_STREAM:   Data arrives in chunks (e.g., TCP stream reassembly).
                  Would be needed if scanning fragmented DNS over TCP.

HS_MODE_VECTORED: Scan multiple non-contiguous buffers in one call.
                  Not used in the DP application.

Next module

Module 16 — Hyperscan: Scratch + Scan: Allocate scratch space (hs_alloc_scratch), clone per-lcore scratch (hs_clone_scratch), and call hs_scan() with the onMatch callback.


Source files

File Download
hs_compile.c hs_compile.c
sample_patterns.txt sample_patterns.txt
Makefile Makefile