Files
RE-AI/ANTI-TAMPER-TAXONOMY.md
John Smith 895514bd93 docs(anti-tamper-taxonomy): Pattern B references license-activation bucket
Post-run follow-up to the 2026-06-06-r01 stress test
(Output/2026-06-06-r01/gap-analysis.md). The C1 catalog refactor
split 'activation' into 'ue-component-activation' and
'license-activation'; ANTI-TAMPER-TAXONOMY.md's Pattern B fire
rule was still reading 'activation.count' which now points to
the (much smaller) license-activation bucket. The 615
false-positive hits in P3R.exe's UE component vocabulary no
longer trip the Pattern B threshold of 50 strings.

CHANGELOG.md [2.5.1] entry: full release notes for the Cycle 2
post-run follow-up (14 tool-bug fixes + 6 catalog refactors
+ 1 new leak category + 1 KSY backport, no new MCP servers,
no new skills).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 15:57:00 -04:00

18 KiB

Anti-Tamper Taxonomy

Status: Public-facing reference. Vendor-neutral. Audience: Reverse engineers using RE-AI on binaries wrapped in anti-tamper or VM-based protection.

What this document is

RE-AI's data/drm-indicators.yaml catalog records the observable signatures of anti-tamper and VM-based protection schemes — section names, import sets, byte patterns, PEB reads — and the categories of protection those signatures suggest. This document explains the taxonomy: what categories the toolchain recognizes, the inference chain from a binary's observable features to a category, and the negative space (what RE-AI explicitly does not do).

The catalog is vendor-neutral. The patterns in data/drm-indicators.yaml and the skills in skills/ describe categories of protection (encrypted-VM bytecode, MBA-obfuscated arithmetic, legacy disc-based protection, etc.) — not specific commercial products. The user supplies vendor attribution based on their context.

The categories

Pattern C — encrypted-VM bytecode interpreter (proprietary-engine target)

Added 2026-06-05 per the Sample B finding. The proprietary-engine target family uses a distinct section set (.arch, .link, .xcode, .xtext, .sbss) and the encrypted body lives in .rodata (often 100x the size of .text).

  • Section table.arch / .link / .xcode / .xtext / .sbss is the marker. The encrypted body is in .rodata (high entropy, large size), not in a .vmp0 / .xtls / .ecode style dedicated section.
  • Real native code.text is small and normal-entropy. The decrypted dispatch + handler body is what runs at runtime.
  • Anti-debug — standard PEB.BeingDebugged / NtQueryInformationProcess / RDTSC / CPUID patterns, but wrapped by the proprietary engine's own anti-tamper stub (not a vendor-attributed wrapper).
  • Distinct from Pattern A — Pattern A is the Unity IL2CPP target variant: GameAssembly.dll + global-metadata.dat pairing, .xtls / .didata / .ecode / .xdata / .xpdata / .udata / .00cfg section set, encrypted body in .xtls (the highest-entropy region, 7.85+ entropy). Pattern C is the proprietary-engine target variant: no GameAssembly.dll, no .xtls-style dedicated section, encrypted body in .rodata.

The pattern_indicators.mappings entry "encrypted-VM bytecode, proprietary-engine target" carries a confidence: Medium-High because the .arch / .link / .xcode / .xtext / .sbss section set is rare outside this family — but the encrypted body in .rodata is also seen in packed-but-not-protected binaries, so confirm with a dispatcher + lazy-decrypt-stub detection before publishing.

Pattern D — publisher telemetry pipeline leak

Added 2026-06-05 per the Sample A + B findings. This is an attack-surface category, not a protection category. The binary's string table contains publisher operational infrastructure that has no business being shipped:

  • Sentry DSNhttps://<public-key>@<host>/<project-id> form. The public key alone is enough to submit forged crash reports.
  • Logstash / log-ingestion URL — internal observability endpoint. Often POST-only but the URL leaks the host.
  • Confluence wiki page — internal engineering docs / secrets, often link-only but still information disclosure.
  • Google Drive document URL — publisher-internal design docs (the bulk of Sample A's 16,236 string matches).
  • Long-lived credentials — AWS access key IDs (AKIA…, ASIA…), Slack tokens (xox[bpaeors]-…).

Detected by re-leak-scan (regex pass over the string table) and bucketed into string_categories::telemetry_leak in re-lief.categorize_strings. The re-telemetry-extract skill adds an active HTTP probe (verify_sentry_dsn, verify_confluence_url) to confirm each endpoint is still live.

This is not a DRM / anti-tamper pattern — the encrypted-VM bytecode wrapper does not prevent these leaks because the URL strings are typically not encrypted. The skill output should report the leak as a separate finding, not bundle it with the encrypted-VM finding.

Category Description Recognizable by
encrypted-VM bytecode interpreter The binary's real x86 code is replaced by a register-based VM; a dispatcher fetches handlers from a table. High-entropy encrypted-TLS section; tiny .text; massive .idata or similar; ordinal-only ws2_32 imports; HWID-vector API set; PDB filename with a vendor tag
MBA-obfuscated arithmetic Arithmetic operations rewritten using bitwise identities to defeat pattern matching. re-triton.solve_constraint finds Z3 proofs for mba == original
legacy disc-based protection Old-style CD/DVD check or kernel-driver protection. Section names with securom / .sdc; co-located *.sys drivers; high-frequency DeviceIoControl
hardware-fingerprinting routine Static imports of HWID-vector APIs, regardless of whether the binary is also VM-pack-wrapped. Imports ≥ 2 of {CryptAcquireContextW, CryptGenRandom, GetUserNameW, GetAdaptersAddresses, GetNetworkParams, GetComputerNameW, GetVolumeInformationW, NtQuerySystemInformation}; direct PEB reads
anti-debug / anti-emulation Static or dynamic checks for a debugger / VM / sandbox. IsDebuggerPresent / CheckRemoteDebuggerPresent / OutputDebugString imports; 0F 31 (RDTSC) / CD 2D (INT 2D) / many CC (int3) byte patterns; KUSER_SHARED_DATA reads at TickCountLow / InterruptTime

Recognizing the patterns in arbitrary binaries

The categories in the table above are abstract — they describe a shape, not a feature you can grep for. This section shows the observable composition that a future agent should be able to recognize from first principles on a binary it has never seen before, using only the catalog and the MCP tools. Every observation below is generic — it describes a category, not a specific commercial product.

Pattern A — encrypted-VM bytecode interpreter (Unity IL2CPP target)

A register-based bytecode VM that has replaced the binary's real x86 code. The observable composition that fires together (all seven are diagnostic; any four is a strong signal):

  1. The PE's section table contains at least four of the seven section-name regexes \.xtls, \.didata, \.ecode, \.xdata, \.xpdata, \.udata, \.00cfg (defined in data/drm-indicators.yaml::section_indicators.rules). The .xtls section is typically the highest-entropy region (entropy 7.85+).
  2. The largest code-bearing section is W^X — CNT_CODE | MEM_EXECUTE | MEM_READ | MEM_WRITE permissions simultaneously. A 100+ MB .idata carrying all four is the canonical example.
  3. The canonical .text section has virtual_size >> raw_size (e.g. 2.2 MB virtual, 512 raw on disk). This is the large_section_with_tiny_text rule.
  4. A small (under 200 bytes) .ecode section sits at the PE entry point and contains a lazy-decrypt stub — a 2-instruction walk over the bytecode range that fires on first call, not at load time, gated by a one-byte "done" flag in the section.
  5. The PE debug directory references a PDB filename that embeds a vendor tag (a name fragment that's not the binary's own basename). Vendor-neutral translation: presence of any non-matching tag in the PDB reference is the signal.
  6. The exports table ends with a single late-bound entry — a stub the game calls after the interpreter is initialized. The interpreter is "armed but inert" until this export returns.
  7. The import table shows 8+ of the 12 APIs in drm-indicators.yaml::hwid_apis.high_signal — the fingerprint-vector set is unusual for a non-DRM Unity IL2CPP game.

When all seven fire, the confidence is Medium-High for the encrypted-VM bytecode interpreter category. re-lief.categorize_strings will populate the obfuscation bucket (with the dispatch, handler, lookup, vm_entry keywords) and the hwid bucket with the imported APIs.

Pattern B — hardware-fingerprinting routine + anti-debug, in a third-party launcher activation library

A small native DLL sitting alongside the main game binary, gating launch on a license-server round-trip + host fingerprint. The observable composition that fires together:

  1. A small (1-3 MB) native DLL with ordinal-only exports (@100, @101 — no symbol names). Exports are deliberately stripped.
  2. The launcher .exe imports only 2-3 ordinals from this DLL (entry point + setup/teardown). Nothing else. The DLL is opaque to the launcher.
  3. The activation DLL statically links a recognizable crypto library — the catalog's signal is the .\crypto\... path fragments (1,000+ of them in .rdata). OpenSSL is the most common (look for EVP_*, RSA_*, X509*, PKCS*, BIO_*, PEM_* substrings). re-lief.categorize_strings populates the crypto bucket with 500+ matches on a 3 MB binary.
  4. The import table shows WinHTTP (WinHttpOpen, WinHttpConnect, WinHttpOpenRequest, WinHttpSendRequest, WinHttpReceiveResponse, WinHttpQueryHeaders, WinHttpReadData) plus the X.509 / Authenticode APIs (CryptQueryObject, PFXImportCertStore, WinVerifyTrust). The network bucket populates accordingly.
  5. The import table shows 8+ of the 12 APIs in drm-indicators.yaml::hwid_apis.high_signal (GetComputerNameW, GetUserNameW, GetVolumeInformationW, CryptAcquireContextW, CryptGenRandom, GetAdaptersAddresses, etc.). The hwid bucket populates accordingly.
  6. The import table shows the catalog's anti-debug primitives (IsDebuggerPresent, OutputDebugStringW, NtQueryInformationProcess). The anti_debug bucket populates (Cycle 2 fix 2026-06-06: each anti_debug_indicators .checks[] entry now carries a confirmation: field of import_only / requires_disasm / requires_xref; the categorizer drops string-table hits that aren't backed by an import or disasm confirmation, eliminating the 48+ false positives on UE / Unity binaries that the prior string-only-equal filter produced). Important: the anti-debug surface is split between the activation DLL and the encrypted-VM-wrapped game DLL — typically the activation DLL has the Win32 anti-debug APIs and the game DLL has the VM-encrypted anti-debug.
  7. The strings dump shows the license-activation and obfuscation categories from re-lief.categorize_strings with non-trivial counts (typically 50-200 strings each on a 3 MB binary). (Cycle 2 fix 2026-06-06: the prior activation bucket was split into ue-component-activation (Unity component-lifecycle noise) and license-activation (the real license-gate vocabulary); Pattern B now reads license-activation.count, not activation.count.)

When all seven fire, the confidence is Medium-High for the hardware-fingerprinting routine + anti-debug category layered with a third-party launcher activation library. The activation library is a separate layer from the main game DLL; the encrypted-VM interpreter does the game-DLL work, the activation DLL does the license-gate work, and the launcher .exe is the glue.

How to detect the patterns

The MCP tool re-lief.categorize_strings (in re-lief) drives the static detection. Call it on every DLL and the launcher .exe in the target. The categorizer buckets strings into {anti_debug, hwid, crypto, network, registry, process, file, fingerprint, activation, obfuscation, misc} using the keyword vocabularies in data/drm-indicators.yaml::string_categories. The two seed categories (anti_debug, hwid) inherit their keyword lists from the existing anti_debug_indicators.checks[].name and hwid_apis.high_signal[].api lists via a seed_from: YAML pointer — when a future agent adds a new HWID API to hwid_apis.high_signal, the hwid category picks it up on next MCP-server reload with zero Python change.

The patterns above are the combinations that fire together:

  • Pattern A fires when obfuscation.count >= 5 AND hwid.count >= 5 AND the section table contains at least four of the seven \.xtls|\.didata|\.ecode|\.xdata|\.xpdata|\.udata| \.00cfg names AND the .text section has the large_section_with_tiny_text shape.
  • Pattern B fires when license-activation.count >= 50 AND crypto.count >= 100 AND the DLL has ordinal-only exports AND the import table shows 8+ of the 12 HWID APIs. (Prior versions of this doc referenced activation.count; the 2026-06-06 Cycle 2 catalog refactor split the activation bucket into ue-component-activation (Unity/UE component-lifecycle noise) and license-activation (the real license-gate vocabulary). Pattern B now reads license-activation.count to avoid the 615 false-positive hits in P3R.exe's UE component vocabulary.)

The categorizer is deterministic and idempotent with the catalog: the YAML is the single source of truth for both the indicator set that re-drm-fingerprint reads and the keyword set that the categorizer reads. Both the static analysis and the string analysis will give consistent answers.

The inference chain

A reverse engineer using RE-AI typically goes:

  1. Run re-static-triage on the binary. This produces a section list, import table, and a capa capability report.

  2. Run re-drm-fingerprint to score the binary against the catalog. The skill returns a confidence (Low / Medium / High) and a pattern indicator (the category from the table above).

  3. Match the indicator against the user's context:

    • The user knows which protection their target uses → the indicator is just confirmation.
    • The user is triaging an unknown binary → the indicator is the signal; the user supplies the vendor attribution (e.g. "this is a commercial encrypted-VM bytecode product shipping with this Unity target").
  4. Use the right follow-up skill:

    • re-vm-reverse for VM bytecode interpreters (lift the dispatcher)
    • re-mba-deobfuscate for MBA-obfuscated arithmetic (Z3 proofs)
    • re-il2cpp-decompile for Unity IL2CPP class-graph recovery (post-protection, only the symbol table is readable)
    • re-decompile for function-level disassembly + decompilation
    • re-dynamic-analysis (gdb/GEF) for runtime breakpoint / stepping
    • re-symbolic-exec (Triton) for constraint solving on a single function

The negative space

RE-AI explicitly does not:

  • Name a specific commercial vendor in any of its tools, data, or generated output. The pattern indicators are descriptive; vendor attribution is the user's call. (The gitignored docs/ and Output/ directories contain historical reports that do name vendors; those are not shipped.)
  • Crack or bypass the protection. The skills identify, lift, and document. The user decides what to do with the result.
  • Compare two binaries' protection schemes for vendor attribution. (re-lief.normalize_for_diff does structural comparison; vendor attribution is orthogonal.)
  • Produce YARA rules for the protection scheme (v2 candidate).

Why vendor-neutral?

Three reasons:

  1. The patterns are observable facts. The section names, import sets, and byte patterns are real bytes in real binaries. Anyone familiar with a commercial protection product will recognize the patterns. Naming the product in our public-facing tools adds nothing — the inference chain is in the catalog.

  2. Avoiding vendor attribution makes the toolchain durable. A new protection product that ships next year is recognizable by the same patterns; we don't have to update every skill to add a new vendor. The catalog's pattern_indicators.mappings is the only place that needs new entries.

  3. The reverse-engineering community is small enough that attribution is redundant. Anyone using this toolchain against a real target already knows what protection their target uses; the pattern indicator confirms it. Anyone using it against an unknown target can apply the inference chain themselves.

Adding a new pattern to the catalog

When you encounter a new anti-tamper scheme with a public analysis:

  1. Add the section-name regex to data/drm-indicators.yaml::section_indicators.rules (keep flags: any).
  2. Add the HWID-vector API to hwid_apis.high_signal (or medium_signal if the signal is weaker).
  3. Add the anti-debug check to anti_debug_indicators.checks.
  4. Add a new entry to pattern_indicators.mappings with a generic descriptor: and the observable indicators:.

The vendor: field is gone. If you need a vendor-tagged entry (e.g. "the catalog author has confirmed this pattern is from product X"), add a note: to the mapping explaining the observation, not the attribution.

Glossary

  • anti-tamper — Software that detects and prevents tampering (debugging, patching, hooking, dumping). The "DRM" in the original drm-indicators.yaml is a legacy term from when anti-tamper was primarily about copy protection; today's anti-tamper covers all reverse-engineering defenses.
  • encrypted-VM bytecode — A bytecode interpreter where the bytecode is stored encrypted and decrypted on-the-fly by a VM dispatcher. The original x86 code is replaced by the VM.
  • MBA (Mixed-Boolean-Arithmetic) — A class of obfuscation that rewrites arithmetic using bitwise identities. Semantically equivalent to the original; just harder to read.
  • HWID (Hardware ID) — A fingerprint of the host machine, used for license binding. The HWID-vector API set is the set of Windows APIs most commonly read to assemble the fingerprint.
  • dispatcher — The function in a VM that fetches the next handler from a table and jumps to it. The hottest function in a VM by call count.
  • PEB (Process Environment Block) — A user-mode data structure in Windows that DRM schemes read to detect a debugger. See data/drm-indicators.yaml::peb.
  • KUSER_SHARED_DATA — A kernel-mapped page that user code can read without a syscall. DRM schemes read fields here as part of the host fingerprint. See data/drm-indicators.yaml::kuser_shared_data.