mirror of https://github.com/Heretek-AI/RE-AI.git synced 2026-07-01 01:37:55 -04:00

Files

T

John Smith 895514bd93 docs(anti-tamper-taxonomy): Pattern B references license-activation bucket

Post-run follow-up to the 2026-06-06-r01 stress test
(Output/2026-06-06-r01/gap-analysis.md). The C1 catalog refactor
split 'activation' into 'ue-component-activation' and
'license-activation'; ANTI-TAMPER-TAXONOMY.md's Pattern B fire
rule was still reading 'activation.count' which now points to
the (much smaller) license-activation bucket. The 615
false-positive hits in P3R.exe's UE component vocabulary no
longer trip the Pattern B threshold of 50 strings.

CHANGELOG.md [2.5.1] entry: full release notes for the Cycle 2
post-run follow-up (14 tool-bug fixes + 6 catalog refactors
+ 1 new leak category + 1 KSY backport, no new MCP servers,
no new skills).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-06 15:57:00 -04:00

18 KiB

Raw Permalink Blame History

Anti-Tamper Taxonomy

Status: Public-facing reference. Vendor-neutral. Audience: Reverse engineers using RE-AI on binaries wrapped in anti-tamper or VM-based protection.

What this document is

RE-AI's data/drm-indicators.yaml catalog records the observable signatures of anti-tamper and VM-based protection schemes — section names, import sets, byte patterns, PEB reads — and the categories of protection those signatures suggest. This document explains the taxonomy: what categories the toolchain recognizes, the inference chain from a binary's observable features to a category, and the negative space (what RE-AI explicitly does not do).

The catalog is vendor-neutral. The patterns in data/drm-indicators.yaml and the skills in skills/ describe categories of protection (encrypted-VM bytecode, MBA-obfuscated arithmetic, legacy disc-based protection, etc.) — not specific commercial products. The user supplies vendor attribution based on their context.

The categories

Pattern C — encrypted-VM bytecode interpreter (proprietary-engine target)

Added 2026-06-05 per the Sample B finding. The proprietary-engine target family uses a distinct section set (.arch, .link, .xcode, .xtext, .sbss) and the encrypted body lives in .rodata (often 100x the size of .text).

Section table — .arch / .link / .xcode / .xtext / .sbss is the marker. The encrypted body is in .rodata (high entropy, large size), not in a .vmp0 / .xtls / .ecode style dedicated section.
Real native code — .text is small and normal-entropy. The decrypted dispatch + handler body is what runs at runtime.
Anti-debug — standard PEB.BeingDebugged / NtQueryInformationProcess / RDTSC / CPUID patterns, but wrapped by the proprietary engine's own anti-tamper stub (not a vendor-attributed wrapper).
Distinct from Pattern A — Pattern A is the Unity IL2CPP target variant: GameAssembly.dll + global-metadata.dat pairing, .xtls / .didata / .ecode / .xdata / .xpdata / .udata / .00cfg section set, encrypted body in .xtls (the highest-entropy region, 7.85+ entropy). Pattern C is the proprietary-engine target variant: no GameAssembly.dll, no .xtls-style dedicated section, encrypted body in .rodata.

The pattern_indicators.mappings entry "encrypted-VM bytecode, proprietary-engine target" carries a confidence: Medium-High because the .arch / .link / .xcode / .xtext / .sbss section set is rare outside this family — but the encrypted body in .rodata is also seen in packed-but-not-protected binaries, so confirm with a dispatcher + lazy-decrypt-stub detection before publishing.

Pattern D — publisher telemetry pipeline leak

Added 2026-06-05 per the Sample A + B findings. This is an attack-surface category, not a protection category. The binary's string table contains publisher operational infrastructure that has no business being shipped:

Sentry DSN — https://<public-key>@<host>/<project-id> form. The public key alone is enough to submit forged crash reports.
Logstash / log-ingestion URL — internal observability endpoint. Often POST-only but the URL leaks the host.
Confluence wiki page — internal engineering docs / secrets, often link-only but still information disclosure.
Google Drive document URL — publisher-internal design docs (the bulk of Sample A's 16,236 string matches).
Long-lived credentials — AWS access key IDs (AKIA…, ASIA…), Slack tokens (xox[bpaeors]-…).

Detected by re-leak-scan (regex pass over the string table) and bucketed into string_categories::telemetry_leak in re-lief.categorize_strings. The re-telemetry-extract skill adds an active HTTP probe (verify_sentry_dsn, verify_confluence_url) to confirm each endpoint is still live.

This is not a DRM / anti-tamper pattern — the encrypted-VM bytecode wrapper does not prevent these leaks because the URL strings are typically not encrypted. The skill output should report the leak as a separate finding, not bundle it with the encrypted-VM finding.

Category	Description	Recognizable by
encrypted-VM bytecode interpreter	The binary's real x86 code is replaced by a register-based VM; a dispatcher fetches handlers from a table.	High-entropy encrypted-TLS section; tiny `.text`; massive `.idata` or similar; ordinal-only `ws2_32` imports; HWID-vector API set; PDB filename with a vendor tag
MBA-obfuscated arithmetic	Arithmetic operations rewritten using bitwise identities to defeat pattern matching.	`re-triton.solve_constraint` finds Z3 proofs for `mba == original`
legacy disc-based protection	Old-style CD/DVD check or kernel-driver protection.	Section names with `securom` / `.sdc`; co-located `*.sys` drivers; high-frequency `DeviceIoControl`
hardware-fingerprinting routine	Static imports of HWID-vector APIs, regardless of whether the binary is also VM-pack-wrapped.	Imports ≥ 2 of `{CryptAcquireContextW, CryptGenRandom, GetUserNameW, GetAdaptersAddresses, GetNetworkParams, GetComputerNameW, GetVolumeInformationW, NtQuerySystemInformation}`; direct PEB reads
anti-debug / anti-emulation	Static or dynamic checks for a debugger / VM / sandbox.	`IsDebuggerPresent` / `CheckRemoteDebuggerPresent` / `OutputDebugString` imports; `0F 31` (RDTSC) / `CD 2D` (INT 2D) / many `CC` (int3) byte patterns; KUSER_SHARED_DATA reads at `TickCountLow` / `InterruptTime`

Recognizing the patterns in arbitrary binaries

The categories in the table above are abstract — they describe a shape, not a feature you can grep for. This section shows the observable composition that a future agent should be able to recognize from first principles on a binary it has never seen before, using only the catalog and the MCP tools. Every observation below is generic — it describes a category, not a specific commercial product.

Pattern A — encrypted-VM bytecode interpreter (Unity IL2CPP target)

A register-based bytecode VM that has replaced the binary's real x86 code. The observable composition that fires together (all seven are diagnostic; any four is a strong signal):

The PE's section table contains at least four of the seven section-name regexes \.xtls, \.didata, \.ecode, \.xdata, \.xpdata, \.udata, \.00cfg (defined in data/drm-indicators.yaml::section_indicators.rules). The .xtls section is typically the highest-entropy region (entropy 7.85+).
The largest code-bearing section is W^X — CNT_CODE | MEM_EXECUTE | MEM_READ | MEM_WRITE permissions simultaneously. A 100+ MB .idata carrying all four is the canonical example.
The canonical .text section has virtual_size >> raw_size (e.g. 2.2 MB virtual, 512 raw on disk). This is the large_section_with_tiny_text rule.
A small (under 200 bytes) .ecode section sits at the PE entry point and contains a lazy-decrypt stub — a 2-instruction walk over the bytecode range that fires on first call, not at load time, gated by a one-byte "done" flag in the section.
The PE debug directory references a PDB filename that embeds a vendor tag (a name fragment that's not the binary's own basename). Vendor-neutral translation: presence of any non-matching tag in the PDB reference is the signal.
The exports table ends with a single late-bound entry — a stub the game calls after the interpreter is initialized. The interpreter is "armed but inert" until this export returns.
The import table shows 8+ of the 12 APIs in drm-indicators.yaml::hwid_apis.high_signal — the fingerprint-vector set is unusual for a non-DRM Unity IL2CPP game.

When all seven fire, the confidence is Medium-High for the encrypted-VM bytecode interpreter category. re-lief.categorize_strings will populate the obfuscation bucket (with the dispatch, handler, lookup, vm_entry keywords) and the hwid bucket with the imported APIs.

Pattern B — hardware-fingerprinting routine + anti-debug, in a third-party launcher activation library

A small native DLL sitting alongside the main game binary, gating launch on a license-server round-trip + host fingerprint. The observable composition that fires together:

A small (1-3 MB) native DLL with ordinal-only exports (@100, @101 — no symbol names). Exports are deliberately stripped.
The launcher .exe imports only 2-3 ordinals from this DLL (entry point + setup/teardown). Nothing else. The DLL is opaque to the launcher.
The activation DLL statically links a recognizable crypto library — the catalog's signal is the .\crypto\... path fragments (1,000+ of them in .rdata). OpenSSL is the most common (look for EVP_*, RSA_*, X509*, PKCS*, BIO_*, PEM_* substrings). re-lief.categorize_strings populates the crypto bucket with 500+ matches on a 3 MB binary.
The import table shows WinHTTP (WinHttpOpen, WinHttpConnect, WinHttpOpenRequest, WinHttpSendRequest, WinHttpReceiveResponse, WinHttpQueryHeaders, WinHttpReadData) plus the X.509 / Authenticode APIs (CryptQueryObject, PFXImportCertStore, WinVerifyTrust). The network bucket populates accordingly.
The import table shows 8+ of the 12 APIs in drm-indicators.yaml::hwid_apis.high_signal (GetComputerNameW, GetUserNameW, GetVolumeInformationW, CryptAcquireContextW, CryptGenRandom, GetAdaptersAddresses, etc.). The hwid bucket populates accordingly.
The import table shows the catalog's anti-debug primitives (IsDebuggerPresent, OutputDebugStringW, NtQueryInformationProcess). The anti_debug bucket populates (Cycle 2 fix 2026-06-06: each anti_debug_indicators .checks[] entry now carries a confirmation: field of import_only / requires_disasm / requires_xref; the categorizer drops string-table hits that aren't backed by an import or disasm confirmation, eliminating the 48+ false positives on UE / Unity binaries that the prior string-only-equal filter produced). Important: the anti-debug surface is split between the activation DLL and the encrypted-VM-wrapped game DLL — typically the activation DLL has the Win32 anti-debug APIs and the game DLL has the VM-encrypted anti-debug.
The strings dump shows the license-activation and obfuscation categories from re-lief.categorize_strings with non-trivial counts (typically 50-200 strings each on a 3 MB binary). (Cycle 2 fix 2026-06-06: the prior activation bucket was split into ue-component-activation (Unity component-lifecycle noise) and license-activation (the real license-gate vocabulary); Pattern B now reads license-activation.count, not activation.count.)

When all seven fire, the confidence is Medium-High for the hardware-fingerprinting routine + anti-debug category layered with a third-party launcher activation library. The activation library is a separate layer from the main game DLL; the encrypted-VM interpreter does the game-DLL work, the activation DLL does the license-gate work, and the launcher .exe is the glue.

How to detect the patterns

The MCP tool re-lief.categorize_strings (in re-lief) drives the static detection. Call it on every DLL and the launcher .exe in the target. The categorizer buckets strings into {anti_debug, hwid, crypto, network, registry, process, file, fingerprint, activation, obfuscation, misc} using the keyword vocabularies in data/drm-indicators.yaml::string_categories. The two seed categories (anti_debug, hwid) inherit their keyword lists from the existing anti_debug_indicators.checks[].name and hwid_apis.high_signal[].api lists via a seed_from: YAML pointer — when a future agent adds a new HWID API to hwid_apis.high_signal, the hwid category picks it up on next MCP-server reload with zero Python change.

The patterns above are the combinations that fire together:

Pattern A fires when obfuscation.count >= 5 AND hwid.count >= 5 AND the section table contains at least four of the seven \.xtls|\.didata|\.ecode|\.xdata|\.xpdata|\.udata| \.00cfg names AND the .text section has the large_section_with_tiny_text shape.
Pattern B fires when license-activation.count >= 50 AND crypto.count >= 100 AND the DLL has ordinal-only exports AND the import table shows 8+ of the 12 HWID APIs. (Prior versions of this doc referenced activation.count; the 2026-06-06 Cycle 2 catalog refactor split the activation bucket into ue-component-activation (Unity/UE component-lifecycle noise) and license-activation (the real license-gate vocabulary). Pattern B now reads license-activation.count to avoid the 615 false-positive hits in P3R.exe's UE component vocabulary.)

The categorizer is deterministic and idempotent with the catalog: the YAML is the single source of truth for both the indicator set that re-drm-fingerprint reads and the keyword set that the categorizer reads. Both the static analysis and the string analysis will give consistent answers.

The inference chain

A reverse engineer using RE-AI typically goes:

Run re-static-triage on the binary. This produces a section list, import table, and a capa capability report.
Run re-drm-fingerprint to score the binary against the catalog. The skill returns a confidence (Low / Medium / High) and a pattern indicator (the category from the table above).
Match the indicator against the user's context:
- The user knows which protection their target uses → the indicator is just confirmation.
- The user is triaging an unknown binary → the indicator is the signal; the user supplies the vendor attribution (e.g. "this is a commercial encrypted-VM bytecode product shipping with this Unity target").
Use the right follow-up skill:
- re-vm-reverse for VM bytecode interpreters (lift the dispatcher)
- re-mba-deobfuscate for MBA-obfuscated arithmetic (Z3 proofs)
- re-il2cpp-decompile for Unity IL2CPP class-graph recovery (post-protection, only the symbol table is readable)
- re-decompile for function-level disassembly + decompilation
- re-dynamic-analysis (gdb/GEF) for runtime breakpoint / stepping
- re-symbolic-exec (Triton) for constraint solving on a single function

The negative space

RE-AI explicitly does not:

Name a specific commercial vendor in any of its tools, data, or generated output. The pattern indicators are descriptive; vendor attribution is the user's call. (The gitignored docs/ and Output/ directories contain historical reports that do name vendors; those are not shipped.)
Crack or bypass the protection. The skills identify, lift, and document. The user decides what to do with the result.
Compare two binaries' protection schemes for vendor attribution. (re-lief.normalize_for_diff does structural comparison; vendor attribution is orthogonal.)
Produce YARA rules for the protection scheme (v2 candidate).

Why vendor-neutral?

Three reasons:

The patterns are observable facts. The section names, import sets, and byte patterns are real bytes in real binaries. Anyone familiar with a commercial protection product will recognize the patterns. Naming the product in our public-facing tools adds nothing — the inference chain is in the catalog.
Avoiding vendor attribution makes the toolchain durable. A new protection product that ships next year is recognizable by the same patterns; we don't have to update every skill to add a new vendor. The catalog's pattern_indicators.mappings is the only place that needs new entries.
The reverse-engineering community is small enough that attribution is redundant. Anyone using this toolchain against a real target already knows what protection their target uses; the pattern indicator confirms it. Anyone using it against an unknown target can apply the inference chain themselves.

Adding a new pattern to the catalog

When you encounter a new anti-tamper scheme with a public analysis:

Add the section-name regex to data/drm-indicators.yaml::section_indicators.rules (keep flags: any).
Add the HWID-vector API to hwid_apis.high_signal (or medium_signal if the signal is weaker).
Add the anti-debug check to anti_debug_indicators.checks.
Add a new entry to pattern_indicators.mappings with a generic descriptor: and the observable indicators:.

The vendor: field is gone. If you need a vendor-tagged entry (e.g. "the catalog author has confirmed this pattern is from product X"), add a note: to the mapping explaining the observation, not the attribution.

Glossary

anti-tamper — Software that detects and prevents tampering (debugging, patching, hooking, dumping). The "DRM" in the original drm-indicators.yaml is a legacy term from when anti-tamper was primarily about copy protection; today's anti-tamper covers all reverse-engineering defenses.
encrypted-VM bytecode — A bytecode interpreter where the bytecode is stored encrypted and decrypted on-the-fly by a VM dispatcher. The original x86 code is replaced by the VM.
MBA (Mixed-Boolean-Arithmetic) — A class of obfuscation that rewrites arithmetic using bitwise identities. Semantically equivalent to the original; just harder to read.
HWID (Hardware ID) — A fingerprint of the host machine, used for license binding. The HWID-vector API set is the set of Windows APIs most commonly read to assemble the fingerprint.
dispatcher — The function in a VM that fetches the next handler from a table and jumps to it. The hottest function in a VM by call count.
PEB (Process Environment Block) — A user-mode data structure in Windows that DRM schemes read to detect a debugger. See data/drm-indicators.yaml::peb.
KUSER_SHARED_DATA — A kernel-mapped page that user code can read without a syscall. DRM schemes read fields here as part of the host fingerprint. See data/drm-indicators.yaml::kuser_shared_data.

18 KiB Raw Permalink Blame History