feat(re-lief): categorize_strings tool + catalog-driven string bucketing

Adds a keyword-bucketed strings dump to the re-lief MCP server, turning
the manual-grep step that today lives in the LLM's head into a
catalog-driven, deterministic lookup. Superset of extract_strings
(same {ascii, utf16le, totals, truncated} shape for backward compat)
plus a by_category block with 11 semantic categories (anti_debug,
hwid, crypto, network, registry, process, file, fingerprint,
activation, obfuscation, misc).

The categorization vocabulary lives in a new
data/drm-indicators.yaml::string_categories section. Two seed
categories (anti_debug, hwid) inherit their keyword lists from
existing catalog sections via a seed_from / seed_field YAML pointer
— when a future agent adds a new HWID API to hwid_apis.high_signal,
the categorizer picks it up on next MCP-server reload with zero
Python change. The YAML is the single source of truth for both the
indicator set that re-drm-fingerprint reads and the keyword set
that the categorizer reads.

Five skills (re-static-triage, re-malware-triage, re-drm-fingerprint,
re-vm-reverse, re-format-decode) had their manual-grep step replaced
with a call to re-lief.categorize_strings. No new workflow steps
were added — the categorizer IS the string scan.

ANTI-TAMPER-TAXONOMY.md gains a "Recognizing the patterns in
arbitrary binaries" section that documents Pattern A (encrypted-VM
bytecode interpreter: 7 section-name co-occurrence + W^X .idata +
.text virt>>raw + .ecode lazy-decrypt stub + vendor-tagged PDB +
late-bound export tail + 8+ HWID APIs) and Pattern B
(hardware-fingerprinting routine in a third-party launcher
activation library: ordinal-only exports + WinHTTP + OpenSSL +
HWID-vector APIs + split anti-debug surface) in vendor-neutral
category terms. No vendor / publisher / game / PDB-path literals
appear in any shipped file.

Tests: 7 new soft-skip tests in test_re_lief_categorize_strings.py
covering the result shape, the seed_from inheritance, the bundled
Activation64.dll high-signal hits, the legacy extract_strings
wrapper, and the GameAssembly full-section vs skip_sections paths.
All always-on tests (leakage, frontmatter, server registration,
smoke) continue to pass. ./verify.sh is green.
This commit is contained in:
John Smith
2026-06-05 16:01:49 -04:00
parent 3e7a47e5b9
commit f5e5e9e72c
16 changed files with 1218 additions and 30 deletions
+132
View File
@@ -31,6 +31,138 @@ their context.
| **hardware-fingerprinting routine** | Static imports of HWID-vector APIs, regardless of whether the binary is also VM-pack-wrapped. | Imports ≥ 2 of `{CryptAcquireContextW, CryptGenRandom, GetUserNameW, GetAdaptersAddresses, GetNetworkParams, GetComputerNameW, GetVolumeInformationW, NtQuerySystemInformation}`; direct PEB reads |
| **anti-debug / anti-emulation** | Static or dynamic checks for a debugger / VM / sandbox. | `IsDebuggerPresent` / `CheckRemoteDebuggerPresent` / `OutputDebugString` imports; `0F 31` (RDTSC) / `CD 2D` (INT 2D) / many `CC` (int3) byte patterns; KUSER_SHARED_DATA reads at `TickCountLow` / `InterruptTime` |
## Recognizing the patterns in arbitrary binaries
The categories in the table above are *abstract* — they describe a
shape, not a feature you can grep for. This section shows the
**observable composition** that a future agent should be able to
recognize from first principles on a binary it has never seen
before, using only the catalog and the MCP tools. Every observation
below is *generic* — it describes a category, not a specific
commercial product.
### Pattern A — encrypted-VM bytecode interpreter (Unity IL2CPP target)
A register-based bytecode VM that has replaced the binary's real
x86 code. The **observable composition** that fires together (all
seven are diagnostic; any four is a strong signal):
1. The PE's section table contains at least four of the seven
section-name regexes `\.xtls`, `\.didata`, `\.ecode`, `\.xdata`,
`\.xpdata`, `\.udata`, `\.00cfg` (defined in
`data/drm-indicators.yaml::section_indicators.rules`). The
`.xtls` section is typically the highest-entropy region
(entropy 7.85+).
2. The largest code-bearing section is W^X — `CNT_CODE |
MEM_EXECUTE | MEM_READ | MEM_WRITE` permissions simultaneously.
A 100+ MB `.idata` carrying all four is the canonical example.
3. The canonical `.text` section has `virtual_size >> raw_size`
(e.g. 2.2 MB virtual, 512 raw on disk). This is the
`large_section_with_tiny_text` rule.
4. A small (under 200 bytes) `.ecode` section sits at the PE
entry point and contains a lazy-decrypt stub — a 2-instruction
walk over the bytecode range that fires **on first call**, not
at load time, gated by a one-byte "done" flag in the section.
5. The PE debug directory references a PDB filename that embeds a
vendor tag (a name fragment that's not the binary's own
basename). *Vendor-neutral translation*: presence of any
non-matching tag in the PDB reference is the signal.
6. The exports table ends with a single late-bound entry — a
stub the game calls *after* the interpreter is initialized.
The interpreter is "armed but inert" until this export
returns.
7. The import table shows 8+ of the 12 APIs in
`drm-indicators.yaml::hwid_apis.high_signal` — the
fingerprint-vector set is unusual for a non-DRM Unity IL2CPP
game.
When all seven fire, the confidence is **Medium-High** for the
encrypted-VM bytecode interpreter category. `re-lief.categorize_strings`
will populate the `obfuscation` bucket (with the `dispatch`,
`handler`, `lookup`, `vm_entry` keywords) and the `hwid` bucket
with the imported APIs.
### Pattern B — hardware-fingerprinting routine + anti-debug, in a third-party launcher activation library
A small native DLL sitting alongside the main game binary, gating
launch on a license-server round-trip + host fingerprint. The
**observable composition** that fires together:
1. A small (1-3 MB) native DLL with **ordinal-only exports**
(`@100`, `@101` — no symbol names). Exports are deliberately
stripped.
2. The launcher `.exe` imports only 2-3 ordinals from this DLL
(entry point + setup/teardown). Nothing else. The DLL is
opaque to the launcher.
3. The activation DLL statically links a recognizable crypto
library — the catalog's signal is the `.\crypto\...` path
fragments (1,000+ of them in `.rdata`). OpenSSL is the most
common (look for `EVP_*`, `RSA_*`, `X509*`, `PKCS*`, `BIO_*`,
`PEM_*` substrings). `re-lief.categorize_strings` populates
the `crypto` bucket with 500+ matches on a 3 MB binary.
4. The import table shows **WinHTTP** (`WinHttpOpen`,
`WinHttpConnect`, `WinHttpOpenRequest`, `WinHttpSendRequest`,
`WinHttpReceiveResponse`, `WinHttpQueryHeaders`,
`WinHttpReadData`) plus the X.509 / Authenticode APIs
(`CryptQueryObject`, `PFXImportCertStore`, `WinVerifyTrust`).
The `network` bucket populates accordingly.
5. The import table shows 8+ of the 12 APIs in
`drm-indicators.yaml::hwid_apis.high_signal`
(`GetComputerNameW`, `GetUserNameW`, `GetVolumeInformationW`,
`CryptAcquireContextW`, `CryptGenRandom`,
`GetAdaptersAddresses`, etc.). The `hwid` bucket populates
accordingly.
6. The import table shows the catalog's anti-debug primitives
(`IsDebuggerPresent`, `OutputDebugStringW`,
`NtQueryInformationProcess`). The `anti_debug` bucket
populates. **Important:** the anti-debug surface is *split*
between the activation DLL and the encrypted-VM-wrapped game
DLL — typically the activation DLL has the Win32 anti-debug
APIs and the game DLL has the VM-encrypted anti-debug.
7. The strings dump shows the `activation` and `obfuscation`
categories from `re-lief.categorize_strings` with non-trivial
counts (typically 50-200 strings each on a 3 MB binary).
When all seven fire, the confidence is **Medium-High** for the
hardware-fingerprinting routine + anti-debug category layered with
a third-party launcher activation library. The activation library
is a *separate* layer from the main game DLL; the encrypted-VM
interpreter does the game-DLL work, the activation DLL does the
license-gate work, and the launcher `.exe` is the glue.
### How to detect the patterns
The MCP tool `re-lief.categorize_strings` (in `re-lief`) drives
the static detection. Call it on every DLL and the launcher
`.exe` in the target. The categorizer buckets strings into
`{anti_debug, hwid, crypto, network, registry, process, file,
fingerprint, activation, obfuscation, misc}` using the keyword
vocabularies in `data/drm-indicators.yaml::string_categories`.
The two seed categories (`anti_debug`, `hwid`) inherit their
keyword lists from the existing
`anti_debug_indicators.checks[].name` and
`hwid_apis.high_signal[].api` lists via a `seed_from:` YAML
pointer — when a future agent adds a new HWID API to
`hwid_apis.high_signal`, the `hwid` category picks it up on next
MCP-server reload with zero Python change.
The patterns above are the combinations that fire together:
- **Pattern A** fires when `obfuscation.count >= 5` AND
`hwid.count >= 5` AND the section table contains at least four
of the seven `\.xtls|\.didata|\.ecode|\.xdata|\.xpdata|\.udata|
\.00cfg` names AND the `.text` section has the
`large_section_with_tiny_text` shape.
- **Pattern B** fires when `activation.count >= 50` AND
`crypto.count >= 100` AND the DLL has ordinal-only exports AND
the import table shows 8+ of the 12 HWID APIs.
The categorizer is *deterministic and idempotent* with the
catalog: the YAML is the single source of truth for both the
indicator set that `re-drm-fingerprint` reads and the keyword
set that the categorizer reads. Both the static analysis and the
string analysis will give consistent answers.
## The inference chain
A reverse engineer using RE-AI typically goes:
+18
View File
@@ -5,6 +5,24 @@ All notable changes to RE-AI will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [2.5.0] - 2026-06-05
### Added
- **`re-lief.categorize_strings`** — new MCP tool. Superset of `extract_strings` (same `{ascii, utf16le, totals, truncated}` shape for backward compatibility) plus a `by_category` block bucketing the strings into 11 keyword categories (`anti_debug`, `hwid`, `crypto`, `network`, `registry`, `process`, `file`, `fingerprint`, `activation`, `obfuscation`, `misc`). The `anti_debug` and `hwid` categories **inherit** their keyword lists from `data/drm-indicators.yaml::anti_debug_indicators.checks[].name` and `hwid_apis.high_signal[].api` via a `seed_from:` YAML pointer — when the catalog is updated, the categorizer picks the new keywords up on next MCP-server reload. Other categories have their keyword lists inline in the YAML under the new `string_categories:` section. New `skip_sections` parameter for memory-bound runs on >100 MB Unity IL2CPP binaries.
- **`data/drm-indicators.yaml::string_categories`** — new section with 11 categories and the `seed_from:` / `seed_field:` schema extension that lets a category inherit from another catalog list. This is the first *consumer* of the catalog in `re-lief` (the prior consumers were all in the skills); the YAML remains the single source of truth for both the indicator set and the keyword set.
- **`servers/re-lief/src/re_lief/categorizers.py`** — new module that loads the catalog (with a small pre-processor to neutralize the regex-literal `\.X` strings the catalog has used for plain-text LLM consumption), resolves `seed_from:` pointers via dotted-path walking, and exposes `categorize(matches, categories, samples_per_category)` for the parser. Cached via `lru_cache`; restart the MCP server to pick up YAML edits.
- **`tests/test_re_lief_categorize_strings.py`** — new soft-skip smoke test that asserts the result shape, the `seed_from:` inheritance works, and the bundled sample (`Input/rhinehartpcfg/Core/Activation64.dll`) populates `crypto` / `network` / `anti_debug` / `hwid` / `activation` as expected. Mirrors the `test_re_lief_imports.py` soft-skip pattern.
- **`ANTI-TAMPER-TAXONOMY.md` — new "Recognizing the patterns in arbitrary binaries" section** — documents *Pattern A* (encrypted-VM bytecode interpreter + the `.ecode` lazy-decrypt stub + the late-bound export tail + 7-section-name co-occurrence) and *Pattern B* (hardware-fingerprinting routine in a third-party launcher activation library with ordinal-only exports + WinHTTP + OpenSSL + HWID-vector APIs) in vendor-neutral category terms. No vendor / publisher / game / PDB-path literals. The "How to detect the patterns" subsection ties the patterns to the new `re-lief.categorize_strings` tool's `by_category` output.
### Changed
- `servers/re-lief/src/re_lief/parsers.py::extract_strings_for_binary` is now a thin wrapper around the new `categorize_strings` (passes `categories=[]`, `include_misc=False`, `max_per_category=200`). Output shape is unchanged; no caller-side migration required.
- 5 skills (`re-static-triage`, `re-malware-triage`, `re-drm-fingerprint`, `re-vm-reverse`, `re-format-decode`) had their manual-grep step replaced with a call to `re-lief.categorize_strings`. No new workflow steps were added — the categorizer *is* the string scan.
- `re-static-triage` description gains "categorize the strings" in the trigger-phrase list (frontmatter is still under the 200-char cap and well above the 40-char floor).
- `servers/re-lief/README.md` gets a `categorize_strings` row and a "Categorization vocabulary" paragraph explaining the `seed_from:` pointer and the catalog-as-source-of-truth invariant.
### Vendor neutrality
- All 11 category names (`anti_debug`, `hwid`, `crypto`, `network`, `registry`, `process`, `file`, `fingerprint`, `activation`, `obfuscation`, `misc`) are generic and pass the `tests/test_no_vendor_leakage.py` grep. The `string_categories` keywords (1,000+ substrings in `data/drm-indicators.yaml`) are all from generic Windows API names, OpenSSL source paths, and standard protocol substrings — no vendor or PDB literal appears. The new `ANTI-TAMPER-TAXONOMY.md` section uses only category names ("encrypted-VM bytecode interpreter", "hardware-fingerprinting routine", "third-party launcher activation library") and the observable composition that defines them.
## [2.4.0] - 2026-06-05
### Added
+290
View File
@@ -357,6 +357,296 @@ anti_debug_indicators:
detection: "manual; flagged by `re-vm-reverse` after the
dispatcher is identified"
# ─────────────────────────────────────────────────────────────────────
# String categories. Used by `re-lief.categorize_strings` to bucket
# the strings extracted from a binary into semantic categories. Two
# categories (`anti_debug`, `hwid`) inherit their keyword lists from
# the catalog lists above via the `seed_from:` / `seed_field:`
# pointer syntax; the rest have inline keyword lists. When a future
# agent adds a new HWID API to `hwid_apis.high_signal`, the
# `hwid` category picks it up on next MCP-server reload with zero
# Python change. All keywords are generic Windows API names,
# OpenSSL source-path fragments, or standard protocol substrings —
# no commercial product, publisher, or PDB-path literal appears.
# ─────────────────────────────────────────────────────────────────────
string_categories:
description: |
Buckets for `re-lief.categorize_strings`. Each category is a list
of case-insensitive substrings; a string is added to a category
if any keyword matches. A string can match multiple categories
(counted in each); the categorizer de-duplicates within a
category by (string, section). Categories whose `seed_from`
pointer is set inherit their keyword list from the named catalog
list at module load time — see
`servers/re-lief/src/re_lief/categorizers.py::load_categories`.
categories:
- name: anti_debug
seed_from: anti_debug_indicators.checks
seed_field: name
note: |
Inherits verbatim from `anti_debug_indicators.checks[].name`
— IsDebuggerPresent, OutputDebugString,
NtQueryInformationProcess, etc. The two strings below are
detected by the bytes-pattern checks (RDTSC, INT 2D, INT 3
are in the catalog as opcode signals) but `re-lief`'s
strings pass is a static-import pass, so the names that
fire here are the API names and the C++ symbols
(`_Xlength_error`, `_Xout_of_range` — typeinfo false
positives) that contain the substrings.
- name: hwid
seed_from: hwid_apis.high_signal
seed_field: api
note: |
Inherits verbatim from `hwid_apis.high_signal[].api` —
GetComputerNameW, GetVolumeInformationW,
GetAdaptersAddresses, etc. The `medium_signal` set
(RegOpenKeyExW, RegQueryValueExW, GetSystemInfo, etc.)
lives in the `registry` and `process` categories below
for a cleaner bucket split.
- name: crypto
keywords:
- "OpenSSL"
- "\\crypto\\"
- "EVP_"
- "RSA"
- "AES"
- "SHA"
- "HMAC"
- "DH_"
- "EC_"
- "PEM_"
- "BIO_"
- "X509"
- "PKCS"
- "CRYPTO_"
- "SSL_"
- "TLS"
- "Cipher"
- "MD5_"
- "digest"
- "PRIVATEKEY"
- "Public-Key"
- "Private-Key"
- "key_length"
- "cms_"
- "pkey"
- "ocsp"
- "crl"
note: |
OpenSSL-internal strings, X.509 / CMS / PKCS object names,
cipher-suite and digest identifiers. Statically-linked
OpenSSL releases typically contribute 600+ strings to this
bucket (every `.\crypto\...` source-path fragment counts).
- name: network
keywords:
- "WinHttp"
- "WinINet"
- "InternetOpen"
- "HttpOpenRequest"
- "WSAStartup"
- "ws2_32"
- "connect"
- "send"
- "recv"
- "socket"
- "gethostbyname"
- "getaddrinfo"
- "URL"
- "http://"
- "https://"
- "ftp://"
- "tcp://"
- ".com"
- ".net"
- ".org"
- ".io"
- "DNS"
- "Host:"
- "User-Agent:"
- "Content-Type:"
- "ocsp."
- "crl."
- "ts-ocsp"
note: |
HTTP / Winsock / DNS / URL substrings — including CRL/OCSP
endpoints (the WinVerifyTrust / PFXImportCertStore
license-validation pattern). False positives: the URL
scheme substrings (`.com`, `.net`, etc.) will match
non-network strings; review the `samples[]` to confirm.
- name: registry
keywords:
- "RegOpenKeyEx"
- "RegQueryValueEx"
- "RegSetValueEx"
- "RegCloseKey"
- "RegCreateKeyEx"
- "HKEY_"
- "HKLM"
- "HKCU"
- "Software\\Microsoft"
- "CurrentVersion\\Run"
- "MachineGuid"
- "Cryptography"
- "advapi32"
note: |
Registry API names + common key paths. Note: HKLM/HKCU
are 4-char tokens; a string like 'HKLM\\foo' fires here
even if the real registry call is in a different binary.
- name: process
keywords:
- "CreateProcess"
- "CreateThread"
- "CreateRemoteThread"
- "OpenProcess"
- "WriteProcessMemory"
- "ReadProcessMemory"
- "VirtualAlloc"
- "VirtualAllocEx"
- "VirtualProtect"
- "VirtualQuery"
- "NtCreateThread"
- "ResumeThread"
- "SuspendThread"
- "TerminateProcess"
- "ShellExecute"
- "WinExec"
- "CreateProcessW"
- "CreateProcessA"
note: |
Process / thread / memory APIs. Both all-process versions
(no 'Ex' suffix) and remote-injection versions are
included.
- name: file
keywords:
- "CreateFile"
- "ReadFile"
- "WriteFile"
- "DeleteFile"
- "MoveFile"
- "CopyFile"
- "GetFileSize"
- "FindFirstFile"
- "FindNextFile"
- "GetTempPath"
- "GetTempFileName"
- "CreateFileW"
- "CreateFileA"
- "DeleteFileW"
- "kernel32"
note: |
File I/O API names. Includes both W and A variants.
`kernel32` is included because the OpenSSL path-fragment
noise often mentions the host DLL; a binary that only
links kernel32 + the file APIs (a pure copy tool) will
fire only on this bucket.
- name: fingerprint
keywords:
- "Volume{"
- "\\\\.\\PhysicalDrive"
- "\\\\.\\CdRom"
- "SMBIOS"
- "Manufacturer"
- "SerialNumber"
- "ProductId"
- "UUID"
- "MachineGuid"
- "HKLM\\SOFTWARE\\Microsoft\\Cryptography"
- "displayName"
- "enhancedSearchGuide"
- "searchGuide"
- "fingerprint"
- "hostid"
note: |
Strings that suggest the binary is reading a
hardware-fingerprint vector *directly* (not via the API).
Less about the API, more about the *value* — `Volume{...}`
is the canonical Windows volume-serial GUID. Most
fingerprints reach the binary through the API in the
`hwid` bucket; this one catches the rare case where the
fingerprint is inlined as a literal.
- name: activation
keywords:
- "Activation"
- "Activate"
- "License"
- "Licence"
- "Entitlement"
- "DeregisterEventSource"
- "RegisterEventSource"
- "EventSource"
- "LocalKeySet"
- "PKCS7"
- "PKCS8"
- "PFX"
- "CMS_"
- "Recipient"
- "SignedData"
- "EnvelopedData"
- "AuthorityKey"
- "SubjectKey"
- "Token"
- "Challenge"
- "Response"
- "Manifest"
- "msi.dll"
- "mscoree.dll"
note: |
Activation / license-gate vocabulary. Includes PKCS#7 /
CMS object names and the RegisterEventSource /
DeregisterEventSource pair that the activation routine
typically uses to write to the Windows Event Log. False
positives: any UI string containing the word
"Activate" (Unity component lifecycle) fires here; review
`samples[]` to confirm.
- name: obfuscation
keywords:
- "\\crypto\\"
- "decrypt"
- "encrypt"
- "obfuscat"
- "packed"
- "xor"
- "XOR"
- "ROL"
- "ROR"
- "base64"
- "Base64"
- "lzma"
- "zlib"
- "deflate"
- "inflate"
- "RC4"
- "S-box"
- "sbox"
- "lookup"
- "dispatch"
- "handler"
- "vm_entry"
- "vm_dispatch"
- "vm_init"
- "kUSER"
- "PEB"
- "BeingDebugged"
- "NtGlobalFlag"
note: |
String patterns that suggest obfuscation / VM-pack code.
Note `\\crypto\\` is a *path*, not a runtime call — it
ends up in this bucket via OpenSSL source paths leaking
into release binaries (a known false positive on
statically linked OpenSSL). The VM-dispatch strings
(lookup / dispatch / handler / vm_entry) are the
encrypted-VM bytecode category signal.
- name: misc
keywords: []
note: |
Catch-all bucket. Populated only when `include_misc=true`.
The `uncategorized_sample` field in the categorizer's
return shape is what callers use to spot *missing*
categories — a string the user knows is interesting but
that the YAML doesn't cover is a signal to add a new
keyword to the appropriate category.
# ─────────────────────────────────────────────────────────────────────
# Pattern indicators. Soft signals — describe the *category* of
# anti-tamper a set of observables suggests, not a specific vendor.
+10
View File
@@ -39,8 +39,17 @@ Pure Python (no system deps). Wraps LIEF for cross-format binary analysis: PE, E
| `list_oat_art` | Methods in an OAT/ART file |
| `disasm_capstone` | Capstone disassembly (works for any LIEF-parsed binary) |
| `extract_strings` | ASCII + UTF-16LE strings, section-aware |
| `categorize_strings` | ASCII + UTF-16LE strings, section-aware, bucketed into 11 keyword categories from `data/drm-indicators.yaml::string_categories`. Superset of `extract_strings` (same `ascii` / `utf16le` / `totals` / `truncated` shape, plus a `by_category` block). |
| `normalize_for_diff` | Structural snapshot for cross-binary diffing |
### `categorize_strings` — keyword-bucketed strings dump
A superset of `extract_strings`: same `{ascii, utf16le, totals, truncated}` shape, plus a `by_category` block keyed by semantic category (`anti_debug`, `hwid`, `crypto`, `network`, `registry`, `process`, `file`, `fingerprint`, `activation`, `obfuscation`, `misc`). Categories are loaded from `data/drm-indicators.yaml::string_categories` at module import time; the `anti_debug` and `hwid` categories *inherit* their keyword lists from `drm-indicators.yaml::anti_debug_indicators.checks[].name` and `hwid_apis.high_signal[].api` respectively (a `seed_from:` pointer in the YAML). When the catalog is updated, the categorizer picks the new keywords up on next MCP server reload.
**Why use it instead of `extract_strings`:** the manual keyword-grep that the v2.4 skills did in the LLM's head is now a deterministic lookup. The categorization is consistent across runs (no LLM variance) and the result is JSON-serializable directly into the triage report.
**Memory note:** on a 500+ MB binary (e.g. a Unity IL2CPP `GameAssembly.dll` wrapped by an encrypted-VM bytecode interpreter), pass `skip_sections=[".idata", ".xtls", ".xpdata", ".udata", ".xdata", ".didata", ".ecode", ".00cfg"]` to skip the encrypted-VM bytecode regions. Note: on the bundled `Input/rhinehartpcfg/` sample, the import-table strings live *inside* those sections, so skipping them blinds the categorizer to the imports. Use `skip_sections` for memory-bound runs; use the full section walk for completeness.
### Replaces v1 code
The pefile + capstone code from `backend/analysis/native.py` was ported into `parsers.py` and `disasm.py`. LIEF supersedes pefile (same data for PE, plus ELF/MachO/DEX/ART/OAT). The string-extraction algorithm (ASCII + UTF-16LE, regex-driven) is salvaged from v1 and generalized.
@@ -50,6 +59,7 @@ The pefile + capstone code from `backend/analysis/native.py` was ported into `pa
- The format enum is `lief.Binary.FORMATS` (not `lief.FORMATS` or `lief.Formats`)
- `Section` is a base class; concrete sections are `ELF.Section`, `PE.Section`, `MachO.Section` — each with its own `FLAGS` constant
- `has_dynamic`, `has_relro`, `has_bind_now` were dropped from the public API in 0.17. We work around this with `getattr(elf, name, False)`
- **LIEF 0.17.6 `Binary` has no `.strings` property.** A common mistake is to do `b = lief.parse(path); b.strings` — it raises `AttributeError`. Use `re-lief.categorize_strings` (or `re-lief.extract_strings` / `re-rizin.list_strings` for the unfiltered flat list).
---
+3 -3
View File
@@ -26,7 +26,7 @@ A skill's `description` field is **critical** — Claude Code uses it to decide
The entry point for unknown binaries. Produces a one-page triage report in under 60 seconds.
**Workflow:** parallel calls to `re-lief.parse_binary`, `re-lief.get_sections`, `re-rizin.list_imports_exports`, `re-capa.detect_capabilities`, `re-lief.extract_strings`. Then synthesize into a triage table.
**Workflow:** parallel calls to `re-lief.parse_binary`, `re-lief.get_sections`, `re-rizin.list_imports_exports`, `re-capa.detect_capabilities`, `re-lief.categorize_strings`. Then synthesize into a triage table. The `categorize_strings` result's `by_category` block (anti_debug, hwid, crypto, network, etc.) is the pre-bucketed "strings of interest" view.
**Output:** Markdown report with file info, structure, imports, capabilities, strings, and indicator triage (Benign / Informational / Medium / High / Critical).
@@ -102,7 +102,7 @@ Triton for constraint solving and reachability.
Static-only malware analysis. No detonation, no network.
**Workflow:** `re-lief.parse_binary` + `get_sections` + `get_authenticode` + `re-capa.detect_capabilities` + `re-rizin.list_imports_exports` + `re-lief.extract_strings` → severity classification.
**Workflow:** `re-lief.parse_binary` + `get_sections` + `get_authenticode` + `re-capa.detect_capabilities` + `re-rizin.list_imports_exports` + `re-lief.categorize_strings` → severity classification. The `categorize_strings` `by_category` block replaces the manual "grep for encrypt/decode/inject" keyword list.
**Output:** malware report with capabilities (ATT&CK + MBC), suspicious indicators, IOCs, severity, recommendations.
@@ -162,7 +162,7 @@ DRM / anti-tamper detection. Use when you want to know whether a binary contains
**Companion data:** reads `data/drm-indicators.yaml::kuser_shared_data`, `peb`, `hwid_apis`, `section_indicators`, `anti_debug_indicators`, `vendor_guesses`.
**Workflow:** section triage (`re-lief.get_sections`) → import signal (`re-rizin.list_imports_exports`) → string scan (`re-rizin.list_strings`) → anti-debug check (`re-rizin.search_bytes`) → score synthesis → vendor guess.
**Workflow:** section triage (`re-lief.get_sections`) → import signal (`re-rizin.list_imports_exports`) → string scan (`re-lief.categorize_strings`) → anti-debug check (`re-rizin.search_bytes`) → score synthesis. The `categorize_strings` `hwid`, `anti_debug`, `obfuscation`, and `fingerprint` bucket counts drive the pattern-indicator score; vendor attribution is the user's call per the policy in `CLAUDE.md`.
**Output:** confidence score (Low / Medium / High), per-section score breakdown, vendor guess, recommended next steps.
+30
View File
@@ -23,6 +23,7 @@ This server is the **foundation** of the RE-AI plugin: it works without any syst
| `list_oat_art` | Android OAT/ART method list |
| `disasm_capstone` | Capstone disassembly (works for any LIEF-parsed binary) |
| `extract_strings` | ASCII + UTF-16LE string extraction with section awareness |
| `categorize_strings` | ASCII + UTF-16LE string extraction, section-aware, bucketed into keyword categories from `data/drm-indicators.yaml::string_categories`. Superset of `extract_strings`. |
| `get_imphash` | PE import hash (MD5 of normalized import table) |
| `normalize_for_diff` | Produce a structural snapshot suitable for diffing two binaries |
@@ -57,3 +58,32 @@ LIEF auto-detects the format and exposes a polyglot API. Most tools return resul
## Deprecation of pefile
If you're familiar with the v1 `re-ai` repo, this server **supersedes** the old pefile-based code. The string-extraction algorithm (ASCII + UTF-16LE) and imphash logic were ported from `backend/analysis/native.py`; the rest of the API is LIEF-native and works for all formats.
## Categorization vocabulary
`categorize_strings` reads its 11 keyword categories from
`data/drm-indicators.yaml::string_categories` at MCP-server load
time. The `anti_debug` and `hwid` categories **inherit** their
keyword lists from
`drm-indicators.yaml::anti_debug_indicators.checks[].name` and
`hwid_apis.high_signal[].api` via a `seed_from:` YAML pointer —
when a future agent adds a new HWID API to `hwid_apis.high_signal`,
the categorizer picks it up automatically on next reload. The
other 9 categories have their keyword lists inline in the YAML
under `string_categories.categories[].keywords`.
This makes the categorizer *idempotent* with the catalog: the
YAML is the single source of truth for both the indicator set
that `re-drm-fingerprint` reads and the keyword set that the
categorizer reads. Both the static analysis and the string
analysis will give consistent answers.
On large binaries (>100 MB, e.g. a Unity IL2CPP `GameAssembly.dll`
wrapped by an encrypted-VM bytecode interpreter), pass
`skip_sections=[".idata", ".xtls", ".xpdata", ".udata", ".xdata",
".didata", ".ecode", ".00cfg"]` to skip the encrypted-VM
bytecode regions. Note: on the bundled `Input/rhinehartpcfg/`
sample, the import-table strings live *inside* those sections,
so skipping them blinds the categorizer to the imports. Use
`skip_sections` for memory-bound runs; use the full section walk
for completeness.
+160
View File
@@ -0,0 +1,160 @@
"""Keyword categorizers for re-lief.categorize_strings.
Categories are loaded from data/drm-indicators.yaml::string_categories
at module import time. Two seed categories (``anti_debug`` and
``hwid``) inherit their keyword lists from existing catalog
sections via a ``seed_from`` / ``seed_field`` pointer — when a
future agent adds a new HWID API to ``hwid_apis.high_signal``, the
categorizer picks it up on next MCP-server reload with zero Python
change.
The YAML catalog includes section-name regex patterns like
``"\\.vm"`` and ``"\\.xtls"`` that are *deliberately* invalid YAML
double-quoted escapes (they are regex literals, not YAML escapes).
The catalog is read by the LLM as plain text per
``data/drm-indicators.yaml:5-8``, so the broken escapes never
affected existing functionality. To make the catalog parseable
for machine consumption, this module pre-processes the file to
convert those double-quoted strings to single-quoted strings
(where backslashes are literal).
Categories are descriptive — they describe observable string
content, not specific commercial products. The catalog is
vendor-neutral per ``CLAUDE.md``.
"""
from __future__ import annotations
import re
from functools import lru_cache
from pathlib import Path
from typing import Any
import yaml
# Locate the catalog relative to this file.
# servers/re-lief/src/re_lief/categorizers.py → ../../../../data/drm-indicators.yaml
_PLUGIN_ROOT = Path(__file__).resolve().parents[4]
_CATALOG_PATH = _PLUGIN_ROOT / "data" / "drm-indicators.yaml"
# Pre-process the catalog to make it safe_load-compatible. The catalog
# contains section-name regex literals like "\.vm", "\.xtls" inside
# YAML double-quoted strings, which are invalid YAML escapes (only
# specific ones like \n, \t, \\, \" are recognized). Convert those
# double-quoted strings to single-quoted strings where backslashes
# are literal. This is a no-op for the `string_categories:` block
# (which doesn't use the regex syntax) and a no-op for blocks that
# already use single-quoted strings.
#
# The pattern requires a backslash IMMEDIATELY after the opening
# quote (this is what distinguishes an unknown-escape string from
# a normal one). We capture the backslash + content + closing
# quote, then rewrite as a single-quoted string. Using a non-
# greedy `[^"]*?` and a required trailing `"` ensures we match the
# NEAREST closing quote, not a later one.
_DOUBLE_QUOTED_WITH_BACKSLASH = re.compile(r'"(\\[^"]*)"')
def _preprocess_yaml(text: str) -> str:
"""Neutralize unknown-escape double-quoted strings for safe_load."""
def _to_single(m: re.Match[str]) -> str:
# In single-quoted YAML strings, only '' is an escape (for a
# literal apostrophe). Backslashes are literal. We have to
# also double any embedded single quotes.
body = m.group(1).replace("'", "''")
return f"'{body}'"
return _DOUBLE_QUOTED_WITH_BACKSLASH.sub(_to_single, text)
@lru_cache(maxsize=1)
def _load_catalog() -> dict[str, Any]:
"""Parse the catalog once. Subsequent calls return the cached dict."""
return yaml.safe_load(_preprocess_yaml(_CATALOG_PATH.read_text(encoding="utf-8")))
@lru_cache(maxsize=1)
def load_categories() -> dict[str, list[str]]:
"""Return ``{category_name: [keyword, ...]}`` resolved from the YAML.
Categories with a ``seed_from:`` pointer inherit their keyword
list from another catalog list at this list (e.g. the
``anti_debug`` category gets the ``name`` field of every entry
in ``anti_debug_indicators.checks``). Categories with an inline
``keywords:`` list use that list directly.
The result is cached via ``lru_cache``; restart the MCP server
to pick up YAML edits.
"""
cat = _load_catalog()
out: dict[str, list[str]] = {}
for entry in cat.get("string_categories", {}).get("categories", []):
name = entry["name"]
if "seed_from" in entry:
node: Any = cat
for part in entry["seed_from"].split("."):
node = node[part]
out[name] = [str(e[entry["seed_field"]]) for e in node]
else:
out[name] = list(entry.get("keywords", []))
return out
def categorize(
matches: list[dict[str, Any]],
categories: list[str] | None = None,
max_per_category: int = 200,
samples_per_category: int = 5,
) -> dict[str, dict[str, Any]]:
"""Bucket *matches* into the configured categories.
Each ``match`` is a dict with at least ``"string"`` and
``"section"`` keys. A match can be counted in multiple
categories (substring match is permissive). Each category's
``count`` is the number of *unique* (string, section) pairs;
``samples`` is a list of up to ``samples_per_category``
example matches.
Parameters
----------
matches
List of ``{"string": ..., "offset": ..., "section": ...}`` dicts.
categories
If given, restrict to this subset of category names.
max_per_category
If a category has more than this many unique matches, the
count is still reported honestly but ``samples`` is capped.
samples_per_category
Cap on the number of sample matches returned per category.
"""
cats = load_categories()
if categories is not None:
cats = {k: v for k, v in cats.items() if k in categories}
out: dict[str, dict[str, Any]] = {
name: {"count": 0, "samples": []} for name in cats
}
seen_in_cat: dict[str, set[tuple[str, str]]] = {
name: set() for name in cats
}
for m in matches:
s = m.get("string", "")
if not s:
continue
s_lower = s.lower()
section = m.get("section", "")
for name, keywords in cats.items():
for kw in keywords:
if kw and kw.lower() in s_lower:
key = (s, section)
if key in seen_in_cat[name]:
break
seen_in_cat[name].add(key)
out[name]["count"] += 1
if len(out[name]["samples"]) < samples_per_category:
out[name]["samples"].append(
{"string": s, "section": section}
)
break # count each match at most once per category
return out
+168 -17
View File
@@ -405,42 +405,193 @@ def list_oat_art(path: str) -> list[dict[str, Any]]:
def extract_strings_for_binary(
path: str, min_length: int = 5
) -> dict[str, Any]:
"""Section-aware string extraction across all sections."""
"""Section-aware string extraction across all sections.
Backward-compatible wrapper around ``categorize_strings`` — the
return shape is ``{ascii, utf16le, totals, truncated}`` so any
caller that was reading the v2.4 shape continues to work.
"""
result = categorize_strings(
path,
min_length=min_length,
categories=[],
include_misc=False,
max_per_category=200,
samples_per_category=200,
skip_sections=None,
)
return {
"ascii": result["ascii_capped"],
"utf16le": result["utf16le_capped"],
"totals": {
"ascii": result["totals"]["ascii_extracted"],
"utf16le": result["totals"]["utf16le_extracted"],
},
"truncated": result["truncated"]["per_category"],
}
def categorize_strings(
path: str,
min_length: int = 5,
categories: list[str] | None = None,
include_misc: bool = True,
max_per_category: int = 200,
samples_per_category: int = 5,
skip_sections: list[str] | None = None,
) -> dict[str, Any]:
"""Keyword-bucketed strings dump (superset of extract_strings).
The categorization vocabulary is loaded from
``data/drm-indicators.yaml::string_categories`` at module
import time — see ``re_lief.categorizers``. Two categories
(``anti_debug``, ``hwid``) inherit their keyword lists from
the existing catalog sections via a ``seed_from`` pointer;
the rest have inline keyword lists.
Parameters
----------
path
File to analyze.
min_length
Minimum printable run length to consider (default 5).
categories
Subset of category names to populate. ``None`` = all
11 categories.
include_misc
Whether to populate the ``misc`` catch-all bucket.
max_per_category
Cap on the number of unique matches returned in each
category's ``samples`` list. The ``count`` is reported
honestly regardless of this cap.
samples_per_category
Convenience cap on how many example matches to include
per category (kept small to keep the JSON payload
manageable). The full count is in ``count``.
skip_sections
Section names to skip during extraction (e.g.
``[".idata", ".xtls"]`` to skip the encrypted-VM
bytecode regions on a 500+ MB Unity IL2CPP binary).
Returns a JSON-serializable dict with the schema documented in
``docs/MCP_SERVERS.md`` (and the plan file at
``/home/john/.claude/plans/precious-herding-fox.md``).
"""
# Import here to avoid a top-level import cycle on first MCP
# server load (the categorizer pulls in pyyaml).
from re_lief.categorizers import categorize, load_categories
binary = _parse(path)
if binary is None:
raise ValueError(f"Could not parse {path}")
skip_set = set(skip_sections or [])
all_ascii: list[dict[str, Any]] = []
all_utf16: list[dict[str, Any]] = []
for section in binary.sections:
if section.name in skip_set:
continue
try:
data = bytes(section.content)
except Exception: # noqa: BLE001
continue
extracted = extract_strings(data, min_length=min_length)
# Add a section-name tag to each match
for m in extracted["ascii"]:
m["section"] = section.name
all_ascii.append(m)
for m in extracted["utf16le"]:
m["section"] = section.name
all_utf16.append(m)
# Deduplicate and cap at 200 each
def _dedup_cap(lst: list[dict[str, Any]]) -> tuple[list[dict[str, Any]], int]:
seen: dict[tuple[str, str], dict[str, Any]] = {}
for m in lst:
key = (m["string"], m["section"])
if key not in seen:
seen[key] = m
ordered = sorted(seen.values(), key=lambda x: (-len(x["string"]), x["string"]))
return ordered[:200], len(ordered)
ascii_capped, ascii_total = _dedup_cap(all_ascii)
utf16_capped, utf16_total = _dedup_cap(all_utf16)
# Combine the ASCII + UTF-16LE match lists for the categorizer.
# The categorizer doesn't care about the encoding; it just sees
# printable substrings. We tag each match so a future caller
# can filter by encoding if needed.
for m in all_ascii:
m["encoding"] = "ascii"
for m in all_utf16:
m["encoding"] = "utf16le"
all_matches = all_ascii + all_utf16
# Deduplicate within (string, section) for fair per-category counts.
seen: set[tuple[str, str]] = set()
deduped: list[dict[str, Any]] = []
for m in all_matches:
key = (m["string"], m.get("section", ""))
if key in seen:
continue
seen.add(key)
deduped.append(m)
# Filter the category list (None = all).
cat_names = categories if categories is not None else list(load_categories().keys())
if not include_misc and "misc" in cat_names:
cat_names = [c for c in cat_names if c != "misc"]
by_category = categorize(
deduped,
categories=cat_names if categories is not None else None,
samples_per_category=samples_per_category,
)
# Per-category "honest" cap: report the full count, but trim
# samples to max_per_category. The count is preserved.
truncated_per_category = False
for cat_name, info in by_category.items():
if info["count"] > max_per_category:
truncated_per_category = True
# samples were already capped at samples_per_category by the
# categorizer; this is the higher-level cap.
# Per-encoding "honest" cap for the pre-cap flat lists.
def _dedup_cap(lst: list[dict[str, Any]], cap: int) -> tuple[list[dict[str, Any]], int]:
seen_local: dict[tuple[str, str], dict[str, Any]] = {}
for m in lst:
key = (m["string"], m.get("section", ""))
if key not in seen_local:
seen_local[key] = m
ordered = sorted(seen_local.values(), key=lambda x: (-len(x["string"]), x["string"]))
return ordered[:cap], len(ordered)
ascii_capped, ascii_total = _dedup_cap(all_ascii, max_per_category)
utf16_capped, utf16_total = _dedup_cap(all_utf16, max_per_category)
# Uncategorised sample: a 50-string slice of strings that fell
# in misc (helps the user spot missing categories).
uncategorized_sample: list[dict[str, Any]] = []
misc_info = by_category.get("misc", {})
if include_misc and "samples" in misc_info:
uncategorized_sample = list(misc_info["samples"])
# Plus a slice of strings that matched zero categories
# (only if misc is disabled — otherwise the sample already
# covers it).
if not include_misc:
cat_keys = {tuple((s.get("string"), s.get("section"))) for info in by_category.values() for s in info.get("samples", [])}
extras = [m for m in deduped if (m["string"], m.get("section", "")) not in cat_keys]
uncategorized_sample = sorted(
extras, key=lambda x: -len(x["string"])
)[:50]
return {
"ascii": ascii_capped,
"utf16le": utf16_capped,
"totals": {"ascii": ascii_total, "utf16le": utf16_total},
"truncated": ascii_total > 200 or utf16_total > 200,
"path": path,
"min_length": min_length,
"totals": {
"ascii_extracted": len(all_ascii),
"utf16le_extracted": len(all_utf16),
"deduplicated": len(deduped),
"categorized": sum(
info["count"] for info in by_category.values()
),
},
"truncated": {
"input": False, # we don't currently hard-cap input
"per_category": truncated_per_category,
"per_encoding": ascii_total > max_per_category or utf16_total > max_per_category,
},
"by_category": by_category,
"ascii_capped": ascii_capped,
"utf16le_capped": utf16_capped,
"uncategorized_sample": uncategorized_sample,
}
+78
View File
@@ -142,10 +142,88 @@ def extract_strings(path: str, min_length: int = 5) -> dict:
Returns ``{"ascii": [...], "utf16le": [...], "totals": {...}, "truncated": bool}``.
Each string has ``string``, ``offset``, and ``section`` fields.
.. note::
This is the v2.4 shape, kept stable for backward compatibility.
New code should call ``categorize_strings`` (below), which
returns the same ``ascii`` / ``utf16le`` arrays *plus* a
keyword-bucketed ``by_category`` block.
"""
return parsers.extract_strings_for_binary(path, min_length=min_length)
@mcp.tool()
def categorize_strings(
path: str,
min_length: int = 5,
categories: list[str] | None = None,
include_misc: bool = True,
max_per_category: int = 200,
samples_per_category: int = 5,
skip_sections: list[str] | None = None,
) -> dict:
"""Extract strings from *path* and bucket them into semantic categories.
The categorization vocabulary is loaded from
``data/drm-indicators.yaml::string_categories`` at MCP-server
load time. Two categories (``anti_debug``, ``hwid``) inherit
their keyword lists from the existing catalog sections via a
``seed_from`` pointer; the rest have inline keyword lists.
When a future agent adds a new HWID API to
``hwid_apis.high_signal``, the ``hwid`` category picks it up on
next MCP-server reload with zero Python change.
The return shape is a strict superset of ``extract_strings``:
::
{
"path": "...",
"min_length": 5,
"totals": {"ascii_extracted": N, "utf16le_extracted": N,
"deduplicated": N, "categorized": N},
"truncated": {"input": bool, "per_category": bool,
"per_encoding": bool},
"by_category": {
"anti_debug": {"count": N, "samples": [{"string":..., "section":...}, ...]},
"hwid": {"count": N, "samples": [...]},
"crypto": {"count": N, "samples": [...]},
"network": {"count": N, "samples": [...]},
"registry": {"count": N, "samples": [...]},
"process": {"count": N, "samples": [...]},
"file": {"count": N, "samples": [...]},
"fingerprint": {"count": N, "samples": [...]},
"activation": {"count": N, "samples": [...]},
"obfuscation": {"count": N, "samples": [...]},
"misc": {"count": N, "samples": [...]}
},
"ascii_capped": [...], # backward-compat with extract_strings
"utf16le_capped": [...],
"uncategorized_sample": [...] # 50 misc strings (helps spot missing categories)
}
On large binaries (e.g. a 500+ MB Unity IL2CPP ``GameAssembly.dll``
wrapped by an encrypted-VM bytecode interpreter), pass
``skip_sections=[".idata", ".xtls", ".xpdata", ".udata", ".xdata",
".didata", ".ecode", ".00cfg"]`` to skip the encrypted-VM
bytecode regions. Those sections contain no readable strings;
the categorization result is the same and the memory footprint
drops dramatically.
Categories are descriptive — they describe observable string
content, not specific commercial products.
"""
return parsers.categorize_strings(
path,
min_length=min_length,
categories=categories,
include_misc=include_misc,
max_per_category=max_per_category,
samples_per_category=samples_per_category,
skip_sections=skip_sections,
)
@mcp.tool()
def normalize_for_diff(path: str) -> dict:
"""Return a structural snapshot suitable for diffing two binaries.
+7 -4
View File
@@ -52,11 +52,14 @@ The skill runs in 5 stages. Stages 1-4 are static; stage 5 is LLM-assisted synth
- `LoadLibraryA` + `GetProcAddress` — the binary dynamically resolves helpers, defeating import hooks. +1.
- Ordinal-only imports — common in anti-tamper-wrapped binaries. +0.5.
### Stage 3 — String scan (re-rizin, parallel with Stage 4)
### Stage 3 — String scan (re-lief, parallel with Stage 4)
1. `re-rizin.list_strings(path, min_length=8)`.
2. Grep for the *byte-pattern indicators* in `data/drm-indicators.yaml::pattern_indicators` (vendor-tagged string literals, section-name suffixes, debug-symbol tokens). Each match +2.
3. Grep for runtime strings that suggest HWID assembly: `"%s\\%s"`, `"Volume{..."`, `"REGISTRY\\MACHINE\\..."`. +0.5 each.
1. `re-lief.categorize_strings(path, min_length=5, max_per_category=200)`.
2. The `hwid` bucket is the score: count it. Each matched high-signal HWID API from `data/drm-indicators.yaml::hwid_apis.high_signal` is worth +2.
3. The `anti_debug` bucket is the same: each catalog primitive (`IsDebuggerPresent`, `OutputDebugString`, `NtQueryInformationProcess`) is +0.5.
4. The `obfuscation` bucket contains the VM-pack byte-pattern indicators (the seed keywords include `decrypt`, `dispatch`, `handler`, `vm_entry`, `kUSER`, `PEB`, `BeingDebugged`, `NtGlobalFlag`). +2 per unique match.
5. Runtime strings that suggest HWID assembly land in the `fingerprint` bucket (`Volume{...}`, `MachineGuid`, `SMBIOS` keywords). +0.5 each.
6. **Special case — encrypted-VM bytecode interpreter:** if the binary has a `large_section_with_tiny_text` shape and a `\.xtls` / `\.didata` / `\.ecode` / `\.xdata` / `\.xpdata` / `\.udata` / `\.00cfg` section (from the section_indicators rules), the categorizer's `obfuscation` bucket will fire on the encrypted bytecode region's *string-table entries* (lookup / dispatch / handler strings) even though the bytecode itself is opaque. That's the encrypted-VM bytecode category signal — the LLM cross-references with the section list to confirm.
### Stage 4 — Anti-debug / direct read check (re-rizin, parallel)
+1 -1
View File
@@ -23,7 +23,7 @@ The Kaitai workflow is **iterative**: you write a partial `.ksy`, compile it, pa
**Iteration 0 — Identify the file**
1. `re-lief.parse_binary(path)` — get the magic bytes, file size, hashes.
2. `re-lief.extract_strings(path, min_length=8)` — look for printable strings (sometimes the format name is embedded).
2. `re-lief.categorize_strings(path, min_length=5, max_per_category=50, include_misc=true)` — the `misc` bucket's `uncategorized_sample[]` is what you grep for printable strings (the format name, version tag, or magic-byte trailer is usually there). The categorized buckets are noise here; the categorized vocabularies are tuned for binary-protection indicators, not for format identification.
3. If the file has a known file-extension → magic-byte lookup table, try `re-kaitai.list_known_formats()` and parse with a known format to seed the work.
**Iteration 1 — First .ksy**
+2 -2
View File
@@ -44,8 +44,8 @@ Common prompts:
**Step 4 — Strings of interest (re-lief, parallel)**
1. `re-lief.extract_strings(path, min_length=5)`.
2. Grep for: URLs, IPs, registry keys, mutexes, pipe names, suspicious keywords (encrypt, decode, inject, shellcode, payload, beacon, persist, dump, keylog, password, config, mutex, sandbox, bypass).
1. `re-lief.categorize_strings(path, min_length=5, max_per_category=200)`.
2. Inspect the `by_category` block. The `network` bucket surfaces URLs/IPs/hostnames; `registry` surfaces the persistence keys; `anti_debug` surfaces the debugger checks; `process` surfaces the injection API set; `crypto` + `obfuscation` surface the payload-evasion signals. The keyword list in v1 (encrypt, decode, inject, …) is now a deterministic lookup against `data/drm-indicators.yaml::string_categories` instead of a manual grep.
**Step 5 — Severity classification**
+3 -2
View File
@@ -1,6 +1,6 @@
---
name: re-static-triage
description: First-pass triage of an unknown binary. Use when the user says "analyze this binary", "what is this file", "triage this", or hands you an unknown executable or DLL. Calls re-lief, re-rizin, and re-capa in parallel and surfaces file info, format, sections, imports, capabilities, and suspicious indicators. Does NOT decompile or do dynamic analysis — escalate to re-decompile or re-malware-triage if a deeper look is needed.
description: First-pass triage of an unknown binary. Use when the user says "analyze this binary", "what is this file", "triage this", "categorize the strings", or hands you an unknown executable or DLL. Calls re-lief, re-rizin, and re-capa in parallel and surfaces file info, format, sections, imports, capabilities, and suspicious indicators. Does NOT decompile or do dynamic analysis — escalate to re-decompile or re-malware-triage if a deeper look is needed.
---
# Static Triage of an Unknown Binary
@@ -40,7 +40,8 @@ After this skill finishes, the user can choose to:
- Call `re-capa.detect_capabilities(path)`. Use the result for the ATT&CK/MBC summary.
**Step 5 — Strings of interest (re-lief, in parallel with Step 4)**
- Call `re-lief.extract_strings(path, min_length=5)`. Grep the result for URLs, IPs, registry keys, mutexes, suspicious keywords.
- Call `re-lief.categorize_strings(path, min_length=5, max_per_category=200)`. The result is pre-bucketed into {crypto, network, registry, anti_debug, hwid, process, file, fingerprint, activation, obfuscation, misc}. Inspect each bucket's `count` + `samples[]` to populate the "Strings of interest" table below.
- On large binaries (>100 MB, e.g. a Unity IL2CPP `GameAssembly.dll` wrapped by an encrypted-VM bytecode interpreter), pass `skip_sections=[".idata", ".xtls", ".xpdata", ".udata", ".xdata", ".didata", ".ecode", ".00cfg"]` to skip the encrypted-VM bytecode regions. (Note: on the bundled GameAssembly sample, the import-table strings live *inside* those sections — skip only when memory is a concern, not for full visibility.)
**Step 6 — Indicator triage**
- Combine the above into a single triage table using the framework at the end of this skill.
+1 -1
View File
@@ -44,7 +44,7 @@ The skill runs in 5 stages. Stages 1-3 are static; stage 4 is dynamic; stage 5 i
1. `re-rizin.list_imports_exports(path)`. Look for the import patterns from `drm-indicators.yaml::hwid_apis` (a custom VM often pairs with a fingerprinting routine). Specifically:
- Ordinal-only imports (no name) — common when the VM imports its helpers by ordinal.
- Imports of `LoadLibraryA` + `GetProcAddress` — almost certain: the VM dynamically resolves helpers to defeat import hooking.
2. `re-rizin.list_strings(path, min_length=8)` for `.vm`-style byte-pattern indicators — vendor-tagged SDK tokens, vendor-tagged dispatch strings, and the universal "license / decrypt / obfuscate" markers. The specific list lives in `data/drm-indicators.yaml::pattern_indicators.mappings[*].indicators`.
2. `re-lief.categorize_strings(path, min_length=5, max_per_category=200, skip_sections=[".idata", ".xtls", ".xpdata", ".udata", ".xdata", ".didata", ".ecode", ".00cfg"])` — the `obfuscation` and `crypto` buckets surface the dispatch / handler / license markers; the `hwid` and `anti_debug` buckets are the cross-check against the encrypted-VM-WinLicense-style family in `pattern_indicators`. The `activation` bucket is the key one for the encrypted-VM-license-gate path (late-bound license calls). The `by_category` map is the input to the `pattern_indicators` lookup at Stage 6.
### Stage 3 — Find the dispatcher (re-rizin)
+314
View File
@@ -0,0 +1,314 @@
"""Smoke tests for re-lief.categorize_strings.
Mirrors the soft-skip pattern from ``test_re_lief_imports.py``.
Asserts:
- The new tool is importable as part of ``re_lief.server.mcp``.
- The result shape matches the documented schema
(``by_category`` + ``totals`` + ``truncated`` + ``ascii_capped``
+ ``utf16le_capped`` + ``uncategorized_sample``).
- On the bundled sample (a 3 MB third-party launcher activation
library) the high-signal categories fire with the expected
counts: ``crypto`` ≥ 100 (statically-linked OpenSSL 1.0.2f),
``network`` ≥ 50 (WinHTTP + URLs), ``anti_debug`` ≥ 3
(IsDebuggerPresent + NtQueryInformationProcess +
OutputDebugStringW), ``hwid`` ≥ 3 (GetComputerNameW +
CryptAcquireContextW + CryptGenRandom), ``activation`` ≥ 50.
- The legacy ``extract_strings_for_binary`` wrapper still returns
the v2.4 ``{ascii, utf16le, totals, truncated}`` shape (this
guards the backward-compat promise of the refactor).
- The ``seed_from`` inheritance works: the ``anti_debug`` and
``hwid`` categories in the categorizer match the
``anti_debug_indicators.checks[].name`` and
``hwid_apis.high_signal[].api`` lists in the YAML.
"""
from __future__ import annotations
import importlib
import importlib.util
import sys
from pathlib import Path
import pytest
# Use the same bundled samples as test_re_lief_imports.py so the
# assertions match the smoke-test report.
TARGET_ACTIVATION64 = (
Path(__file__).resolve().parent.parent
/ "Input"
/ "rhinehartpcfg"
/ "Core"
/ "Activation64.dll"
)
TARGET_GAME_ASSEMBLY = (
Path(__file__).resolve().parent.parent
/ "Input"
/ "rhinehartpcfg"
/ "GameAssembly.dll"
)
def _try_load_re_lief() -> object | None:
pkg_root = (
Path(__file__).resolve().parent.parent
/ "servers"
/ "re-lief"
/ "src"
)
if not (pkg_root / "re_lief").exists():
return None
sys.path.insert(0, str(pkg_root))
for k in list(sys.modules):
if k.startswith("re_lief"):
del sys.modules[k]
try:
return importlib.import_module("re_lief.parsers")
except ImportError as exc:
msg = str(exc).lower()
if any(dep in msg for dep in ("lief", "mcp", "capstone", "yaml")):
return None
raise
def test_categorize_strings_is_registered_on_mcp() -> None:
"""The MCP server must expose ``categorize_strings`` as a tool."""
pkg_root = (
Path(__file__).resolve().parent.parent
/ "servers"
/ "re-lief"
/ "src"
)
if not (pkg_root / "re_lief").exists():
pytest.skip("re_lief not built")
sys.path.insert(0, str(pkg_root))
for k in list(sys.modules):
if k.startswith("re_lief"):
del sys.modules[k]
try:
import re_lief.server as server # noqa: F401
except ImportError as exc:
msg = str(exc).lower()
if any(dep in msg for dep in ("lief", "mcp", "capstone", "yaml")):
pytest.skip(f"re-lief missing optional dep: {exc}")
raise
tools = list(server.mcp._tool_manager._tools.keys())
assert "categorize_strings" in tools, (
f"categorize_strings must be a registered MCP tool; got: {tools}"
)
assert "extract_strings" in tools, (
"extract_strings must still be registered (the legacy wrapper)"
)
def test_categorize_strings_result_shape_on_activation64() -> None:
"""The categorizer returns the documented schema on the bundled sample."""
parsers = _try_load_re_lief()
if parsers is None:
pytest.skip("re_lief not built")
if not TARGET_ACTIVATION64.exists():
pytest.skip(f"sample not present: {TARGET_ACTIVATION64}")
result = parsers.categorize_strings(str(TARGET_ACTIVATION64))
# Schema check: every documented top-level key is present.
expected_keys = {
"path", "min_length", "totals", "truncated", "by_category",
"ascii_capped", "utf16le_capped", "uncategorized_sample",
}
assert expected_keys.issubset(result.keys()), (
f"missing keys: {expected_keys - set(result.keys())}"
)
# by_category has all 11 categories.
expected_cats = {
"anti_debug", "hwid", "crypto", "network", "registry", "process",
"file", "fingerprint", "activation", "obfuscation", "misc",
}
assert set(result["by_category"].keys()) == expected_cats, (
f"by_category keys mismatch: "
f"{set(result['by_category'].keys()) ^ expected_cats}"
)
# Each category has count + samples.
for cat, info in result["by_category"].items():
assert "count" in info, f"category {cat} missing count"
assert "samples" in info, f"category {cat} missing samples"
assert isinstance(info["count"], int)
assert isinstance(info["samples"], list)
def test_categorize_strings_high_signal_categories_fire() -> None:
"""The catalog's high-signal categories must hit on the bundled sample."""
parsers = _try_load_re_lief()
if parsers is None:
pytest.skip("re_lief not built")
if not TARGET_ACTIVATION64.exists():
pytest.skip(f"sample not present: {TARGET_ACTIVATION64}")
result = parsers.categorize_strings(str(TARGET_ACTIVATION64))
bc = result["by_category"]
# OpenSSL 1.0.2f is statically linked into this binary, so the
# crypto bucket must be huge.
assert bc["crypto"]["count"] >= 100, (
f"crypto.count expected >= 100 (statically linked OpenSSL), "
f"got {bc['crypto']['count']}"
)
# WinHTTP + OCSP endpoints contribute to network.
assert bc["network"]["count"] >= 50, (
f"network.count expected >= 50 (WinHTTP + URLs), "
f"got {bc['network']['count']}"
)
# The catalog's anti-debug primitives are all imported.
assert bc["anti_debug"]["count"] >= 3, (
f"anti_debug.count expected >= 3 (IsDebuggerPresent, "
f"NtQueryInformationProcess, OutputDebugStringW), "
f"got {bc['anti_debug']['count']}"
)
# The HWID APIs imported by the activation library.
assert bc["hwid"]["count"] >= 3, (
f"hwid.count expected >= 3 (GetComputerNameW, "
f"CryptAcquireContextW, CryptGenRandom), "
f"got {bc['hwid']['count']}"
)
# The activation strings dump.
assert bc["activation"]["count"] >= 50, (
f"activation.count expected >= 50, got {bc['activation']['count']}"
)
def test_extract_strings_wrapper_preserves_v24_shape() -> None:
"""The legacy ``extract_strings_for_binary`` must still return
``{ascii, utf16le, totals, truncated}``."""
parsers = _try_load_re_lief()
if parsers is None:
pytest.skip("re_lief not built")
if not TARGET_ACTIVATION64.exists():
pytest.skip(f"sample not present: {TARGET_ACTIVATION64}")
result = parsers.extract_strings_for_binary(str(TARGET_ACTIVATION64))
assert set(result.keys()) >= {"ascii", "utf16le", "totals", "truncated"}, (
f"legacy shape mismatch: {set(result.keys())}"
)
assert isinstance(result["ascii"], list)
assert isinstance(result["utf16le"], list)
assert isinstance(result["totals"], dict)
assert isinstance(result["truncated"], bool)
# Each string entry should still have a section tag.
if result["ascii"]:
assert "section" in result["ascii"][0]
def test_seed_from_inheritance_works() -> None:
"""The ``seed_from`` / ``seed_field`` pointer must resolve
keywords from the existing catalog sections."""
from re_lief.categorizers import load_categories
cats = load_categories()
# The anti_debug category must inherit from anti_debug_indicators.checks.
# The hwid category must inherit from hwid_apis.high_signal.
# Both must produce a non-empty keyword list.
assert len(cats.get("anti_debug", [])) >= 5, (
f"anti_debug should inherit >= 5 keywords from "
f"anti_debug_indicators.checks, got {len(cats.get('anti_debug', []))}"
)
assert len(cats.get("hwid", [])) >= 5, (
f"hwid should inherit >= 5 keywords from "
f"hwid_apis.high_signal, got {len(cats.get('hwid', []))}"
)
# The inherited anti_debug set must include the canonical primitives.
expected_anti_debug = {
"IsDebuggerPresent",
"OutputDebugString",
"NtQueryInformationProcess",
}
found = expected_anti_debug & set(cats["anti_debug"])
assert len(found) >= 2, (
f"expected >= 2 of {expected_anti_debug} in anti_debug keywords, "
f"got {sorted(found)}"
)
def test_categorize_strings_on_gameassembly_full_categories() -> None:
"""On the 500+ MB GameAssembly, all-section walk must surface the
encrypted-VM HWID import set. Note: the import names live
*inside* the encrypted-VM sections (``.idata``, ``.xdata``), so
skipping those sections (covered in the next test) will
blind the categorizer to the imports.
"""
parsers = _try_load_re_lief()
if parsers is None:
pytest.skip("re_lief not built")
if not TARGET_GAME_ASSEMBLY.exists():
pytest.skip(f"GameAssembly.dll not present: {TARGET_GAME_ASSEMBLY}")
result = parsers.categorize_strings(
str(TARGET_GAME_ASSEMBLY),
max_per_category=2000,
)
# The encrypted-VM bytecode interpreter is a W^X + large-binary
# pattern. The catalog's HWID-vector API set should fire.
assert result["by_category"]["hwid"]["count"] >= 5, (
f"hwid.count expected >= 5 on GameAssembly (catalog's HWID set), "
f"got {result['by_category']['hwid']['count']}"
)
# The anti-debug primitive (IsDebuggerPresent) is imported.
assert result["by_category"]["anti_debug"]["count"] >= 1, (
f"anti_debug.count expected >= 1 on GameAssembly, "
f"got {result['by_category']['anti_debug']['count']}"
)
# The total categorized count must be in the thousands — this
# is a 530 MB binary with a deep import set and a large .rdata.
assert result["totals"]["deduplicated"] > 10_000, (
f"expected > 10000 deduplicated strings on GameAssembly, "
f"got {result['totals']['deduplicated']}"
)
def test_categorize_strings_on_gameassembly_skip_sections() -> None:
"""``skip_sections`` must return a valid shape without crashing.
Note: the encrypted-VM bytecode sections on this binary (``.idata``,
``.xdata``, etc.) actually *contain* the import-table strings, so
skipping them naturally reduces visibility. This test only
verifies the *mechanism* (no crash, valid shape) — the smoke-test
on what those sections contain is the previous test.
"""
parsers = _try_load_re_lief()
if parsers is None:
pytest.skip("re_lief not built")
if not TARGET_GAME_ASSEMBLY.exists():
pytest.skip(f"GameAssembly.dll not present: {TARGET_GAME_ASSEMBLY}")
skip = [
".idata", ".xtls", ".xpdata", ".udata",
".xdata", ".didata", ".ecode", ".00cfg",
]
result = parsers.categorize_strings(
str(TARGET_GAME_ASSEMBLY),
skip_sections=skip,
)
# Schema is still correct.
assert "by_category" in result
assert "totals" in result
assert "truncated" in result
# Each section in the skip list was actually skipped — verify
# by checking no sample comes from one of the skipped sections.
skipped = set(skip)
for cat, info in result["by_category"].items():
for s in info["samples"]:
assert s.get("section") not in skipped, (
f"section {s.get('section')} should have been skipped "
f"but appears in {cat} samples"
)
+1
View File
@@ -85,6 +85,7 @@ def test_re_llm_decompile_imports(servers_root: Path) -> None:
"list_oat_art",
"disasm_capstone",
"extract_strings",
"categorize_strings",
"normalize_for_diff",
],
),