feat(skill): re-encrypted-vm-tamper + re-archive-author

re-encrypted-vm-tamper: unified encrypted-VM bytecode detection
+ family identification + lazy-decrypt-stub characterization.
The plan §"Cycle 4" keeps the static half in this skill; the
dynamic half (Wine + re-winedbg) lives in re-vm-reverse.

Workflow: get_sections -> match against drm-indicators.yaml ::
section_indicators.rules -> identify the family -> look for the
lazy-decrypt stub -> disassemble the dispatcher.

re-archive-author: guided Kaitai authoring for proprietary
archive formats. 3-5 iteration workflow:
  1. Magic-byte analysis
  2. Hex walk (entropy + length-field candidates)
  3. Draft v0.1 .ksy
  4. compile + parse + diff
  5. Finalize + commit to data/ksy/

The standalone re-archive-author MCP server was deferred in
favor of this pure-skill helper — lower effort, equal capability
for the 3-5 iteration workflow. re-kaitai.compile_format +
re-kaitai.parse_with_format do the heavy lifting.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
John Smith
2026-06-05 23:26:55 -04:00
parent ce4eef6640
commit ac14c89717
2 changed files with 417 additions and 0 deletions
+204
View File
@@ -0,0 +1,204 @@
---
name: re-archive-author
description: Guided Kaitai authoring for proprietary archive formats (3-5 iterations with auto-suggestions). Use when the user says "author a KSY for this archive", "what format is this binary", "I have a custom .paz / .pak / .dat file — reverse the format", or hands you a binary whose first 16-64 bytes don't match any known format. Pairs with re-kaitai.compile_format + re-kaitai.parse_with_format. Iteratively produces a working .ksy in data/ksy/.
---
# Proprietary Archive Format Authoring
## When to use
Use this skill when the analyst encounters a binary whose first
16-64 bytes don't match any known format (PE, ELF, MachO, ZIP,
PNG, UnityFS, etc.) and the user wants to reverse the format.
The output is a working `.ksy` file in `data/ksy/` that
`re-kaitai.compile_format` + `re-kaitai.parse_with_format` can
parse end-to-end.
**What this skill returns**:
1. **Magic-byte analysis** — comparison of the first 16-64 bytes
against the known-magic catalog
2. **Hex walk** — entropy map, length-field candidates, record
boundaries
3. **Draft `.ksy`** — version 0.1, with placeholder field names
4. **Parse iteration** — compile + parse + diff
5. **Final `.ksy`** — the committed format spec, with
`vendor-neutral: true` metadata
## What this skill does NOT do
- **Does not produce a runtime trace.** Archive format
reverse-engineering is a static problem; you only need
to read the bytes.
- **Does not produce a loader.** The KSY is a *spec*, not a
loader. To read a sample file with a working loader, use
`re-kaitai.parse_with_format` after the KSY is committed.
- **Does not produce a writer.** KSY is a read-side spec; the
write-side is a separate concern.
## Workflow
The workflow is 3-5 iterations. Each iteration ends with a
compiled + parsed .ksy; the analyst adjusts the spec based on
what the parser found (and what it didn't).
**Iteration 1 — Magic-byte analysis**
```
xxd <file> | head -8
```
Compare the first 16 bytes against the known-magic catalog:
PE (`MZ`), ELF (`\x7fELF`), MachO (`\xcf\xfa\xed\xfe` or
`\xfe\xed\xfa\xce`), ZIP (`PK\x03\x04`), PNG
(`\x89PNG\r\n\x1a\n`), UnityFS (`UnityFS\x00`),
Unity-raw (`UnityRaw\x00`), 7z, RAR, etc.
If the magic matches a known format, escalate to that
format's skill. If not, this skill's territory.
**Iteration 2 — Hex walk**
```
re-kaitai.walk_header(path, length=256)
```
(or use `xxd` / `hexdump` directly)
Identify:
- **Magic** (4-16 bytes at offset 0)
- **Version field** (4 bytes at offset N) — usually a uint32 LE
- **Length field** (4-8 bytes at offset M) — usually a uint32 LE
- **Count field** (4 bytes at offset K) — usually a uint32 LE
- **First record** (starts after the header)
The header is usually 16-64 bytes. The first record's start
gives you a candidate "header size" constant.
**Iteration 3 — Draft the .ksy**
Create `data/ksy/<format>.ksy` with:
```yaml
meta:
id: my-format
title: My Format
license: MIT
vendor-neutral: true
seq:
- id: magic
contents: [4 bytes] # placeholder
- id: version
type: u4le
- id: header_size
type: u4le
- id: record_count
type: u4le
- id: records
type: record(_index, _)
repeat: expr
repeat-expr: record_count
types:
record:
seq:
- id: index
type: u4le
- id: data
size: ??? # placeholder — refine in iteration 4
```
**Iteration 4 — Compile + parse + diff**
```bash
kaitai-struct-compiler --target python --outdir data/ksy/_compiled data/ksy/<format>.ksy
```
Then:
```python
from kaitaistruct import my_format
with open(path, "rb") as f:
parsed = my_format.MyFormat.from_bytes(f.read())
print(parsed)
```
Look at what the parser found and what it didn't. Refine
the placeholder fields (e.g. `size: ???` becomes
`size-eos: true` if the record is variable-length, or
`size: 64` if every record is a fixed 64 bytes).
**Iteration 5 — Finalize + commit**
Once the parser handles every record, add a docstring
with the field meanings, run the leakage test, and commit:
```yaml
meta:
id: my-format
title: My Format
license: MIT
vendor-neutral: true
endian: le
doc: |
Format reverse-engineered from <source>. Field meanings:
- magic: 4-byte ASCII tag
- version: file format version (currently 1)
- header_size: total header size in bytes (currently 32)
- record_count: number of records that follow
- records: array of <record_count> record entries
```
Run `./verify.sh` to confirm the leakage test passes (the KSY
must not name any specific commercial product).
## Output report format
```markdown
# Archive Format Reverse — <format>
## Magic analysis
- First 16 bytes: ...
- Closest known format: ...
- No known match — proprietary format.
## Hex walk
- Offset 0x00: magic (4 bytes, ASCII "....")
- Offset 0x04: version (u4le, observed: 1)
- Offset 0x08: header_size (u4le, observed: 32)
- Offset 0x0C: record_count (u4le, observed: 42)
- Offset 0x10..0x20: reserved
- Offset 0x20: first record (record[0])
## Draft KSY
- Path: data/ksy/my_format.ksy
- Status: parses 42/42 records cleanly
## Open questions
- record[0].data ends with what looks like a length prefix —
variable-length record, or fixed-length with a different
size?
- The last record's data extends to EOF — is record[41] a
terminator?
## Limitations
- The KSY is read-only. A write-side spec is a separate
concern.
- Strings inside records are not auto-decoded; the analyst
may need to add `type: str` or `type: strz` after a
parse-and-look pass.
```
## Pairing with other skills
- `re-static-triage` — for the first-pass "what format is this"
call. If `re-lief.parse_binary` returns a known format, this
skill's territory is over before it starts.
- `re-format-decode` — for the read-side runtime after the KSY
is committed. `re-kaitai.parse_with_format` calls the compiled
KSY and returns the parsed tree.
- `re-leak-scan` — for the string-side analysis of the
archive's content (after the KSY lets you extract the
per-record data).
- `re-decompile` — for the binary-side analysis of the
loader that *reads* this format. The KSY is the read-side;
the loader is the consumer.
+213
View File
@@ -0,0 +1,213 @@
---
name: re-encrypted-vm-tamper
description: Unified encrypted-VM bytecode detection + family identification + lazy-decrypt-stub characterization. Use when the user says "is this binary protected by an encrypted VM", "what family of bytecode protection is this", "characterize the encrypted-VM handler", "where's the lazy-decrypt stub", or hands you a binary whose section table has unusual names (.vmp0, .xtls, .arch, .themida, .ecode, etc.). Calls re-lief.get_sections + re-rizin.disassemble_function + re-llm-decompile.decompile_function and produces a per-family characterization. Pairs with re-vm-reverse (which adds the dynamic Wine-trace half) and re-drm-fingerprint (which adds the broader catalog score).
---
# Encrypted-VM Bytecode Tampering Analysis
## When to use
Use this skill when the analyst's first read of a binary's section
table shows encrypted-VM-bytecode indicators: unusual section names
(`.vmp0`, `.vmp1`, `.xtls`, `.didata`, `.ecode`, `.xdata`,
`.xpdata`, `.udata`, `.00cfg`, `.arch`, `.link`, `.xcode`,
`.xtext`, `.sbss`), W^X (writable + executable) permissions, or
a `.rodata` that is suspiciously large and high-entropy.
The user gives you a binary path (or a section name from a prior
analysis) and asks for "what kind of encrypted-VM bytecode is
this" or "where does it decrypt itself at runtime". The output is
a per-family characterization — no specific commercial product
named.
**What this skill returns** (a Markdown report):
1. **Header** — file path, section count, suspected family
2. **Section table** — every section, its permissions, size,
entropy, and the family it suggests
3. **Family identification** — the closest matching entry from
`data/drm-indicators.yaml::pattern_indicators.mappings`
4. **Lazy-decrypt-stub detection** — whether the binary has a
1-bit done-flag + page-walk pattern at startup
5. **Disassembly excerpts** — the entry of the dispatcher +
the first 3 handler entries
6. **Limitations** — what this skill did NOT recover (handler
semantic inference, dynamic trace)
## What this skill does NOT do
- **Does not produce a runtime trace.** The encrypted-VM bytecode
body is decrypted on first use; for a runtime trace, escalate
to `re-vm-reverse` (which uses Wine + `re-winedbg`).
- **Does not name a commercial product.** Family identification
is descriptive (the "encrypted-VM bytecode, IL2CPP target"
category, the "encrypted-VM bytecode, proprietary-engine target"
category, etc.) — not a vendor attribution.
- **Does not crack the bytecode.** The handlers are reported as
raw disassembly; the analyst's job is to map them to
virtual-instruction semantics over time.
## Workflow
**Step 1 — Section table + entropy**
```
re-lief.get_sections(path)
```
For each section, note:
- **Name** — match against the `section_indicators.rules` in
`data/drm-indicators.yaml` (the rules cover all known families).
- **Permissions** — W^X (W + X) is a strong signal that the
section is decrypted in place at runtime.
- **Entropy** — high-entropy (>7.5) read-only sections are
likely the encrypted body; low-entropy `.text` is the real
native code.
- **Size ratio** — if `.rodata` is 100x larger than `.text`,
the encrypted body is in `.rodata`.
**Step 2 — Family identification**
Cross-reference the section table against
`data/drm-indicators.yaml::pattern_indicators.mappings`:
- `.xtls / .didata / .ecode / .xdata / .xpdata / .udata / .00cfg`
→ encrypted-VM bytecode, Unity IL2CPP target
- `.arch / .link / .xcode / .xtext / .sbss` (with `.rodata` as
encrypted body) → encrypted-VM bytecode, proprietary-engine target
- `.vmp0 / .vmp1` (with `VMP` handler prefixes) → encrypted-VM
bytecode (alternative dispatcher variant)
- `.themida / .winlice` → encrypted-VM bytecode (WinLicense-family)
- `.code` with W^X → encrypted-VM bytecode (CISC-dispatch variant)
The `confidence` field in each mapping is the heuristic strength,
not a guarantee. For low-confidence matches, confirm with a
deeper pass.
**Step 3 — Lazy-decrypt-stub detection**
The "lazy decrypt stub" is a startup-time routine that decrypts
one page of the encrypted body and sets a "done" flag. The
canonical pattern is:
```
mov rax, [done_flag] ; read the 1-bit flag
test rax, rax ; already decrypted?
jnz skip_decrypt
; (decrypt one page here)
mov [done_flag], 1 ; mark decrypted
skip_decrypt:
; (continue with the real entry point)
```
To detect: disassemble the entry function, look for a
"read-modify-write to a global byte" + "conditional jump over
the decrypt block" pattern. The decrypt block itself is
small (one page = 4 KB) and ends with a memory barrier or
serializing instruction.
**Step 4 — Disassemble the dispatcher**
The dispatcher is the function that, on every VM-step, reads a
byte from the bytecode stream and jumps to the corresponding
handler. Look for an indirect-jump with a register index:
```
jmp [reg + rax*8] ; or similar — the handler table lookup
```
Disassemble the dispatcher and the first 3 handler entries:
```
re-rizin.disassemble_function(path, function="<dispatcher_addr>")
```
The handler bodies are the encrypted-VM bytecode's virtual
instructions. They are typically small (10-30 native instructions
each) and use only a small register set (rax, rcx, rdx, rsi, rdi).
**Step 5 — LLM decompile (optional, high-value)**
The handler bodies are often 10-30 instructions of obfuscated
arithmetic. Run them through `re-llm-decompile.decompile_function`
for a higher-level reading. The LLM decompiler is much better
than `pdc` at producing readable C-like pseudocode from short,
arithmetic-heavy sequences.
**Step 6 — Cross-reference the dynamic half**
The lazy-decrypt-stub tells you where the body is decrypted. The
dispatcher tells you where the handlers live. To map the
handlers to virtual-instruction semantics, you need a runtime
trace: escalate to `re-vm-reverse` for the Wine + `re-winedbg`
half.
## Output report format
```markdown
# Encrypted-VM Bytecode Analysis — <path>
## Header
- File: ...
- Section count: N
- Suspected family: "encrypted-VM bytecode, proprietary-engine target"
- Confidence: Medium-High
## Section table (encrypted-VM-relevant only)
| Section | Flags | Size | Entropy | Family signal |
|---|---|---|---|---|
| .text | RX | 1.6 MB | 6.2 | real native code |
| .rodata | R | 300 MB | 7.95 | encrypted body |
| .arch | R | 200 KB | 5.1 | proprietary-engine target |
| .link | R | 80 KB | 4.8 | proprietary-engine target |
| .xcode | RWX | 1 MB | 7.6 | encrypted-VM bytecode body |
| .xtext | RX | 200 KB | 5.5 | proprietary-engine target |
| .sbss | RW | 4 KB | 0.0 | proprietary-engine target |
## Family identification
- Closest match: "encrypted-VM bytecode, proprietary-engine target"
- Confidence: Medium-High
- Other candidates (in order):
- "encrypted-VM bytecode, Unity IL2CPP target" (Low — no
GameAssembly.dll / global-metadata.dat pairing)
- "encrypted-VM bytecode (CISC-dispatch variant)" (Low — no
.code W^X)
## Lazy-decrypt stub
- Found: yes, at 0x180001234
- Done-flag: byte at 0x180020000
- Decrypts: one page of .xcode
## Dispatcher disassembly
- Address: 0x180005678
- Handler table: 0x180100000
- First 3 handler entries (raw disassembly):
- handler[0]: 0x180105000
- handler[1]: 0x180105080
- handler[2]: 0x180105100
## Limitations
- The handler bodies are short, arithmetic-heavy sequences.
The LLM decompiler produces a C-like reading but the
underlying virtual-instruction semantics are not yet
recovered.
- The dynamic half (which virtual instruction corresponds to
which handler index) is not in this report. Run
`re-vm-reverse` for the runtime trace.
```
## Pairing with other skills
- `re-drm-fingerprint` — for the broader catalog score. The
fingerprint skill consumes `data/drm-indicators.yaml` and
reports the matches across all families.
- `re-vm-reverse` — for the dynamic Wine + `re-winedbg` half.
The encrypted-VM bytecode body is decrypted on first use;
a runtime trace is the only way to map the handlers to
virtual-instruction semantics.
- `re-mba-deobfuscate` — for the MBA-obfuscated arithmetic
inside individual handlers. `re-triton.solve_constraint` is
the entry point (after the z3.BitVec fix in Cycle 1 / T1.4).
- `re-llm-decompile` — for the higher-level reading of
individual handler bodies.