mirror of
https://github.com/Heretek-AI/RE-AI.git
synced 2026-07-01 01:37:55 -04:00
feat(skill): re-encrypted-vm-tamper + re-archive-author
re-encrypted-vm-tamper: unified encrypted-VM bytecode detection + family identification + lazy-decrypt-stub characterization. The plan §"Cycle 4" keeps the static half in this skill; the dynamic half (Wine + re-winedbg) lives in re-vm-reverse. Workflow: get_sections -> match against drm-indicators.yaml :: section_indicators.rules -> identify the family -> look for the lazy-decrypt stub -> disassemble the dispatcher. re-archive-author: guided Kaitai authoring for proprietary archive formats. 3-5 iteration workflow: 1. Magic-byte analysis 2. Hex walk (entropy + length-field candidates) 3. Draft v0.1 .ksy 4. compile + parse + diff 5. Finalize + commit to data/ksy/ The standalone re-archive-author MCP server was deferred in favor of this pure-skill helper — lower effort, equal capability for the 3-5 iteration workflow. re-kaitai.compile_format + re-kaitai.parse_with_format do the heavy lifting. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,204 @@
|
||||
---
|
||||
name: re-archive-author
|
||||
description: Guided Kaitai authoring for proprietary archive formats (3-5 iterations with auto-suggestions). Use when the user says "author a KSY for this archive", "what format is this binary", "I have a custom .paz / .pak / .dat file — reverse the format", or hands you a binary whose first 16-64 bytes don't match any known format. Pairs with re-kaitai.compile_format + re-kaitai.parse_with_format. Iteratively produces a working .ksy in data/ksy/.
|
||||
---
|
||||
|
||||
# Proprietary Archive Format Authoring
|
||||
|
||||
## When to use
|
||||
|
||||
Use this skill when the analyst encounters a binary whose first
|
||||
16-64 bytes don't match any known format (PE, ELF, MachO, ZIP,
|
||||
PNG, UnityFS, etc.) and the user wants to reverse the format.
|
||||
The output is a working `.ksy` file in `data/ksy/` that
|
||||
`re-kaitai.compile_format` + `re-kaitai.parse_with_format` can
|
||||
parse end-to-end.
|
||||
|
||||
**What this skill returns**:
|
||||
|
||||
1. **Magic-byte analysis** — comparison of the first 16-64 bytes
|
||||
against the known-magic catalog
|
||||
2. **Hex walk** — entropy map, length-field candidates, record
|
||||
boundaries
|
||||
3. **Draft `.ksy`** — version 0.1, with placeholder field names
|
||||
4. **Parse iteration** — compile + parse + diff
|
||||
5. **Final `.ksy`** — the committed format spec, with
|
||||
`vendor-neutral: true` metadata
|
||||
|
||||
## What this skill does NOT do
|
||||
|
||||
- **Does not produce a runtime trace.** Archive format
|
||||
reverse-engineering is a static problem; you only need
|
||||
to read the bytes.
|
||||
- **Does not produce a loader.** The KSY is a *spec*, not a
|
||||
loader. To read a sample file with a working loader, use
|
||||
`re-kaitai.parse_with_format` after the KSY is committed.
|
||||
- **Does not produce a writer.** KSY is a read-side spec; the
|
||||
write-side is a separate concern.
|
||||
|
||||
## Workflow
|
||||
|
||||
The workflow is 3-5 iterations. Each iteration ends with a
|
||||
compiled + parsed .ksy; the analyst adjusts the spec based on
|
||||
what the parser found (and what it didn't).
|
||||
|
||||
**Iteration 1 — Magic-byte analysis**
|
||||
|
||||
```
|
||||
xxd <file> | head -8
|
||||
```
|
||||
|
||||
Compare the first 16 bytes against the known-magic catalog:
|
||||
PE (`MZ`), ELF (`\x7fELF`), MachO (`\xcf\xfa\xed\xfe` or
|
||||
`\xfe\xed\xfa\xce`), ZIP (`PK\x03\x04`), PNG
|
||||
(`\x89PNG\r\n\x1a\n`), UnityFS (`UnityFS\x00`),
|
||||
Unity-raw (`UnityRaw\x00`), 7z, RAR, etc.
|
||||
|
||||
If the magic matches a known format, escalate to that
|
||||
format's skill. If not, this skill's territory.
|
||||
|
||||
**Iteration 2 — Hex walk**
|
||||
|
||||
```
|
||||
re-kaitai.walk_header(path, length=256)
|
||||
```
|
||||
|
||||
(or use `xxd` / `hexdump` directly)
|
||||
|
||||
Identify:
|
||||
- **Magic** (4-16 bytes at offset 0)
|
||||
- **Version field** (4 bytes at offset N) — usually a uint32 LE
|
||||
- **Length field** (4-8 bytes at offset M) — usually a uint32 LE
|
||||
- **Count field** (4 bytes at offset K) — usually a uint32 LE
|
||||
- **First record** (starts after the header)
|
||||
|
||||
The header is usually 16-64 bytes. The first record's start
|
||||
gives you a candidate "header size" constant.
|
||||
|
||||
**Iteration 3 — Draft the .ksy**
|
||||
|
||||
Create `data/ksy/<format>.ksy` with:
|
||||
|
||||
```yaml
|
||||
meta:
|
||||
id: my-format
|
||||
title: My Format
|
||||
license: MIT
|
||||
vendor-neutral: true
|
||||
seq:
|
||||
- id: magic
|
||||
contents: [4 bytes] # placeholder
|
||||
- id: version
|
||||
type: u4le
|
||||
- id: header_size
|
||||
type: u4le
|
||||
- id: record_count
|
||||
type: u4le
|
||||
- id: records
|
||||
type: record(_index, _)
|
||||
repeat: expr
|
||||
repeat-expr: record_count
|
||||
types:
|
||||
record:
|
||||
seq:
|
||||
- id: index
|
||||
type: u4le
|
||||
- id: data
|
||||
size: ??? # placeholder — refine in iteration 4
|
||||
```
|
||||
|
||||
**Iteration 4 — Compile + parse + diff**
|
||||
|
||||
```bash
|
||||
kaitai-struct-compiler --target python --outdir data/ksy/_compiled data/ksy/<format>.ksy
|
||||
```
|
||||
|
||||
Then:
|
||||
|
||||
```python
|
||||
from kaitaistruct import my_format
|
||||
with open(path, "rb") as f:
|
||||
parsed = my_format.MyFormat.from_bytes(f.read())
|
||||
print(parsed)
|
||||
```
|
||||
|
||||
Look at what the parser found and what it didn't. Refine
|
||||
the placeholder fields (e.g. `size: ???` becomes
|
||||
`size-eos: true` if the record is variable-length, or
|
||||
`size: 64` if every record is a fixed 64 bytes).
|
||||
|
||||
**Iteration 5 — Finalize + commit**
|
||||
|
||||
Once the parser handles every record, add a docstring
|
||||
with the field meanings, run the leakage test, and commit:
|
||||
|
||||
```yaml
|
||||
meta:
|
||||
id: my-format
|
||||
title: My Format
|
||||
license: MIT
|
||||
vendor-neutral: true
|
||||
endian: le
|
||||
doc: |
|
||||
Format reverse-engineered from <source>. Field meanings:
|
||||
- magic: 4-byte ASCII tag
|
||||
- version: file format version (currently 1)
|
||||
- header_size: total header size in bytes (currently 32)
|
||||
- record_count: number of records that follow
|
||||
- records: array of <record_count> record entries
|
||||
```
|
||||
|
||||
Run `./verify.sh` to confirm the leakage test passes (the KSY
|
||||
must not name any specific commercial product).
|
||||
|
||||
## Output report format
|
||||
|
||||
```markdown
|
||||
# Archive Format Reverse — <format>
|
||||
|
||||
## Magic analysis
|
||||
- First 16 bytes: ...
|
||||
- Closest known format: ...
|
||||
- No known match — proprietary format.
|
||||
|
||||
## Hex walk
|
||||
- Offset 0x00: magic (4 bytes, ASCII "....")
|
||||
- Offset 0x04: version (u4le, observed: 1)
|
||||
- Offset 0x08: header_size (u4le, observed: 32)
|
||||
- Offset 0x0C: record_count (u4le, observed: 42)
|
||||
- Offset 0x10..0x20: reserved
|
||||
- Offset 0x20: first record (record[0])
|
||||
|
||||
## Draft KSY
|
||||
- Path: data/ksy/my_format.ksy
|
||||
- Status: parses 42/42 records cleanly
|
||||
|
||||
## Open questions
|
||||
- record[0].data ends with what looks like a length prefix —
|
||||
variable-length record, or fixed-length with a different
|
||||
size?
|
||||
- The last record's data extends to EOF — is record[41] a
|
||||
terminator?
|
||||
|
||||
## Limitations
|
||||
- The KSY is read-only. A write-side spec is a separate
|
||||
concern.
|
||||
- Strings inside records are not auto-decoded; the analyst
|
||||
may need to add `type: str` or `type: strz` after a
|
||||
parse-and-look pass.
|
||||
```
|
||||
|
||||
## Pairing with other skills
|
||||
|
||||
- `re-static-triage` — for the first-pass "what format is this"
|
||||
call. If `re-lief.parse_binary` returns a known format, this
|
||||
skill's territory is over before it starts.
|
||||
- `re-format-decode` — for the read-side runtime after the KSY
|
||||
is committed. `re-kaitai.parse_with_format` calls the compiled
|
||||
KSY and returns the parsed tree.
|
||||
- `re-leak-scan` — for the string-side analysis of the
|
||||
archive's content (after the KSY lets you extract the
|
||||
per-record data).
|
||||
- `re-decompile` — for the binary-side analysis of the
|
||||
loader that *reads* this format. The KSY is the read-side;
|
||||
the loader is the consumer.
|
||||
@@ -0,0 +1,213 @@
|
||||
---
|
||||
name: re-encrypted-vm-tamper
|
||||
description: Unified encrypted-VM bytecode detection + family identification + lazy-decrypt-stub characterization. Use when the user says "is this binary protected by an encrypted VM", "what family of bytecode protection is this", "characterize the encrypted-VM handler", "where's the lazy-decrypt stub", or hands you a binary whose section table has unusual names (.vmp0, .xtls, .arch, .themida, .ecode, etc.). Calls re-lief.get_sections + re-rizin.disassemble_function + re-llm-decompile.decompile_function and produces a per-family characterization. Pairs with re-vm-reverse (which adds the dynamic Wine-trace half) and re-drm-fingerprint (which adds the broader catalog score).
|
||||
---
|
||||
|
||||
# Encrypted-VM Bytecode Tampering Analysis
|
||||
|
||||
## When to use
|
||||
|
||||
Use this skill when the analyst's first read of a binary's section
|
||||
table shows encrypted-VM-bytecode indicators: unusual section names
|
||||
(`.vmp0`, `.vmp1`, `.xtls`, `.didata`, `.ecode`, `.xdata`,
|
||||
`.xpdata`, `.udata`, `.00cfg`, `.arch`, `.link`, `.xcode`,
|
||||
`.xtext`, `.sbss`), W^X (writable + executable) permissions, or
|
||||
a `.rodata` that is suspiciously large and high-entropy.
|
||||
|
||||
The user gives you a binary path (or a section name from a prior
|
||||
analysis) and asks for "what kind of encrypted-VM bytecode is
|
||||
this" or "where does it decrypt itself at runtime". The output is
|
||||
a per-family characterization — no specific commercial product
|
||||
named.
|
||||
|
||||
**What this skill returns** (a Markdown report):
|
||||
|
||||
1. **Header** — file path, section count, suspected family
|
||||
2. **Section table** — every section, its permissions, size,
|
||||
entropy, and the family it suggests
|
||||
3. **Family identification** — the closest matching entry from
|
||||
`data/drm-indicators.yaml::pattern_indicators.mappings`
|
||||
4. **Lazy-decrypt-stub detection** — whether the binary has a
|
||||
1-bit done-flag + page-walk pattern at startup
|
||||
5. **Disassembly excerpts** — the entry of the dispatcher +
|
||||
the first 3 handler entries
|
||||
6. **Limitations** — what this skill did NOT recover (handler
|
||||
semantic inference, dynamic trace)
|
||||
|
||||
## What this skill does NOT do
|
||||
|
||||
- **Does not produce a runtime trace.** The encrypted-VM bytecode
|
||||
body is decrypted on first use; for a runtime trace, escalate
|
||||
to `re-vm-reverse` (which uses Wine + `re-winedbg`).
|
||||
- **Does not name a commercial product.** Family identification
|
||||
is descriptive (the "encrypted-VM bytecode, IL2CPP target"
|
||||
category, the "encrypted-VM bytecode, proprietary-engine target"
|
||||
category, etc.) — not a vendor attribution.
|
||||
- **Does not crack the bytecode.** The handlers are reported as
|
||||
raw disassembly; the analyst's job is to map them to
|
||||
virtual-instruction semantics over time.
|
||||
|
||||
## Workflow
|
||||
|
||||
**Step 1 — Section table + entropy**
|
||||
|
||||
```
|
||||
re-lief.get_sections(path)
|
||||
```
|
||||
|
||||
For each section, note:
|
||||
- **Name** — match against the `section_indicators.rules` in
|
||||
`data/drm-indicators.yaml` (the rules cover all known families).
|
||||
- **Permissions** — W^X (W + X) is a strong signal that the
|
||||
section is decrypted in place at runtime.
|
||||
- **Entropy** — high-entropy (>7.5) read-only sections are
|
||||
likely the encrypted body; low-entropy `.text` is the real
|
||||
native code.
|
||||
- **Size ratio** — if `.rodata` is 100x larger than `.text`,
|
||||
the encrypted body is in `.rodata`.
|
||||
|
||||
**Step 2 — Family identification**
|
||||
|
||||
Cross-reference the section table against
|
||||
`data/drm-indicators.yaml::pattern_indicators.mappings`:
|
||||
|
||||
- `.xtls / .didata / .ecode / .xdata / .xpdata / .udata / .00cfg`
|
||||
→ encrypted-VM bytecode, Unity IL2CPP target
|
||||
- `.arch / .link / .xcode / .xtext / .sbss` (with `.rodata` as
|
||||
encrypted body) → encrypted-VM bytecode, proprietary-engine target
|
||||
- `.vmp0 / .vmp1` (with `VMP` handler prefixes) → encrypted-VM
|
||||
bytecode (alternative dispatcher variant)
|
||||
- `.themida / .winlice` → encrypted-VM bytecode (WinLicense-family)
|
||||
- `.code` with W^X → encrypted-VM bytecode (CISC-dispatch variant)
|
||||
|
||||
The `confidence` field in each mapping is the heuristic strength,
|
||||
not a guarantee. For low-confidence matches, confirm with a
|
||||
deeper pass.
|
||||
|
||||
**Step 3 — Lazy-decrypt-stub detection**
|
||||
|
||||
The "lazy decrypt stub" is a startup-time routine that decrypts
|
||||
one page of the encrypted body and sets a "done" flag. The
|
||||
canonical pattern is:
|
||||
|
||||
```
|
||||
mov rax, [done_flag] ; read the 1-bit flag
|
||||
test rax, rax ; already decrypted?
|
||||
jnz skip_decrypt
|
||||
; (decrypt one page here)
|
||||
mov [done_flag], 1 ; mark decrypted
|
||||
skip_decrypt:
|
||||
; (continue with the real entry point)
|
||||
```
|
||||
|
||||
To detect: disassemble the entry function, look for a
|
||||
"read-modify-write to a global byte" + "conditional jump over
|
||||
the decrypt block" pattern. The decrypt block itself is
|
||||
small (one page = 4 KB) and ends with a memory barrier or
|
||||
serializing instruction.
|
||||
|
||||
**Step 4 — Disassemble the dispatcher**
|
||||
|
||||
The dispatcher is the function that, on every VM-step, reads a
|
||||
byte from the bytecode stream and jumps to the corresponding
|
||||
handler. Look for an indirect-jump with a register index:
|
||||
|
||||
```
|
||||
jmp [reg + rax*8] ; or similar — the handler table lookup
|
||||
```
|
||||
|
||||
Disassemble the dispatcher and the first 3 handler entries:
|
||||
|
||||
```
|
||||
re-rizin.disassemble_function(path, function="<dispatcher_addr>")
|
||||
```
|
||||
|
||||
The handler bodies are the encrypted-VM bytecode's virtual
|
||||
instructions. They are typically small (10-30 native instructions
|
||||
each) and use only a small register set (rax, rcx, rdx, rsi, rdi).
|
||||
|
||||
**Step 5 — LLM decompile (optional, high-value)**
|
||||
|
||||
The handler bodies are often 10-30 instructions of obfuscated
|
||||
arithmetic. Run them through `re-llm-decompile.decompile_function`
|
||||
for a higher-level reading. The LLM decompiler is much better
|
||||
than `pdc` at producing readable C-like pseudocode from short,
|
||||
arithmetic-heavy sequences.
|
||||
|
||||
**Step 6 — Cross-reference the dynamic half**
|
||||
|
||||
The lazy-decrypt-stub tells you where the body is decrypted. The
|
||||
dispatcher tells you where the handlers live. To map the
|
||||
handlers to virtual-instruction semantics, you need a runtime
|
||||
trace: escalate to `re-vm-reverse` for the Wine + `re-winedbg`
|
||||
half.
|
||||
|
||||
## Output report format
|
||||
|
||||
```markdown
|
||||
# Encrypted-VM Bytecode Analysis — <path>
|
||||
|
||||
## Header
|
||||
- File: ...
|
||||
- Section count: N
|
||||
- Suspected family: "encrypted-VM bytecode, proprietary-engine target"
|
||||
- Confidence: Medium-High
|
||||
|
||||
## Section table (encrypted-VM-relevant only)
|
||||
|
||||
| Section | Flags | Size | Entropy | Family signal |
|
||||
|---|---|---|---|---|
|
||||
| .text | RX | 1.6 MB | 6.2 | real native code |
|
||||
| .rodata | R | 300 MB | 7.95 | encrypted body |
|
||||
| .arch | R | 200 KB | 5.1 | proprietary-engine target |
|
||||
| .link | R | 80 KB | 4.8 | proprietary-engine target |
|
||||
| .xcode | RWX | 1 MB | 7.6 | encrypted-VM bytecode body |
|
||||
| .xtext | RX | 200 KB | 5.5 | proprietary-engine target |
|
||||
| .sbss | RW | 4 KB | 0.0 | proprietary-engine target |
|
||||
|
||||
## Family identification
|
||||
- Closest match: "encrypted-VM bytecode, proprietary-engine target"
|
||||
- Confidence: Medium-High
|
||||
- Other candidates (in order):
|
||||
- "encrypted-VM bytecode, Unity IL2CPP target" (Low — no
|
||||
GameAssembly.dll / global-metadata.dat pairing)
|
||||
- "encrypted-VM bytecode (CISC-dispatch variant)" (Low — no
|
||||
.code W^X)
|
||||
|
||||
## Lazy-decrypt stub
|
||||
- Found: yes, at 0x180001234
|
||||
- Done-flag: byte at 0x180020000
|
||||
- Decrypts: one page of .xcode
|
||||
|
||||
## Dispatcher disassembly
|
||||
- Address: 0x180005678
|
||||
- Handler table: 0x180100000
|
||||
- First 3 handler entries (raw disassembly):
|
||||
- handler[0]: 0x180105000
|
||||
- handler[1]: 0x180105080
|
||||
- handler[2]: 0x180105100
|
||||
|
||||
## Limitations
|
||||
- The handler bodies are short, arithmetic-heavy sequences.
|
||||
The LLM decompiler produces a C-like reading but the
|
||||
underlying virtual-instruction semantics are not yet
|
||||
recovered.
|
||||
- The dynamic half (which virtual instruction corresponds to
|
||||
which handler index) is not in this report. Run
|
||||
`re-vm-reverse` for the runtime trace.
|
||||
```
|
||||
|
||||
## Pairing with other skills
|
||||
|
||||
- `re-drm-fingerprint` — for the broader catalog score. The
|
||||
fingerprint skill consumes `data/drm-indicators.yaml` and
|
||||
reports the matches across all families.
|
||||
- `re-vm-reverse` — for the dynamic Wine + `re-winedbg` half.
|
||||
The encrypted-VM bytecode body is decrypted on first use;
|
||||
a runtime trace is the only way to map the handlers to
|
||||
virtual-instruction semantics.
|
||||
- `re-mba-deobfuscate` — for the MBA-obfuscated arithmetic
|
||||
inside individual handlers. `re-triton.solve_constraint` is
|
||||
the entry point (after the z3.BitVec fix in Cycle 1 / T1.4).
|
||||
- `re-llm-decompile` — for the higher-level reading of
|
||||
individual handler bodies.
|
||||
Reference in New Issue
Block a user