feat(skill): re-encrypted-vm-tamper + re-archive-author

re-encrypted-vm-tamper: unified encrypted-VM bytecode detection + family identification + lazy-decrypt-stub characterization. The plan §"Cycle 4" keeps the static half in this skill; the dynamic half (Wine + re-winedbg) lives in re-vm-reverse. Workflow: get_sections -> match against drm-indicators.yaml :: section_indicators.rules -> identify the family -> look for the lazy-decrypt stub -> disassemble the dispatcher. re-archive-author: guided Kaitai authoring for proprietary archive formats. 3-5 iteration workflow: 1. Magic-byte analysis 2. Hex walk (entropy + length-field candidates) 3. Draft v0.1 .ksy 4. compile + parse + diff 5. Finalize + commit to data/ksy/ The standalone re-archive-author MCP server was deferred in favor of this pure-skill helper — lower effort, equal capability for the 3-5 iteration workflow. re-kaitai.compile_format + re-kaitai.parse_with_format do the heavy lifting. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 01:37:55 -04:00 · 2026-06-05 23:26:55 -04:00
parent ce4eef6640
commit ac14c89717
2 changed files with 417 additions and 0 deletions
@@ -0,0 +1,204 @@
+---
+name: re-archive-author
+description: Guided Kaitai authoring for proprietary archive formats (3-5 iterations with auto-suggestions). Use when the user says "author a KSY for this archive", "what format is this binary", "I have a custom .paz / .pak / .dat file — reverse the format", or hands you a binary whose first 16-64 bytes don't match any known format. Pairs with re-kaitai.compile_format + re-kaitai.parse_with_format. Iteratively produces a working .ksy in data/ksy/.
+---
+
+# Proprietary Archive Format Authoring
+
+## When to use
+
+Use this skill when the analyst encounters a binary whose first
+16-64 bytes don't match any known format (PE, ELF, MachO, ZIP,
+PNG, UnityFS, etc.) and the user wants to reverse the format.
+The output is a working `.ksy` file in `data/ksy/` that
+`re-kaitai.compile_format` + `re-kaitai.parse_with_format` can
+parse end-to-end.
+
+**What this skill returns**:
+
+1. **Magic-byte analysis** — comparison of the first 16-64 bytes
+   against the known-magic catalog
+2. **Hex walk** — entropy map, length-field candidates, record
+   boundaries
+3. **Draft `.ksy`** — version 0.1, with placeholder field names
+4. **Parse iteration** — compile + parse + diff
+5. **Final `.ksy`** — the committed format spec, with
+   `vendor-neutral: true` metadata
+
+## What this skill does NOT do
+
+- **Does not produce a runtime trace.** Archive format
+  reverse-engineering is a static problem; you only need
+  to read the bytes.
+- **Does not produce a loader.** The KSY is a *spec*, not a
+  loader. To read a sample file with a working loader, use
+  `re-kaitai.parse_with_format` after the KSY is committed.
+- **Does not produce a writer.** KSY is a read-side spec; the
+  write-side is a separate concern.
+
+## Workflow
+
+The workflow is 3-5 iterations. Each iteration ends with a
+compiled + parsed .ksy; the analyst adjusts the spec based on
+what the parser found (and what it didn't).
+
+**Iteration 1 — Magic-byte analysis**
+
+```
+xxd <file> | head -8
+```
+
+Compare the first 16 bytes against the known-magic catalog:
+PE (`MZ`), ELF (`\x7fELF`), MachO (`\xcf\xfa\xed\xfe` or
+`\xfe\xed\xfa\xce`), ZIP (`PK\x03\x04`), PNG
+(`\x89PNG\r\n\x1a\n`), UnityFS (`UnityFS\x00`),
+Unity-raw (`UnityRaw\x00`), 7z, RAR, etc.
+
+If the magic matches a known format, escalate to that
+format's skill. If not, this skill's territory.
+
+**Iteration 2 — Hex walk**
+
+```
+re-kaitai.walk_header(path, length=256)
+```
+
+(or use `xxd` / `hexdump` directly)
+
+Identify:
+- **Magic** (4-16 bytes at offset 0)
+- **Version field** (4 bytes at offset N) — usually a uint32 LE
+- **Length field** (4-8 bytes at offset M) — usually a uint32 LE
+- **Count field** (4 bytes at offset K) — usually a uint32 LE
+- **First record** (starts after the header)
+
+The header is usually 16-64 bytes. The first record's start
+gives you a candidate "header size" constant.
+
+**Iteration 3 — Draft the .ksy**
+
+Create `data/ksy/<format>.ksy` with:
+
+```yaml
+meta:
+  id: my-format
+  title: My Format
+  license: MIT
+  vendor-neutral: true
+seq:
+  - id: magic
+    contents: [4 bytes]  # placeholder
+  - id: version
+    type: u4le
+  - id: header_size
+    type: u4le
+  - id: record_count
+    type: u4le
+  - id: records
+    type: record(_index, _)
+    repeat: expr
+    repeat-expr: record_count
+types:
+  record:
+    seq:
+      - id: index
+        type: u4le
+      - id: data
+        size: ???  # placeholder — refine in iteration 4
+```
+
+**Iteration 4 — Compile + parse + diff**
+
+```bash
+kaitai-struct-compiler --target python --outdir data/ksy/_compiled data/ksy/<format>.ksy
+```
+
+Then:
+
+```python
+from kaitaistruct import my_format
+with open(path, "rb") as f:
+    parsed = my_format.MyFormat.from_bytes(f.read())
+print(parsed)
+```
+
+Look at what the parser found and what it didn't. Refine
+the placeholder fields (e.g. `size: ???` becomes
+`size-eos: true` if the record is variable-length, or
+`size: 64` if every record is a fixed 64 bytes).
+
+**Iteration 5 — Finalize + commit**
+
+Once the parser handles every record, add a docstring
+with the field meanings, run the leakage test, and commit:
+
+```yaml
+meta:
+  id: my-format
+  title: My Format
+  license: MIT
+  vendor-neutral: true
+  endian: le
+doc: |
+  Format reverse-engineered from <source>. Field meanings:
+  - magic: 4-byte ASCII tag
+  - version: file format version (currently 1)
+  - header_size: total header size in bytes (currently 32)
+  - record_count: number of records that follow
+  - records: array of <record_count> record entries
+```
+
+Run `./verify.sh` to confirm the leakage test passes (the KSY
+must not name any specific commercial product).
+
+## Output report format
+
+```markdown
+# Archive Format Reverse — <format>
+
+## Magic analysis
+- First 16 bytes: ...
+- Closest known format: ...
+- No known match — proprietary format.
+
+## Hex walk
+- Offset 0x00: magic (4 bytes, ASCII "....")
+- Offset 0x04: version (u4le, observed: 1)
+- Offset 0x08: header_size (u4le, observed: 32)
+- Offset 0x0C: record_count (u4le, observed: 42)
+- Offset 0x10..0x20: reserved
+- Offset 0x20: first record (record[0])
+
+## Draft KSY
+- Path: data/ksy/my_format.ksy
+- Status: parses 42/42 records cleanly
+
+## Open questions
+- record[0].data ends with what looks like a length prefix —
+  variable-length record, or fixed-length with a different
+  size?
+- The last record's data extends to EOF — is record[41] a
+  terminator?
+
+## Limitations
+- The KSY is read-only. A write-side spec is a separate
+  concern.
+- Strings inside records are not auto-decoded; the analyst
+  may need to add `type: str` or `type: strz` after a
+  parse-and-look pass.
+```
+
+## Pairing with other skills
+
+- `re-static-triage` — for the first-pass "what format is this"
+  call. If `re-lief.parse_binary` returns a known format, this
+  skill's territory is over before it starts.
+- `re-format-decode` — for the read-side runtime after the KSY
+  is committed. `re-kaitai.parse_with_format` calls the compiled
+  KSY and returns the parsed tree.
+- `re-leak-scan` — for the string-side analysis of the
+  archive's content (after the KSY lets you extract the
+  per-record data).
+- `re-decompile` — for the binary-side analysis of the
+  loader that *reads* this format. The KSY is the read-side;
+  the loader is the consumer.
@@ -0,0 +1,213 @@
+---
+name: re-encrypted-vm-tamper
+description: Unified encrypted-VM bytecode detection + family identification + lazy-decrypt-stub characterization. Use when the user says "is this binary protected by an encrypted VM", "what family of bytecode protection is this", "characterize the encrypted-VM handler", "where's the lazy-decrypt stub", or hands you a binary whose section table has unusual names (.vmp0, .xtls, .arch, .themida, .ecode, etc.). Calls re-lief.get_sections + re-rizin.disassemble_function + re-llm-decompile.decompile_function and produces a per-family characterization. Pairs with re-vm-reverse (which adds the dynamic Wine-trace half) and re-drm-fingerprint (which adds the broader catalog score).
+---
+
+# Encrypted-VM Bytecode Tampering Analysis
+
+## When to use
+
+Use this skill when the analyst's first read of a binary's section
+table shows encrypted-VM-bytecode indicators: unusual section names
+(`.vmp0`, `.vmp1`, `.xtls`, `.didata`, `.ecode`, `.xdata`,
+`.xpdata`, `.udata`, `.00cfg`, `.arch`, `.link`, `.xcode`,
+`.xtext`, `.sbss`), W^X (writable + executable) permissions, or
+a `.rodata` that is suspiciously large and high-entropy.
+
+The user gives you a binary path (or a section name from a prior
+analysis) and asks for "what kind of encrypted-VM bytecode is
+this" or "where does it decrypt itself at runtime". The output is
+a per-family characterization — no specific commercial product
+named.
+
+**What this skill returns** (a Markdown report):
+
+1. **Header** — file path, section count, suspected family
+2. **Section table** — every section, its permissions, size,
+   entropy, and the family it suggests
+3. **Family identification** — the closest matching entry from
+   `data/drm-indicators.yaml::pattern_indicators.mappings`
+4. **Lazy-decrypt-stub detection** — whether the binary has a
+   1-bit done-flag + page-walk pattern at startup
+5. **Disassembly excerpts** — the entry of the dispatcher +
+   the first 3 handler entries
+6. **Limitations** — what this skill did NOT recover (handler
+   semantic inference, dynamic trace)
+
+## What this skill does NOT do
+
+- **Does not produce a runtime trace.** The encrypted-VM bytecode
+  body is decrypted on first use; for a runtime trace, escalate
+  to `re-vm-reverse` (which uses Wine + `re-winedbg`).
+- **Does not name a commercial product.** Family identification
+  is descriptive (the "encrypted-VM bytecode, IL2CPP target"
+  category, the "encrypted-VM bytecode, proprietary-engine target"
+  category, etc.) — not a vendor attribution.
+- **Does not crack the bytecode.** The handlers are reported as
+  raw disassembly; the analyst's job is to map them to
+  virtual-instruction semantics over time.
+
+## Workflow
+
+**Step 1 — Section table + entropy**
+
+```
+re-lief.get_sections(path)
+```
+
+For each section, note:
+- **Name** — match against the `section_indicators.rules` in
+  `data/drm-indicators.yaml` (the rules cover all known families).
+- **Permissions** — W^X (W + X) is a strong signal that the
+  section is decrypted in place at runtime.
+- **Entropy** — high-entropy (>7.5) read-only sections are
+  likely the encrypted body; low-entropy `.text` is the real
+  native code.
+- **Size ratio** — if `.rodata` is 100x larger than `.text`,
+  the encrypted body is in `.rodata`.
+
+**Step 2 — Family identification**
+
+Cross-reference the section table against
+`data/drm-indicators.yaml::pattern_indicators.mappings`:
+
+- `.xtls / .didata / .ecode / .xdata / .xpdata / .udata / .00cfg`
+  → encrypted-VM bytecode, Unity IL2CPP target
+- `.arch / .link / .xcode / .xtext / .sbss` (with `.rodata` as
+  encrypted body) → encrypted-VM bytecode, proprietary-engine target
+- `.vmp0 / .vmp1` (with `VMP` handler prefixes) → encrypted-VM
+  bytecode (alternative dispatcher variant)
+- `.themida / .winlice` → encrypted-VM bytecode (WinLicense-family)
+- `.code` with W^X → encrypted-VM bytecode (CISC-dispatch variant)
+
+The `confidence` field in each mapping is the heuristic strength,
+not a guarantee. For low-confidence matches, confirm with a
+deeper pass.
+
+**Step 3 — Lazy-decrypt-stub detection**
+
+The "lazy decrypt stub" is a startup-time routine that decrypts
+one page of the encrypted body and sets a "done" flag. The
+canonical pattern is:
+
+```
+mov  rax, [done_flag]      ; read the 1-bit flag
+test rax, rax              ; already decrypted?
+jnz  skip_decrypt
+; (decrypt one page here)
+mov  [done_flag], 1        ; mark decrypted
+skip_decrypt:
+; (continue with the real entry point)
+```
+
+To detect: disassemble the entry function, look for a
+"read-modify-write to a global byte" + "conditional jump over
+the decrypt block" pattern. The decrypt block itself is
+small (one page = 4 KB) and ends with a memory barrier or
+serializing instruction.
+
+**Step 4 — Disassemble the dispatcher**
+
+The dispatcher is the function that, on every VM-step, reads a
+byte from the bytecode stream and jumps to the corresponding
+handler. Look for an indirect-jump with a register index:
+
+```
+jmp  [reg + rax*8]         ; or similar — the handler table lookup
+```
+
+Disassemble the dispatcher and the first 3 handler entries:
+
+```
+re-rizin.disassemble_function(path, function="<dispatcher_addr>")
+```
+
+The handler bodies are the encrypted-VM bytecode's virtual
+instructions. They are typically small (10-30 native instructions
+each) and use only a small register set (rax, rcx, rdx, rsi, rdi).
+
+**Step 5 — LLM decompile (optional, high-value)**
+
+The handler bodies are often 10-30 instructions of obfuscated
+arithmetic. Run them through `re-llm-decompile.decompile_function`
+for a higher-level reading. The LLM decompiler is much better
+than `pdc` at producing readable C-like pseudocode from short,
+arithmetic-heavy sequences.
+
+**Step 6 — Cross-reference the dynamic half**
+
+The lazy-decrypt-stub tells you where the body is decrypted. The
+dispatcher tells you where the handlers live. To map the
+handlers to virtual-instruction semantics, you need a runtime
+trace: escalate to `re-vm-reverse` for the Wine + `re-winedbg`
+half.
+
+## Output report format
+
+```markdown
+# Encrypted-VM Bytecode Analysis — <path>
+
+## Header
+- File: ...
+- Section count: N
+- Suspected family: "encrypted-VM bytecode, proprietary-engine target"
+- Confidence: Medium-High
+
+## Section table (encrypted-VM-relevant only)
+
+| Section | Flags | Size | Entropy | Family signal |
+|---|---|---|---|---|
+| .text | RX | 1.6 MB | 6.2 | real native code |
+| .rodata | R | 300 MB | 7.95 | encrypted body |
+| .arch | R | 200 KB | 5.1 | proprietary-engine target |
+| .link | R | 80 KB | 4.8 | proprietary-engine target |
+| .xcode | RWX | 1 MB | 7.6 | encrypted-VM bytecode body |
+| .xtext | RX | 200 KB | 5.5 | proprietary-engine target |
+| .sbss | RW | 4 KB | 0.0 | proprietary-engine target |
+
+## Family identification
+- Closest match: "encrypted-VM bytecode, proprietary-engine target"
+- Confidence: Medium-High
+- Other candidates (in order):
+  - "encrypted-VM bytecode, Unity IL2CPP target" (Low — no
+    GameAssembly.dll / global-metadata.dat pairing)
+  - "encrypted-VM bytecode (CISC-dispatch variant)" (Low — no
+    .code W^X)
+
+## Lazy-decrypt stub
+- Found: yes, at 0x180001234
+- Done-flag: byte at 0x180020000
+- Decrypts: one page of .xcode
+
+## Dispatcher disassembly
+- Address: 0x180005678
+- Handler table: 0x180100000
+- First 3 handler entries (raw disassembly):
+  - handler[0]: 0x180105000
+  - handler[1]: 0x180105080
+  - handler[2]: 0x180105100
+
+## Limitations
+- The handler bodies are short, arithmetic-heavy sequences.
+  The LLM decompiler produces a C-like reading but the
+  underlying virtual-instruction semantics are not yet
+  recovered.
+- The dynamic half (which virtual instruction corresponds to
+  which handler index) is not in this report. Run
+  `re-vm-reverse` for the runtime trace.
+```
+
+## Pairing with other skills
+
+- `re-drm-fingerprint` — for the broader catalog score. The
+  fingerprint skill consumes `data/drm-indicators.yaml` and
+  reports the matches across all families.
+- `re-vm-reverse` — for the dynamic Wine + `re-winedbg` half.
+  The encrypted-VM bytecode body is decrypted on first use;
+  a runtime trace is the only way to map the handlers to
+  virtual-instruction semantics.
+- `re-mba-deobfuscate` — for the MBA-obfuscated arithmetic
+  inside individual handlers. `re-triton.solve_constraint` is
+  the entry point (after the z3.BitVec fix in Cycle 1 / T1.4).
+- `re-llm-decompile` — for the higher-level reading of
+  individual handler bodies.