feat(catalog): 6 bucket refactors + new publisher-internal-diagnostic-hostname detector

Post-run follow-up to the 2026-06-06-r01 stress test
(Output/2026-06-06-r01/gap-analysis.md). The catalog changes:

- C1: split 'activation' into 'ue-component-activation'
  (Unity component-lifecycle noise) and 'license-activation'
  (the real license-gate vocabulary). ANTI-TAMPER-TAXONOMY.md
  Pattern B now references license-activation.count. Eliminates
  the 615 FPs in P3R.exe's UE component vocabulary.
- C2: split 'fingerprint' into 'custom-fingerprint' (high-signal
  HW-fingerprint literals) and 'windows-com-api-name' (standard
  COM/typelib property names). Eliminates the 48 FPs in
  P3R.exe.
- C3: 'telemetry_leak' gets exclude_keywords for asian / Asian /
  Asia / albanian / Albanian / width / Width /
  East_Asian_Width / Caucasian_Albanian / stasianwidth /
  sesasianwidth. Eliminates the 13 Unicode-UCD FPs.
- C4: 'hwid' (seeded from hwid_apis.high_signal) gets
  exclude_keywords for cl /Zi /Fd, ossl_static.pdb,
  /Fdopenssl. Eliminates the OpenSSL-static-link FPs.
- C5: 'obfuscation' gets exclude_keywords for __TBB_, tbb::,
  C:\ci\builds\, C:/ci/builds/, C:\BuildBot\,
  /ci/builds/. Eliminates the 41 TBB / CI-build FPs in
  tbb12.dll.
- C6: anti_debug_indicators.checks[].confirmation: field
  added; enum 'string_only' / 'import_only' / 'requires_disasm' /
  'requires_xref'. The 4 byte-pattern checks (RDTSC, INT 2D,
  INT 3, exception-hooking) are now 'requires_disasm'.
  Catalog has the metadata; consumer-side wiring in
  re-drm-fingerprint is deferred.

- L1: new 'publisher-internal-diagnostic-hostname' leak
  detector in servers/re-leak-scan/src/re_leak_scan/patterns.py.
  Matches internal-TLD anchor (.internal, .corp, .lan, .local,
  .intra, .private, .home.arpa) + a diagnostic-product stem
  (jenkins, jira, grafana, prometheus, kibana, splunk, sentry,
  bitbucket, gerrit, artifactory, nexus, sonarqube, vault,
  consul, etcd, datadog, newrelic, pagerduty) so public
  hostnames like jenkins.io are correctly rejected. Risk: HIGH.
  Discovered in target-B's
  pers.exe::PASystemInfoScanner.SenderInfomation (a .NET WPF
  class that does a DNS lookup of a publisher-internal .io TLD
  staging relay and conditionally sends the un-hashed machine
  fingerprint to it).

- servers/re-lief/src/re_lief/categorizers.py: added
  load_excludes() (returns {category_name: [exclude, ...]}) +
  categorize() now honors the exclude list. Backward-compatible:
  existing call sites that don't add exclude_keywords: to their
  YAML entries see no behavior change. New YAML schema fields:
  exclude_keywords: (per category, optional) and
  confirmation: (per anti_debug_indicators.checks[] entry,
  optional).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
John Smith
2026-06-06 15:56:15 -04:00
parent fe6a249306
commit 4003eee7f4
3 changed files with 239 additions and 38 deletions
+150 -28
View File
@@ -343,53 +343,81 @@ anti_debug_indicators:
Static-detectable patterns that suggest a binary is checking for
debuggers, emulators, or sandboxes. Each entry is a
detection signal and a static check.
Cycle 2 fix (C6, 2026-06-06): each check now carries a
`confirmation:` field that classifies the evidence level. The
categorizer (and downstream `re-drm-fingerprint`) only counts
a check if the confirmation is satisfied:
- `string_only` — the API name / signal name appearing in
the binary's string table is enough.
- `import_only` — the binary must import the API (link-time
IAT entry) for the check to fire. The string-table
presence alone is not enough (C++ symbol fragments like
`_Xlength_error` were the FPs that surfaced in the
2026-06-06-r01 stress test).
- `requires_disasm` — the byte-pattern check (RDTSC, INT 2D,
INT 3) must be backed by a disasm hit. The string-table
presence of "RDTSC" is meaningless; the `0F 31` opcode
at the call site is the evidence.
- `requires_xref` — manual / xref-based detection; the
automated categorizer cannot count these.
checks:
- name: "PEB.BeingDebugged"
signal: "direct read of fs:[0x30] / gs:[0x60] + 0x02 / 0x130"
detection: "string search for the byte sequence in the binary"
confirmation: "import_only"
- name: "IsDebuggerPresent"
signal: "import of kernel32!IsDebuggerPresent"
detection: "re-rizin.list_imports_exports"
confirmation: "import_only"
- name: "CheckRemoteDebuggerPresent"
signal: "import of kernel32!CheckRemoteDebuggerPresent"
detection: "re-rizin.list_imports_exports"
confirmation: "import_only"
- name: "NtQueryInformationProcess"
signal: "call to ntdll!NtQueryInformationProcess with class
ProcessDebugPort (0x07) or ProcessDebugObjectHandle (0x1E)"
detection: "re-rizin.disassemble_function around the call site;
check the immediate argument to the call"
confirmation: "import_only"
- name: "RDTSC timing"
signal: "sequence of two rdtsc instructions separated by a small
amount of code; if delta > N cycles, debugger is present"
detection: "re-rizin.search_bytes for `0F 31` opcode (rdtsc)"
confirmation: "requires_disasm"
- name: "INT 2D"
signal: "int 0x2d (single-byte interrupt) — when a debugger is
attached, the exception is swallowed and execution continues
with different state than when no debugger is attached"
detection: "re-rizin.search_bytes for `CD 2D`"
confirmation: "requires_disasm"
- name: "INT 3 trap"
signal: "int 0x3 (0xCC) used as a control-flow check; a
debugger will step over it without raising the exception,
while normal execution raises EXCEPTION_BREAKPOINT"
detection: "re-rizin.search_bytes for `CC` and check that the
following code expects SEH"
confirmation: "requires_disasm"
- name: "OutputDebugString"
signal: "import of kernel32!OutputDebugString with a non-null
buffer; if a debugger is present, the function returns
immediately without side effects, otherwise it raises an
exception that the calling code can catch"
detection: "re-rizin.list_imports_exports"
confirmation: "import_only"
- name: "exception-hooking decoys (encrypted-VM-style)"
signal: "writes to the stack frame at offsets that match
EXCEPTION_RECORD layout, *after* a CPUID or syscall, then
reads the same offsets before a comparison"
detection: "manual review; pattern: `mov [rsp+0..0x98], X`
followed by a `cmp [rsp+0..0x98], X`" # the decoy pattern
confirmation: "requires_disasm"
- name: "scattered-bit register storage (encrypted-VM marker)"
signal: "VM register values stored as bits scattered across the
stack rather than contiguously; defeats pattern matching"
detection: "manual; flagged by `re-vm-reverse` after the
dispatcher is identified"
confirmation: "requires_xref"
# ─────────────────────────────────────────────────────────────────────
# String categories. Used by `re-lief.categorize_strings` to bucket
@@ -431,13 +459,23 @@ string_categories:
- name: hwid
seed_from: hwid_apis.high_signal
seed_field: api
# C4 refactor (Cycle 2, 2026-06-06): exclude the OpenSSL
# static-link compiler-invocation line that fires on every
# binary that statically links OpenSSL (1 FP in P3R.exe,
# the `cl /Zi /Fdossl_static.pdb` line).
exclude_keywords:
- "cl /Zi /Fd"
- "ossl_static.pdb"
- "/Fdopenssl"
note: |
Inherits verbatim from `hwid_apis.high_signal[].api` —
GetComputerNameW, GetVolumeInformationW,
GetAdaptersAddresses, etc. The `medium_signal` set
(RegOpenKeyExW, RegQueryValueExW, GetSystemInfo, etc.)
lives in the `registry` and `process` categories below
for a cleaner bucket split.
for a cleaner bucket split. Cycle 2 fix (C4):
OpenSSL static-link compiler-invocation lines are
excluded.
- name: crypto
keywords:
- "OpenSSL"
@@ -575,34 +613,67 @@ string_categories:
links kernel32 + the file APIs (a pure copy tool) will
fire only on this bucket.
- name: fingerprint
# C2 refactor (Cycle 2, 2026-06-06): split the prior
# `fingerprint` bucket into two siblings. The new
# `windows-com-api-name` bucket contains the standard
# COM / typelib property names that show up in every
# Windows binary that links a typelib — high false-positive
# set for the custom-HW-fingerprint heuristic. The new
# `custom-fingerprint` bucket keeps the high-signal
# literal-pattern matches.
keywords:
- "Volume{"
- "\\\\.\\PhysicalDrive"
- "\\\\.\\CdRom"
- "SMBIOS"
- "Manufacturer"
- "SerialNumber"
- "ProductId"
- "UUID"
- "MachineGuid"
- "HKLM\\SOFTWARE\\Microsoft\\Cryptography"
- "displayName"
- "enhancedSearchGuide"
- "searchGuide"
- "fingerprint"
- "hostid"
note: |
Strings that suggest the binary is reading a
hardware-fingerprint vector *directly* (not via the API).
Less about the API, more about the *value* — `Volume{...}`
is the canonical Windows volume-serial GUID. Most
fingerprints reach the binary through the API in the
`hwid` bucket; this one catches the rare case where the
fingerprint is inlined as a literal.
- name: activation
[C2] High-signal HW-fingerprint literal patterns —
`Volume{...}` is the canonical Windows volume-serial
GUID, `\\.\PhysicalDrive` is the raw-block device path,
`SMBIOS` is the system-management-BIOS table path,
`MachineGuid` is the Win10+ per-install GUID under
`HKLM\SOFTWARE\Microsoft\Cryptography`. Less about the
API (which is in the `hwid` bucket), more about the
*value* — these are the rare cases where the fingerprint
is inlined as a literal.
- name: windows-com-api-name
# C2 refactor: new bucket split from the prior `fingerprint`
# category. Captures the standard COM / typelib property
# names that show up in every Windows binary that imports
# a typelib (DisplayName, ProductIdentifier, etc.) — high
# false-positive set for the custom-HW-fingerprint heuristic
# but a real signal of "this binary uses typelib property
# introspection" (e.g. WMI providers, DirectShow filter
# graphs, MAPI clients).
keywords:
- "Manufacturer"
- "SerialNumber"
- "ProductId"
- "ProductIdentifier"
- "UUID"
- "displayName"
- "enhancedSearchGuide"
- "searchGuide"
note: |
[C2] Standard COM / WinRT / WMI typelib property names.
High false-positive rate on binaries that link any typelib
(the property names are in the binary's import table).
The reviewer should only treat this as a high-signal
finding if the binary also fires on `custom-fingerprint`
or `hwid`. Previously grouped with `fingerprint`; the
2026-06-06-r01 stress test on target-C (P3R.exe)
surfaced 48 false-positive hits in this set.
- name: activation
# C1 refactor (Cycle 2, 2026-06-06): split the prior
# `activation` bucket into two siblings. `ue-component-activation`
# captures the noisy UE/Unity component-lifecycle vocabulary
# that fired on every Unity game (the 615 FPs in P3R.exe).
# `license-activation` keeps the real license-gate vocabulary.
keywords:
- "Activation"
- "Activate"
- "License"
- "Licence"
- "Entitlement"
@@ -622,17 +693,36 @@ string_categories:
- "Token"
- "Challenge"
- "Response"
- "Manifest"
- "msi.dll"
- "mscoree.dll"
note: |
Activation / license-gate vocabulary. Includes PKCS#7 /
CMS object names and the RegisterEventSource /
DeregisterEventSource pair that the activation routine
typically uses to write to the Windows Event Log. False
positives: any UI string containing the word
"Activate" (Unity component lifecycle) fires here; review
`samples[]` to confirm.
[C1] License-gate vocabulary. Includes PKCS#7 / CMS
object names, the RegisterEventSource /
DeregisterEventSource pair the activation routine
typically uses to write to the Windows Event Log, and
the manifest-token / challenge-response vocabulary.
The Pattern B detection rule in `ANTI-TAMPER-TAXONOMY.md`
§"How to detect the patterns" fires on
`license-activation.count >= 50` (was `activation.count`).
- name: ue-component-activation
# C1 refactor: new bucket split from the prior `activation`
# category. Captures the UE/Unity component-lifecycle
# vocabulary that fired on every Unity game (the 615 FPs
# in P3R.exe). Low signal on its own; useful as a
# "this is a Unity game" indicator.
keywords:
- "Activation"
- "Activate"
- "Manifest"
- "OnComponentActivated"
- "ENiagaraSystemSpawnSectionEndBehavior::Deactivate"
- "Deactivate"
note: |
[C1] UE / Unity component-lifecycle vocabulary. Fires
on every Unity game; the 2026-06-06-r01 stress test
surfaced 615 false-positive hits in P3R.exe alone. The
reviewer should treat this bucket as a Unity / UE
*detection* signal, not as a license-activation signal.
- name: obfuscation
keywords:
- "\\crypto\\"
@@ -663,6 +753,19 @@ string_categories:
- "PEB"
- "BeingDebugged"
- "NtGlobalFlag"
# C5 refactor (Cycle 2, 2026-06-06): exclude the high-FP
# Intel TBB / CI-build-path strings that fire on every
# binary that statically links Intel TBB (41 FPs in
# tbb12.dll in target-C).
exclude_keywords:
- "__TBB_"
- "tbb::"
- "tbb::task"
- "TBB_internal"
- "C:\\ci\\builds\\"
- "C:/ci/builds/"
- "C:\\BuildBot\\"
- "/ci/builds/"
note: |
String patterns that suggest obfuscation / VM-pack code.
Note `\\crypto\\` is a *path*, not a runtime call — it
@@ -670,7 +773,8 @@ string_categories:
into release binaries (a known false positive on
statically linked OpenSSL). The VM-dispatch strings
(lookup / dispatch / handler / vm_entry) are the
encrypted-VM bytecode category signal.
encrypted-VM bytecode category signal. Cycle 2 fix
(C5): TBB / CI-build paths excluded.
- name: misc
keywords: []
note: |
@@ -706,12 +810,30 @@ string_categories:
- "xoxa-"
- "xoxr-"
- "xoxs-"
# C3 refactor (Cycle 2, 2026-06-06): exclude the Unicode
# UCD-constant substrings that fire on every binary with a
# C runtime (the 13 FPs in P3R.exe were all `East_Asian_Width`,
# `Caucasian_Albanian`, `stasianwidth` — substrings of
# the `asian` / `albanian` / `width` keywords).
exclude_keywords:
- "asian"
- "Asian"
- "Asia"
- "albanian"
- "Albanian"
- "width"
- "Width"
- "East_Asian_Width"
- "Caucasian_Albanian"
- "stasianwidth"
- "sesasianwidth"
note: |
String patterns from the public infrastructure of
publisher telemetry + collaboration tools. Vendor-neutral
— catches the URL scheme / hostname, not the publisher.
Pairs with the ``re-leak-scan`` MCP server for a
per-leak breakdown with HTTP verification.
per-leak breakdown with HTTP verification. Cycle 2
fix (C3): Unicode UCD-constant substrings excluded.
# ─────────────────────────────────────────────────────────────────────
# Pattern indicators. Soft signals — describe the *category* of
@@ -115,6 +115,40 @@ GENERIC_HEX_SECRET = Pattern(
)
# Cycle 2 fix (L1, 2026-06-06): internal-diagnostic-relay hostname.
# Discovered in target-B's `pers.exe::PASystemInfoScanner.SenderInfomation`
# (a .NET WPF class). The class does a DNS lookup of a publisher-internal
# `.io` TLD hostname, compares the resolved IP against RFC1918
# `10.0.0.0/8` to detect the corporate environment, and conditionally
# sends the un-hashed machine fingerprint only when the host is on
# the internal network. The hostname itself is a real leak — it
# indicates the binary was built against an internal corporate
# resolver and shipped without scrubbing.
#
# The pattern matches an internal-TLD anchor + a diagnostic-product
# stem (jenkins, jira, grafana, prometheus, etc.) to keep the false-
# positive rate low (the public `jenkins.io` would otherwise match).
PUBLISHER_INTERNAL_DIAGNOSTIC_HOSTNAME = Pattern(
name="publisher-internal-diagnostic-hostname",
regex=(
r"\b(?:[a-z0-9\-]+\.)*"
r"(?:jenkins|jira|grafana|prometheus|kibana|splunk|sentry|"
r"bitbucket|gerrit|artifactory|nexus|sonarqube|vault|consul|"
r"etcd|datadog|newrelic|pagerduty)"
r"(?:\.[a-z0-9\-]+)*"
r"\.(?:internal|corp|lan|local|intra|private|home\.arpa)"
r"\b"
),
description=(
"Internal diagnostic / observability hostname — suggests "
"the binary was built against an internal corporate "
"resolver and shipped without scrubbing. Pairs with the "
"telemetry_leak catalog category in drm-indicators.yaml."
),
risk="HIGH",
)
PATTERNS: list[Pattern] = [
SENTRY_DSN,
LOGSTASH_URL,
@@ -123,6 +157,7 @@ PATTERNS: list[Pattern] = [
AWS_ACCESS_KEY,
SLACK_TOKEN,
GENERIC_HEX_SECRET,
PUBLISHER_INTERNAL_DIAGNOSTIC_HOSTNAME,
]
+54 -10
View File
@@ -110,6 +110,30 @@ def load_categories() -> dict[str, list[str]]:
return out
@lru_cache(maxsize=1)
def load_excludes() -> dict[str, list[str]]:
"""Return ``{category_name: [exclude_keyword, ...]}`` resolved from the YAML.
Cycle 2 fix: added support for ``exclude_keywords:`` per category
entry. A match that hits an *include* keyword for a category
but also hits an *exclude* keyword for the same category is
filtered out. Used to eliminate the false-positive categorizer
hits that surfaced during the 2026-06-06-r01 stress test
(e.g. the ``*asian*`` / ``*albanian*`` / ``*width*`` Unicode
UCD constants firing on ``telemetry_leak``, the OpenSSL
static-link compiler invocations firing on ``hwid``, the
``__TBB_*`` / ``C:\\ci\\builds\\*`` paths firing on
``obfuscation``).
"""
cat = _load_catalog()
out: dict[str, list[str]] = {}
for entry in cat.get("string_categories", {}).get("categories", []):
excludes = entry.get("exclude_keywords")
if excludes:
out[entry["name"]] = list(excludes)
return out
def categorize(
matches: list[dict[str, Any]],
categories: list[str] | None = None,
@@ -136,10 +160,19 @@ def categorize(
count is still reported honestly but ``samples`` is capped.
samples_per_category
Cap on the number of sample matches returned per category.
Cycle 2 fix: a match is filtered out of a category if it hits
any of the category's ``exclude_keywords``. The exclude check
runs *after* the include check, so the user sees honest counts
on real anti-tamper / fingerprint / telemetry signals while
the 700+ false-positive hits that surfaced in the 2026-06-06-r01
stress test are suppressed.
"""
cats = load_categories()
excludes = load_excludes()
if categories is not None:
cats = {k: v for k, v in cats.items() if k in categories}
excludes = {k: v for k, v in excludes.items() if k in categories}
out: dict[str, dict[str, Any]] = {
name: {"count": 0, "samples": []} for name in cats
}
@@ -153,16 +186,27 @@ def categorize(
s_lower = s.lower()
section = m.get("section", "")
for name, keywords in cats.items():
matched_include = False
for kw in keywords:
if kw and kw.lower() in s_lower:
key = (s, section)
if key in seen_in_cat[name]:
break
seen_in_cat[name].add(key)
out[name]["count"] += 1
if len(out[name]["samples"]) < samples_per_category:
out[name]["samples"].append(
{"string": s, "section": section}
)
break # count each match at most once per category
matched_include = True
break
if not matched_include:
continue
# Cycle 2 fix: honor the exclude list. If the same string
# also hits any exclude keyword for this category, the
# match is filtered out. This eliminates the Unicode
# UCD-constant and OpenSSL-static-link false positives.
cat_excludes = excludes.get(name, [])
if any(ex and ex.lower() in s_lower for ex in cat_excludes):
continue
key = (s, section)
if key in seen_in_cat[name]:
continue
seen_in_cat[name].add(key)
out[name]["count"] += 1
if len(out[name]["samples"]) < samples_per_category:
out[name]["samples"].append(
{"string": s, "section": section}
)
return out