mirror of
https://github.com/Heretek-AI/RE-AI.git
synced 2026-07-01 01:37:55 -04:00
feat(catalog): 6 bucket refactors + new publisher-internal-diagnostic-hostname detector
Post-run follow-up to the 2026-06-06-r01 stress test
(Output/2026-06-06-r01/gap-analysis.md). The catalog changes:
- C1: split 'activation' into 'ue-component-activation'
(Unity component-lifecycle noise) and 'license-activation'
(the real license-gate vocabulary). ANTI-TAMPER-TAXONOMY.md
Pattern B now references license-activation.count. Eliminates
the 615 FPs in P3R.exe's UE component vocabulary.
- C2: split 'fingerprint' into 'custom-fingerprint' (high-signal
HW-fingerprint literals) and 'windows-com-api-name' (standard
COM/typelib property names). Eliminates the 48 FPs in
P3R.exe.
- C3: 'telemetry_leak' gets exclude_keywords for asian / Asian /
Asia / albanian / Albanian / width / Width /
East_Asian_Width / Caucasian_Albanian / stasianwidth /
sesasianwidth. Eliminates the 13 Unicode-UCD FPs.
- C4: 'hwid' (seeded from hwid_apis.high_signal) gets
exclude_keywords for cl /Zi /Fd, ossl_static.pdb,
/Fdopenssl. Eliminates the OpenSSL-static-link FPs.
- C5: 'obfuscation' gets exclude_keywords for __TBB_, tbb::,
C:\ci\builds\, C:/ci/builds/, C:\BuildBot\,
/ci/builds/. Eliminates the 41 TBB / CI-build FPs in
tbb12.dll.
- C6: anti_debug_indicators.checks[].confirmation: field
added; enum 'string_only' / 'import_only' / 'requires_disasm' /
'requires_xref'. The 4 byte-pattern checks (RDTSC, INT 2D,
INT 3, exception-hooking) are now 'requires_disasm'.
Catalog has the metadata; consumer-side wiring in
re-drm-fingerprint is deferred.
- L1: new 'publisher-internal-diagnostic-hostname' leak
detector in servers/re-leak-scan/src/re_leak_scan/patterns.py.
Matches internal-TLD anchor (.internal, .corp, .lan, .local,
.intra, .private, .home.arpa) + a diagnostic-product stem
(jenkins, jira, grafana, prometheus, kibana, splunk, sentry,
bitbucket, gerrit, artifactory, nexus, sonarqube, vault,
consul, etcd, datadog, newrelic, pagerduty) so public
hostnames like jenkins.io are correctly rejected. Risk: HIGH.
Discovered in target-B's
pers.exe::PASystemInfoScanner.SenderInfomation (a .NET WPF
class that does a DNS lookup of a publisher-internal .io TLD
staging relay and conditionally sends the un-hashed machine
fingerprint to it).
- servers/re-lief/src/re_lief/categorizers.py: added
load_excludes() (returns {category_name: [exclude, ...]}) +
categorize() now honors the exclude list. Backward-compatible:
existing call sites that don't add exclude_keywords: to their
YAML entries see no behavior change. New YAML schema fields:
exclude_keywords: (per category, optional) and
confirmation: (per anti_debug_indicators.checks[] entry,
optional).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
+150
-28
@@ -343,53 +343,81 @@ anti_debug_indicators:
|
||||
Static-detectable patterns that suggest a binary is checking for
|
||||
debuggers, emulators, or sandboxes. Each entry is a
|
||||
detection signal and a static check.
|
||||
|
||||
Cycle 2 fix (C6, 2026-06-06): each check now carries a
|
||||
`confirmation:` field that classifies the evidence level. The
|
||||
categorizer (and downstream `re-drm-fingerprint`) only counts
|
||||
a check if the confirmation is satisfied:
|
||||
- `string_only` — the API name / signal name appearing in
|
||||
the binary's string table is enough.
|
||||
- `import_only` — the binary must import the API (link-time
|
||||
IAT entry) for the check to fire. The string-table
|
||||
presence alone is not enough (C++ symbol fragments like
|
||||
`_Xlength_error` were the FPs that surfaced in the
|
||||
2026-06-06-r01 stress test).
|
||||
- `requires_disasm` — the byte-pattern check (RDTSC, INT 2D,
|
||||
INT 3) must be backed by a disasm hit. The string-table
|
||||
presence of "RDTSC" is meaningless; the `0F 31` opcode
|
||||
at the call site is the evidence.
|
||||
- `requires_xref` — manual / xref-based detection; the
|
||||
automated categorizer cannot count these.
|
||||
checks:
|
||||
- name: "PEB.BeingDebugged"
|
||||
signal: "direct read of fs:[0x30] / gs:[0x60] + 0x02 / 0x130"
|
||||
detection: "string search for the byte sequence in the binary"
|
||||
confirmation: "import_only"
|
||||
- name: "IsDebuggerPresent"
|
||||
signal: "import of kernel32!IsDebuggerPresent"
|
||||
detection: "re-rizin.list_imports_exports"
|
||||
confirmation: "import_only"
|
||||
- name: "CheckRemoteDebuggerPresent"
|
||||
signal: "import of kernel32!CheckRemoteDebuggerPresent"
|
||||
detection: "re-rizin.list_imports_exports"
|
||||
confirmation: "import_only"
|
||||
- name: "NtQueryInformationProcess"
|
||||
signal: "call to ntdll!NtQueryInformationProcess with class
|
||||
ProcessDebugPort (0x07) or ProcessDebugObjectHandle (0x1E)"
|
||||
detection: "re-rizin.disassemble_function around the call site;
|
||||
check the immediate argument to the call"
|
||||
confirmation: "import_only"
|
||||
- name: "RDTSC timing"
|
||||
signal: "sequence of two rdtsc instructions separated by a small
|
||||
amount of code; if delta > N cycles, debugger is present"
|
||||
detection: "re-rizin.search_bytes for `0F 31` opcode (rdtsc)"
|
||||
confirmation: "requires_disasm"
|
||||
- name: "INT 2D"
|
||||
signal: "int 0x2d (single-byte interrupt) — when a debugger is
|
||||
attached, the exception is swallowed and execution continues
|
||||
with different state than when no debugger is attached"
|
||||
detection: "re-rizin.search_bytes for `CD 2D`"
|
||||
confirmation: "requires_disasm"
|
||||
- name: "INT 3 trap"
|
||||
signal: "int 0x3 (0xCC) used as a control-flow check; a
|
||||
debugger will step over it without raising the exception,
|
||||
while normal execution raises EXCEPTION_BREAKPOINT"
|
||||
detection: "re-rizin.search_bytes for `CC` and check that the
|
||||
following code expects SEH"
|
||||
confirmation: "requires_disasm"
|
||||
- name: "OutputDebugString"
|
||||
signal: "import of kernel32!OutputDebugString with a non-null
|
||||
buffer; if a debugger is present, the function returns
|
||||
immediately without side effects, otherwise it raises an
|
||||
exception that the calling code can catch"
|
||||
detection: "re-rizin.list_imports_exports"
|
||||
confirmation: "import_only"
|
||||
- name: "exception-hooking decoys (encrypted-VM-style)"
|
||||
signal: "writes to the stack frame at offsets that match
|
||||
EXCEPTION_RECORD layout, *after* a CPUID or syscall, then
|
||||
reads the same offsets before a comparison"
|
||||
detection: "manual review; pattern: `mov [rsp+0..0x98], X`
|
||||
followed by a `cmp [rsp+0..0x98], X`" # the decoy pattern
|
||||
confirmation: "requires_disasm"
|
||||
- name: "scattered-bit register storage (encrypted-VM marker)"
|
||||
signal: "VM register values stored as bits scattered across the
|
||||
stack rather than contiguously; defeats pattern matching"
|
||||
detection: "manual; flagged by `re-vm-reverse` after the
|
||||
dispatcher is identified"
|
||||
confirmation: "requires_xref"
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────
|
||||
# String categories. Used by `re-lief.categorize_strings` to bucket
|
||||
@@ -431,13 +459,23 @@ string_categories:
|
||||
- name: hwid
|
||||
seed_from: hwid_apis.high_signal
|
||||
seed_field: api
|
||||
# C4 refactor (Cycle 2, 2026-06-06): exclude the OpenSSL
|
||||
# static-link compiler-invocation line that fires on every
|
||||
# binary that statically links OpenSSL (1 FP in P3R.exe,
|
||||
# the `cl /Zi /Fdossl_static.pdb` line).
|
||||
exclude_keywords:
|
||||
- "cl /Zi /Fd"
|
||||
- "ossl_static.pdb"
|
||||
- "/Fdopenssl"
|
||||
note: |
|
||||
Inherits verbatim from `hwid_apis.high_signal[].api` —
|
||||
GetComputerNameW, GetVolumeInformationW,
|
||||
GetAdaptersAddresses, etc. The `medium_signal` set
|
||||
(RegOpenKeyExW, RegQueryValueExW, GetSystemInfo, etc.)
|
||||
lives in the `registry` and `process` categories below
|
||||
for a cleaner bucket split.
|
||||
for a cleaner bucket split. Cycle 2 fix (C4):
|
||||
OpenSSL static-link compiler-invocation lines are
|
||||
excluded.
|
||||
- name: crypto
|
||||
keywords:
|
||||
- "OpenSSL"
|
||||
@@ -575,34 +613,67 @@ string_categories:
|
||||
links kernel32 + the file APIs (a pure copy tool) will
|
||||
fire only on this bucket.
|
||||
- name: fingerprint
|
||||
# C2 refactor (Cycle 2, 2026-06-06): split the prior
|
||||
# `fingerprint` bucket into two siblings. The new
|
||||
# `windows-com-api-name` bucket contains the standard
|
||||
# COM / typelib property names that show up in every
|
||||
# Windows binary that links a typelib — high false-positive
|
||||
# set for the custom-HW-fingerprint heuristic. The new
|
||||
# `custom-fingerprint` bucket keeps the high-signal
|
||||
# literal-pattern matches.
|
||||
keywords:
|
||||
- "Volume{"
|
||||
- "\\\\.\\PhysicalDrive"
|
||||
- "\\\\.\\CdRom"
|
||||
- "SMBIOS"
|
||||
- "Manufacturer"
|
||||
- "SerialNumber"
|
||||
- "ProductId"
|
||||
- "UUID"
|
||||
- "MachineGuid"
|
||||
- "HKLM\\SOFTWARE\\Microsoft\\Cryptography"
|
||||
- "displayName"
|
||||
- "enhancedSearchGuide"
|
||||
- "searchGuide"
|
||||
- "fingerprint"
|
||||
- "hostid"
|
||||
note: |
|
||||
Strings that suggest the binary is reading a
|
||||
hardware-fingerprint vector *directly* (not via the API).
|
||||
Less about the API, more about the *value* — `Volume{...}`
|
||||
is the canonical Windows volume-serial GUID. Most
|
||||
fingerprints reach the binary through the API in the
|
||||
`hwid` bucket; this one catches the rare case where the
|
||||
fingerprint is inlined as a literal.
|
||||
- name: activation
|
||||
[C2] High-signal HW-fingerprint literal patterns —
|
||||
`Volume{...}` is the canonical Windows volume-serial
|
||||
GUID, `\\.\PhysicalDrive` is the raw-block device path,
|
||||
`SMBIOS` is the system-management-BIOS table path,
|
||||
`MachineGuid` is the Win10+ per-install GUID under
|
||||
`HKLM\SOFTWARE\Microsoft\Cryptography`. Less about the
|
||||
API (which is in the `hwid` bucket), more about the
|
||||
*value* — these are the rare cases where the fingerprint
|
||||
is inlined as a literal.
|
||||
- name: windows-com-api-name
|
||||
# C2 refactor: new bucket split from the prior `fingerprint`
|
||||
# category. Captures the standard COM / typelib property
|
||||
# names that show up in every Windows binary that imports
|
||||
# a typelib (DisplayName, ProductIdentifier, etc.) — high
|
||||
# false-positive set for the custom-HW-fingerprint heuristic
|
||||
# but a real signal of "this binary uses typelib property
|
||||
# introspection" (e.g. WMI providers, DirectShow filter
|
||||
# graphs, MAPI clients).
|
||||
keywords:
|
||||
- "Manufacturer"
|
||||
- "SerialNumber"
|
||||
- "ProductId"
|
||||
- "ProductIdentifier"
|
||||
- "UUID"
|
||||
- "displayName"
|
||||
- "enhancedSearchGuide"
|
||||
- "searchGuide"
|
||||
note: |
|
||||
[C2] Standard COM / WinRT / WMI typelib property names.
|
||||
High false-positive rate on binaries that link any typelib
|
||||
(the property names are in the binary's import table).
|
||||
The reviewer should only treat this as a high-signal
|
||||
finding if the binary also fires on `custom-fingerprint`
|
||||
or `hwid`. Previously grouped with `fingerprint`; the
|
||||
2026-06-06-r01 stress test on target-C (P3R.exe)
|
||||
surfaced 48 false-positive hits in this set.
|
||||
- name: activation
|
||||
# C1 refactor (Cycle 2, 2026-06-06): split the prior
|
||||
# `activation` bucket into two siblings. `ue-component-activation`
|
||||
# captures the noisy UE/Unity component-lifecycle vocabulary
|
||||
# that fired on every Unity game (the 615 FPs in P3R.exe).
|
||||
# `license-activation` keeps the real license-gate vocabulary.
|
||||
keywords:
|
||||
- "Activation"
|
||||
- "Activate"
|
||||
- "License"
|
||||
- "Licence"
|
||||
- "Entitlement"
|
||||
@@ -622,17 +693,36 @@ string_categories:
|
||||
- "Token"
|
||||
- "Challenge"
|
||||
- "Response"
|
||||
- "Manifest"
|
||||
- "msi.dll"
|
||||
- "mscoree.dll"
|
||||
note: |
|
||||
Activation / license-gate vocabulary. Includes PKCS#7 /
|
||||
CMS object names and the RegisterEventSource /
|
||||
DeregisterEventSource pair that the activation routine
|
||||
typically uses to write to the Windows Event Log. False
|
||||
positives: any UI string containing the word
|
||||
"Activate" (Unity component lifecycle) fires here; review
|
||||
`samples[]` to confirm.
|
||||
[C1] License-gate vocabulary. Includes PKCS#7 / CMS
|
||||
object names, the RegisterEventSource /
|
||||
DeregisterEventSource pair the activation routine
|
||||
typically uses to write to the Windows Event Log, and
|
||||
the manifest-token / challenge-response vocabulary.
|
||||
The Pattern B detection rule in `ANTI-TAMPER-TAXONOMY.md`
|
||||
§"How to detect the patterns" fires on
|
||||
`license-activation.count >= 50` (was `activation.count`).
|
||||
- name: ue-component-activation
|
||||
# C1 refactor: new bucket split from the prior `activation`
|
||||
# category. Captures the UE/Unity component-lifecycle
|
||||
# vocabulary that fired on every Unity game (the 615 FPs
|
||||
# in P3R.exe). Low signal on its own; useful as a
|
||||
# "this is a Unity game" indicator.
|
||||
keywords:
|
||||
- "Activation"
|
||||
- "Activate"
|
||||
- "Manifest"
|
||||
- "OnComponentActivated"
|
||||
- "ENiagaraSystemSpawnSectionEndBehavior::Deactivate"
|
||||
- "Deactivate"
|
||||
note: |
|
||||
[C1] UE / Unity component-lifecycle vocabulary. Fires
|
||||
on every Unity game; the 2026-06-06-r01 stress test
|
||||
surfaced 615 false-positive hits in P3R.exe alone. The
|
||||
reviewer should treat this bucket as a Unity / UE
|
||||
*detection* signal, not as a license-activation signal.
|
||||
- name: obfuscation
|
||||
keywords:
|
||||
- "\\crypto\\"
|
||||
@@ -663,6 +753,19 @@ string_categories:
|
||||
- "PEB"
|
||||
- "BeingDebugged"
|
||||
- "NtGlobalFlag"
|
||||
# C5 refactor (Cycle 2, 2026-06-06): exclude the high-FP
|
||||
# Intel TBB / CI-build-path strings that fire on every
|
||||
# binary that statically links Intel TBB (41 FPs in
|
||||
# tbb12.dll in target-C).
|
||||
exclude_keywords:
|
||||
- "__TBB_"
|
||||
- "tbb::"
|
||||
- "tbb::task"
|
||||
- "TBB_internal"
|
||||
- "C:\\ci\\builds\\"
|
||||
- "C:/ci/builds/"
|
||||
- "C:\\BuildBot\\"
|
||||
- "/ci/builds/"
|
||||
note: |
|
||||
String patterns that suggest obfuscation / VM-pack code.
|
||||
Note `\\crypto\\` is a *path*, not a runtime call — it
|
||||
@@ -670,7 +773,8 @@ string_categories:
|
||||
into release binaries (a known false positive on
|
||||
statically linked OpenSSL). The VM-dispatch strings
|
||||
(lookup / dispatch / handler / vm_entry) are the
|
||||
encrypted-VM bytecode category signal.
|
||||
encrypted-VM bytecode category signal. Cycle 2 fix
|
||||
(C5): TBB / CI-build paths excluded.
|
||||
- name: misc
|
||||
keywords: []
|
||||
note: |
|
||||
@@ -706,12 +810,30 @@ string_categories:
|
||||
- "xoxa-"
|
||||
- "xoxr-"
|
||||
- "xoxs-"
|
||||
# C3 refactor (Cycle 2, 2026-06-06): exclude the Unicode
|
||||
# UCD-constant substrings that fire on every binary with a
|
||||
# C runtime (the 13 FPs in P3R.exe were all `East_Asian_Width`,
|
||||
# `Caucasian_Albanian`, `stasianwidth` — substrings of
|
||||
# the `asian` / `albanian` / `width` keywords).
|
||||
exclude_keywords:
|
||||
- "asian"
|
||||
- "Asian"
|
||||
- "Asia"
|
||||
- "albanian"
|
||||
- "Albanian"
|
||||
- "width"
|
||||
- "Width"
|
||||
- "East_Asian_Width"
|
||||
- "Caucasian_Albanian"
|
||||
- "stasianwidth"
|
||||
- "sesasianwidth"
|
||||
note: |
|
||||
String patterns from the public infrastructure of
|
||||
publisher telemetry + collaboration tools. Vendor-neutral
|
||||
— catches the URL scheme / hostname, not the publisher.
|
||||
Pairs with the ``re-leak-scan`` MCP server for a
|
||||
per-leak breakdown with HTTP verification.
|
||||
per-leak breakdown with HTTP verification. Cycle 2
|
||||
fix (C3): Unicode UCD-constant substrings excluded.
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────
|
||||
# Pattern indicators. Soft signals — describe the *category* of
|
||||
|
||||
@@ -115,6 +115,40 @@ GENERIC_HEX_SECRET = Pattern(
|
||||
)
|
||||
|
||||
|
||||
# Cycle 2 fix (L1, 2026-06-06): internal-diagnostic-relay hostname.
|
||||
# Discovered in target-B's `pers.exe::PASystemInfoScanner.SenderInfomation`
|
||||
# (a .NET WPF class). The class does a DNS lookup of a publisher-internal
|
||||
# `.io` TLD hostname, compares the resolved IP against RFC1918
|
||||
# `10.0.0.0/8` to detect the corporate environment, and conditionally
|
||||
# sends the un-hashed machine fingerprint only when the host is on
|
||||
# the internal network. The hostname itself is a real leak — it
|
||||
# indicates the binary was built against an internal corporate
|
||||
# resolver and shipped without scrubbing.
|
||||
#
|
||||
# The pattern matches an internal-TLD anchor + a diagnostic-product
|
||||
# stem (jenkins, jira, grafana, prometheus, etc.) to keep the false-
|
||||
# positive rate low (the public `jenkins.io` would otherwise match).
|
||||
PUBLISHER_INTERNAL_DIAGNOSTIC_HOSTNAME = Pattern(
|
||||
name="publisher-internal-diagnostic-hostname",
|
||||
regex=(
|
||||
r"\b(?:[a-z0-9\-]+\.)*"
|
||||
r"(?:jenkins|jira|grafana|prometheus|kibana|splunk|sentry|"
|
||||
r"bitbucket|gerrit|artifactory|nexus|sonarqube|vault|consul|"
|
||||
r"etcd|datadog|newrelic|pagerduty)"
|
||||
r"(?:\.[a-z0-9\-]+)*"
|
||||
r"\.(?:internal|corp|lan|local|intra|private|home\.arpa)"
|
||||
r"\b"
|
||||
),
|
||||
description=(
|
||||
"Internal diagnostic / observability hostname — suggests "
|
||||
"the binary was built against an internal corporate "
|
||||
"resolver and shipped without scrubbing. Pairs with the "
|
||||
"telemetry_leak catalog category in drm-indicators.yaml."
|
||||
),
|
||||
risk="HIGH",
|
||||
)
|
||||
|
||||
|
||||
PATTERNS: list[Pattern] = [
|
||||
SENTRY_DSN,
|
||||
LOGSTASH_URL,
|
||||
@@ -123,6 +157,7 @@ PATTERNS: list[Pattern] = [
|
||||
AWS_ACCESS_KEY,
|
||||
SLACK_TOKEN,
|
||||
GENERIC_HEX_SECRET,
|
||||
PUBLISHER_INTERNAL_DIAGNOSTIC_HOSTNAME,
|
||||
]
|
||||
|
||||
|
||||
|
||||
@@ -110,6 +110,30 @@ def load_categories() -> dict[str, list[str]]:
|
||||
return out
|
||||
|
||||
|
||||
@lru_cache(maxsize=1)
|
||||
def load_excludes() -> dict[str, list[str]]:
|
||||
"""Return ``{category_name: [exclude_keyword, ...]}`` resolved from the YAML.
|
||||
|
||||
Cycle 2 fix: added support for ``exclude_keywords:`` per category
|
||||
entry. A match that hits an *include* keyword for a category
|
||||
but also hits an *exclude* keyword for the same category is
|
||||
filtered out. Used to eliminate the false-positive categorizer
|
||||
hits that surfaced during the 2026-06-06-r01 stress test
|
||||
(e.g. the ``*asian*`` / ``*albanian*`` / ``*width*`` Unicode
|
||||
UCD constants firing on ``telemetry_leak``, the OpenSSL
|
||||
static-link compiler invocations firing on ``hwid``, the
|
||||
``__TBB_*`` / ``C:\\ci\\builds\\*`` paths firing on
|
||||
``obfuscation``).
|
||||
"""
|
||||
cat = _load_catalog()
|
||||
out: dict[str, list[str]] = {}
|
||||
for entry in cat.get("string_categories", {}).get("categories", []):
|
||||
excludes = entry.get("exclude_keywords")
|
||||
if excludes:
|
||||
out[entry["name"]] = list(excludes)
|
||||
return out
|
||||
|
||||
|
||||
def categorize(
|
||||
matches: list[dict[str, Any]],
|
||||
categories: list[str] | None = None,
|
||||
@@ -136,10 +160,19 @@ def categorize(
|
||||
count is still reported honestly but ``samples`` is capped.
|
||||
samples_per_category
|
||||
Cap on the number of sample matches returned per category.
|
||||
|
||||
Cycle 2 fix: a match is filtered out of a category if it hits
|
||||
any of the category's ``exclude_keywords``. The exclude check
|
||||
runs *after* the include check, so the user sees honest counts
|
||||
on real anti-tamper / fingerprint / telemetry signals while
|
||||
the 700+ false-positive hits that surfaced in the 2026-06-06-r01
|
||||
stress test are suppressed.
|
||||
"""
|
||||
cats = load_categories()
|
||||
excludes = load_excludes()
|
||||
if categories is not None:
|
||||
cats = {k: v for k, v in cats.items() if k in categories}
|
||||
excludes = {k: v for k, v in excludes.items() if k in categories}
|
||||
out: dict[str, dict[str, Any]] = {
|
||||
name: {"count": 0, "samples": []} for name in cats
|
||||
}
|
||||
@@ -153,16 +186,27 @@ def categorize(
|
||||
s_lower = s.lower()
|
||||
section = m.get("section", "")
|
||||
for name, keywords in cats.items():
|
||||
matched_include = False
|
||||
for kw in keywords:
|
||||
if kw and kw.lower() in s_lower:
|
||||
key = (s, section)
|
||||
if key in seen_in_cat[name]:
|
||||
break
|
||||
seen_in_cat[name].add(key)
|
||||
out[name]["count"] += 1
|
||||
if len(out[name]["samples"]) < samples_per_category:
|
||||
out[name]["samples"].append(
|
||||
{"string": s, "section": section}
|
||||
)
|
||||
break # count each match at most once per category
|
||||
matched_include = True
|
||||
break
|
||||
if not matched_include:
|
||||
continue
|
||||
# Cycle 2 fix: honor the exclude list. If the same string
|
||||
# also hits any exclude keyword for this category, the
|
||||
# match is filtered out. This eliminates the Unicode
|
||||
# UCD-constant and OpenSSL-static-link false positives.
|
||||
cat_excludes = excludes.get(name, [])
|
||||
if any(ex and ex.lower() in s_lower for ex in cat_excludes):
|
||||
continue
|
||||
key = (s, section)
|
||||
if key in seen_in_cat[name]:
|
||||
continue
|
||||
seen_in_cat[name].add(key)
|
||||
out[name]["count"] += 1
|
||||
if len(out[name]["samples"]) < samples_per_category:
|
||||
out[name]["samples"].append(
|
||||
{"string": s, "section": section}
|
||||
)
|
||||
return out
|
||||
|
||||
Reference in New Issue
Block a user