Improve PII scrubbing: Use recursive field name matching

- Changed scrub_trace.sh to match field names recursively at any depth
- Works with arrays and nested objects (e.g., "content" finds all content fields)
- Simpler interface: just field names instead of dotted paths
- Safer for PII: catches sensitive data in unexpected locations
- Updated README with new usage examples and field list

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Stephen Chu
2025-12-23 18:42:56 -05:00
parent d785edb9eb
commit 5b31b79331
2 changed files with 31 additions and 34 deletions
+13 -12
View File
@@ -13,8 +13,8 @@ Customers extract their trace and scrub sensitive data before sending:
export LANGSMITH_API_KEY='your-api-key'
./extract_trace.sh 00000000-0000-0000-f319-b36446ca3f23
# 2. Scrub PII
./scrub_trace.sh trace_00000000-0000-0000-f319-b36446ca3f23.json "inputs.messages,inputs.email"
# 2. Scrub PII (recursively redacts field names)
./scrub_trace.sh trace_00000000-0000-0000-f319-b36446ca3f23.json "content,email"
# 3. Review scrubbed file manually
@@ -50,7 +50,7 @@ Extract a trace by ID.
### `scrub_trace.sh`
Redact PII fields from trace.
Redact PII fields from trace using recursive field name matching.
```bash
./scrub_trace.sh <trace_file> "<field1>,<field2>,..."
@@ -59,15 +59,16 @@ Redact PII fields from trace.
**Output:** `<trace_file>.scrubbed.json`
**Common fields to redact:**
- `inputs.messages` - User messages
- `inputs.email` - Email addresses
- `inputs.query` - Search queries
- `outputs.text` - Generated text
- `extra.metadata.session_id` - Session IDs
- `extra.metadata.user_id` - User IDs
- `extra.metadata.api_key` - API keys
- `content` - Message content (finds all content fields)
- `email` - Email addresses
- `messages` - Entire message arrays
- `query` - Search queries
- `text` - Generated text
- `session_id` - Session IDs
- `user_id` - User IDs
- `api_key` - API keys
**Handles nested fields:** Use dot notation (e.g., `extra.metadata.api_key`)
**Recursive matching:** Field names are matched at any depth in the JSON structure, including inside arrays and nested objects. For example, specifying `content` will redact all fields named `content` anywhere in the trace.
### `upload_trace.sh`
@@ -88,7 +89,7 @@ export LANGSMITH_API_KEY='lsv2_pt_...'
# Scrub
./scrub_trace.sh trace_a1b2c3d4-5678-90ab-cdef-1234567890ab.json \
"inputs.messages,inputs.email,extra.metadata.session_id"
"content,email,session_id"
# Review and send trace_a1b2c3d4-5678-90ab-cdef-1234567890ab.scrubbed.json to support
```
+18 -22
View File
@@ -13,24 +13,27 @@ USAGE:
ARGUMENTS:
trace_file - JSON file with extracted trace
fields - Comma-separated field paths to redact
fields - Comma-separated field names to redact (recursively)
OUTPUT:
Creates <trace_file>.scrubbed.json
EXAMPLES:
# Redact messages and email
$0 trace.json "inputs.messages,inputs.email"
# Redact all 'content' and 'email' fields anywhere in the trace
$0 trace.json "content,email"
# Redact nested metadata
$0 trace.json "extra.metadata.api_key,outputs.user_data"
# Redact nested metadata fields
$0 trace.json "api_key,session_id,user_id"
COMMON FIELDS:
inputs.messages - Chat messages
inputs.email - Email addresses
outputs.text - Output text
extra.metadata.session_id - Session IDs
extra.metadata.user_id - User IDs
content - Message content (finds all content fields)
email - Email addresses
messages - Entire messages arrays
session_id - Session IDs
user_id - User IDs
api_key - API keys
NOTE: Fields are matched recursively at any depth, including inside arrays.
EOF
}
@@ -67,28 +70,21 @@ echo "Output: $OUTPUT_FILE"
echo "Fields: $FIELDS"
echo ""
# Build jq filter for nested redaction
# Build jq filter for recursive redaction
IFS=',' read -ra FIELD_LIST <<< "$FIELDS"
JQ_FILTER='walk(if type == "object" then ('
JQ_FILTER='walk(if type == "object" then'
for field in "${FIELD_LIST[@]}"; do
# Trim whitespace
field="${field#"${field%%[![:space:]]*}"}" # trim leading
field="${field%"${field##*[![:space:]]}"}" # trim trailing
# Build path array for getpath/setpath
IFS='.' read -ra PARTS <<< "$field"
JQ_PATH="["
for part in "${PARTS[@]}"; do
JQ_PATH="$JQ_PATH\"$part\","
done
JQ_PATH="${JQ_PATH%,}]"
JQ_FILTER="$JQ_FILTER if getpath($JQ_PATH) then setpath($JQ_PATH; \"[REDACTED]\") else . end |"
# Add recursive field check
JQ_FILTER="$JQ_FILTER if has(\"$field\") then .\"$field\" = \"[REDACTED]\" else . end |"
done
# Remove trailing pipe and close
JQ_FILTER="${JQ_FILTER% |}) else . end)"
JQ_FILTER="${JQ_FILTER% |} else . end)"
# Apply redactions
if ! jq "$JQ_FILTER" "$TRACE_FILE" > "$OUTPUT_FILE"; then