Files
Cory Waddingham f9c22ad3ea Add Module 4: Troubleshooting & Incident Response
This commit adds Module 4 of the LangSmith Self-Hosted Operator workshop,
focused on teaching operators how to diagnose issues under pressure, collect
evidence, and resolve incidents efficiently.

Documentation:
- docs/modules/module-4.md: Complete module documentation covering incident
  reality, common failure modes, diagnostics collection, debugging methodology,
  and working with Support
- docs/shared/incident_first_10_minutes.md: Quick reference checklist for
  critical initial incident response steps
- docs/shared/support_escalation_template.md: Copy-paste template for
  escalating issues to LangChain Support with all necessary information

Notebooks:
- notebooks/module-4/00_setup_or_resume_environment.ipynb: Environment
  validation and setup for Module 4 labs
- notebooks/module-4/01_diagnostics_baseline.ipynb: Teaches "baseline first"
  discipline with comprehensive cluster state capture and canonical diagnostics
  script execution
- notebooks/module-4/10_failure_lab_postgres.ipynb: Hands-on PostgreSQL
  connectivity failure lab with failure injection, diagnostics, and remediation
- notebooks/module-4/20_failure_lab_redis.ipynb: Hands-on Redis cache/queue
  failure lab focusing on intermittent ingestion and worker backlog issues
- notebooks/module-4/30_failure_lab_clickhouse.ipynb: Hands-on ClickHouse trace
  storage failure lab covering missing traces and insert errors
- notebooks/module-4/40_failure_lab_blob_storage.ipynb: Hands-on blob storage
  configuration failure lab demonstrating ClickHouse pressure from large payloads
- notebooks/module-4/README.md: Module overview and notebook descriptions

Key Features:
- All failure labs follow consistent structure: baseline → inject → observe →
  collect → triage → remediate → recover
- Cloud-agnostic implementation using shared cloud helpers
- Safe-by-default failure injections with backup/restore mechanisms
- Integration with canonical LangChain diagnostics script
- Secrets-safe: no credentials printed, all values redacted
- Comprehensive Support escalation guidance for each service

Each failure lab includes:
- Service role and importance explanation
- Expected symptoms documentation
- Multiple failure injection options (subtle vs. obvious)
- Guided triage steps with automatic checks
- Support escalation requirements
- Lessons learned and common mistakes

This module completes the core operator workshop curriculum, enabling
operators to confidently troubleshoot and respond to incidents in production
LangSmith deployments.
2026-01-02 10:47:33 -08:00
..