mirror of
https://github.com/langchain-ai/langsmith-self-hosted-workshops.git
synced 2026-07-01 20:44:14 -04:00
f9c22ad3ea
This commit adds Module 4 of the LangSmith Self-Hosted Operator workshop, focused on teaching operators how to diagnose issues under pressure, collect evidence, and resolve incidents efficiently. Documentation: - docs/modules/module-4.md: Complete module documentation covering incident reality, common failure modes, diagnostics collection, debugging methodology, and working with Support - docs/shared/incident_first_10_minutes.md: Quick reference checklist for critical initial incident response steps - docs/shared/support_escalation_template.md: Copy-paste template for escalating issues to LangChain Support with all necessary information Notebooks: - notebooks/module-4/00_setup_or_resume_environment.ipynb: Environment validation and setup for Module 4 labs - notebooks/module-4/01_diagnostics_baseline.ipynb: Teaches "baseline first" discipline with comprehensive cluster state capture and canonical diagnostics script execution - notebooks/module-4/10_failure_lab_postgres.ipynb: Hands-on PostgreSQL connectivity failure lab with failure injection, diagnostics, and remediation - notebooks/module-4/20_failure_lab_redis.ipynb: Hands-on Redis cache/queue failure lab focusing on intermittent ingestion and worker backlog issues - notebooks/module-4/30_failure_lab_clickhouse.ipynb: Hands-on ClickHouse trace storage failure lab covering missing traces and insert errors - notebooks/module-4/40_failure_lab_blob_storage.ipynb: Hands-on blob storage configuration failure lab demonstrating ClickHouse pressure from large payloads - notebooks/module-4/README.md: Module overview and notebook descriptions Key Features: - All failure labs follow consistent structure: baseline → inject → observe → collect → triage → remediate → recover - Cloud-agnostic implementation using shared cloud helpers - Safe-by-default failure injections with backup/restore mechanisms - Integration with canonical LangChain diagnostics script - Secrets-safe: no credentials printed, all values redacted - Comprehensive Support escalation guidance for each service Each failure lab includes: - Service role and importance explanation - Expected symptoms documentation - Multiple failure injection options (subtle vs. obvious) - Guided triage steps with automatic checks - Support escalation requirements - Lessons learned and common mistakes This module completes the core operator workshop curriculum, enabling operators to confidently troubleshoot and respond to incidents in production LangSmith deployments.