Add Module 4: Troubleshooting & Incident Response

This commit adds Module 4 of the LangSmith Self-Hosted Operator workshop, focused on teaching operators how to diagnose issues under pressure, collect evidence, and resolve incidents efficiently. Documentation: - docs/modules/module-4.md: Complete module documentation covering incident reality, common failure modes, diagnostics collection, debugging methodology, and working with Support - docs/shared/incident_first_10_minutes.md: Quick reference checklist for critical initial incident response steps - docs/shared/support_escalation_template.md: Copy-paste template for escalating issues to LangChain Support with all necessary information Notebooks: - notebooks/module-4/00_setup_or_resume_environment.ipynb: Environment validation and setup for Module 4 labs - notebooks/module-4/01_diagnostics_baseline.ipynb: Teaches "baseline first" discipline with comprehensive cluster state capture and canonical diagnostics script execution - notebooks/module-4/10_failure_lab_postgres.ipynb: Hands-on PostgreSQL connectivity failure lab with failure injection, diagnostics, and remediation - notebooks/module-4/20_failure_lab_redis.ipynb: Hands-on Redis cache/queue failure lab focusing on intermittent ingestion and worker backlog issues - notebooks/module-4/30_failure_lab_clickhouse.ipynb: Hands-on ClickHouse trace storage failure lab covering missing traces and insert errors - notebooks/module-4/40_failure_lab_blob_storage.ipynb: Hands-on blob storage configuration failure lab demonstrating ClickHouse pressure from large payloads - notebooks/module-4/README.md: Module overview and notebook descriptions Key Features: - All failure labs follow consistent structure: baseline → inject → observe → collect → triage → remediate → recover - Cloud-agnostic implementation using shared cloud helpers - Safe-by-default failure injections with backup/restore mechanisms - Integration with canonical LangChain diagnostics script - Secrets-safe: no credentials printed, all values redacted - Comprehensive Support escalation guidance for each service Each failure lab includes: - Service role and importance explanation - Expected symptoms documentation - Multiple failure injection options (subtle vs. obvious) - Guided triage steps with automatic checks - Support escalation requirements - Lessons learned and common mistakes This module completes the core operator workshop curriculum, enabling operators to confidently troubleshoot and respond to incidents in production LangSmith deployments.
2026-07-01 20:44:14 -04:00 · 2026-01-02 10:47:33 -08:00
parent af1e7a840c
commit f9c22ad3ea
10 changed files with 5170 additions and 0 deletions
@@ -0,0 +1,426 @@
+# Module 4: Troubleshooting & Incident Response
+
+**Goal:** Teach operators how to diagnose LangSmith self-hosted issues under pressure, collect the right evidence, and resolve incidents efficiently—either independently or with LangChain Support.
+
+**Duration:** ~3-4 hours (with optional full incident drill)  
+**Audience:** On-call engineers, platform owners, SREs, and anyone responsible for keeping LangSmith healthy  
+**Prerequisites:**
+- Module 1 complete: LangSmith deployed and reachable
+- Module 2 complete: Authentication configured
+- Module 3 complete: Production operations concepts understood
+- Participants own day-2 operations
+
+---
+
+## Overview
+
+Module 4 is hands-on: learners will introduce subtle but noticeable failures and debug them using standard tools and the canonical diagnostics bundle. This module builds the muscle memory needed for real incidents.
+
+**What you'll accomplish:**
+- Understand common failure modes and their symptoms
+- Master the "first 10 minutes" incident response checklist
+- Learn to collect canonical diagnostics bundles
+- Practice debugging with guided failure labs
+- Know when and how to escalate to Support
+
+**What this module avoids:**
+- Deep dives into specific monitoring tools (assumes basic kubectl/helm)
+- Performance optimization (covered in Module 3)
+- Infrastructure provisioning (covered in Module 1)
+- Authentication configuration (covered in Module 2)
+
+---
+
+## Section 1: Incident Reality Check
+
+### The Mindset
+
+**Incidents happen.** Even with perfect configuration, production systems fail. The difference between a 30-minute incident and a 4-hour outage is often preparation and process.
+
+**Key principles:**
+1. **Collect evidence first.** Don't redeploy, restart, or reconfigure until you understand what's wrong.
+2. **Time is evidence.** Every minute that passes without collecting diagnostics is lost information.
+3. **Symptoms are clues.** The same root cause can manifest differently depending on load, timing, and configuration.
+4. **Support needs context.** A good diagnostics bundle is worth more than a perfect description.
+
+### What Makes Incidents Hard
+
+**Pressure:**
+- Users are impacted
+- Management is asking for updates
+- You're on-call and tired
+- Multiple systems are involved
+
+**Complexity:**
+- Distributed systems have many moving parts
+- Failures cascade (one service fails, others follow)
+- Symptoms don't always point to root cause
+- Configuration drift accumulates over time
+
+**Tooling:**
+- Too many tools (which one shows the truth?)
+- Too few tools (missing critical information)
+- Tools that hide the problem (aggregation, sampling)
+
+**This module prepares you for all of these.**
+
+---
+
+## Section 2: Common Failure Modes
+
+### Ingestion & Tracing Failures
+
+**Symptoms:**
+- Traces appear delayed or missing
+- Worker pods show errors in logs
+- ClickHouse insert errors
+- Queue backlogs
+
+**Common causes:**
+- ClickHouse connectivity issues (network, credentials, resource limits)
+- Blob storage misconfiguration (large payloads fail)
+- Worker resource exhaustion (CPU/memory limits)
+- Redis connectivity (job queue backing up)
+
+**What to check first:**
+- Worker pod logs
+- ClickHouse pod status and logs
+- Redis connectivity and latency
+- Blob storage configuration
+
+### UI & API Failures
+
+**Symptoms:**
+- UI returns 5xx errors
+- API endpoints timeout
+- Login fails or redirects loop
+- Specific features don't work
+
+**Common causes:**
+- Database connectivity (PostgreSQL unreachable)
+- Authentication misconfiguration (OIDC/SAML)
+- Ingress/load balancer issues
+- API pod crashes or resource limits
+
+**What to check first:**
+- API pod logs
+- Database connectivity
+- Ingress status and configuration
+- Authentication configuration (Module 2 validation)
+
+### Authentication Failures
+
+**Symptoms:**
+- Users can't log in
+- Redirect loops
+- 403 errors after successful login
+- Session timeouts
+
+**Common causes:**
+- IdP connectivity issues
+- OIDC/SAML configuration drift
+- Secret rotation without updating LangSmith
+- Network policies blocking egress
+
+**What to check first:**
+- Auth pod logs
+- IdP connectivity (curl to issuer URL)
+- OIDC/SAML configuration (Module 2 validation)
+- Network policies
+
+---
+
+## Section 3: First 10 Minutes Checklist
+
+**The first 10 minutes of an incident are critical.** This is when you collect the most valuable evidence and make decisions that determine how long the incident lasts.
+
+### What NOT to Do
+
+**Resist the urge to:**
+- Run `helm upgrade` or `kubectl rollout restart`
+- Delete pods "to see if they come back"
+- Scale resources up/down
+- Change configuration
+
+**Why:** Redeploying destroys evidence and may mask the root cause. Collect diagnostics first.
+
+### The Checklist
+
+See [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md) for the complete reference.
+
+**Quick summary:**
+1. **Minute 0-2:** Triage & scope (what's broken, who's impacted)
+2. **Minute 2-5:** Quick health check (pods, events, ingress)
+3. **Minute 5-8:** Collect diagnostics bundle (canonical script + snapshots)
+4. **Minute 8-10:** Identify likely root cause (symptoms → checks)
+
+**Key insight:** This checklist is not about fixing the issue—it's about collecting evidence and making informed decisions.
+
+---
+
+## Section 4: Standard Diagnostics Collection
+
+### The Canonical Script
+
+LangChain provides an official diagnostics script that captures everything Support needs:
+
+**Location:**
+```
+https://github.com/langchain-ai/helm/blob/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
+```
+
+**What it captures:**
+- Pod logs (all containers)
+- Events (sorted by timestamp)
+- Resource usage (CPU, memory)
+- Configuration (deployments, services, ingress)
+- Storage (PVCs, storage classes)
+- Network (services, endpoints)
+
+**How to use it:**
+```bash
+curl -O https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
+chmod +x get_k8s_debugging_info.sh
+./get_k8s_debugging_info.sh <namespace>
+```
+
+**Important:** Always run this script before making changes. The bundle it creates is your evidence.
+
+### What Good Debugging Looks Like
+
+**Good debugging:**
+- Starts with a baseline (what was working before)
+- Collects evidence systematically (checklist-driven)
+- Documents hypotheses and tests them
+- Preserves evidence (saves diagnostics bundles)
+- Escalates with context (diagnostics + timeline)
+
+**Bad debugging:**
+- Changes things without understanding
+- Doesn't collect evidence
+- Jumps to conclusions
+- Destroys evidence (redeploys, deletes)
+- Escalates without context ("it's broken, fix it")
+
+**The difference:** Good debugging produces a clear root cause and fix. Bad debugging produces more incidents.
+
+---
+
+## Section 5: Working with Support
+
+### What Speeds Up Support
+
+**Good escalation includes:**
+- Diagnostics bundle (canonical script output)
+- Timeline (when did it start, what changed)
+- Symptoms (what's broken, who's impacted)
+- What you've tried (investigation steps, results)
+- Environment details (versions, configuration)
+
+**Use the [Support Escalation Template](../shared/support_escalation_template.md).**
+
+### What Slows Down Support
+
+**Poor escalation includes:**
+- No diagnostics bundle ("just look at it")
+- Vague symptoms ("it's slow")
+- No timeline ("it broke")
+- No environment details ("it's on Kubernetes")
+- Secrets in logs (security risk)
+
+**Result:** Support has to ask for information you could have provided, delaying resolution.
+
+### Required Metadata
+
+**Support will always ask for:**
+1. Diagnostics bundle (canonical script)
+2. Helm chart version
+3. Image tags (if known)
+4. Recent changes (deployments, config, infrastructure)
+5. Cloud provider and region
+6. Kubernetes version
+7. What you've tried and results
+
+**Provide this upfront to speed resolution.**
+
+---
+
+## Section 6: Preventing Repeat Incidents
+
+### Post-Incident Review
+
+**After an incident is resolved:**
+1. **Document the root cause** (what actually broke)
+2. **Identify contributing factors** (what made it worse)
+3. **List what worked** (what helped you debug)
+4. **List what didn't work** (what slowed you down)
+5. **Create action items** (what to change to prevent recurrence)
+
+**Key questions:**
+- Could we have detected this earlier? (monitoring, alerts)
+- Could we have prevented this? (configuration, testing)
+- Could we have fixed it faster? (runbooks, tooling)
+- What did we learn? (new failure mode, new tool)
+
+### Common Patterns
+
+**Configuration drift:**
+- Secrets rotate, but LangSmith config isn't updated
+- Infrastructure changes, but Helm values aren't updated
+- IdP settings change, but OIDC/SAML config isn't updated
+
+**Prevention:** Automated validation (Module 2, Module 3 notebooks), configuration as code, regular audits.
+
+**Resource exhaustion:**
+- ClickHouse runs out of disk
+- PostgreSQL hits connection limits
+- Workers hit CPU/memory limits
+
+**Prevention:** Monitoring (Module 3), autoscaling (Module 3), capacity planning.
+
+**Network issues:**
+- Egress blocked by NetworkPolicy
+- Load balancer misconfiguration
+- DNS resolution failures
+
+**Prevention:** Network policy testing, ingress validation (Module 1), DNS checks.
+
+---
+
+## Section 7: Hands-on Failure Labs
+
+**This is where you practice.** Each lab follows the same pattern:
+
+1. **Baseline snapshot:** Capture what "good" looks like
+2. **Introduce failure:** Apply a subtle but noticeable fault
+3. **Observe symptoms:** See how the failure manifests
+4. **Collect diagnostics:** Run the canonical script and gather evidence
+5. **Hypothesize root cause:** Based on symptoms, identify likely cause
+6. **Verify with targeted checks:** Confirm your hypothesis
+7. **Remediate:** Revert the failure
+8. **Confirm recovery:** Verify everything is working again
+9. **Capture lessons learned:** Document what you discovered
+
+### Lab Structure
+
+**Each failure lab includes:**
+- **What this service does for LangSmith:** Context on the service's role
+- **Expected symptoms when it fails:** What you'll see when it breaks
+- **Failure injection options:** Two levels (subtle vs. obvious)
+- **Do the drill:** Step-by-step debugging process
+- **What Support will ask for:** Service-specific evidence
+
+### Available Labs
+
+1. **PostgreSQL Failure Lab** (`10_failure_lab_postgres.ipynb`)
+   - Connection failures, wrong credentials, network isolation
+   - Symptoms: API 5xx, login failures, connection exhaustion
+
+2. **Redis Failure Lab** (`20_failure_lab_redis.ipynb`)
+   - Connectivity issues, wrong credentials
+   - Symptoms: Intermittent ingestion, latency spikes, worker backlog
+
+3. **ClickHouse Failure Lab** (`30_failure_lab_clickhouse.ipynb`)
+   - Endpoint misconfiguration, network isolation, resource limits
+   - Symptoms: Traces delayed/missing, insert errors, UI loads but traces don't appear
+
+4. **Blob Storage Failure Lab** (`40_failure_lab_blob_storage.ipynb`)
+   - Credential misconfiguration, bucket name errors
+   - Symptoms: Large payload traces degrade ClickHouse, warnings in logs
+
+5. **Full Incident Drill** (`90_full_incident_drill.ipynb`) (optional)
+   - Combined failure + timeline pressure
+   - Practice "first 10 minutes" checklist
+   - Produce incident summary using escalation template
+
+---
+
+## Section 8: Workshop Wrap-up
+
+### What You've Learned
+
+- How to respond to incidents systematically
+- How to collect canonical diagnostics bundles
+- How to debug common failure modes
+- How to escalate effectively to Support
+- How to prevent repeat incidents
+
+### Next Steps
+
+**Immediate:**
+- Run through failure labs to build muscle memory
+- Customize the "first 10 minutes" checklist for your environment
+- Set up monitoring and alerts (Module 3)
+
+**Ongoing:**
+- Practice incident response regularly (drills)
+- Keep diagnostics script updated
+- Document your own failure modes and fixes
+- Share learnings with your team
+
+### Resources
+
+- [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)
+- [Support Escalation Template](../shared/support_escalation_template.md)
+- [Canonical Diagnostics Script](https://github.com/langchain-ai/helm/blob/main/charts/langsmith/scripts/get_k8s_debugging_info.sh)
+- Module 1: Deployment & Baseline Validation
+- Module 2: Identity & Authentication
+- Module 3: Production Operations & Scaling
+
+---
+
+## Artifacts
+
+**Participants leave with:**
+- A working incident response process
+- Experience debugging real failure modes
+- A diagnostics bundle collection workflow
+- An escalation template customized for their environment
+- Confidence to handle incidents independently
+
+---
+
+## Common Pitfalls
+
+**Don't:**
+- Skip the baseline snapshot (you need "before" to compare to "after")
+- Redeploy before collecting evidence (destroys diagnostics)
+- Ignore error messages (they're clues)
+- Escalate without diagnostics bundle (slows Support)
+- Delete evidence (you'll need it for post-incident review)
+
+**Do:**
+- Follow the checklist (it's battle-tested)
+- Collect diagnostics early (time is evidence)
+- Document your investigation (helps you and Support)
+- Test your process (run drills)
+- Learn from each incident (prevent repeats)
+
+---
+
+## Troubleshooting
+
+**"The diagnostics script fails":**
+- Check kubectl access and namespace
+- Verify script is up-to-date (check GitHub)
+- Run with verbose output to see what's failing
+
+**"I can't reproduce the failure":**
+- Check that failure injection was applied correctly
+- Verify symptoms match expected behavior
+- Try a different failure injection method (Level 2 if Level 1 didn't work)
+
+**"The remediation doesn't work":**
+- Verify you reverted the exact change you made
+- Check for cascading failures (one failure caused another)
+- Collect post-remediation diagnostics to compare
+
+**"I don't understand the symptoms":**
+- Review the service's role in LangSmith (lab introduction)
+- Check logs for error patterns
+- Compare to baseline snapshot (what changed?)
+
+---
+
+**Remember:** Incident response is a skill. Practice makes perfect. The more you drill, the better you'll be when real incidents happen.
+
@@ -0,0 +1,163 @@
+# First 10 Minutes: Incident Response Checklist
+
+**When:** You detect or are alerted to a LangSmith self-hosted issue.
+
+**Goal:** Collect evidence, stabilize if possible, and prepare for escalation—without making things worse.
+
+---
+
+## ⚠️ Critical: Do NOT Redeploy
+
+**Resist the urge to:**
+- Run `helm upgrade` or `kubectl rollout restart`
+- Delete pods "to see if they come back"
+- Scale resources up/down
+- Change configuration
+
+**Why:** Redeploying destroys evidence and may mask the root cause. Collect diagnostics first.
+
+---
+
+## Minute 0-2: Triage & Scope
+
+- [ ] **Confirm the issue:** What's broken? (UI down, API 5xx, traces missing, auth failing)
+- [ ] **Check who's impacted:** All users, specific endpoints, specific features?
+- [ ] **Note the time:** Record detection time and any recent changes (deployments, config changes, infrastructure changes)
+- [ ] **Check basic connectivity:**
+  ```bash
+  kubectl cluster-info
+  kubectl get nodes
+  kubectl get pods -n <namespace>
+  ```
+
+---
+
+## Minute 2-5: Quick Health Check
+
+- [ ] **Pod status:**
+  ```bash
+  kubectl get pods -n <namespace> -o wide
+  ```
+  Look for: CrashLoopBackOff, Pending, Error states
+
+- [ ] **Recent events:**
+  ```bash
+  kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
+  ```
+  Look for: Failed scheduling, image pull errors, resource limits
+
+- [ ] **Ingress/Load Balancer:**
+  ```bash
+  kubectl get ingress -n <namespace>
+  ```
+  Check if endpoint is reachable (curl or browser)
+
+- [ ] **Key deployments:**
+  ```bash
+  kubectl get deployments -n <namespace>
+  kubectl describe deployment <deployment-name> -n <namespace>
+  ```
+
+---
+
+## Minute 5-8: Collect Diagnostics Bundle
+
+- [ ] **Run canonical diagnostics script:**
+  ```bash
+  # Download and run the official script
+  curl -O https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
+  chmod +x get_k8s_debugging_info.sh
+  ./get_k8s_debugging_info.sh <namespace>
+  ```
+  This captures: pod logs, events, resource usage, configuration
+
+- [ ] **Save timestamped snapshot:**
+  ```bash
+  TIMESTAMP=$(date +%Y%m%d-%H%M%S)
+  mkdir -p artifacts/incident-$TIMESTAMP
+  
+  kubectl get all -n <namespace> -o yaml > artifacts/incident-$TIMESTAMP/all-resources.yaml
+  kubectl get events -n <namespace> --sort-by='.lastTimestamp' > artifacts/incident-$TIMESTAMP/events.txt
+  ```
+
+- [ ] **Check logs for obvious errors:**
+  ```bash
+  # Check API server logs
+  kubectl logs -n <namespace> -l app=langsmith-api --tail=100
+  
+  # Check worker logs
+  kubectl logs -n <namespace> -l app=langsmith-worker --tail=100
+  ```
+  Look for: connection errors, timeouts, authentication failures, resource exhaustion
+
+---
+
+## Minute 8-10: Identify Likely Root Cause
+
+Based on symptoms, check the most likely culprits:
+
+### If UI/API is down:
+- [ ] Check ingress/load balancer status (via cloud helper or kubectl)
+- [ ] Check API pod logs for startup errors
+- [ ] Verify external services (PostgreSQL, Redis) are reachable
+
+### If traces are missing/delayed:
+- [ ] Check ClickHouse connectivity and logs
+- [ ] Check worker pod logs for insert errors
+- [ ] Verify blob storage configuration (if large payloads)
+
+### If authentication fails:
+- [ ] Check OIDC/SAML configuration (Module 2 validation)
+- [ ] Check IdP connectivity
+- [ ] Review auth-related pod logs
+
+### If ingestion is slow:
+- [ ] Check Redis connectivity and latency
+- [ ] Check worker pod resource usage
+- [ ] Look for queue backlogs
+
+---
+
+## After 10 Minutes: Decision Point
+
+**If you've identified and can safely fix the issue:**
+- Document what you changed
+- Verify recovery
+- Collect post-recovery diagnostics
+
+**If you need help:**
+- Use the [Support Escalation Template](../shared/support_escalation_template.md)
+- Include the diagnostics bundle
+- Note what you've tried and the results
+
+**If the issue is critical and escalating:**
+- Continue collecting evidence every 5-10 minutes
+- Document timeline of symptoms
+- Prepare escalation with all evidence
+
+---
+
+## What NOT to Do
+
+- ❌ Don't delete namespaces or persistent volumes
+- ❌ Don't change database passwords or connection strings
+- ❌ Don't scale resources without understanding the bottleneck
+- ❌ Don't ignore error messages—they're evidence
+- ❌ Don't skip the diagnostics bundle—Support will ask for it
+
+---
+
+## Quick Reference: Common Failure Patterns
+
+| Symptom | Likely Cause | First Check |
+|---------|--------------|-------------|
+| All pods CrashLoopBackOff | Config error, missing secret | `kubectl describe pod` |
+| API 5xx errors | Database/Redis connection | Pod logs, service endpoints |
+| Traces not appearing | ClickHouse connectivity | ClickHouse pod logs |
+| Slow ingestion | Redis latency, worker backlog | Worker logs, Redis metrics |
+| Auth redirect loop | OIDC/SAML misconfiguration | Auth pod logs, IdP connectivity |
+
+---
+
+**Remember:** The goal is evidence collection and safe triage, not immediate resolution. A good diagnostics bundle is worth more than a hasty fix.
+
@@ -0,0 +1,185 @@
+# Support Escalation Template
+
+**Use this template when escalating an incident to LangChain Support.**
+
+Copy and fill in each section. Include the diagnostics bundle and any relevant evidence.
+
+---
+
+## Incident Summary
+
+**Start Time:** `YYYY-MM-DD HH:MM:SS UTC`  
+**Detection Time:** `YYYY-MM-DD HH:MM:SS UTC`  
+**Current Status:** `[Investigating / Escalating / Resolved]`
+
+**Brief Description:**
+```
+[One-sentence summary of the issue]
+```
+
+---
+
+## Symptoms
+
+**Who is impacted:**
+- [ ] All users
+- [ ] Specific user(s) or workspace(s)
+- [ ] Specific endpoints or features
+- [ ] Internal operations only
+
+**What's broken:**
+- [ ] UI is unreachable or returns errors
+- [ ] API endpoints return 5xx errors
+- [ ] Traces are missing or delayed
+- [ ] Authentication/authorization failures
+- [ ] Ingestion is slow or failing
+- [ ] Other: `[describe]`
+
+**Error messages observed:**
+```
+[Paste relevant error messages, redacting any secrets]
+```
+
+**User-facing impact:**
+```
+[Describe what users experience]
+```
+
+---
+
+## Recent Changes
+
+**Deployments/Releases:**
+- [ ] Helm upgrade/chart change: `[version/date]`
+- [ ] Configuration change: `[what changed]`
+- [ ] Infrastructure change: `[what changed]`
+- [ ] No recent changes
+
+**Timeline:**
+```
+[Chronological list of changes leading up to the incident]
+```
+
+---
+
+## Environment Details
+
+**Cloud Provider:** `[AWS / Azure / GCP / Other]`  
+**Region/Location:** `[region]`  
+**Kubernetes Service:** `[EKS / AKS / GKE / Other]`  
+**Cluster Name:** `[cluster-name]`  
+**Namespace:** `[namespace]`
+
+**LangSmith Version:**
+- Helm Chart Version: `[version]`
+- Image Tags: `[if known]`
+- Deployment Method: `[Helm / kubectl / Other]`
+
+**Infrastructure:**
+- PostgreSQL: `[RDS / Azure Database / In-cluster / Other]`
+- Redis: `[ElastiCache / Azure Cache / In-cluster / Other]`
+- ClickHouse: `[Managed / In-cluster]`
+- Blob Storage: `[S3 / Azure Blob / GCS / Other]`
+
+---
+
+## Diagnostics Bundle
+
+**Bundle Location:** `[path or URL to diagnostics bundle]`
+
+**Bundle Contents:**
+- [ ] Canonical diagnostics script output (`get_k8s_debugging_info.sh`)
+- [ ] `kubectl get all -o yaml` snapshot
+- [ ] Recent events (`kubectl get events`)
+- [ ] Pod logs (API, workers, ClickHouse)
+- [ ] Resource usage snapshot (`kubectl top pods/nodes`)
+- [ ] Ingress/load balancer configuration
+- [ ] Helm values (redacted)
+
+**Bundle Timestamp:** `YYYY-MM-DD HH:MM:SS UTC`
+
+---
+
+## What We've Tried
+
+**Investigation Steps:**
+1. `[What you checked and what you found]`
+2. `[Next step and result]`
+3. `[Continue as needed]`
+
+**Remediation Attempts:**
+- [ ] Restarted pods: `[which pods, result]`
+- [ ] Checked external service connectivity: `[result]`
+- [ ] Verified configuration: `[result]`
+- [ ] Other: `[describe]`
+
+**Current Hypothesis:**
+```
+[Your best guess at the root cause, with evidence]
+```
+
+---
+
+## Evidence & Logs
+
+**Key Log Excerpts (redact secrets):**
+```
+[Paste relevant log lines with timestamps]
+```
+
+**Error Patterns:**
+```
+[Describe patterns you've observed]
+```
+
+**Metrics/Signals:**
+```
+[Any metrics or signals that indicate the issue]
+```
+
+---
+
+## Questions for Support
+
+1. `[Your question]`
+2. `[Another question]`
+3. `[Continue as needed]`
+
+---
+
+## Additional Context
+
+**Related Issues:**
+- Previous similar incidents: `[reference]`
+- Known limitations: `[describe]`
+- Custom configurations: `[describe, redact secrets]`
+
+**Priority:**
+- [ ] Critical (service down, all users impacted)
+- [ ] High (major feature broken, many users impacted)
+- [ ] Medium (degraded performance, some users impacted)
+- [ ] Low (minor issue, workaround available)
+
+---
+
+## Next Steps
+
+**What we need from Support:**
+- [ ] Root cause analysis
+- [ ] Remediation steps
+- [ ] Configuration guidance
+- [ ] Performance optimization
+- [ ] Other: `[describe]`
+
+**Our availability:**
+- Timezone: `[timezone]`
+- Best time to contact: `[time range]`
+- Escalation contact: `[name/email]`
+
+---
+
+**Template Version:** 1.0  
+**Last Updated:** `[date]`
+
+**Note:** Always redact secrets, API keys, passwords, and connection strings before sharing. Use `[REDACTED]` or similar markers.
+
@@ -0,0 +1,420 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Module 4: Setup or Resume Environment\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "This notebook helps you prepare for Module 4 (Troubleshooting & Incident Response). It validates that your LangSmith environment is running and accessible, or directs you to deploy it using Module 1.\n",
+    "\n",
+    "**Prerequisites:**\n",
+    "- Module 1 notebooks available (for deployment if needed)\n",
+    "- kubectl configured (if environment exists)\n",
+    "- Cloud provider credentials (if deploying)\n",
+    "\n",
+    "**What This Notebook Does:**\n",
+    "1. Checks if LangSmith is already deployed\n",
+    "2. If not, provides links to Module 1 deployment notebooks\n",
+    "3. If yes, validates the environment is healthy and reachable\n",
+    "4. Confirms prerequisites for Module 4 failure labs\n",
+    "\n",
+    "**Estimated time:** 10-15 minutes\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bootstrap environment\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Add notebooks directory to path so we can import shared as a package\n",
+    "possible_paths = [\n",
+    "    Path.cwd().parent,  # If cwd is module-4, go up one level to notebooks\n",
+    "    Path.cwd(),  # If cwd is already notebooks\n",
+    "    Path.cwd() / \"notebooks\",  # If cwd is workspace root\n",
+    "]\n",
+    "\n",
+    "notebooks_path = None\n",
+    "for path in possible_paths:\n",
+    "    if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        notebooks_path = path\n",
+    "        break\n",
+    "\n",
+    "if not notebooks_path:\n",
+    "    notebooks_path = Path.cwd() / \"notebooks\"\n",
+    "    if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
+    "\n",
+    "# Add notebooks directory to path so 'shared' can be imported as a package\n",
+    "if str(notebooks_path) not in sys.path:\n",
+    "    sys.path.insert(0, str(notebooks_path))\n",
+    "\n",
+    "from shared._bootstrap import bootstrap\n",
+    "\n",
+    "# Run bootstrap\n",
+    "bootstrap_info = bootstrap()\n",
+    "artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
+    "print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Configuration\n",
+    "\n",
+    "Load and validate configuration from environment variables.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from shared._validation import require_env, ok, warn\n",
+    "from shared._cloud_helpers import get_cloud_provider, get_region\n",
+    "\n",
+    "# Required configuration\n",
+    "required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
+    "\n",
+    "print(\"### Loading Configuration\\n\")\n",
+    "\n",
+    "config = {}\n",
+    "missing = []\n",
+    "\n",
+    "for var in required_vars:\n",
+    "    value = os.environ.get(var, \"\").strip()\n",
+    "    if not value:\n",
+    "        missing.append(var)\n",
+    "    config[var] = value\n",
+    "\n",
+    "if missing:\n",
+    "    raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
+    "                      f\"💡 Copy env-samples/workshop.env.example to your .env file and fill in values\")\n",
+    "\n",
+    "# Optional but recommended\n",
+    "config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
+    "config[\"LANGSMITH_DOMAIN\"] = os.environ.get(\"LANGSMITH_DOMAIN\", \"\")\n",
+    "\n",
+    "# Show cloud provider info\n",
+    "provider = get_cloud_provider()\n",
+    "region = get_region()\n",
+    "\n",
+    "print(f\"Cloud Provider: {provider.upper()}\")\n",
+    "print(f\"Region: {region}\")\n",
+    "print(f\"Namespace: {config['NAMESPACE']}\")\n",
+    "print(f\"Cluster: {config['CLUSTER_NAME']}\")\n",
+    "print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
+    "\n",
+    "if config[\"LANGSMITH_DOMAIN\"]:\n",
+    "    print(f\"LangSmith Domain: {config['LANGSMITH_DOMAIN']}\")\n",
+    "\n",
+    "ok(\"Configuration loaded\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Check if Environment Exists\n",
+    "\n",
+    "We'll check if LangSmith is already deployed. If not, we'll provide instructions to deploy using Module 1.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._shell import run\n",
+    "from shared._cloud_helpers import cluster_exists, configure_kubectl, get_kubernetes_service_name\n",
+    "\n",
+    "namespace = config[\"NAMESPACE\"]\n",
+    "cluster_name = config[\"CLUSTER_NAME\"]\n",
+    "k8s_service = get_kubernetes_service_name()\n",
+    "\n",
+    "print(f\"### Checking {k8s_service} Cluster\\n\")\n",
+    "\n",
+    "# Check if cluster exists\n",
+    "if cluster_exists(cluster_name):\n",
+    "    ok(f\"Cluster '{cluster_name}' exists\")\n",
+    "    \n",
+    "    # Configure kubectl\n",
+    "    print(f\"\\n### Configuring kubectl\\n\")\n",
+    "    try:\n",
+    "        configure_kubectl(cluster_name, region)\n",
+    "        ok(\"kubectl configured\")\n",
+    "    except Exception as e:\n",
+    "        warn(f\"Could not configure kubectl: {e}\")\n",
+    "        print(\"💡 Make sure you have proper cloud provider credentials\")\n",
+    "        raise\n",
+    "else:\n",
+    "    warn(f\"Cluster '{cluster_name}' not found\")\n",
+    "    print(\"\\n💡 You need to deploy LangSmith first using Module 1 notebooks.\")\n",
+    "    print(\"   See the 'Deploy Environment' section below.\")\n",
+    "    raise RuntimeError(\"Cluster not found. Deploy using Module 1 first.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Verify Namespace and Helm Release\n",
+    "\n",
+    "Check that the LangSmith namespace exists and Helm release is installed.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "\n",
+    "helm_release = config[\"HELM_RELEASE\"]\n",
+    "\n",
+    "print(\"### Checking Namespace\\n\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"namespace\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    ok(f\"Namespace '{namespace}' exists\")\n",
+    "else:\n",
+    "    warn(f\"Namespace '{namespace}' not found\")\n",
+    "    print(\"\\n💡 You need to deploy LangSmith first using Module 1 notebooks.\")\n",
+    "    print(\"   See the 'Deploy Environment' section below.\")\n",
+    "    raise RuntimeError(\"Namespace not found. Deploy using Module 1 first.\")\n",
+    "\n",
+    "print(\"\\n### Checking Helm Release\\n\")\n",
+    "result = run(\n",
+    "    [\"helm\", \"list\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    releases = json.loads(result.stdout)\n",
+    "    release_found = any(r.get(\"name\") == helm_release for r in releases)\n",
+    "    \n",
+    "    if release_found:\n",
+    "        ok(f\"Helm release '{helm_release}' found\")\n",
+    "        # Get release info\n",
+    "        result = run(\n",
+    "            [\"helm\", \"status\", helm_release, \"-n\", namespace, \"-o\", \"json\"],\n",
+    "            check=False,\n",
+    "            stream=False\n",
+    "        )\n",
+    "        if result.returncode == 0:\n",
+    "            release_info = json.loads(result.stdout)\n",
+    "            print(f\"   Status: {release_info.get('info', {}).get('status', 'unknown')}\")\n",
+    "            print(f\"   Chart: {release_info.get('chart', {}).get('metadata', {}).get('name', 'unknown')}\")\n",
+    "            print(f\"   Version: {release_info.get('chart', {}).get('metadata', {}).get('version', 'unknown')}\")\n",
+    "    else:\n",
+    "        warn(f\"Helm release '{helm_release}' not found in namespace '{namespace}'\")\n",
+    "        print(\"\\n💡 You need to deploy LangSmith first using Module 1 notebooks.\")\n",
+    "        print(\"   See the 'Deploy Environment' section below.\")\n",
+    "        raise RuntimeError(\"Helm release not found. Deploy using Module 1 first.\")\n",
+    "else:\n",
+    "    warn(\"Could not list Helm releases\")\n",
+    "    print(\"💡 Make sure Helm is installed and kubectl is configured correctly\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Verify Ingress Endpoint\n",
+    "\n",
+    "Check that the LangSmith ingress is configured and reachable.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "from urllib.parse import urlparse\n",
+    "\n",
+    "print(\"### Checking Ingress\\n\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"ingress\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "ingress_found = False\n",
+    "ingress_host = None\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    ingresses = json.loads(result.stdout)\n",
+    "    for ingress in ingresses.get(\"items\", []):\n",
+    "        rules = ingress.get(\"spec\", {}).get(\"rules\", [])\n",
+    "        for rule in rules:\n",
+    "            host = rule.get(\"host\", \"\")\n",
+    "            if host:\n",
+    "                ingress_found = True\n",
+    "                ingress_host = host\n",
+    "                print(f\"   Found ingress with host: {host}\")\n",
+    "                break\n",
+    "\n",
+    "if not ingress_found:\n",
+    "    warn(\"No ingress found\")\n",
+    "    print(\"💡 Ingress may still be provisioning. Check Module 1 validation notebook.\")\n",
+    "else:\n",
+    "    ok(f\"Ingress configured with host: {ingress_host}\")\n",
+    "    \n",
+    "    # Try to reach the endpoint\n",
+    "    if config[\"LANGSMITH_DOMAIN\"]:\n",
+    "        test_url = f\"https://{config['LANGSMITH_DOMAIN']}\"\n",
+    "    elif ingress_host:\n",
+    "        test_url = f\"https://{ingress_host}\"\n",
+    "    else:\n",
+    "        test_url = None\n",
+    "    \n",
+    "    if test_url:\n",
+    "        print(f\"\\n### Testing Endpoint Reachability\\n\")\n",
+    "        print(f\"Testing: {test_url}\")\n",
+    "        try:\n",
+    "            # Allow redirects, don't verify SSL (may be self-signed)\n",
+    "            response = requests.get(test_url, allow_redirects=True, verify=False, timeout=10)\n",
+    "            if response.status_code in [200, 302, 401, 403]:\n",
+    "                ok(f\"Endpoint is reachable (HTTP {response.status_code})\")\n",
+    "            else:\n",
+    "                warn(f\"Endpoint returned unexpected status: {response.status_code}\")\n",
+    "        except requests.exceptions.SSLError:\n",
+    "            # SSL error is OK if using self-signed certs\n",
+    "            warn(\"SSL verification failed (may be self-signed certificate)\")\n",
+    "            print(\"💡 This is OK for testing. In production, use proper TLS certificates.\")\n",
+    "        except requests.exceptions.RequestException as e:\n",
+    "            warn(f\"Could not reach endpoint: {e}\")\n",
+    "            print(\"💡 Ingress may still be provisioning. Wait a few minutes and try again.\")\n",
+    "    else:\n",
+    "        warn(\"No domain configured for testing\")\n",
+    "        print(\"💡 Set LANGSMITH_DOMAIN in your .env file to test endpoint reachability\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Quick Health Check\n",
+    "\n",
+    "Verify that key deployments are running.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Checking Key Deployments\\n\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    deployments = json.loads(result.stdout)\n",
+    "    deployment_items = deployments.get(\"items\", [])\n",
+    "    \n",
+    "    if deployment_items:\n",
+    "        ok(f\"Found {len(deployment_items)} deployment(s)\")\n",
+    "        print(\"\\nDeployment Status:\")\n",
+    "        for deployment in deployment_items:\n",
+    "            name = deployment.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "            spec_replicas = deployment.get(\"spec\", {}).get(\"replicas\", 0)\n",
+    "            status_replicas = deployment.get(\"status\", {}).get(\"replicas\", 0)\n",
+    "            ready_replicas = deployment.get(\"status\", {}).get(\"readyReplicas\", 0)\n",
+    "            available_replicas = deployment.get(\"status\", {}).get(\"availableReplicas\", 0)\n",
+    "            \n",
+    "            status_icon = \"✅\" if ready_replicas == spec_replicas and available_replicas == spec_replicas else \"⚠️\"\n",
+    "            print(f\"   {status_icon} {name}: {ready_replicas}/{spec_replicas} ready, {available_replicas}/{spec_replicas} available\")\n",
+    "    else:\n",
+    "        warn(\"No deployments found\")\n",
+    "        print(\"💡 LangSmith may not be fully deployed. Check Module 1 validation notebook.\")\n",
+    "else:\n",
+    "    warn(\"Could not list deployments\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ✅ Environment Ready\n",
+    "\n",
+    "Your LangSmith environment is running and accessible. You're ready to proceed with Module 4 failure labs.\n",
+    "\n",
+    "**Next Steps:**\n",
+    "1. Run `01_diagnostics_baseline.ipynb` to capture a baseline snapshot\n",
+    "2. Proceed with failure labs (10, 20, 30, 40)\n",
+    "3. Optionally run `90_full_incident_drill.ipynb` for a complete incident simulation\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 📝 Important Reminder\n",
+    "\n",
+    "**When finished with Module 4, run Module 1's `99_teardown.ipynb` to delete the environment and avoid ongoing cloud costs.**\n",
+    "\n",
+    "The teardown notebook will:\n",
+    "- Remove Helm release\n",
+    "- Destroy Terraform-managed infrastructure (Kubernetes cluster, database, cache, blob storage, etc.)\n",
+    "- Clean up any remaining resources\n",
+    "\n",
+    "**Location:** `../module-1/99_teardown.ipynb`\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 🚀 Deploy Environment (If Not Already Deployed)\n",
+    "\n",
+    "If your environment is not running, follow these steps to deploy LangSmith using Module 1:\n",
+    "\n",
+    "### Step 1: Preflight Checks\n",
+    "Run `../module-1/01_preflight.ipynb` to validate your environment.\n",
+    "\n",
+    "### Step 2: Provision Infrastructure\n",
+    "Run `../module-1/02_terraform_apply.ipynb` to deploy cloud infrastructure (Kubernetes cluster, database, cache, blob storage).\n",
+    "\n",
+    "### Step 3: Install LangSmith\n",
+    "Run `../module-1/03_helm_install_langsmith.ipynb` to install LangSmith using Helm.\n",
+    "\n",
+    "### Step 4: Validate Deployment\n",
+    "Run `../module-1/04_validate_ingress_and_ui.ipynb` to verify everything is working.\n",
+    "\n",
+    "### Step 5: Return Here\n",
+    "Once deployment is complete, return to this notebook and re-run the cells above to verify your environment is ready for Module 4.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "**Note:** If you encounter errors during deployment, refer to Module 1 documentation and troubleshooting guides.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,600 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Module 4: Diagnostics Baseline\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "**This notebook teaches \"baseline first\" discipline.** Before introducing failures or debugging issues, you must capture what \"good\" looks like. This baseline becomes your reference point for all troubleshooting.\n",
+    "\n",
+    "**What This Notebook Does:**\n",
+    "1. Captures cluster state snapshot (pods, services, deployments)\n",
+    "2. Collects recent events and resource usage\n",
+    "3. Runs the canonical diagnostics script\n",
+    "4. Performs basic health checks\n",
+    "5. Saves everything to a timestamped directory\n",
+    "\n",
+    "**Why This Matters:**\n",
+    "- You need \"before\" to compare to \"after\"\n",
+    "- Support will ask for baseline diagnostics\n",
+    "- Good debugging starts with understanding normal state\n",
+    "- Evidence collection is time-sensitive\n",
+    "\n",
+    "**Estimated time:** 15-20 minutes\n",
+    "\n",
+    "**Important:** Run this notebook BEFORE starting any failure labs. It's your evidence baseline.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bootstrap environment\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "from datetime import datetime\n",
+    "\n",
+    "# Add notebooks directory to path so we can import shared as a package\n",
+    "possible_paths = [\n",
+    "    Path.cwd().parent,  # If cwd is module-4, go up one level to notebooks\n",
+    "    Path.cwd(),  # If cwd is already notebooks\n",
+    "    Path.cwd() / \"notebooks\",  # If cwd is workspace root\n",
+    "]\n",
+    "\n",
+    "notebooks_path = None\n",
+    "for path in possible_paths:\n",
+    "    if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        notebooks_path = path\n",
+    "        break\n",
+    "\n",
+    "if not notebooks_path:\n",
+    "    notebooks_path = Path.cwd() / \"notebooks\"\n",
+    "    if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
+    "\n",
+    "# Add notebooks directory to path so 'shared' can be imported as a package\n",
+    "if str(notebooks_path) not in sys.path:\n",
+    "    sys.path.insert(0, str(notebooks_path))\n",
+    "\n",
+    "from shared._bootstrap import bootstrap\n",
+    "\n",
+    "# Run bootstrap\n",
+    "bootstrap_info = bootstrap()\n",
+    "artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
+    "\n",
+    "# Create timestamped directory for this baseline\n",
+    "timestamp = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
+    "baseline_dir = artifacts_dir / \"module-4\" / f\"baseline-{timestamp}\"\n",
+    "baseline_dir.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "print(f\"\\nBaseline directory: {baseline_dir}\")\n",
+    "print(f\"All diagnostics will be saved here.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Configuration\n",
+    "\n",
+    "Load and validate configuration from environment variables.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from shared._validation import ok, warn\n",
+    "from shared._cloud_helpers import get_cloud_provider, get_region\n",
+    "\n",
+    "# Required configuration\n",
+    "required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
+    "\n",
+    "print(\"### Loading Configuration\\n\")\n",
+    "\n",
+    "config = {}\n",
+    "missing = []\n",
+    "\n",
+    "for var in required_vars:\n",
+    "    value = os.environ.get(var, \"\").strip()\n",
+    "    if not value:\n",
+    "        missing.append(var)\n",
+    "    config[var] = value\n",
+    "\n",
+    "if missing:\n",
+    "    raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
+    "                      f\"💡 Copy env-samples/workshop.env.example to your .env file and fill in values\")\n",
+    "\n",
+    "# Optional but recommended\n",
+    "config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
+    "config[\"LANGSMITH_DOMAIN\"] = os.environ.get(\"LANGSMITH_DOMAIN\", \"\")\n",
+    "\n",
+    "namespace = config[\"NAMESPACE\"]\n",
+    "\n",
+    "# Show cloud provider info\n",
+    "provider = get_cloud_provider()\n",
+    "region = get_region()\n",
+    "\n",
+    "print(f\"Cloud Provider: {provider.upper()}\")\n",
+    "print(f\"Region: {region}\")\n",
+    "print(f\"Namespace: {namespace}\")\n",
+    "print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
+    "\n",
+    "if config[\"LANGSMITH_DOMAIN\"]:\n",
+    "    print(f\"LangSmith Domain: {config['LANGSMITH_DOMAIN']}\")\n",
+    "\n",
+    "ok(\"Configuration loaded\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Cluster State Snapshot\n",
+    "\n",
+    "Capture a complete snapshot of all resources in the namespace. This is your \"before\" picture.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._shell import run\n",
+    "import json\n",
+    "\n",
+    "print(\"### Capturing Cluster State Snapshot\\n\")\n",
+    "\n",
+    "# Get all resources\n",
+    "print(\"1. Collecting all resources...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"all\", \"-n\", namespace, \"-o\", \"wide\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    snapshot_file = baseline_dir / \"all-resources.txt\"\n",
+    "    with open(snapshot_file, \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    ok(f\"Saved resource snapshot to {snapshot_file.name}\")\n",
+    "    print(f\"   Resources captured: {len(result.stdout.splitlines())} lines\")\n",
+    "else:\n",
+    "    warn(\"Could not capture resource snapshot\")\n",
+    "\n",
+    "# Get all resources as YAML (more detailed)\n",
+    "print(\"\\n2. Collecting detailed YAML...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"all\", \"-n\", namespace, \"-o\", \"yaml\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    yaml_file = baseline_dir / \"all-resources.yaml\"\n",
+    "    with open(yaml_file, \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    ok(f\"Saved detailed YAML to {yaml_file.name}\")\n",
+    "else:\n",
+    "    warn(\"Could not capture detailed YAML\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Key Deployments Description\n",
+    "\n",
+    "Get detailed information about key deployments.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Describing Key Deployments\\n\")\n",
+    "\n",
+    "# Get list of deployments\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    deployments = json.loads(result.stdout)\n",
+    "    deployment_items = deployments.get(\"items\", [])\n",
+    "    \n",
+    "    if deployment_items:\n",
+    "        ok(f\"Found {len(deployment_items)} deployment(s)\")\n",
+    "        \n",
+    "        # Describe each deployment\n",
+    "        for deployment in deployment_items:\n",
+    "            name = deployment.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "            print(f\"\\n3. Describing deployment: {name}\")\n",
+    "            \n",
+    "            result = run(\n",
+    "                [\"kubectl\", \"describe\", \"deployment\", name, \"-n\", namespace],\n",
+    "                check=False,\n",
+    "                stream=False\n",
+    "            )\n",
+    "            \n",
+    "            if result.returncode == 0:\n",
+    "                desc_file = baseline_dir / f\"deployment-{name}.txt\"\n",
+    "                with open(desc_file, \"w\") as f:\n",
+    "                    f.write(result.stdout)\n",
+    "                print(f\"   ✅ Saved description to {desc_file.name}\")\n",
+    "            else:\n",
+    "                warn(f\"Could not describe deployment {name}\")\n",
+    "    else:\n",
+    "        warn(\"No deployments found\")\n",
+    "else:\n",
+    "    warn(\"Could not list deployments\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Recent Events\n",
+    "\n",
+    "Capture recent events sorted by timestamp. Events often contain the first clues about what's happening.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Collecting Recent Events\\n\")\n",
+    "\n",
+    "# Get events sorted by timestamp\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    events_file = baseline_dir / \"events.txt\"\n",
+    "    with open(events_file, \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    ok(f\"Saved events to {events_file.name}\")\n",
+    "    \n",
+    "    # Count events by type\n",
+    "    lines = result.stdout.strip().split(\"\\n\")\n",
+    "    if len(lines) > 1:  # Header + events\n",
+    "        event_count = len(lines) - 1\n",
+    "        print(f\"   Captured {event_count} event(s)\")\n",
+    "        \n",
+    "        # Show last few events\n",
+    "        if event_count > 0:\n",
+    "            print(\"\\n   Last 5 events:\")\n",
+    "            for line in lines[-5:]:\n",
+    "                if line.strip():\n",
+    "                    print(f\"   {line}\")\n",
+    "    else:\n",
+    "        print(\"   No events found (this is normal for a healthy cluster)\")\n",
+    "else:\n",
+    "    warn(\"Could not collect events\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Resource Usage\n",
+    "\n",
+    "Capture resource usage (CPU, memory) if metrics are available.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Collecting Resource Usage\\n\")\n",
+    "\n",
+    "# Top pods\n",
+    "print(\"1. Checking pod resource usage...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"top\", \"pods\", \"-n\", namespace],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    top_pods_file = baseline_dir / \"top-pods.txt\"\n",
+    "    with open(top_pods_file, \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    ok(f\"Saved pod resource usage to {top_pods_file.name}\")\n",
+    "    print(result.stdout)\n",
+    "else:\n",
+    "    warn(\"Could not get pod resource usage (metrics server may not be available)\")\n",
+    "    print(\"   💡 This is OK - metrics are optional for baseline collection\")\n",
+    "\n",
+    "# Top nodes (if available)\n",
+    "print(\"\\n2. Checking node resource usage...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"top\", \"nodes\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    top_nodes_file = baseline_dir / \"top-nodes.txt\"\n",
+    "    with open(top_nodes_file, \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    ok(f\"Saved node resource usage to {top_nodes_file.name}\")\n",
+    "    print(result.stdout)\n",
+    "else:\n",
+    "    warn(\"Could not get node resource usage (metrics server may not be available)\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Canonical Diagnostics Script\n",
+    "\n",
+    "**This is the most important step.** Run the official LangChain diagnostics script that Support expects.\n",
+    "\n",
+    "The script captures:\n",
+    "- Pod logs (all containers)\n",
+    "- Events (sorted by timestamp)\n",
+    "- Resource usage (CPU, memory)\n",
+    "- Configuration (deployments, services, ingress)\n",
+    "- Storage (PVCs, storage classes)\n",
+    "- Network (services, endpoints)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import urllib.request\n",
+    "import subprocess\n",
+    "\n",
+    "print(\"### Running Canonical Diagnostics Script\\n\")\n",
+    "\n",
+    "# URL to the canonical script\n",
+    "script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
+    "script_path = baseline_dir / \"get_k8s_debugging_info.sh\"\n",
+    "\n",
+    "print(f\"1. Downloading script from: {script_url}\")\n",
+    "try:\n",
+    "    urllib.request.urlretrieve(script_url, script_path)\n",
+    "    ok(f\"Downloaded script to {script_path.name}\")\n",
+    "    \n",
+    "    # Make executable\n",
+    "    script_path.chmod(0o755)\n",
+    "    \n",
+    "    # Run the script\n",
+    "    print(f\"\\n2. Running diagnostics script for namespace: {namespace}\")\n",
+    "    print(\"   (This may take a few minutes...)\")\n",
+    "    \n",
+    "    result = run(\n",
+    "        [str(script_path), namespace],\n",
+    "        check=False,\n",
+    "        stream=True  # Stream output so user can see progress\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        ok(\"Diagnostics script completed successfully\")\n",
+    "        \n",
+    "        # The script creates a tarball - find it\n",
+    "        diagnostics_tarball = None\n",
+    "        for file in baseline_dir.parent.iterdir():\n",
+    "            if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
+    "                diagnostics_tarball = file\n",
+    "                break\n",
+    "        \n",
+    "        if diagnostics_tarball:\n",
+    "            # Move it to our baseline directory\n",
+    "            target_path = baseline_dir / diagnostics_tarball.name\n",
+    "            diagnostics_tarball.rename(target_path)\n",
+    "            ok(f\"Diagnostics bundle saved to: {target_path.name}\")\n",
+    "            print(f\"   Size: {target_path.stat().st_size / 1024 / 1024:.2f} MB\")\n",
+    "        else:\n",
+    "            warn(\"Could not find diagnostics tarball (check script output above)\")\n",
+    "    else:\n",
+    "        warn(f\"Diagnostics script returned non-zero exit code: {result.returncode}\")\n",
+    "        print(\"   Check the output above for errors\")\n",
+    "        print(\"   💡 The script may still have collected useful information\")\n",
+    "        \n",
+    "except urllib.request.URLError as e:\n",
+    "    warn(f\"Could not download diagnostics script: {e}\")\n",
+    "    print(\"   💡 You can download it manually and run it:\")\n",
+    "    print(f\"      curl -O {script_url}\")\n",
+    "    print(f\"      chmod +x get_k8s_debugging_info.sh\")\n",
+    "    print(f\"      ./get_k8s_debugging_info.sh {namespace}\")\n",
+    "except Exception as e:\n",
+    "    warn(f\"Error running diagnostics script: {e}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Basic Health Check\n",
+    "\n",
+    "Perform a basic HTTP check to verify the LangSmith endpoint is reachable.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import urllib3\n",
+    "\n",
+    "# Disable SSL warnings for self-signed certs\n",
+    "urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\n",
+    "\n",
+    "print(\"### Testing Endpoint Reachability\\n\")\n",
+    "\n",
+    "# Determine endpoint URL\n",
+    "if config[\"LANGSMITH_DOMAIN\"]:\n",
+    "    test_url = f\"https://{config['LANGSMITH_DOMAIN']}\"\n",
+    "else:\n",
+    "    # Try to get from ingress\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"get\", \"ingress\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        ingresses = json.loads(result.stdout)\n",
+    "        for ingress in ingresses.get(\"items\", []):\n",
+    "            rules = ingress.get(\"spec\", {}).get(\"rules\", [])\n",
+    "            for rule in rules:\n",
+    "                host = rule.get(\"host\", \"\")\n",
+    "                if host:\n",
+    "                    test_url = f\"https://{host}\"\n",
+    "                    break\n",
+    "    else:\n",
+    "        test_url = None\n",
+    "\n",
+    "if test_url:\n",
+    "    print(f\"Testing: {test_url}\")\n",
+    "    try:\n",
+    "        response = requests.get(test_url, allow_redirects=True, verify=False, timeout=10)\n",
+    "        \n",
+    "        health_file = baseline_dir / \"endpoint-health.txt\"\n",
+    "        with open(health_file, \"w\") as f:\n",
+    "            f.write(f\"URL: {test_url}\\n\")\n",
+    "            f.write(f\"Status Code: {response.status_code}\\n\")\n",
+    "            f.write(f\"Response Headers:\\n{json.dumps(dict(response.headers), indent=2)}\\n\")\n",
+    "        \n",
+    "        if response.status_code in [200, 302, 401, 403]:\n",
+    "            ok(f\"Endpoint is reachable (HTTP {response.status_code})\")\n",
+    "            print(f\"   Response saved to {health_file.name}\")\n",
+    "        else:\n",
+    "            warn(f\"Endpoint returned unexpected status: {response.status_code}\")\n",
+    "    except requests.exceptions.SSLError:\n",
+    "        warn(\"SSL verification failed (may be self-signed certificate)\")\n",
+    "        print(\"   💡 This is OK for testing. In production, use proper TLS certificates.\")\n",
+    "    except requests.exceptions.RequestException as e:\n",
+    "        warn(f\"Could not reach endpoint: {e}\")\n",
+    "        print(\"   💡 Endpoint may still be provisioning or DNS not configured\")\n",
+    "else:\n",
+    "    warn(\"No endpoint URL available for testing\")\n",
+    "    print(\"   💡 Set LANGSMITH_DOMAIN in your .env file to test endpoint reachability\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. What Good Looks Like\n",
+    "\n",
+    "Quick validation checks to confirm the baseline is healthy.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._validation import ok, warn\n",
+    "\n",
+    "print(\"### Quick Health Validation\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "healthy_pods = 0\n",
+    "unhealthy_pods = []\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    for pod in pods.get(\"items\", []):\n",
+    "        name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        phase = pod.get(\"status\", {}).get(\"phase\", \"\")\n",
+    "        container_statuses = pod.get(\"status\", {}).get(\"containerStatuses\", [])\n",
+    "        \n",
+    "        is_ready = True\n",
+    "        for cs in container_statuses:\n",
+    "            if not cs.get(\"ready\", False):\n",
+    "                is_ready = False\n",
+    "                break\n",
+    "        \n",
+    "        if phase == \"Running\" and is_ready:\n",
+    "            healthy_pods += 1\n",
+    "        else:\n",
+    "            unhealthy_pods.append((name, phase, is_ready))\n",
+    "    \n",
+    "    if unhealthy_pods:\n",
+    "        warn(f\"Found {len(unhealthy_pods)} pod(s) that are not healthy:\")\n",
+    "        for name, phase, ready in unhealthy_pods:\n",
+    "            print(f\"   - {name}: phase={phase}, ready={ready}\")\n",
+    "    else:\n",
+    "        ok(f\"All {healthy_pods} pod(s) are healthy and ready\")\n",
+    "else:\n",
+    "    warn(\"Could not check pod status\")\n",
+    "\n",
+    "# Check for CrashLoopBackOff\n",
+    "if unhealthy_pods:\n",
+    "    crash_loops = [name for name, phase, _ in unhealthy_pods if phase == \"CrashLoopBackOff\"]\n",
+    "    if crash_loops:\n",
+    "        warn(f\"Found {len(crash_loops)} pod(s) in CrashLoopBackOff:\")\n",
+    "        for name in crash_loops:\n",
+    "            print(f\"   - {name}\")\n",
+    "        print(\"   💡 Check pod logs to understand why they're crashing\")\n",
+    "\n",
+    "# Check for Pending pods\n",
+    "pending = [name for name, phase, _ in unhealthy_pods if phase == \"Pending\"]\n",
+    "if pending:\n",
+    "    warn(f\"Found {len(pending)} pod(s) in Pending state:\")\n",
+    "    for name in pending:\n",
+    "        print(f\"   - {name}\")\n",
+    "    print(\"   💡 Check events and resource availability\")\n",
+    "\n",
+    "print(\"\\n### Baseline Summary\\n\")\n",
+    "print(f\"✅ Baseline captured at: {timestamp}\")\n",
+    "print(f\"📁 Baseline directory: {baseline_dir}\")\n",
+    "print(f\"📊 Resources captured:\")\n",
+    "print(f\"   - Cluster state snapshot\")\n",
+    "print(f\"   - Deployment descriptions\")\n",
+    "print(f\"   - Recent events\")\n",
+    "print(f\"   - Resource usage (if available)\")\n",
+    "print(f\"   - Canonical diagnostics bundle\")\n",
+    "print(f\"   - Endpoint health check\")\n",
+    "\n",
+    "ok(\"Baseline collection complete!\")\n",
+    "print(\"\\n💡 Use this baseline as your reference point for all failure labs.\")\n",
+    "print(\"   Compare future diagnostics to this baseline to identify what changed.\")\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,820 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Module 4: Failure Lab - PostgreSQL\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "**This lab teaches you how to debug PostgreSQL connectivity failures in LangSmith.**\n",
+    "\n",
+    "PostgreSQL is LangSmith's primary metadata store. It holds:\n",
+    "- User accounts and workspaces\n",
+    "- Project definitions\n",
+    "- API keys and permissions\n",
+    "- Trace metadata (not the traces themselves, which go to ClickHouse)\n",
+    "\n",
+    "**When PostgreSQL fails, you'll see:**\n",
+    "- API endpoints return 5xx errors\n",
+    "- Login/authentication may fail\n",
+    "- UI may load but actions fail\n",
+    "- Connection exhaustion patterns in logs\n",
+    "\n",
+    "**Learning Objectives:**\n",
+    "1. Understand how PostgreSQL failures manifest\n",
+    "2. Practice collecting diagnostics for database issues\n",
+    "3. Learn to identify connection vs. credential vs. network issues\n",
+    "4. Practice safe remediation\n",
+    "\n",
+    "**Estimated time:** 30-45 minutes\n",
+    "\n",
+    "**⚠️ Important:** Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bootstrap environment\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Add notebooks directory to path\n",
+    "possible_paths = [\n",
+    "    Path.cwd().parent,\n",
+    "    Path.cwd(),\n",
+    "    Path.cwd() / \"notebooks\",\n",
+    "]\n",
+    "\n",
+    "notebooks_path = None\n",
+    "for path in possible_paths:\n",
+    "    if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        notebooks_path = path\n",
+    "        break\n",
+    "\n",
+    "if not notebooks_path:\n",
+    "    notebooks_path = Path.cwd() / \"notebooks\"\n",
+    "    if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
+    "\n",
+    "if str(notebooks_path) not in sys.path:\n",
+    "    sys.path.insert(0, str(notebooks_path))\n",
+    "\n",
+    "from shared._bootstrap import bootstrap\n",
+    "from shared._validation import ok, warn\n",
+    "\n",
+    "# Run bootstrap\n",
+    "bootstrap_info = bootstrap()\n",
+    "artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
+    "print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Configuration & Prerequisites\n",
+    "\n",
+    "Load configuration and verify prerequisites.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from shared._validation import require_env\n",
+    "\n",
+    "required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
+    "config = require_env(*required_vars)\n",
+    "config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
+    "\n",
+    "namespace = config[\"NAMESPACE\"]\n",
+    "\n",
+    "print(f\"Namespace: {namespace}\")\n",
+    "print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
+    "\n",
+    "ok(\"Configuration loaded\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. What This Service Does for LangSmith\n",
+    "\n",
+    "PostgreSQL is LangSmith's **primary metadata store**. It holds:\n",
+    "\n",
+    "- **User accounts and authentication data**\n",
+    "- **Workspaces and projects** (organizational structure)\n",
+    "- **API keys and permissions** (access control)\n",
+    "- **Trace metadata** (not the trace data itself, which goes to ClickHouse)\n",
+    "- **Evaluation results and feedback**\n",
+    "\n",
+    "**Why it matters:**\n",
+    "- Without PostgreSQL, users can't log in\n",
+    "- API calls fail (no authentication, no project lookups)\n",
+    "- UI loads but can't perform actions\n",
+    "- All LangSmith functionality depends on it\n",
+    "\n",
+    "**How LangSmith connects:**\n",
+    "- Connection string stored in Kubernetes Secrets\n",
+    "- Connection pool managed by application\n",
+    "- Connection limits are critical (PostgreSQL has max connections)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Expected Symptoms When PostgreSQL Fails\n",
+    "\n",
+    "**What you'll see:**\n",
+    "\n",
+    "1. **API 5xx errors:**\n",
+    "   - `/api/v1/...` endpoints return 500 or 503\n",
+    "   - Error messages mention \"database\" or \"connection\"\n",
+    "\n",
+    "2. **Login failures:**\n",
+    "   - Users can't authenticate\n",
+    "   - OIDC/SAML may work (redirects) but session creation fails\n",
+    "\n",
+    "3. **UI loads but actions fail:**\n",
+    "   - Pages render (static content)\n",
+    "   - API calls fail (can't load projects, traces, etc.)\n",
+    "\n",
+    "4. **Log patterns:**\n",
+    "   - Connection timeout errors\n",
+    "   - \"connection refused\" or \"connection reset\"\n",
+    "   - \"too many connections\" (if connection pool exhausted)\n",
+    "   - \"authentication failed\" (if credentials wrong)\n",
+    "\n",
+    "**Timeline:**\n",
+    "- Symptoms appear within seconds of failure\n",
+    "- API calls start failing immediately\n",
+    "- Existing connections may work briefly, then fail\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Failure Injection Options\n",
+    "\n",
+    "**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
+    "\n",
+    "### Level 1: Subtle Failure (Recommended for first run)\n",
+    "\n",
+    "**Option A: Wrong Database Password**\n",
+    "- Modify the PostgreSQL password in the Kubernetes Secret\n",
+    "- Symptoms: Authentication failures, connection refused\n",
+    "\n",
+    "**Option B: Wrong Database Host**\n",
+    "- Point connection string to non-existent host\n",
+    "- Symptoms: Connection timeout, DNS resolution failures\n",
+    "\n",
+    "**Option C: Network Isolation (if NetworkPolicy supported)**\n",
+    "- Apply NetworkPolicy blocking egress to PostgreSQL\n",
+    "- Symptoms: Connection timeout, no route to host\n",
+    "\n",
+    "### Level 2: Obvious Failure\n",
+    "\n",
+    "**Option D: Remove Secret Entirely**\n",
+    "- Delete the PostgreSQL connection secret\n",
+    "- Symptoms: Pods crash on startup, immediate failures\n",
+    "\n",
+    "**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Do the Drill - Step 1: Confirm Baseline\n",
+    "\n",
+    "**Before injecting any failure, verify your baseline is healthy.**\n",
+    "\n",
+    "💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._shell import run\n",
+    "import json\n",
+    "\n",
+    "print(\"### Quick Baseline Check\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    healthy = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    print(f\"Pods: {healthy}/{total} running\")\n",
+    "    \n",
+    "    if healthy == total and total > 0:\n",
+    "        ok(\"Baseline looks healthy\")\n",
+    "    else:\n",
+    "        warn(\"Some pods are not running - check baseline first\")\n",
+    "else:\n",
+    "    warn(\"Could not check pod status\")\n",
+    "\n",
+    "# Check for PostgreSQL secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "postgres_secrets = []\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        if \"postgres\" in name.lower() or \"database\" in name.lower() or \"db\" in name.lower():\n",
+    "            postgres_secrets.append(name)\n",
+    "\n",
+    "if postgres_secrets:\n",
+    "    ok(f\"Found {len(postgres_secrets)} PostgreSQL-related secret(s)\")\n",
+    "    for secret_name in postgres_secrets:\n",
+    "        print(f\"   - {secret_name}\")\n",
+    "else:\n",
+    "    warn(\"No PostgreSQL secrets found\")\n",
+    "    print(\"   💡 PostgreSQL connection may be configured differently\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Do the Drill - Step 2: Apply Failure Injection\n",
+    "\n",
+    "**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
+    "\n",
+    "Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# FAILURE INJECTION: Wrong Database Password\n",
+    "# This cell modifies the PostgreSQL password secret to an invalid value\n",
+    "\n",
+    "import base64\n",
+    "import yaml\n",
+    "from datetime import datetime\n",
+    "\n",
+    "# Find PostgreSQL secret (look for common names)\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "postgres_secret_name = None\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        # Common patterns: postgres, database, db, langsmith-db\n",
+    "        if any(keyword in name.lower() for keyword in [\"postgres\", \"database\", \"db\"]):\n",
+    "            # Check if it has password-related keys\n",
+    "            data = secret.get(\"data\", {})\n",
+    "            if any(key in data for key in [\"password\", \"POSTGRES_PASSWORD\", \"DB_PASSWORD\"]):\n",
+    "                postgres_secret_name = name\n",
+    "                break\n",
+    "\n",
+    "if not postgres_secret_name:\n",
+    "    raise RuntimeError(\"❌ Could not find PostgreSQL secret. Check your deployment configuration.\")\n",
+    "\n",
+    "print(f\"Found PostgreSQL secret: {postgres_secret_name}\")\n",
+    "\n",
+    "# Get current secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", postgres_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
+    "    check=True,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "# Save original secret for restoration\n",
+    "backup_file = artifacts_dir / \"module-4\" / f\"postgres-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
+    "backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
+    "with open(backup_file, \"w\") as f:\n",
+    "    f.write(result.stdout)\n",
+    "\n",
+    "ok(f\"Backed up original secret to: {backup_file.name}\")\n",
+    "\n",
+    "# Parse YAML and modify password\n",
+    "secret_data = yaml.safe_load(result.stdout)\n",
+    "if \"data\" not in secret_data:\n",
+    "    raise RuntimeError(\"Secret has no data section\")\n",
+    "\n",
+    "# Find password key (could be password, POSTGRES_PASSWORD, DB_PASSWORD, etc.)\n",
+    "password_key = None\n",
+    "for key in [\"password\", \"POSTGRES_PASSWORD\", \"DB_PASSWORD\", \"postgres-password\"]:\n",
+    "    if key in secret_data[\"data\"]:\n",
+    "        password_key = key\n",
+    "        break\n",
+    "\n",
+    "if not password_key:\n",
+    "    raise RuntimeError(\"Could not find password key in secret\")\n",
+    "\n",
+    "# Set invalid password\n",
+    "invalid_password = \"INVALID_PASSWORD_12345\"\n",
+    "invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
+    "\n",
+    "# Modify secret\n",
+    "secret_data[\"data\"][password_key] = invalid_password_b64\n",
+    "\n",
+    "# Save modified secret to temp file\n",
+    "temp_secret_file = artifacts_dir / \"module-4\" / \"postgres-secret-modified.yaml\"\n",
+    "with open(temp_secret_file, \"w\") as f:\n",
+    "    yaml.dump(secret_data, f)\n",
+    "\n",
+    "print(f\"\\n⚠️  READY TO APPLY FAILURE INJECTION\")\n",
+    "print(f\"   This will set an invalid password in secret: {postgres_secret_name}\")\n",
+    "print(f\"   Modified secret saved to: {temp_secret_file.name}\")\n",
+    "print(f\"\\n   To apply, uncomment and run the next cell.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# UNCOMMENT TO APPLY FAILURE INJECTION\n",
+    "# \n",
+    "# result = run(\n",
+    "#     [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
+    "#     check=True,\n",
+    "#     stream=True\n",
+    "# )\n",
+    "# \n",
+    "# ok(\"Failure injection applied - PostgreSQL password is now invalid\")\n",
+    "# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
+    "# print(\"   This may take 1-2 minutes. Watch for pod restarts:\")\n",
+    "# print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "# \n",
+    "# # Wait a moment for changes to propagate\n",
+    "# import time\n",
+    "# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
+    "# time.sleep(30)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Do the Drill - Step 3: Observe Symptoms\n",
+    "\n",
+    "**Now that the failure is injected, observe how it manifests.**\n",
+    "\n",
+    "Check:\n",
+    "1. Pod logs for connection errors\n",
+    "2. API endpoint responses\n",
+    "3. UI behavior\n",
+    "4. Events for pod restarts\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datetime import datetime\n",
+    "\n",
+    "# Create incident directory for diagnostics\n",
+    "incident_dir = artifacts_dir / \"module-4\" / f\"postgres-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
+    "incident_dir.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "print(f\"### Collecting Failure Diagnostics\\n\")\n",
+    "print(f\"Saving to: {incident_dir}\\n\")\n",
+    "\n",
+    "# 1. Check pod status\n",
+    "print(\"1. Checking pod status...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    print(result.stdout)\n",
+    "    \n",
+    "    # Check for restarts\n",
+    "    lines = result.stdout.split(\"\\n\")\n",
+    "    restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
+    "    if restarts:\n",
+    "        print(\"\\n   Pod restart counts:\")\n",
+    "        for line in restarts[1:]:  # Skip header\n",
+    "            if line.strip():\n",
+    "                parts = line.split()\n",
+    "                if len(parts) > 3:\n",
+    "                    print(f\"   {parts[0]}: {parts[3]} restarts\")\n",
+    "\n",
+    "# 2. Check recent events\n",
+    "print(\"\\n2. Checking recent events...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"events.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    if result.stdout.strip():\n",
+    "        print(\"   Recent warning/error events:\")\n",
+    "        for line in result.stdout.split(\"\\n\")[-5:]:\n",
+    "            if line.strip():\n",
+    "                print(f\"   {line}\")\n",
+    "\n",
+    "# 3. Check API pod logs for database errors\n",
+    "print(\"\\n3. Checking API pod logs for database errors...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-api\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "api_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if api_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, api_pod, \"--tail=50\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        logs_file = incident_dir / f\"api-pod-{api_pod}-logs.txt\"\n",
+    "        with open(logs_file, \"w\") as f:\n",
+    "            f.write(result.stdout)\n",
+    "        \n",
+    "        # Look for database-related errors\n",
+    "        error_keywords = [\"database\", \"postgres\", \"connection\", \"timeout\", \"refused\", \"authentication\"]\n",
+    "        error_lines = [l for l in result.stdout.split(\"\\n\") \n",
+    "                      if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if error_lines:\n",
+    "            print(\"   Found database-related errors:\")\n",
+    "            for line in error_lines[-5:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            print(\"   No obvious database errors in recent logs\")\n",
+    "else:\n",
+    "    warn(\"Could not find API pod\")\n",
+    "\n",
+    "ok(f\"Diagnostics saved to: {incident_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
+    "\n",
+    "**This is critical - Support will ask for this bundle.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import urllib.request\n",
+    "\n",
+    "print(\"### Running Canonical Diagnostics Script\\n\")\n",
+    "\n",
+    "script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
+    "script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
+    "\n",
+    "try:\n",
+    "    urllib.request.urlretrieve(script_url, script_path)\n",
+    "    script_path.chmod(0o755)\n",
+    "    \n",
+    "    print(f\"Running diagnostics script for namespace: {namespace}\")\n",
+    "    result = run(\n",
+    "        [str(script_path), namespace],\n",
+    "        check=False,\n",
+    "        stream=True\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        ok(\"Diagnostics script completed\")\n",
+    "        \n",
+    "        # Find and move tarball\n",
+    "        for file in incident_dir.parent.iterdir():\n",
+    "            if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
+    "                target_path = incident_dir / file.name\n",
+    "                file.rename(target_path)\n",
+    "                ok(f\"Diagnostics bundle: {target_path.name}\")\n",
+    "                break\n",
+    "    else:\n",
+    "        warn(\"Diagnostics script had errors (check output above)\")\n",
+    "        \n",
+    "except Exception as e:\n",
+    "    warn(f\"Could not run diagnostics script: {e}\")\n",
+    "    print(\"   💡 You can run it manually:\")\n",
+    "    print(f\"      curl -O {script_url}\")\n",
+    "    print(f\"      chmod +x get_k8s_debugging_info.sh\")\n",
+    "    print(f\"      ./get_k8s_debugging_info.sh {namespace}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 10. Do the Drill - Step 5: Guided Triage\n",
+    "\n",
+    "**Where to look first for PostgreSQL issues:**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Guided Triage Steps\\n\")\n",
+    "\n",
+    "print(\"1. Check pod logs for connection errors:\")\n",
+    "print(f\"   kubectl logs -n {namespace} <pod-name> | grep -i 'database\\\\|postgres\\\\|connection'\")\n",
+    "print()\n",
+    "\n",
+    "print(\"2. Verify secret exists and has correct keys:\")\n",
+    "print(f\"   kubectl get secret {postgres_secret_name} -n {namespace} -o yaml\")\n",
+    "print(\"   (Don't print the actual values - they're base64 encoded)\")\n",
+    "print()\n",
+    "\n",
+    "print(\"3. Check for pod restarts (indicates startup failures):\")\n",
+    "print(f\"   kubectl get pods -n {namespace}\")\n",
+    "print()\n",
+    "\n",
+    "print(\"4. Test database connectivity from a pod (if possible):\")\n",
+    "print(\"   kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \\\\\")\n",
+    "print(\"     psql -h <db-host> -U <user> -d <database>\")\n",
+    "print()\n",
+    "\n",
+    "print(\"5. Check events for authentication/connection errors:\")\n",
+    "print(f\"   kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
+    "print()\n",
+    "\n",
+    "# Check what we can automatically\n",
+    "print(\"\\n### Automatic Checks\\n\")\n",
+    "\n",
+    "# Check secret still exists\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", postgres_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    ok(f\"Secret '{postgres_secret_name}' still exists\")\n",
+    "    secret_data = json.loads(result.stdout)\n",
+    "    keys = list(secret_data.get(\"data\", {}).keys())\n",
+    "    print(f\"   Secret keys: {', '.join(keys)}\")\n",
+    "else:\n",
+    "    warn(f\"Secret '{postgres_secret_name}' not found!\")\n",
+    "\n",
+    "# Check for pods with database connection env vars\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    db_related_pods = []\n",
+    "    for pod in pods.get(\"items\", []):\n",
+    "        name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
+    "        for container in containers:\n",
+    "            env = container.get(\"env\", [])\n",
+    "            db_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
+    "                                           for kw in [\"DB\", \"POSTGRES\", \"DATABASE\"])]\n",
+    "            if db_env:\n",
+    "                db_related_pods.append(name)\n",
+    "                break\n",
+    "    \n",
+    "    if db_related_pods:\n",
+    "        print(f\"\\n   Pods with database environment variables:\")\n",
+    "        for pod_name in set(db_related_pods):\n",
+    "            print(f\"   - {pod_name}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 11. Do the Drill - Step 6: Remediation\n",
+    "\n",
+    "**Restore the original secret to fix the issue.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# REMEDIATION: Restore original secret\n",
+    "# UNCOMMENT TO RESTORE\n",
+    "\n",
+    "# if backup_file.exists():\n",
+    "#     print(f\"Restoring original secret from: {backup_file.name}\")\n",
+    "#     result = run(\n",
+    "#         [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
+    "#         check=True,\n",
+    "#         stream=True\n",
+    "#     )\n",
+    "#     \n",
+    "#     ok(\"Original secret restored\")\n",
+    "#     print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
+    "#     print(\"   This may take 1-2 minutes. Monitor pod status:\")\n",
+    "#     print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "#     \n",
+    "#     import time\n",
+    "#     print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
+    "#     time.sleep(60)\n",
+    "# else:\n",
+    "#     warn(f\"Backup file not found: {backup_file}\")\n",
+    "#     print(\"   💡 You may need to manually restore the secret\")\n",
+    "\n",
+    "print(\"⚠️  To restore, uncomment the code above and run this cell.\")\n",
+    "print(f\"   Backup file: {backup_file.name}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 12. Do the Drill - Step 7: Confirm Recovery\n",
+    "\n",
+    "**Verify that everything is working again.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Verifying Recovery\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    running = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    \n",
+    "    if running == total and total > 0:\n",
+    "        ok(f\"All {total} pod(s) are running\")\n",
+    "    else:\n",
+    "        warn(f\"Only {running}/{total} pod(s) running\")\n",
+    "        print(\"   💡 Wait a bit longer for pods to fully recover\")\n",
+    "\n",
+    "# Check for recent errors in logs\n",
+    "print(\"\\nChecking for recent errors...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-api\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "api_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if api_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, api_pod, \"--tail=20\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        error_keywords = [\"error\", \"fail\", \"database\", \"postgres\", \"connection\"]\n",
+    "        recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
+    "                        if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if recent_errors:\n",
+    "            warn(\"Still seeing some errors in logs:\")\n",
+    "            for line in recent_errors[-3:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            ok(\"No recent errors in API logs\")\n",
+    "\n",
+    "ok(\"Recovery verification complete\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 13. What Support Will Ask For\n",
+    "\n",
+    "**When escalating a PostgreSQL issue, Support will need:**\n",
+    "\n",
+    "1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
+    "2. **PostgreSQL connection details:**\n",
+    "   - Host/endpoint (redacted)\n",
+    "   - Database name\n",
+    "   - Username (redacted)\n",
+    "   - Whether using SSL/TLS\n",
+    "3. **Error messages from logs:**\n",
+    "   - Full error text (not just \"connection failed\")\n",
+    "   - Timestamps of first occurrence\n",
+    "4. **Recent changes:**\n",
+    "   - Secret rotations\n",
+    "   - Database migrations\n",
+    "   - Network policy changes\n",
+    "5. **Connection pool status:**\n",
+    "   - Current connections vs. max connections\n",
+    "   - Connection pool exhaustion patterns\n",
+    "6. **Database health (if accessible):**\n",
+    "   - PostgreSQL version\n",
+    "   - Active connections\n",
+    "   - Lock contention\n",
+    "\n",
+    "**Evidence collected in this lab:**\n",
+    "- ✅ Diagnostics bundle\n",
+    "- ✅ Pod logs with database errors\n",
+    "- ✅ Events showing failures\n",
+    "- ✅ Secret configuration (structure, not values)\n",
+    "\n",
+    "**Additional evidence to gather (if escalating):**\n",
+    "- Database endpoint connectivity test\n",
+    "- Connection pool metrics (if available)\n",
+    "- PostgreSQL logs (if accessible via cloud provider)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 14. Lessons Learned\n",
+    "\n",
+    "**Key takeaways from this lab:**\n",
+    "\n",
+    "1. **PostgreSQL failures manifest quickly** - API calls fail within seconds\n",
+    "2. **Logs are your friend** - Connection errors appear in pod logs immediately\n",
+    "3. **Secrets matter** - Wrong credentials cause authentication failures\n",
+    "4. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
+    "5. **Diagnostics bundle is essential** - Support needs it for root cause analysis\n",
+    "\n",
+    "**Common mistakes to avoid:**\n",
+    "- ❌ Changing multiple things at once (hard to identify root cause)\n",
+    "- ❌ Not collecting diagnostics before remediation\n",
+    "- ❌ Ignoring connection pool limits\n",
+    "- ❌ Not testing database connectivity independently\n",
+    "\n",
+    "**Next steps:**\n",
+    "- Practice with other failure injection methods (Level 2)\n",
+    "- Try the Redis, ClickHouse, or Blob Storage failure labs\n",
+    "- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,836 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Module 4: Failure Lab - Redis\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "**This lab teaches you how to debug Redis connectivity failures in LangSmith.**\n",
+    "\n",
+    "Redis is LangSmith's **cache and job queue**. It handles:\n",
+    "- Job queue for asynchronous trace processing\n",
+    "- Caching for frequently accessed data\n",
+    "- Rate limiting and session management\n",
+    "- Worker coordination\n",
+    "\n",
+    "**When Redis fails, you'll see:**\n",
+    "- Intermittent ingestion issues\n",
+    "- Latency spikes and retries\n",
+    "- Worker backlog (jobs piling up)\n",
+    "- Traces may be delayed or missing\n",
+    "\n",
+    "**Learning Objectives:**\n",
+    "1. Understand how Redis failures manifest\n",
+    "2. Practice collecting diagnostics for cache/queue issues\n",
+    "3. Learn to identify connection vs. credential vs. network issues\n",
+    "4. Practice safe remediation\n",
+    "\n",
+    "**Estimated time:** 30-45 minutes\n",
+    "\n",
+    "**⚠️ Important:** Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bootstrap environment\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Add notebooks directory to path\n",
+    "possible_paths = [\n",
+    "    Path.cwd().parent,\n",
+    "    Path.cwd(),\n",
+    "    Path.cwd() / \"notebooks\",\n",
+    "]\n",
+    "\n",
+    "notebooks_path = None\n",
+    "for path in possible_paths:\n",
+    "    if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        notebooks_path = path\n",
+    "        break\n",
+    "\n",
+    "if not notebooks_path:\n",
+    "    notebooks_path = Path.cwd() / \"notebooks\"\n",
+    "    if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
+    "\n",
+    "if str(notebooks_path) not in sys.path:\n",
+    "    sys.path.insert(0, str(notebooks_path))\n",
+    "\n",
+    "from shared._bootstrap import bootstrap\n",
+    "from shared._validation import ok, warn\n",
+    "\n",
+    "# Run bootstrap\n",
+    "bootstrap_info = bootstrap()\n",
+    "artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
+    "print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Configuration & Prerequisites\n",
+    "\n",
+    "Load configuration and verify prerequisites.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from shared._validation import require_env\n",
+    "\n",
+    "required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
+    "config = require_env(*required_vars)\n",
+    "config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
+    "\n",
+    "namespace = config[\"NAMESPACE\"]\n",
+    "\n",
+    "print(f\"Namespace: {namespace}\")\n",
+    "print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
+    "\n",
+    "ok(\"Configuration loaded\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. What This Service Does for LangSmith\n",
+    "\n",
+    "Redis is LangSmith's **cache and job queue**. It handles:\n",
+    "\n",
+    "- **Job queue for asynchronous processing:**\n",
+    "  - Workers pull trace processing jobs from Redis\n",
+    "  - Jobs are queued when traces arrive via API\n",
+    "  - Queue backlog indicates processing delays\n",
+    "\n",
+    "- **Caching:**\n",
+    "  - Frequently accessed data (project metadata, user info)\n",
+    "  - Reduces load on PostgreSQL\n",
+    "  - Improves response times\n",
+    "\n",
+    "- **Rate limiting and session management:**\n",
+    "  - API rate limiting\n",
+    "  - Session storage (if configured)\n",
+    "\n",
+    "- **Worker coordination:**\n",
+    "  - Distributed locking\n",
+    "  - Task distribution\n",
+    "\n",
+    "**Why it matters:**\n",
+    "- Without Redis, workers can't process traces\n",
+    "- Job queue fills up, causing delays\n",
+    "- Cache misses increase load on PostgreSQL\n",
+    "- Ingestion becomes unreliable\n",
+    "\n",
+    "**How LangSmith connects:**\n",
+    "- Connection string stored in Kubernetes Secrets\n",
+    "- Workers connect to Redis to pull jobs\n",
+    "- API servers use Redis for caching\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Expected Symptoms When Redis Fails\n",
+    "\n",
+    "**What you'll see:**\n",
+    "\n",
+    "1. **Intermittent ingestion issues:**\n",
+    "   - Some traces process, others don't\n",
+    "   - Inconsistent behavior (works sometimes, fails other times)\n",
+    "   - Retries visible in logs\n",
+    "\n",
+    "2. **Latency spikes:**\n",
+    "   - API responses slow down\n",
+    "   - Worker processing delays\n",
+    "   - Timeout errors\n",
+    "\n",
+    "3. **Worker backlog:**\n",
+    "   - Jobs piling up in queue\n",
+    "   - Workers unable to pull new jobs\n",
+    "   - Queue length increasing\n",
+    "\n",
+    "4. **Log patterns:**\n",
+    "   - Connection timeout errors\n",
+    "   - \"connection refused\" or \"connection reset\"\n",
+    "   - \"NOAUTH Authentication required\" (if password wrong)\n",
+    "   - Retry attempts in worker logs\n",
+    "   - Cache miss patterns\n",
+    "\n",
+    "**Timeline:**\n",
+    "- Symptoms may be intermittent (connection pool retries)\n",
+    "- Worker backlog builds over time\n",
+    "- Cache misses cause cascading delays\n",
+    "- Full failure if connection pool exhausted\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Failure Injection Options\n",
+    "\n",
+    "**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
+    "\n",
+    "### Level 1: Subtle Failure (Recommended for first run)\n",
+    "\n",
+    "**Option A: Wrong Redis Password**\n",
+    "- Modify the Redis password in the Kubernetes Secret\n",
+    "- Symptoms: Authentication failures, connection refused, intermittent failures\n",
+    "\n",
+    "**Option B: Block Egress to Redis Endpoint**\n",
+    "- Apply NetworkPolicy blocking egress to Redis (if NetworkPolicy supported)\n",
+    "- Symptoms: Connection timeout, no route to host, intermittent failures\n",
+    "\n",
+    "### Level 2: Obvious Failure\n",
+    "\n",
+    "**Option C: Wrong Redis Host/Endpoint**\n",
+    "- Point connection string to non-existent host\n",
+    "- Symptoms: Connection timeout, DNS resolution failures, immediate failures\n",
+    "\n",
+    "**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Do the Drill - Step 1: Confirm Baseline\n",
+    "\n",
+    "**Before injecting any failure, verify your baseline is healthy.**\n",
+    "\n",
+    "💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._shell import run\n",
+    "import json\n",
+    "\n",
+    "print(\"### Quick Baseline Check\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    healthy = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    print(f\"Pods: {healthy}/{total} running\")\n",
+    "    \n",
+    "    if healthy == total and total > 0:\n",
+    "        ok(\"Baseline looks healthy\")\n",
+    "    else:\n",
+    "        warn(\"Some pods are not running - check baseline first\")\n",
+    "else:\n",
+    "    warn(\"Could not check pod status\")\n",
+    "\n",
+    "# Check for Redis secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "redis_secrets = []\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        if \"redis\" in name.lower() or \"cache\" in name.lower():\n",
+    "            redis_secrets.append(name)\n",
+    "\n",
+    "if redis_secrets:\n",
+    "    ok(f\"Found {len(redis_secrets)} Redis-related secret(s)\")\n",
+    "    for secret_name in redis_secrets:\n",
+    "        print(f\"   - {secret_name}\")\n",
+    "else:\n",
+    "    warn(\"No Redis secrets found\")\n",
+    "    print(\"   💡 Redis connection may be configured differently\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Do the Drill - Step 2: Apply Failure Injection\n",
+    "\n",
+    "**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
+    "\n",
+    "Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# FAILURE INJECTION: Wrong Redis Password\n",
+    "# This cell modifies the Redis password secret to an invalid value\n",
+    "\n",
+    "import base64\n",
+    "import yaml\n",
+    "from datetime import datetime\n",
+    "\n",
+    "# Find Redis secret (look for common names)\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "redis_secret_name = None\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        # Common patterns: redis, cache\n",
+    "        if any(keyword in name.lower() for keyword in [\"redis\", \"cache\"]):\n",
+    "            # Check if it has password-related keys\n",
+    "            data = secret.get(\"data\", {})\n",
+    "            if any(key in data for key in [\"password\", \"REDIS_PASSWORD\", \"CACHE_PASSWORD\"]):\n",
+    "                redis_secret_name = name\n",
+    "                break\n",
+    "\n",
+    "if not redis_secret_name:\n",
+    "    raise RuntimeError(\"❌ Could not find Redis secret. Check your deployment configuration.\")\n",
+    "\n",
+    "print(f\"Found Redis secret: {redis_secret_name}\")\n",
+    "\n",
+    "# Get current secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", redis_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
+    "    check=True,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "# Save original secret for restoration\n",
+    "backup_file = artifacts_dir / \"module-4\" / f\"redis-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
+    "backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
+    "with open(backup_file, \"w\") as f:\n",
+    "    f.write(result.stdout)\n",
+    "\n",
+    "ok(f\"Backed up original secret to: {backup_file.name}\")\n",
+    "\n",
+    "# Parse YAML and modify password\n",
+    "secret_data = yaml.safe_load(result.stdout)\n",
+    "if \"data\" not in secret_data:\n",
+    "    raise RuntimeError(\"Secret has no data section\")\n",
+    "\n",
+    "# Find password key (could be password, REDIS_PASSWORD, CACHE_PASSWORD, etc.)\n",
+    "password_key = None\n",
+    "for key in [\"password\", \"REDIS_PASSWORD\", \"CACHE_PASSWORD\", \"redis-password\"]:\n",
+    "    if key in secret_data[\"data\"]:\n",
+    "        password_key = key\n",
+    "        break\n",
+    "\n",
+    "if not password_key:\n",
+    "    raise RuntimeError(\"Could not find password key in secret\")\n",
+    "\n",
+    "# Set invalid password\n",
+    "invalid_password = \"INVALID_PASSWORD_12345\"\n",
+    "invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
+    "\n",
+    "# Modify secret\n",
+    "secret_data[\"data\"][password_key] = invalid_password_b64\n",
+    "\n",
+    "# Save modified secret to temp file\n",
+    "temp_secret_file = artifacts_dir / \"module-4\" / \"redis-secret-modified.yaml\"\n",
+    "with open(temp_secret_file, \"w\") as f:\n",
+    "    yaml.dump(secret_data, f)\n",
+    "\n",
+    "print(f\"\\n⚠️  READY TO APPLY FAILURE INJECTION\")\n",
+    "print(f\"   This will set an invalid password in secret: {redis_secret_name}\")\n",
+    "print(f\"   Modified secret saved to: {temp_secret_file.name}\")\n",
+    "print(f\"\\n   To apply, uncomment and run the next cell.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# UNCOMMENT TO APPLY FAILURE INJECTION\n",
+    "# \n",
+    "# result = run(\n",
+    "#     [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
+    "#     check=True,\n",
+    "#     stream=True\n",
+    "# )\n",
+    "# \n",
+    "# ok(\"Failure injection applied - Redis password is now invalid\")\n",
+    "# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
+    "# print(\"   This may take 1-2 minutes. Watch for pod restarts:\")\n",
+    "# print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "# \n",
+    "# # Wait a moment for changes to propagate\n",
+    "# import time\n",
+    "# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
+    "# time.sleep(30)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Do the Drill - Step 3: Observe Symptoms\n",
+    "\n",
+    "**Now that the failure is injected, observe how it manifests.**\n",
+    "\n",
+    "Check:\n",
+    "1. Worker pod logs for Redis connection errors\n",
+    "2. Queue backlog (if visible)\n",
+    "3. Worker retry patterns\n",
+    "4. Latency in API responses\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datetime import datetime\n",
+    "\n",
+    "# Create incident directory for diagnostics\n",
+    "incident_dir = artifacts_dir / \"module-4\" / f\"redis-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
+    "incident_dir.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "print(f\"### Collecting Failure Diagnostics\\n\")\n",
+    "print(f\"Saving to: {incident_dir}\\n\")\n",
+    "\n",
+    "# 1. Check pod status\n",
+    "print(\"1. Checking pod status...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    print(result.stdout)\n",
+    "    \n",
+    "    # Check for restarts\n",
+    "    lines = result.stdout.split(\"\\n\")\n",
+    "    restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
+    "    if restarts:\n",
+    "        print(\"\\n   Pod restart counts:\")\n",
+    "        for line in restarts[1:]:  # Skip header\n",
+    "            if line.strip():\n",
+    "                parts = line.split()\n",
+    "                if len(parts) > 3:\n",
+    "                    print(f\"   {parts[0]}: {parts[3]} restarts\")\n",
+    "\n",
+    "# 2. Check recent events\n",
+    "print(\"\\n2. Checking recent events...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"events.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    if result.stdout.strip():\n",
+    "        print(\"   Recent warning/error events:\")\n",
+    "        for line in result.stdout.split(\"\\n\")[-5:]:\n",
+    "            if line.strip():\n",
+    "                print(f\"   {line}\")\n",
+    "\n",
+    "# 3. Check worker pod logs for Redis errors\n",
+    "print(\"\\n3. Checking worker pod logs for Redis errors...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if worker_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=50\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        logs_file = incident_dir / f\"worker-pod-{worker_pod}-logs.txt\"\n",
+    "        with open(logs_file, \"w\") as f:\n",
+    "            f.write(result.stdout)\n",
+    "        \n",
+    "        # Look for Redis-related errors\n",
+    "        error_keywords = [\"redis\", \"cache\", \"connection\", \"timeout\", \"refused\", \"authentication\", \"NOAUTH\", \"retry\"]\n",
+    "        error_lines = [l for l in result.stdout.split(\"\\n\") \n",
+    "                      if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if error_lines:\n",
+    "            print(\"   Found Redis-related errors:\")\n",
+    "            for line in error_lines[-5:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            print(\"   No obvious Redis errors in recent logs\")\n",
+    "else:\n",
+    "    warn(\"Could not find worker pod\")\n",
+    "\n",
+    "ok(f\"Diagnostics saved to: {incident_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
+    "\n",
+    "**This is critical - Support will ask for this bundle.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import urllib.request\n",
+    "\n",
+    "print(\"### Running Canonical Diagnostics Script\\n\")\n",
+    "\n",
+    "script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
+    "script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
+    "\n",
+    "try:\n",
+    "    urllib.request.urlretrieve(script_url, script_path)\n",
+    "    script_path.chmod(0o755)\n",
+    "    \n",
+    "    print(f\"Running diagnostics script for namespace: {namespace}\")\n",
+    "    result = run(\n",
+    "        [str(script_path), namespace],\n",
+    "        check=False,\n",
+    "        stream=True\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        ok(\"Diagnostics script completed\")\n",
+    "        \n",
+    "        # Find and move tarball\n",
+    "        for file in incident_dir.parent.iterdir():\n",
+    "            if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
+    "                target_path = incident_dir / file.name\n",
+    "                file.rename(target_path)\n",
+    "                ok(f\"Diagnostics bundle: {target_path.name}\")\n",
+    "                break\n",
+    "    else:\n",
+    "        warn(\"Diagnostics script had errors (check output above)\")\n",
+    "        \n",
+    "except Exception as e:\n",
+    "    warn(f\"Could not run diagnostics script: {e}\")\n",
+    "    print(\"   💡 You can run it manually:\")\n",
+    "    print(f\"      curl -O {script_url}\")\n",
+    "    print(f\"      chmod +x get_k8s_debugging_info.sh\")\n",
+    "    print(f\"      ./get_k8s_debugging_info.sh {namespace}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 10. Do the Drill - Step 5: Guided Triage\n",
+    "\n",
+    "**Where to look first for Redis issues:**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Guided Triage Steps\\n\")\n",
+    "\n",
+    "print(\"1. Check worker pod logs for Redis connection errors:\")\n",
+    "print(f\"   kubectl logs -n {namespace} <worker-pod-name> | grep -i 'redis\\\\|cache\\\\|connection'\")\n",
+    "print()\n",
+    "\n",
+    "print(\"2. Verify secret exists and has correct keys:\")\n",
+    "print(f\"   kubectl get secret {redis_secret_name} -n {namespace} -o yaml\")\n",
+    "print(\"   (Don't print the actual values - they're base64 encoded)\")\n",
+    "print()\n",
+    "\n",
+    "print(\"3. Check for worker pod restarts (indicates connection failures):\")\n",
+    "print(f\"   kubectl get pods -n {namespace} -l app=langsmith-worker\")\n",
+    "print()\n",
+    "\n",
+    "print(\"4. Test Redis connectivity from a pod (if possible):\")\n",
+    "print(\"   kubectl run -it --rm debug --image=redis:7 --restart=Never -- \\\\\")\n",
+    "print(\"     redis-cli -h <redis-host> -p <port> -a <password> ping\")\n",
+    "print()\n",
+    "\n",
+    "print(\"5. Check events for connection/authentication errors:\")\n",
+    "print(f\"   kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
+    "print()\n",
+    "\n",
+    "# Check what we can automatically\n",
+    "print(\"\\n### Automatic Checks\\n\")\n",
+    "\n",
+    "# Check secret still exists\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", redis_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    ok(f\"Secret '{redis_secret_name}' still exists\")\n",
+    "    secret_data = json.loads(result.stdout)\n",
+    "    keys = list(secret_data.get(\"data\", {}).keys())\n",
+    "    print(f\"   Secret keys: {', '.join(keys)}\")\n",
+    "else:\n",
+    "    warn(f\"Secret '{redis_secret_name}' not found!\")\n",
+    "\n",
+    "# Check for pods with Redis connection env vars\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    redis_related_pods = []\n",
+    "    for pod in pods.get(\"items\", []):\n",
+    "        name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
+    "        for container in containers:\n",
+    "            env = container.get(\"env\", [])\n",
+    "            redis_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
+    "                                              for kw in [\"REDIS\", \"CACHE\"])]\n",
+    "            if redis_env:\n",
+    "                redis_related_pods.append(name)\n",
+    "                break\n",
+    "    \n",
+    "    if redis_related_pods:\n",
+    "        print(f\"\\n   Pods with Redis environment variables:\")\n",
+    "        for pod_name in set(redis_related_pods):\n",
+    "            print(f\"   - {pod_name}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 11. Do the Drill - Step 6: Remediation\n",
+    "\n",
+    "**Restore the original secret to fix the issue.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# REMEDIATION: Restore original secret\n",
+    "# UNCOMMENT TO RESTORE\n",
+    "\n",
+    "# if backup_file.exists():\n",
+    "#     print(f\"Restoring original secret from: {backup_file.name}\")\n",
+    "#     result = run(\n",
+    "#         [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
+    "#         check=True,\n",
+    "#         stream=True\n",
+    "#     )\n",
+    "#     \n",
+    "#     ok(\"Original secret restored\")\n",
+    "#     print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
+    "#     print(\"   This may take 1-2 minutes. Monitor pod status:\")\n",
+    "#     print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "#     \n",
+    "#     import time\n",
+    "#     print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
+    "#     time.sleep(60)\n",
+    "# else:\n",
+    "#     warn(f\"Backup file not found: {backup_file}\")\n",
+    "#     print(\"   💡 You may need to manually restore the secret\")\n",
+    "\n",
+    "print(\"⚠️  To restore, uncomment the code above and run this cell.\")\n",
+    "print(f\"   Backup file: {backup_file.name}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 12. Do the Drill - Step 7: Confirm Recovery\n",
+    "\n",
+    "**Verify that everything is working again.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Verifying Recovery\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    running = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    \n",
+    "    if running == total and total > 0:\n",
+    "        ok(f\"All {total} pod(s) are running\")\n",
+    "    else:\n",
+    "        warn(f\"Only {running}/{total} pod(s) running\")\n",
+    "        print(\"   💡 Wait a bit longer for pods to fully recover\")\n",
+    "\n",
+    "# Check for recent errors in worker logs\n",
+    "print(\"\\nChecking for recent errors in worker logs...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if worker_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=20\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        error_keywords = [\"error\", \"fail\", \"redis\", \"cache\", \"connection\"]\n",
+    "        recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
+    "                        if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if recent_errors:\n",
+    "            warn(\"Still seeing some errors in logs:\")\n",
+    "            for line in recent_errors[-3:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            ok(\"No recent errors in worker logs\")\n",
+    "\n",
+    "ok(\"Recovery verification complete\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 13. What Support Will Ask For\n",
+    "\n",
+    "**When escalating a Redis issue, Support will need:**\n",
+    "\n",
+    "1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
+    "2. **Redis connection details:**\n",
+    "   - Host/endpoint (redacted)\n",
+    "   - Port\n",
+    "   - Password (redacted)\n",
+    "   - Whether using SSL/TLS\n",
+    "3. **Error messages from logs:**\n",
+    "   - Full error text (not just \"connection failed\")\n",
+    "   - Timestamps of first occurrence\n",
+    "   - Retry patterns\n",
+    "4. **Recent changes:**\n",
+    "   - Secret rotations\n",
+    "   - Network policy changes\n",
+    "   - Redis configuration changes\n",
+    "5. **Queue status (if accessible):**\n",
+    "   - Queue length\n",
+    "   - Worker processing rate\n",
+    "   - Backlog growth rate\n",
+    "6. **Redis health (if accessible):**\n",
+    "   - Redis version\n",
+    "   - Memory usage\n",
+    "   - Connection count\n",
+    "   - Slow queries\n",
+    "\n",
+    "**Evidence collected in this lab:**\n",
+    "- ✅ Diagnostics bundle\n",
+    "- ✅ Worker pod logs with Redis errors\n",
+    "- ✅ Events showing failures\n",
+    "- ✅ Secret configuration (structure, not values)\n",
+    "\n",
+    "**Additional evidence to gather (if escalating):**\n",
+    "- Redis endpoint connectivity test\n",
+    "- Queue metrics (if available)\n",
+    "- Redis logs (if accessible via cloud provider)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 14. Lessons Learned\n",
+    "\n",
+    "**Key takeaways from this lab:**\n",
+    "\n",
+    "1. **Redis failures can be intermittent** - Connection pool retries may mask issues\n",
+    "2. **Worker logs are critical** - Redis errors appear in worker pod logs\n",
+    "3. **Queue backlog is a symptom** - Jobs pile up when workers can't connect\n",
+    "4. **Secrets matter** - Wrong credentials cause authentication failures\n",
+    "5. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
+    "\n",
+    "**Common mistakes to avoid:**\n",
+    "- ❌ Ignoring intermittent failures (they indicate connection issues)\n",
+    "- ❌ Not checking worker logs (API logs may not show Redis errors)\n",
+    "- ❌ Not monitoring queue length (backlog indicates processing delays)\n",
+    "- ❌ Not testing Redis connectivity independently\n",
+    "\n",
+    "**Next steps:**\n",
+    "- Practice with other failure injection methods (Level 2)\n",
+    "- Try the ClickHouse or Blob Storage failure labs\n",
+    "- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,837 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Module 4: Failure Lab - ClickHouse\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "**This lab teaches you how to debug ClickHouse connectivity failures in LangSmith.**\n",
+    "\n",
+    "ClickHouse is LangSmith's **trace storage**. It handles:\n",
+    "- Storing trace data (spans, events, metadata)\n",
+    "- Time-series queries for trace search and filtering\n",
+    "- High-volume writes from workers\n",
+    "- Efficient querying for UI display\n",
+    "\n",
+    "**When ClickHouse fails, you'll see:**\n",
+    "- Traces delayed or missing\n",
+    "- Insert errors and merge/backlog hints\n",
+    "- UI loads but traces don't appear\n",
+    "- Query timeouts\n",
+    "\n",
+    "**Learning Objectives:**\n",
+    "1. Understand how ClickHouse failures manifest\n",
+    "2. Practice collecting diagnostics for trace storage issues\n",
+    "3. Learn to identify connection vs. credential vs. network issues\n",
+    "4. Practice safe remediation\n",
+    "\n",
+    "**Estimated time:** 30-45 minutes\n",
+    "\n",
+    "**⚠️ Important:** Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bootstrap environment\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Add notebooks directory to path\n",
+    "possible_paths = [\n",
+    "    Path.cwd().parent,\n",
+    "    Path.cwd(),\n",
+    "    Path.cwd() / \"notebooks\",\n",
+    "]\n",
+    "\n",
+    "notebooks_path = None\n",
+    "for path in possible_paths:\n",
+    "    if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        notebooks_path = path\n",
+    "        break\n",
+    "\n",
+    "if not notebooks_path:\n",
+    "    notebooks_path = Path.cwd() / \"notebooks\"\n",
+    "    if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
+    "\n",
+    "if str(notebooks_path) not in sys.path:\n",
+    "    sys.path.insert(0, str(notebooks_path))\n",
+    "\n",
+    "from shared._bootstrap import bootstrap\n",
+    "from shared._validation import ok, warn\n",
+    "\n",
+    "# Run bootstrap\n",
+    "bootstrap_info = bootstrap()\n",
+    "artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
+    "print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Configuration & Prerequisites\n",
+    "\n",
+    "Load configuration and verify prerequisites.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from shared._validation import require_env\n",
+    "\n",
+    "required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
+    "config = require_env(*required_vars)\n",
+    "config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
+    "\n",
+    "namespace = config[\"NAMESPACE\"]\n",
+    "\n",
+    "print(f\"Namespace: {namespace}\")\n",
+    "print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
+    "\n",
+    "ok(\"Configuration loaded\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. What This Service Does for LangSmith\n",
+    "\n",
+    "ClickHouse is LangSmith's **clickhouse and job queue**. It handles:\n",
+    "\n",
+    "- **Job queue for asynchronous processing:**\n",
+    "  - Workers pull trace processing jobs from ClickHouse\n",
+    "  - Jobs are queued when traces arrive via API\n",
+    "  - Queue backlog indicates processing delays\n",
+    "\n",
+    "- **Caching:**\n",
+    "  - Frequently accessed data (project metadata, user info)\n",
+    "  - Reduces load on PostgreSQL\n",
+    "  - Improves response times\n",
+    "\n",
+    "- **Rate limiting and session management:**\n",
+    "  - API rate limiting\n",
+    "  - Session storage (if configured)\n",
+    "\n",
+    "- **Worker coordination:**\n",
+    "  - Distributed locking\n",
+    "  - Task distribution\n",
+    "\n",
+    "**Why it matters:**\n",
+    "- Without ClickHouse, workers can't process traces\n",
+    "- Job queue fills up, causing delays\n",
+    "- Cache misses increase load on PostgreSQL\n",
+    "- Ingestion becomes unreliable\n",
+    "\n",
+    "**How LangSmith connects:**\n",
+    "- Connection string stored in Kubernetes Secrets\n",
+    "- Workers connect to ClickHouse to pull jobs\n",
+    "- API servers use ClickHouse for caching\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Expected Symptoms When ClickHouse Fails\n",
+    "\n",
+    "**What you'll see:**\n",
+    "\n",
+    "1. **Traces delayed or missing:**\n",
+    "   - Some traces process, others don't\n",
+    "   - Inconsistent behavior (works sometimes, fails other times)\n",
+    "   - Retries visible in logs\n",
+    "\n",
+    "2. **Latency spikes:**\n",
+    "   - API responses slow down\n",
+    "   - Worker processing delays\n",
+    "   - Timeout errors\n",
+    "\n",
+    "3. **Worker backlog:**\n",
+    "   - Jobs piling up in queue\n",
+    "   - Workers unable to pull new jobs\n",
+    "   - Queue length increasing\n",
+    "\n",
+    "4. **Log patterns:**\n",
+    "   - Connection timeout errors\n",
+    "   - \"connection refused\" or \"connection reset\"\n",
+    "   - \"NOAUTH Authentication required\" (if password wrong)\n",
+    "   - Retry attempts in worker logs\n",
+    "   - Cache miss patterns\n",
+    "\n",
+    "**Timeline:**\n",
+    "- Symptoms may be intermittent (connection pool retries)\n",
+    "- Worker backlog builds over time\n",
+    "- Cache misses cause cascading delays\n",
+    "- Full failure if connection pool exhausted\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Failure Injection Options\n",
+    "\n",
+    "**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
+    "\n",
+    "### Level 1: Subtle Failure (Recommended for first run)\n",
+    "\n",
+    "**Option A: Wrong ClickHouse Password**\n",
+    "- Modify the ClickHouse password in the Kubernetes Secret\n",
+    "- Symptoms: Authentication failures, connection refused, intermittent failures\n",
+    "\n",
+    "**Option B: Block Egress to ClickHouse Endpoint**\n",
+    "- Apply NetworkPolicy blocking egress to ClickHouse (if NetworkPolicy supported)\n",
+    "- Symptoms: Connection timeout, no route to host, intermittent failures\n",
+    "\n",
+    "### Level 2: Obvious Failure\n",
+    "\n",
+    "**Option C: Wrong ClickHouse Host/Endpoint**\n",
+    "- Point connection string to non-existent host\n",
+    "- Symptoms: Connection timeout, DNS resolution failures, immediate failures\n",
+    "\n",
+    "**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Do the Drill - Step 1: Confirm Baseline\n",
+    "\n",
+    "**Before injecting any failure, verify your baseline is healthy.**\n",
+    "\n",
+    "💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._shell import run\n",
+    "import json\n",
+    "\n",
+    "print(\"### Quick Baseline Check\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    healthy = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    print(f\"Pods: {healthy}/{total} running\")\n",
+    "    \n",
+    "    if healthy == total and total > 0:\n",
+    "        ok(\"Baseline looks healthy\")\n",
+    "    else:\n",
+    "        warn(\"Some pods are not running - check baseline first\")\n",
+    "else:\n",
+    "    warn(\"Could not check pod status\")\n",
+    "\n",
+    "# Check for ClickHouse secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "clickhouse_secrets = []\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        if \"clickhouse\" in name.lower() or \"clickhouse\" in name.lower():\n",
+    "            clickhouse_secrets.append(name)\n",
+    "\n",
+    "if clickhouse_secrets:\n",
+    "    ok(f\"Found {len(clickhouse_secrets)} ClickHouse-related secret(s)\")\n",
+    "    for secret_name in clickhouse_secrets:\n",
+    "        print(f\"   - {secret_name}\")\n",
+    "else:\n",
+    "    warn(\"No ClickHouse secrets found\")\n",
+    "    print(\"   💡 ClickHouse connection may be configured differently\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Do the Drill - Step 2: Apply Failure Injection\n",
+    "\n",
+    "**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
+    "\n",
+    "Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# FAILURE INJECTION: Wrong ClickHouse Password\n",
+    "# This cell modifies the ClickHouse password secret to an invalid value\n",
+    "\n",
+    "import base64\n",
+    "import yaml\n",
+    "from datetime import datetime\n",
+    "\n",
+    "# Find ClickHouse secret (look for common names)\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "clickhouse_secret_name = None\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        # Common patterns: clickhouse, clickhouse\n",
+    "        if any(keyword in name.lower() for keyword in [\"clickhouse\", \"clickhouse\"]):\n",
+    "            # Check if it has password-related keys\n",
+    "            data = secret.get(\"data\", {})\n",
+    "            if any(key in data for key in [\"password\", \"REDIS_PASSWORD\", \"CLICKHOUSE_PASSWORD\"]):\n",
+    "                clickhouse_secret_name = name\n",
+    "                break\n",
+    "\n",
+    "if not clickhouse_secret_name:\n",
+    "    raise RuntimeError(\"❌ Could not find ClickHouse secret. Check your deployment configuration.\")\n",
+    "\n",
+    "print(f\"Found ClickHouse secret: {clickhouse_secret_name}\")\n",
+    "\n",
+    "# Get current secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", clickhouse_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
+    "    check=True,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "# Save original secret for restoration\n",
+    "backup_file = artifacts_dir / \"module-4\" / f\"clickhouse-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
+    "backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
+    "with open(backup_file, \"w\") as f:\n",
+    "    f.write(result.stdout)\n",
+    "\n",
+    "ok(f\"Backed up original secret to: {backup_file.name}\")\n",
+    "\n",
+    "# Parse YAML and modify password\n",
+    "secret_data = yaml.safe_load(result.stdout)\n",
+    "if \"data\" not in secret_data:\n",
+    "    raise RuntimeError(\"Secret has no data section\")\n",
+    "\n",
+    "# Find password key (could be password, REDIS_PASSWORD, CLICKHOUSE_PASSWORD, etc.)\n",
+    "password_key = None\n",
+    "for key in [\"password\", \"REDIS_PASSWORD\", \"CLICKHOUSE_PASSWORD\", \"clickhouse-password\"]:\n",
+    "    if key in secret_data[\"data\"]:\n",
+    "        password_key = key\n",
+    "        break\n",
+    "\n",
+    "if not password_key:\n",
+    "    raise RuntimeError(\"Could not find password key in secret\")\n",
+    "\n",
+    "# Set invalid password\n",
+    "invalid_password = \"INVALID_PASSWORD_12345\"\n",
+    "invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
+    "\n",
+    "# Modify secret\n",
+    "secret_data[\"data\"][password_key] = invalid_password_b64\n",
+    "\n",
+    "# Save modified secret to temp file\n",
+    "temp_secret_file = artifacts_dir / \"module-4\" / \"clickhouse-secret-modified.yaml\"\n",
+    "with open(temp_secret_file, \"w\") as f:\n",
+    "    yaml.dump(secret_data, f)\n",
+    "\n",
+    "print(f\"\\n⚠️  READY TO APPLY FAILURE INJECTION\")\n",
+    "print(f\"   This will set an invalid password in secret: {clickhouse_secret_name}\")\n",
+    "print(f\"   Modified secret saved to: {temp_secret_file.name}\")\n",
+    "print(f\"\\n   To apply, uncomment and run the next cell.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# UNCOMMENT TO APPLY FAILURE INJECTION\n",
+    "# \n",
+    "# result = run(\n",
+    "#     [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
+    "#     check=True,\n",
+    "#     stream=True\n",
+    "# )\n",
+    "# \n",
+    "# ok(\"Failure injection applied - ClickHouse password is now invalid\")\n",
+    "# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
+    "# print(\"   This may take 1-2 minutes. Watch for pod restarts:\")\n",
+    "# print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "# \n",
+    "# # Wait a moment for changes to propagate\n",
+    "# import time\n",
+    "# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
+    "# time.sleep(30)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Do the Drill - Step 3: Observe Symptoms\n",
+    "\n",
+    "**Now that the failure is injected, observe how it manifests.**\n",
+    "\n",
+    "Check:\n",
+    "1. Worker pod logs for ClickHouse connection errors\n",
+    "2. Queue backlog (if visible)\n",
+    "3. Worker retry patterns\n",
+    "4. Latency in API responses\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datetime import datetime\n",
+    "\n",
+    "# Create incident directory for diagnostics\n",
+    "incident_dir = artifacts_dir / \"module-4\" / f\"clickhouse-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
+    "incident_dir.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "print(f\"### Collecting Failure Diagnostics\\n\")\n",
+    "print(f\"Saving to: {incident_dir}\\n\")\n",
+    "\n",
+    "# 1. Check pod status\n",
+    "print(\"1. Checking pod status...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    print(result.stdout)\n",
+    "    \n",
+    "    # Check for restarts\n",
+    "    lines = result.stdout.split(\"\\n\")\n",
+    "    restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
+    "    if restarts:\n",
+    "        print(\"\\n   Pod restart counts:\")\n",
+    "        for line in restarts[1:]:  # Skip header\n",
+    "            if line.strip():\n",
+    "                parts = line.split()\n",
+    "                if len(parts) > 3:\n",
+    "                    print(f\"   {parts[0]}: {parts[3]} restarts\")\n",
+    "\n",
+    "# 2. Check recent events\n",
+    "print(\"\\n2. Checking recent events...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"events.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    if result.stdout.strip():\n",
+    "        print(\"   Recent warning/error events:\")\n",
+    "        for line in result.stdout.split(\"\\n\")[-5:]:\n",
+    "            if line.strip():\n",
+    "                print(f\"   {line}\")\n",
+    "\n",
+    "# 3. Check worker pod logs for ClickHouse errors\n",
+    "print(\"\\n3. Checking worker pod logs for ClickHouse errors...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if worker_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=50\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        logs_file = incident_dir / f\"worker-pod-{worker_pod}-logs.txt\"\n",
+    "        with open(logs_file, \"w\") as f:\n",
+    "            f.write(result.stdout)\n",
+    "        \n",
+    "        # Look for ClickHouse-related errors\n",
+    "        error_keywords = [\"clickhouse\", \"clickhouse\", \"connection\", \"timeout\", \"refused\", \"authentication\", \"NOAUTH\", \"retry\"]\n",
+    "        error_lines = [l for l in result.stdout.split(\"\\n\") \n",
+    "                      if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if error_lines:\n",
+    "            print(\"   Found ClickHouse-related errors:\")\n",
+    "            for line in error_lines[-5:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            print(\"   No obvious ClickHouse errors in recent logs\")\n",
+    "else:\n",
+    "    warn(\"Could not find worker pod\")\n",
+    "\n",
+    "ok(f\"Diagnostics saved to: {incident_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
+    "\n",
+    "**This is critical - Support will ask for this bundle.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import urllib.request\n",
+    "\n",
+    "print(\"### Running Canonical Diagnostics Script\\n\")\n",
+    "\n",
+    "script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
+    "script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
+    "\n",
+    "try:\n",
+    "    urllib.request.urlretrieve(script_url, script_path)\n",
+    "    script_path.chmod(0o755)\n",
+    "    \n",
+    "    print(f\"Running diagnostics script for namespace: {namespace}\")\n",
+    "    result = run(\n",
+    "        [str(script_path), namespace],\n",
+    "        check=False,\n",
+    "        stream=True\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        ok(\"Diagnostics script completed\")\n",
+    "        \n",
+    "        # Find and move tarball\n",
+    "        for file in incident_dir.parent.iterdir():\n",
+    "            if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
+    "                target_path = incident_dir / file.name\n",
+    "                file.rename(target_path)\n",
+    "                ok(f\"Diagnostics bundle: {target_path.name}\")\n",
+    "                break\n",
+    "    else:\n",
+    "        warn(\"Diagnostics script had errors (check output above)\")\n",
+    "        \n",
+    "except Exception as e:\n",
+    "    warn(f\"Could not run diagnostics script: {e}\")\n",
+    "    print(\"   💡 You can run it manually:\")\n",
+    "    print(f\"      curl -O {script_url}\")\n",
+    "    print(f\"      chmod +x get_k8s_debugging_info.sh\")\n",
+    "    print(f\"      ./get_k8s_debugging_info.sh {namespace}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 10. Do the Drill - Step 5: Guided Triage\n",
+    "\n",
+    "**Where to look first for ClickHouse issues:**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Guided Triage Steps\\n\")\n",
+    "\n",
+    "print(\"1. Check worker pod logs for ClickHouse connection errors:\")\n",
+    "print(f\"   kubectl logs -n {namespace} <worker-pod-name> | grep -i 'clickhouse\\\\|clickhouse\\\\|connection'\")\n",
+    "print()\n",
+    "\n",
+    "print(\"2. Verify secret exists and has correct keys:\")\n",
+    "print(f\"   kubectl get secret {clickhouse_secret_name} -n {namespace} -o yaml\")\n",
+    "print(\"   (Don't print the actual values - they're base64 encoded)\")\n",
+    "print()\n",
+    "\n",
+    "print(\"3. Check for worker pod restarts (indicates connection failures):\")\n",
+    "print(f\"   kubectl get pods -n {namespace} -l app=langsmith-worker\")\n",
+    "print()\n",
+    "\n",
+    "print(\"4. Test ClickHouse connectivity from a pod (if possible):\")\n",
+    "print(\"   kubectl run -it --rm debug --image=clickhouse:7 --restart=Never -- \\\\\")\n",
+    "print(\"     clickhouse-cli -h <clickhouse-host> -p <port> -a <password> ping\")\n",
+    "print()\n",
+    "\n",
+    "print(\"5. Check events for connection/authentication errors:\")\n",
+    "print(f\"   kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
+    "print()\n",
+    "\n",
+    "# Check what we can automatically\n",
+    "print(\"\\n### Automatic Checks\\n\")\n",
+    "\n",
+    "# Check secret still exists\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", clickhouse_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    ok(f\"Secret '{clickhouse_secret_name}' still exists\")\n",
+    "    secret_data = json.loads(result.stdout)\n",
+    "    keys = list(secret_data.get(\"data\", {}).keys())\n",
+    "    print(f\"   Secret keys: {', '.join(keys)}\")\n",
+    "else:\n",
+    "    warn(f\"Secret '{clickhouse_secret_name}' not found!\")\n",
+    "\n",
+    "# Check for pods with ClickHouse connection env vars\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    clickhouse_related_pods = []\n",
+    "    for pod in pods.get(\"items\", []):\n",
+    "        name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
+    "        for container in containers:\n",
+    "            env = container.get(\"env\", [])\n",
+    "            clickhouse_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
+    "                                              for kw in [\"REDIS\", \"CLICKHOUSE\"])]\n",
+    "            if clickhouse_env:\n",
+    "                clickhouse_related_pods.append(name)\n",
+    "                break\n",
+    "    \n",
+    "    if clickhouse_related_pods:\n",
+    "        print(f\"\\n   Pods with ClickHouse environment variables:\")\n",
+    "        for pod_name in set(clickhouse_related_pods):\n",
+    "            print(f\"   - {pod_name}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 11. Do the Drill - Step 6: Remediation\n",
+    "\n",
+    "**Restore the original secret to fix the issue.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# REMEDIATION: Restore original secret\n",
+    "# UNCOMMENT TO RESTORE\n",
+    "\n",
+    "# if backup_file.exists():\n",
+    "#     print(f\"Restoring original secret from: {backup_file.name}\")\n",
+    "#     result = run(\n",
+    "#         [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
+    "#         check=True,\n",
+    "#         stream=True\n",
+    "#     )\n",
+    "#     \n",
+    "#     ok(\"Original secret restored\")\n",
+    "#     print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
+    "#     print(\"   This may take 1-2 minutes. Monitor pod status:\")\n",
+    "#     print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "#     \n",
+    "#     import time\n",
+    "#     print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
+    "#     time.sleep(60)\n",
+    "# else:\n",
+    "#     warn(f\"Backup file not found: {backup_file}\")\n",
+    "#     print(\"   💡 You may need to manually restore the secret\")\n",
+    "\n",
+    "print(\"⚠️  To restore, uncomment the code above and run this cell.\")\n",
+    "if 'backup_file' in locals():\n",
+    "    print(f\"   Backup file: {backup_file.name}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 12. Do the Drill - Step 7: Confirm Recovery\n",
+    "\n",
+    "**Verify that everything is working again.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Verifying Recovery\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    running = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    \n",
+    "    if running == total and total > 0:\n",
+    "        ok(f\"All {total} pod(s) are running\")\n",
+    "    else:\n",
+    "        warn(f\"Only {running}/{total} pod(s) running\")\n",
+    "        print(\"   💡 Wait a bit longer for pods to fully recover\")\n",
+    "\n",
+    "# Check for recent errors in worker logs\n",
+    "print(\"\\nChecking for recent errors in worker logs...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if worker_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=20\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        error_keywords = [\"error\", \"fail\", \"clickhouse\", \"clickhouse\", \"connection\"]\n",
+    "        recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
+    "                        if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if recent_errors:\n",
+    "            warn(\"Still seeing some errors in logs:\")\n",
+    "            for line in recent_errors[-3:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            ok(\"No recent errors in worker logs\")\n",
+    "\n",
+    "ok(\"Recovery verification complete\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 13. What Support Will Ask For\n",
+    "\n",
+    "**When escalating a ClickHouse issue, Support will need:**\n",
+    "\n",
+    "1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
+    "2. **ClickHouse connection details:**\n",
+    "   - Host/endpoint (redacted)\n",
+    "   - Port\n",
+    "   - Password (redacted)\n",
+    "   - Whether using SSL/TLS\n",
+    "3. **Error messages from logs:**\n",
+    "   - Full error text (not just \"connection failed\")\n",
+    "   - Timestamps of first occurrence\n",
+    "   - Retry patterns\n",
+    "4. **Recent changes:**\n",
+    "   - Secret rotations\n",
+    "   - Network policy changes\n",
+    "   - ClickHouse configuration changes\n",
+    "5. **Queue status (if accessible):**\n",
+    "   - Queue length\n",
+    "   - Worker processing rate\n",
+    "   - Backlog growth rate\n",
+    "6. **ClickHouse health (if accessible):**\n",
+    "   - ClickHouse version\n",
+    "   - Memory usage\n",
+    "   - Connection count\n",
+    "   - Slow queries\n",
+    "\n",
+    "**Evidence collected in this lab:**\n",
+    "- ✅ Diagnostics bundle\n",
+    "- ✅ Worker pod logs with ClickHouse errors\n",
+    "- ✅ Events showing failures\n",
+    "- ✅ Secret configuration (structure, not values)\n",
+    "\n",
+    "**Additional evidence to gather (if escalating):**\n",
+    "- ClickHouse endpoint connectivity test\n",
+    "- Queue metrics (if available)\n",
+    "- ClickHouse logs (if accessible via cloud provider)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 14. Lessons Learned\n",
+    "\n",
+    "**Key takeaways from this lab:**\n",
+    "\n",
+    "1. **ClickHouse failures can be intermittent** - Connection pool retries may mask issues\n",
+    "2. **Worker logs are critical** - ClickHouse errors appear in worker pod logs\n",
+    "3. **Queue backlog is a symptom** - Jobs pile up when workers can't connect\n",
+    "4. **Secrets matter** - Wrong credentials cause authentication failures\n",
+    "5. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
+    "\n",
+    "**Common mistakes to avoid:**\n",
+    "- ❌ Ignoring intermittent failures (they indicate connection issues)\n",
+    "- ❌ Not checking worker logs (API logs may not show ClickHouse errors)\n",
+    "- ❌ Not monitoring queue length (backlog indicates processing delays)\n",
+    "- ❌ Not testing ClickHouse connectivity independently\n",
+    "\n",
+    "**Next steps:**\n",
+    "- Practice with other failure injection methods (Level 2)\n",
+    "- Try the ClickHouse or Blob Storage failure labs\n",
+    "- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,846 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Module 4: Failure Lab - Blob Storage\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "**This lab teaches you how to debug Blob Storage configuration failures in LangSmith.**\n",
+    "\n",
+    "Blob Storage is LangSmith's **large payload storage**. It handles:\n",
+    "- Job queue for asynchronous trace processing\n",
+    "- Caching for frequently accessed data\n",
+    "- Rate limiting and session management\n",
+    "- Worker coordination\n",
+    "\n",
+    "**When Blob Storage fails, you'll see:**\n",
+    "- Large payload traces degrade ClickHouse performance\n",
+    "- Warnings/errors in logs about artifact storage\n",
+    "- Increased ClickHouse pressure and latency under load\n",
+    "- Traces with large payloads fail to store properly\n",
+    "- Intermittent ingestion issues\n",
+    "- Latency spikes and retries\n",
+    "- Worker backlog (jobs piling up)\n",
+    "- Traces may be delayed or missing\n",
+    "\n",
+    "**Learning Objectives:**\n",
+    "1. Understand how Blob Storage failures manifest\n",
+    "2. Practice collecting diagnostics for blob/queue issues\n",
+    "3. Learn to identify connection vs. credential vs. network issues\n",
+    "4. Practice safe remediation\n",
+    "\n",
+    "**Estimated time:** 30-45 minutes\n",
+    "\n",
+    "**⚠️ Important:** Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bootstrap environment\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Add notebooks directory to path\n",
+    "possible_paths = [\n",
+    "    Path.cwd().parent,\n",
+    "    Path.cwd(),\n",
+    "    Path.cwd() / \"notebooks\",\n",
+    "]\n",
+    "\n",
+    "notebooks_path = None\n",
+    "for path in possible_paths:\n",
+    "    if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        notebooks_path = path\n",
+    "        break\n",
+    "\n",
+    "if not notebooks_path:\n",
+    "    notebooks_path = Path.cwd() / \"notebooks\"\n",
+    "    if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
+    "\n",
+    "if str(notebooks_path) not in sys.path:\n",
+    "    sys.path.insert(0, str(notebooks_path))\n",
+    "\n",
+    "from shared._bootstrap import bootstrap\n",
+    "from shared._validation import ok, warn\n",
+    "\n",
+    "# Run bootstrap\n",
+    "bootstrap_info = bootstrap()\n",
+    "artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
+    "print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Configuration & Prerequisites\n",
+    "\n",
+    "Load configuration and verify prerequisites.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from shared._validation import require_env\n",
+    "\n",
+    "required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
+    "config = require_env(*required_vars)\n",
+    "config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
+    "\n",
+    "namespace = config[\"NAMESPACE\"]\n",
+    "\n",
+    "print(f\"Namespace: {namespace}\")\n",
+    "print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
+    "\n",
+    "ok(\"Configuration loaded\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. What This Service Does for LangSmith\n",
+    "\n",
+    "Blob Storage is LangSmith's **blob and job queue**. It handles:\n",
+    "\n",
+    "- **Job queue for asynchronous processing:**\n",
+    "  - Workers pull trace processing jobs from Blob Storage\n",
+    "  - Jobs are queued when traces arrive via API\n",
+    "  - Queue backlog indicates processing delays\n",
+    "\n",
+    "- **Caching:**\n",
+    "  - Frequently accessed data (project metadata, user info)\n",
+    "  - Reduces load on PostgreSQL\n",
+    "  - Improves response times\n",
+    "\n",
+    "- **Rate limiting and session management:**\n",
+    "  - API rate limiting\n",
+    "  - Session storage (if configured)\n",
+    "\n",
+    "- **Worker coordination:**\n",
+    "  - Distributed locking\n",
+    "  - Task distribution\n",
+    "\n",
+    "**Why it matters:**\n",
+    "- Without Blob Storage, workers can't process traces\n",
+    "- Job queue fills up, causing delays\n",
+    "- Cache misses increase load on PostgreSQL\n",
+    "- Ingestion becomes unreliable\n",
+    "\n",
+    "**How LangSmith connects:**\n",
+    "- Connection string stored in Kubernetes Secrets\n",
+    "- Workers connect to Blob Storage to pull jobs\n",
+    "- API servers use Blob Storage for caching\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Expected Symptoms When Blob Storage Fails\n",
+    "\n",
+    "**What you'll see:**\n",
+    "\n",
+    "1. **Large payload traces degrade ClickHouse:**\n",
+    "   - ClickHouse performance degrades under load\n",
+    "   - Insert operations slow down\n",
+    "   - Query performance suffers\n",
+    "   - Storage pressure increases\n",
+    "\n",
+    "2. **Warnings/errors in logs about artifact storage:**\n",
+    "   - Worker logs show artifact upload failures\n",
+    "   - Bucket access errors\n",
+    "   - Credential errors\n",
+    "   - \"No such bucket\" or \"Access Denied\" errors\n",
+    "\n",
+    "3. **Increased ClickHouse pressure:**\n",
+    "   - ClickHouse latency increases\n",
+    "   - Merge operations backlog\n",
+    "   - Storage usage spikes\n",
+    "   - Query timeouts\n",
+    "\n",
+    "4. **Log patterns:**\n",
+    "   - Artifact storage errors in worker logs\n",
+    "   - S3/blob storage connection errors\n",
+    "   - Bucket access denied errors\n",
+    "   - Credential errors\n",
+    "   - Configuration errors\n",
+    "\n",
+    "**Timeline:**\n",
+    "- Symptoms appear gradually (under load)\n",
+    "- ClickHouse performance degrades over time\n",
+    "- Large traces fail or are rejected\n",
+    "- Full failure if blob storage completely unavailable\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Failure Injection Options\n",
+    "\n",
+    "**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
+    "\n",
+    "### Level 1: Subtle Failure (Recommended for first run)\n",
+    "\n",
+    "**Option A: Wrong Blob Storage Password**\n",
+    "- Modify the Blob Storage password in the Kubernetes Secret\n",
+    "- Symptoms: Authentication failures, connection refused, intermittent failures\n",
+    "\n",
+    "**Option B: Block Egress to Blob Storage Endpoint**\n",
+    "- Apply NetworkPolicy blocking egress to Blob Storage (if NetworkPolicy supported)\n",
+    "- Symptoms: Connection timeout, no route to host, intermittent failures\n",
+    "\n",
+    "### Level 2: Obvious Failure\n",
+    "\n",
+    "**Option C: Wrong Blob Storage Host/Endpoint**\n",
+    "- Point connection string to non-existent host\n",
+    "- Symptoms: Connection timeout, DNS resolution failures, immediate failures\n",
+    "\n",
+    "**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Do the Drill - Step 1: Confirm Baseline\n",
+    "\n",
+    "**Before injecting any failure, verify your baseline is healthy.**\n",
+    "\n",
+    "💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._shell import run\n",
+    "import json\n",
+    "\n",
+    "print(\"### Quick Baseline Check\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    healthy = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    print(f\"Pods: {healthy}/{total} running\")\n",
+    "    \n",
+    "    if healthy == total and total > 0:\n",
+    "        ok(\"Baseline looks healthy\")\n",
+    "    else:\n",
+    "        warn(\"Some pods are not running - check baseline first\")\n",
+    "else:\n",
+    "    warn(\"Could not check pod status\")\n",
+    "\n",
+    "# Check for Blob Storage secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "blob_secrets = []\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        if \"blob\" in name.lower() or \"blob\" in name.lower():\n",
+    "            blob_secrets.append(name)\n",
+    "\n",
+    "if blob_secrets:\n",
+    "    ok(f\"Found {len(blob_secrets)} Blob Storage-related secret(s)\")\n",
+    "    for secret_name in blob_secrets:\n",
+    "        print(f\"   - {secret_name}\")\n",
+    "else:\n",
+    "    warn(\"No Blob Storage secrets found\")\n",
+    "    print(\"   💡 Blob Storage connection may be configured differently\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Do the Drill - Step 2: Apply Failure Injection\n",
+    "\n",
+    "**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
+    "\n",
+    "Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# FAILURE INJECTION: Wrong Blob Storage Password\n",
+    "# This cell modifies the Blob Storage password secret to an invalid value\n",
+    "\n",
+    "import base64\n",
+    "import yaml\n",
+    "from datetime import datetime\n",
+    "\n",
+    "# Find Blob Storage secret (look for common names)\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "blob_secret_name = None\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        # Common patterns: blob, blob\n",
+    "        if any(keyword in name.lower() for keyword in [\"blob\", \"blob\"]):\n",
+    "            # Check if it has password-related keys\n",
+    "            data = secret.get(\"data\", {})\n",
+    "            if any(key in data for key in [\"password\", \"REDIS_PASSWORD\", \"BLOB_PASSWORD\"]):\n",
+    "                blob_secret_name = name\n",
+    "                break\n",
+    "\n",
+    "if not blob_secret_name:\n",
+    "    raise RuntimeError(\"❌ Could not find Blob Storage secret. Check your deployment configuration.\")\n",
+    "\n",
+    "print(f\"Found Blob Storage secret: {blob_secret_name}\")\n",
+    "\n",
+    "# Get current secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", blob_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
+    "    check=True,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "# Save original secret for restoration\n",
+    "backup_file = artifacts_dir / \"module-4\" / f\"blob-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
+    "backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
+    "with open(backup_file, \"w\") as f:\n",
+    "    f.write(result.stdout)\n",
+    "\n",
+    "ok(f\"Backed up original secret to: {backup_file.name}\")\n",
+    "\n",
+    "# Parse YAML and modify password\n",
+    "secret_data = yaml.safe_load(result.stdout)\n",
+    "if \"data\" not in secret_data:\n",
+    "    raise RuntimeError(\"Secret has no data section\")\n",
+    "\n",
+    "# Find password key (could be password, REDIS_PASSWORD, BLOB_PASSWORD, etc.)\n",
+    "password_key = None\n",
+    "for key in [\"password\", \"REDIS_PASSWORD\", \"BLOB_PASSWORD\", \"blob-password\"]:\n",
+    "    if key in secret_data[\"data\"]:\n",
+    "        password_key = key\n",
+    "        break\n",
+    "\n",
+    "if not password_key:\n",
+    "    raise RuntimeError(\"Could not find password key in secret\")\n",
+    "\n",
+    "# Set invalid password\n",
+    "invalid_password = \"INVALID_PASSWORD_12345\"\n",
+    "invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
+    "\n",
+    "# Modify secret\n",
+    "secret_data[\"data\"][password_key] = invalid_password_b64\n",
+    "\n",
+    "# Save modified secret to temp file\n",
+    "temp_secret_file = artifacts_dir / \"module-4\" / \"blob-secret-modified.yaml\"\n",
+    "with open(temp_secret_file, \"w\") as f:\n",
+    "    yaml.dump(secret_data, f)\n",
+    "\n",
+    "print(f\"\\n⚠️  READY TO APPLY FAILURE INJECTION\")\n",
+    "print(f\"   This will set an invalid password in secret: {blob_secret_name}\")\n",
+    "print(f\"   Modified secret saved to: {temp_secret_file.name}\")\n",
+    "print(f\"\\n   To apply, uncomment and run the next cell.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# UNCOMMENT TO APPLY FAILURE INJECTION\n",
+    "# \n",
+    "# result = run(\n",
+    "#     [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
+    "#     check=True,\n",
+    "#     stream=True\n",
+    "# )\n",
+    "# \n",
+    "# ok(\"Failure injection applied - Blob Storage password is now invalid\")\n",
+    "# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
+    "# print(\"   This may take 1-2 minutes. Watch for pod restarts:\")\n",
+    "# print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "# \n",
+    "# # Wait a moment for changes to propagate\n",
+    "# import time\n",
+    "# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
+    "# time.sleep(30)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Do the Drill - Step 3: Observe Symptoms\n",
+    "\n",
+    "**Now that the failure is injected, observe how it manifests.**\n",
+    "\n",
+    "Check:\n",
+    "1. Worker pod logs for Blob Storage connection errors\n",
+    "2. Queue backlog (if visible)\n",
+    "3. Worker retry patterns\n",
+    "4. Latency in API responses\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datetime import datetime\n",
+    "\n",
+    "# Create incident directory for diagnostics\n",
+    "incident_dir = artifacts_dir / \"module-4\" / f\"blob-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
+    "incident_dir.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "print(f\"### Collecting Failure Diagnostics\\n\")\n",
+    "print(f\"Saving to: {incident_dir}\\n\")\n",
+    "\n",
+    "# 1. Check pod status\n",
+    "print(\"1. Checking pod status...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    print(result.stdout)\n",
+    "    \n",
+    "    # Check for restarts\n",
+    "    lines = result.stdout.split(\"\\n\")\n",
+    "    restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
+    "    if restarts:\n",
+    "        print(\"\\n   Pod restart counts:\")\n",
+    "        for line in restarts[1:]:  # Skip header\n",
+    "            if line.strip():\n",
+    "                parts = line.split()\n",
+    "                if len(parts) > 3:\n",
+    "                    print(f\"   {parts[0]}: {parts[3]} restarts\")\n",
+    "\n",
+    "# 2. Check recent events\n",
+    "print(\"\\n2. Checking recent events...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"events.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    if result.stdout.strip():\n",
+    "        print(\"   Recent warning/error events:\")\n",
+    "        for line in result.stdout.split(\"\\n\")[-5:]:\n",
+    "            if line.strip():\n",
+    "                print(f\"   {line}\")\n",
+    "\n",
+    "# 3. Check worker pod logs for Blob Storage errors\n",
+    "print(\"\\n3. Checking worker pod logs for Blob Storage errors...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if worker_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=50\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        logs_file = incident_dir / f\"worker-pod-{worker_pod}-logs.txt\"\n",
+    "        with open(logs_file, \"w\") as f:\n",
+    "            f.write(result.stdout)\n",
+    "        \n",
+    "        # Look for Blob Storage-related errors\n",
+    "        error_keywords = [\"blob\", \"blob\", \"connection\", \"timeout\", \"refused\", \"authentication\", \"NOAUTH\", \"retry\"]\n",
+    "        error_lines = [l for l in result.stdout.split(\"\\n\") \n",
+    "                      if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if error_lines:\n",
+    "            print(\"   Found Blob Storage-related errors:\")\n",
+    "            for line in error_lines[-5:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            print(\"   No obvious Blob Storage errors in recent logs\")\n",
+    "else:\n",
+    "    warn(\"Could not find worker pod\")\n",
+    "\n",
+    "ok(f\"Diagnostics saved to: {incident_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
+    "\n",
+    "**This is critical - Support will ask for this bundle.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import urllib.request\n",
+    "\n",
+    "print(\"### Running Canonical Diagnostics Script\\n\")\n",
+    "\n",
+    "script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
+    "script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
+    "\n",
+    "try:\n",
+    "    urllib.request.urlretrieve(script_url, script_path)\n",
+    "    script_path.chmod(0o755)\n",
+    "    \n",
+    "    print(f\"Running diagnostics script for namespace: {namespace}\")\n",
+    "    result = run(\n",
+    "        [str(script_path), namespace],\n",
+    "        check=False,\n",
+    "        stream=True\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        ok(\"Diagnostics script completed\")\n",
+    "        \n",
+    "        # Find and move tarball\n",
+    "        for file in incident_dir.parent.iterdir():\n",
+    "            if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
+    "                target_path = incident_dir / file.name\n",
+    "                file.rename(target_path)\n",
+    "                ok(f\"Diagnostics bundle: {target_path.name}\")\n",
+    "                break\n",
+    "    else:\n",
+    "        warn(\"Diagnostics script had errors (check output above)\")\n",
+    "        \n",
+    "except Exception as e:\n",
+    "    warn(f\"Could not run diagnostics script: {e}\")\n",
+    "    print(\"   💡 You can run it manually:\")\n",
+    "    print(f\"      curl -O {script_url}\")\n",
+    "    print(f\"      chmod +x get_k8s_debugging_info.sh\")\n",
+    "    print(f\"      ./get_k8s_debugging_info.sh {namespace}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 10. Do the Drill - Step 5: Guided Triage\n",
+    "\n",
+    "**Where to look first for Blob Storage issues:**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Guided Triage Steps\\n\")\n",
+    "\n",
+    "print(\"1. Check worker pod logs for Blob Storage connection errors:\")\n",
+    "print(f\"   kubectl logs -n {namespace} <worker-pod-name> | grep -i 'blob\\\\|blob\\\\|connection'\")\n",
+    "print()\n",
+    "\n",
+    "print(\"2. Verify secret exists and has correct keys:\")\n",
+    "print(f\"   kubectl get secret {blob_secret_name} -n {namespace} -o yaml\")\n",
+    "print(\"   (Don't print the actual values - they're base64 encoded)\")\n",
+    "print()\n",
+    "\n",
+    "print(\"3. Check for worker pod restarts (indicates connection failures):\")\n",
+    "print(f\"   kubectl get pods -n {namespace} -l app=langsmith-worker\")\n",
+    "print()\n",
+    "\n",
+    "print(\"4. Test Blob Storage connectivity from a pod (if possible):\")\n",
+    "print(\"   kubectl run -it --rm debug --image=blob:7 --restart=Never -- \\\\\")\n",
+    "print(\"     blob-cli -h <blob-host> -p <port> -a <password> ping\")\n",
+    "print()\n",
+    "\n",
+    "print(\"5. Check events for connection/authentication errors:\")\n",
+    "print(f\"   kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
+    "print()\n",
+    "\n",
+    "# Check what we can automatically\n",
+    "print(\"\\n### Automatic Checks\\n\")\n",
+    "\n",
+    "# Check secret still exists\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", blob_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    ok(f\"Secret '{blob_secret_name}' still exists\")\n",
+    "    secret_data = json.loads(result.stdout)\n",
+    "    keys = list(secret_data.get(\"data\", {}).keys())\n",
+    "    print(f\"   Secret keys: {', '.join(keys)}\")\n",
+    "else:\n",
+    "    warn(f\"Secret '{blob_secret_name}' not found!\")\n",
+    "\n",
+    "# Check for pods with Blob Storage connection env vars\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    blob_related_pods = []\n",
+    "    for pod in pods.get(\"items\", []):\n",
+    "        name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
+    "        for container in containers:\n",
+    "            env = container.get(\"env\", [])\n",
+    "            blob_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
+    "                                              for kw in [\"REDIS\", \"BLOB\"])]\n",
+    "            if blob_env:\n",
+    "                blob_related_pods.append(name)\n",
+    "                break\n",
+    "    \n",
+    "    if blob_related_pods:\n",
+    "        print(f\"\\n   Pods with Blob Storage environment variables:\")\n",
+    "        for pod_name in set(blob_related_pods):\n",
+    "            print(f\"   - {pod_name}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 11. Do the Drill - Step 6: Remediation\n",
+    "\n",
+    "**Restore the original secret to fix the issue.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# REMEDIATION: Restore original secret\n",
+    "# UNCOMMENT TO RESTORE\n",
+    "\n",
+    "# if backup_file.exists():\n",
+    "#     print(f\"Restoring original secret from: {backup_file.name}\")\n",
+    "#     result = run(\n",
+    "#         [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
+    "#         check=True,\n",
+    "#         stream=True\n",
+    "#     )\n",
+    "#     \n",
+    "#     ok(\"Original secret restored\")\n",
+    "#     print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
+    "#     print(\"   This may take 1-2 minutes. Monitor pod status:\")\n",
+    "#     print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "#     \n",
+    "#     import time\n",
+    "#     print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
+    "#     time.sleep(60)\n",
+    "# else:\n",
+    "#     warn(f\"Backup file not found: {backup_file}\")\n",
+    "#     print(\"   💡 You may need to manually restore the secret\")\n",
+    "\n",
+    "print(\"⚠️  To restore, uncomment the code above and run this cell.\")\n",
+    "if 'backup_file' in locals() and backup_file:\n",
+    "    print(f\"   Backup file: {backup_file.name}\")\n",
+    "else:\n",
+    "    print(\"   💡 If you modified Helm values, restore them manually\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 12. Do the Drill - Step 7: Confirm Recovery\n",
+    "\n",
+    "**Verify that everything is working again.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Verifying Recovery\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    running = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    \n",
+    "    if running == total and total > 0:\n",
+    "        ok(f\"All {total} pod(s) are running\")\n",
+    "    else:\n",
+    "        warn(f\"Only {running}/{total} pod(s) running\")\n",
+    "        print(\"   💡 Wait a bit longer for pods to fully recover\")\n",
+    "\n",
+    "# Check for recent errors in worker logs\n",
+    "print(\"\\nChecking for recent errors in worker logs...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if worker_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=20\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        error_keywords = [\"error\", \"fail\", \"blob\", \"blob\", \"connection\"]\n",
+    "        recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
+    "                        if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if recent_errors:\n",
+    "            warn(\"Still seeing some errors in logs:\")\n",
+    "            for line in recent_errors[-3:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            ok(\"No recent errors in worker logs\")\n",
+    "\n",
+    "ok(\"Recovery verification complete\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 13. What Support Will Ask For\n",
+    "\n",
+    "**When escalating a Blob Storage issue, Support will need:**\n",
+    "\n",
+    "1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
+    "2. **Blob Storage connection details:**\n",
+    "   - Host/endpoint (redacted)\n",
+    "   - Port\n",
+    "   - Password (redacted)\n",
+    "   - Whether using SSL/TLS\n",
+    "3. **Error messages from logs:**\n",
+    "   - Full error text (not just \"connection failed\")\n",
+    "   - Timestamps of first occurrence\n",
+    "   - Retry patterns\n",
+    "4. **Recent changes:**\n",
+    "   - Secret rotations\n",
+    "   - Network policy changes\n",
+    "   - Blob Storage configuration changes\n",
+    "5. **Queue status (if accessible):**\n",
+    "   - Queue length\n",
+    "   - Worker processing rate\n",
+    "   - Backlog growth rate\n",
+    "6. **Blob Storage health (if accessible):**\n",
+    "   - Blob Storage version\n",
+    "   - Memory usage\n",
+    "   - Connection count\n",
+    "   - Slow queries\n",
+    "\n",
+    "**Evidence collected in this lab:**\n",
+    "- ✅ Diagnostics bundle\n",
+    "- ✅ Worker pod logs with Blob Storage errors\n",
+    "- ✅ Events showing failures\n",
+    "- ✅ Secret configuration (structure, not values)\n",
+    "\n",
+    "**Additional evidence to gather (if escalating):**\n",
+    "- Blob Storage endpoint connectivity test\n",
+    "- Queue metrics (if available)\n",
+    "- Blob Storage logs (if accessible via cloud provider)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 14. Lessons Learned\n",
+    "\n",
+    "**Key takeaways from this lab:**\n",
+    "\n",
+    "1. **Blob Storage failures can be intermittent** - Connection pool retries may mask issues\n",
+    "2. **Worker logs are critical** - Blob Storage errors appear in worker pod logs\n",
+    "3. **Queue backlog is a symptom** - Jobs pile up when workers can't connect\n",
+    "4. **Secrets matter** - Wrong credentials cause authentication failures\n",
+    "5. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
+    "\n",
+    "**Common mistakes to avoid:**\n",
+    "- ❌ Ignoring intermittent failures (they indicate connection issues)\n",
+    "- ❌ Not checking worker logs (API logs may not show Blob Storage errors)\n",
+    "- ❌ Not monitoring queue length (backlog indicates processing delays)\n",
+    "- ❌ Not testing Blob Storage connectivity independently\n",
+    "\n",
+    "**Next steps:**\n",
+    "- Practice with other failure injection methods (Level 2)\n",
+    "- Try the ClickHouse or Blob Storage failure labs\n",
+    "- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,37 @@
+# Module 4: Troubleshooting & Incident Response
+
+This directory contains notebooks for Module 4 of the LangSmith Self-Hosted Operator workshop.
+
+## Notebooks
+
+### Setup & Baseline
+- **`00_setup_or_resume_environment.ipynb`** - Validates environment is ready for Module 4
+- **`01_diagnostics_baseline.ipynb`** - Captures baseline diagnostics (run this first!)
+
+### Failure Labs
+- **`10_failure_lab_postgres.ipynb`** - PostgreSQL connectivity failure debugging
+- **`20_failure_lab_redis.ipynb`** - Redis connectivity failure debugging
+- **`30_failure_lab_clickhouse.ipynb`** - ClickHouse connectivity failure debugging
+- **`40_failure_lab_blob_storage.ipynb`** - Blob storage configuration failure debugging
+
+### Advanced
+- **`90_full_incident_drill.ipynb`** - Complete incident simulation (optional)
+
+## Workflow
+
+1. Run `00_setup_or_resume_environment.ipynb` to verify your environment
+2. Run `01_diagnostics_baseline.ipynb` to capture baseline
+3. Run failure labs in order (10, 20, 30, 40) or pick specific ones
+4. Optionally run `90_full_incident_drill.ipynb` for complete practice
+
+## Important Notes
+
+- **Always run baseline first** - You need "before" to compare to "after"
+- **Failure injections are reversible** - All labs include remediation steps
+- **Don't skip diagnostics collection** - Support will ask for the canonical bundle
+- **Practice in test environments only** - These labs modify your deployment
+
+## Documentation
+
+See `docs/modules/module-4.md` for complete module documentation.
+