Add Module 4: Troubleshooting & Incident Response

This commit adds Module 4 of the LangSmith Self-Hosted Operator workshop,
focused on teaching operators how to diagnose issues under pressure, collect
evidence, and resolve incidents efficiently.

Documentation:
- docs/modules/module-4.md: Complete module documentation covering incident
  reality, common failure modes, diagnostics collection, debugging methodology,
  and working with Support
- docs/shared/incident_first_10_minutes.md: Quick reference checklist for
  critical initial incident response steps
- docs/shared/support_escalation_template.md: Copy-paste template for
  escalating issues to LangChain Support with all necessary information

Notebooks:
- notebooks/module-4/00_setup_or_resume_environment.ipynb: Environment
  validation and setup for Module 4 labs
- notebooks/module-4/01_diagnostics_baseline.ipynb: Teaches "baseline first"
  discipline with comprehensive cluster state capture and canonical diagnostics
  script execution
- notebooks/module-4/10_failure_lab_postgres.ipynb: Hands-on PostgreSQL
  connectivity failure lab with failure injection, diagnostics, and remediation
- notebooks/module-4/20_failure_lab_redis.ipynb: Hands-on Redis cache/queue
  failure lab focusing on intermittent ingestion and worker backlog issues
- notebooks/module-4/30_failure_lab_clickhouse.ipynb: Hands-on ClickHouse trace
  storage failure lab covering missing traces and insert errors
- notebooks/module-4/40_failure_lab_blob_storage.ipynb: Hands-on blob storage
  configuration failure lab demonstrating ClickHouse pressure from large payloads
- notebooks/module-4/README.md: Module overview and notebook descriptions

Key Features:
- All failure labs follow consistent structure: baseline → inject → observe →
  collect → triage → remediate → recover
- Cloud-agnostic implementation using shared cloud helpers
- Safe-by-default failure injections with backup/restore mechanisms
- Integration with canonical LangChain diagnostics script
- Secrets-safe: no credentials printed, all values redacted
- Comprehensive Support escalation guidance for each service

Each failure lab includes:
- Service role and importance explanation
- Expected symptoms documentation
- Multiple failure injection options (subtle vs. obvious)
- Guided triage steps with automatic checks
- Support escalation requirements
- Lessons learned and common mistakes

This module completes the core operator workshop curriculum, enabling
operators to confidently troubleshoot and respond to incidents in production
LangSmith deployments.
This commit is contained in:
Cory Waddingham
2026-01-02 10:47:33 -08:00
parent af1e7a840c
commit f9c22ad3ea
10 changed files with 5170 additions and 0 deletions
+426
View File
@@ -0,0 +1,426 @@
# Module 4: Troubleshooting & Incident Response
**Goal:** Teach operators how to diagnose LangSmith self-hosted issues under pressure, collect the right evidence, and resolve incidents efficiently—either independently or with LangChain Support.
**Duration:** ~3-4 hours (with optional full incident drill)
**Audience:** On-call engineers, platform owners, SREs, and anyone responsible for keeping LangSmith healthy
**Prerequisites:**
- Module 1 complete: LangSmith deployed and reachable
- Module 2 complete: Authentication configured
- Module 3 complete: Production operations concepts understood
- Participants own day-2 operations
---
## Overview
Module 4 is hands-on: learners will introduce subtle but noticeable failures and debug them using standard tools and the canonical diagnostics bundle. This module builds the muscle memory needed for real incidents.
**What you'll accomplish:**
- Understand common failure modes and their symptoms
- Master the "first 10 minutes" incident response checklist
- Learn to collect canonical diagnostics bundles
- Practice debugging with guided failure labs
- Know when and how to escalate to Support
**What this module avoids:**
- Deep dives into specific monitoring tools (assumes basic kubectl/helm)
- Performance optimization (covered in Module 3)
- Infrastructure provisioning (covered in Module 1)
- Authentication configuration (covered in Module 2)
---
## Section 1: Incident Reality Check
### The Mindset
**Incidents happen.** Even with perfect configuration, production systems fail. The difference between a 30-minute incident and a 4-hour outage is often preparation and process.
**Key principles:**
1. **Collect evidence first.** Don't redeploy, restart, or reconfigure until you understand what's wrong.
2. **Time is evidence.** Every minute that passes without collecting diagnostics is lost information.
3. **Symptoms are clues.** The same root cause can manifest differently depending on load, timing, and configuration.
4. **Support needs context.** A good diagnostics bundle is worth more than a perfect description.
### What Makes Incidents Hard
**Pressure:**
- Users are impacted
- Management is asking for updates
- You're on-call and tired
- Multiple systems are involved
**Complexity:**
- Distributed systems have many moving parts
- Failures cascade (one service fails, others follow)
- Symptoms don't always point to root cause
- Configuration drift accumulates over time
**Tooling:**
- Too many tools (which one shows the truth?)
- Too few tools (missing critical information)
- Tools that hide the problem (aggregation, sampling)
**This module prepares you for all of these.**
---
## Section 2: Common Failure Modes
### Ingestion & Tracing Failures
**Symptoms:**
- Traces appear delayed or missing
- Worker pods show errors in logs
- ClickHouse insert errors
- Queue backlogs
**Common causes:**
- ClickHouse connectivity issues (network, credentials, resource limits)
- Blob storage misconfiguration (large payloads fail)
- Worker resource exhaustion (CPU/memory limits)
- Redis connectivity (job queue backing up)
**What to check first:**
- Worker pod logs
- ClickHouse pod status and logs
- Redis connectivity and latency
- Blob storage configuration
### UI & API Failures
**Symptoms:**
- UI returns 5xx errors
- API endpoints timeout
- Login fails or redirects loop
- Specific features don't work
**Common causes:**
- Database connectivity (PostgreSQL unreachable)
- Authentication misconfiguration (OIDC/SAML)
- Ingress/load balancer issues
- API pod crashes or resource limits
**What to check first:**
- API pod logs
- Database connectivity
- Ingress status and configuration
- Authentication configuration (Module 2 validation)
### Authentication Failures
**Symptoms:**
- Users can't log in
- Redirect loops
- 403 errors after successful login
- Session timeouts
**Common causes:**
- IdP connectivity issues
- OIDC/SAML configuration drift
- Secret rotation without updating LangSmith
- Network policies blocking egress
**What to check first:**
- Auth pod logs
- IdP connectivity (curl to issuer URL)
- OIDC/SAML configuration (Module 2 validation)
- Network policies
---
## Section 3: First 10 Minutes Checklist
**The first 10 minutes of an incident are critical.** This is when you collect the most valuable evidence and make decisions that determine how long the incident lasts.
### What NOT to Do
**Resist the urge to:**
- Run `helm upgrade` or `kubectl rollout restart`
- Delete pods "to see if they come back"
- Scale resources up/down
- Change configuration
**Why:** Redeploying destroys evidence and may mask the root cause. Collect diagnostics first.
### The Checklist
See [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md) for the complete reference.
**Quick summary:**
1. **Minute 0-2:** Triage & scope (what's broken, who's impacted)
2. **Minute 2-5:** Quick health check (pods, events, ingress)
3. **Minute 5-8:** Collect diagnostics bundle (canonical script + snapshots)
4. **Minute 8-10:** Identify likely root cause (symptoms → checks)
**Key insight:** This checklist is not about fixing the issue—it's about collecting evidence and making informed decisions.
---
## Section 4: Standard Diagnostics Collection
### The Canonical Script
LangChain provides an official diagnostics script that captures everything Support needs:
**Location:**
```
https://github.com/langchain-ai/helm/blob/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
```
**What it captures:**
- Pod logs (all containers)
- Events (sorted by timestamp)
- Resource usage (CPU, memory)
- Configuration (deployments, services, ingress)
- Storage (PVCs, storage classes)
- Network (services, endpoints)
**How to use it:**
```bash
curl -O https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
chmod +x get_k8s_debugging_info.sh
./get_k8s_debugging_info.sh <namespace>
```
**Important:** Always run this script before making changes. The bundle it creates is your evidence.
### What Good Debugging Looks Like
**Good debugging:**
- Starts with a baseline (what was working before)
- Collects evidence systematically (checklist-driven)
- Documents hypotheses and tests them
- Preserves evidence (saves diagnostics bundles)
- Escalates with context (diagnostics + timeline)
**Bad debugging:**
- Changes things without understanding
- Doesn't collect evidence
- Jumps to conclusions
- Destroys evidence (redeploys, deletes)
- Escalates without context ("it's broken, fix it")
**The difference:** Good debugging produces a clear root cause and fix. Bad debugging produces more incidents.
---
## Section 5: Working with Support
### What Speeds Up Support
**Good escalation includes:**
- Diagnostics bundle (canonical script output)
- Timeline (when did it start, what changed)
- Symptoms (what's broken, who's impacted)
- What you've tried (investigation steps, results)
- Environment details (versions, configuration)
**Use the [Support Escalation Template](../shared/support_escalation_template.md).**
### What Slows Down Support
**Poor escalation includes:**
- No diagnostics bundle ("just look at it")
- Vague symptoms ("it's slow")
- No timeline ("it broke")
- No environment details ("it's on Kubernetes")
- Secrets in logs (security risk)
**Result:** Support has to ask for information you could have provided, delaying resolution.
### Required Metadata
**Support will always ask for:**
1. Diagnostics bundle (canonical script)
2. Helm chart version
3. Image tags (if known)
4. Recent changes (deployments, config, infrastructure)
5. Cloud provider and region
6. Kubernetes version
7. What you've tried and results
**Provide this upfront to speed resolution.**
---
## Section 6: Preventing Repeat Incidents
### Post-Incident Review
**After an incident is resolved:**
1. **Document the root cause** (what actually broke)
2. **Identify contributing factors** (what made it worse)
3. **List what worked** (what helped you debug)
4. **List what didn't work** (what slowed you down)
5. **Create action items** (what to change to prevent recurrence)
**Key questions:**
- Could we have detected this earlier? (monitoring, alerts)
- Could we have prevented this? (configuration, testing)
- Could we have fixed it faster? (runbooks, tooling)
- What did we learn? (new failure mode, new tool)
### Common Patterns
**Configuration drift:**
- Secrets rotate, but LangSmith config isn't updated
- Infrastructure changes, but Helm values aren't updated
- IdP settings change, but OIDC/SAML config isn't updated
**Prevention:** Automated validation (Module 2, Module 3 notebooks), configuration as code, regular audits.
**Resource exhaustion:**
- ClickHouse runs out of disk
- PostgreSQL hits connection limits
- Workers hit CPU/memory limits
**Prevention:** Monitoring (Module 3), autoscaling (Module 3), capacity planning.
**Network issues:**
- Egress blocked by NetworkPolicy
- Load balancer misconfiguration
- DNS resolution failures
**Prevention:** Network policy testing, ingress validation (Module 1), DNS checks.
---
## Section 7: Hands-on Failure Labs
**This is where you practice.** Each lab follows the same pattern:
1. **Baseline snapshot:** Capture what "good" looks like
2. **Introduce failure:** Apply a subtle but noticeable fault
3. **Observe symptoms:** See how the failure manifests
4. **Collect diagnostics:** Run the canonical script and gather evidence
5. **Hypothesize root cause:** Based on symptoms, identify likely cause
6. **Verify with targeted checks:** Confirm your hypothesis
7. **Remediate:** Revert the failure
8. **Confirm recovery:** Verify everything is working again
9. **Capture lessons learned:** Document what you discovered
### Lab Structure
**Each failure lab includes:**
- **What this service does for LangSmith:** Context on the service's role
- **Expected symptoms when it fails:** What you'll see when it breaks
- **Failure injection options:** Two levels (subtle vs. obvious)
- **Do the drill:** Step-by-step debugging process
- **What Support will ask for:** Service-specific evidence
### Available Labs
1. **PostgreSQL Failure Lab** (`10_failure_lab_postgres.ipynb`)
- Connection failures, wrong credentials, network isolation
- Symptoms: API 5xx, login failures, connection exhaustion
2. **Redis Failure Lab** (`20_failure_lab_redis.ipynb`)
- Connectivity issues, wrong credentials
- Symptoms: Intermittent ingestion, latency spikes, worker backlog
3. **ClickHouse Failure Lab** (`30_failure_lab_clickhouse.ipynb`)
- Endpoint misconfiguration, network isolation, resource limits
- Symptoms: Traces delayed/missing, insert errors, UI loads but traces don't appear
4. **Blob Storage Failure Lab** (`40_failure_lab_blob_storage.ipynb`)
- Credential misconfiguration, bucket name errors
- Symptoms: Large payload traces degrade ClickHouse, warnings in logs
5. **Full Incident Drill** (`90_full_incident_drill.ipynb`) (optional)
- Combined failure + timeline pressure
- Practice "first 10 minutes" checklist
- Produce incident summary using escalation template
---
## Section 8: Workshop Wrap-up
### What You've Learned
- How to respond to incidents systematically
- How to collect canonical diagnostics bundles
- How to debug common failure modes
- How to escalate effectively to Support
- How to prevent repeat incidents
### Next Steps
**Immediate:**
- Run through failure labs to build muscle memory
- Customize the "first 10 minutes" checklist for your environment
- Set up monitoring and alerts (Module 3)
**Ongoing:**
- Practice incident response regularly (drills)
- Keep diagnostics script updated
- Document your own failure modes and fixes
- Share learnings with your team
### Resources
- [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)
- [Support Escalation Template](../shared/support_escalation_template.md)
- [Canonical Diagnostics Script](https://github.com/langchain-ai/helm/blob/main/charts/langsmith/scripts/get_k8s_debugging_info.sh)
- Module 1: Deployment & Baseline Validation
- Module 2: Identity & Authentication
- Module 3: Production Operations & Scaling
---
## Artifacts
**Participants leave with:**
- A working incident response process
- Experience debugging real failure modes
- A diagnostics bundle collection workflow
- An escalation template customized for their environment
- Confidence to handle incidents independently
---
## Common Pitfalls
**Don't:**
- Skip the baseline snapshot (you need "before" to compare to "after")
- Redeploy before collecting evidence (destroys diagnostics)
- Ignore error messages (they're clues)
- Escalate without diagnostics bundle (slows Support)
- Delete evidence (you'll need it for post-incident review)
**Do:**
- Follow the checklist (it's battle-tested)
- Collect diagnostics early (time is evidence)
- Document your investigation (helps you and Support)
- Test your process (run drills)
- Learn from each incident (prevent repeats)
---
## Troubleshooting
**"The diagnostics script fails":**
- Check kubectl access and namespace
- Verify script is up-to-date (check GitHub)
- Run with verbose output to see what's failing
**"I can't reproduce the failure":**
- Check that failure injection was applied correctly
- Verify symptoms match expected behavior
- Try a different failure injection method (Level 2 if Level 1 didn't work)
**"The remediation doesn't work":**
- Verify you reverted the exact change you made
- Check for cascading failures (one failure caused another)
- Collect post-remediation diagnostics to compare
**"I don't understand the symptoms":**
- Review the service's role in LangSmith (lab introduction)
- Check logs for error patterns
- Compare to baseline snapshot (what changed?)
---
**Remember:** Incident response is a skill. Practice makes perfect. The more you drill, the better you'll be when real incidents happen.
+163
View File
@@ -0,0 +1,163 @@
# First 10 Minutes: Incident Response Checklist
**When:** You detect or are alerted to a LangSmith self-hosted issue.
**Goal:** Collect evidence, stabilize if possible, and prepare for escalation—without making things worse.
---
## ⚠️ Critical: Do NOT Redeploy
**Resist the urge to:**
- Run `helm upgrade` or `kubectl rollout restart`
- Delete pods "to see if they come back"
- Scale resources up/down
- Change configuration
**Why:** Redeploying destroys evidence and may mask the root cause. Collect diagnostics first.
---
## Minute 0-2: Triage & Scope
- [ ] **Confirm the issue:** What's broken? (UI down, API 5xx, traces missing, auth failing)
- [ ] **Check who's impacted:** All users, specific endpoints, specific features?
- [ ] **Note the time:** Record detection time and any recent changes (deployments, config changes, infrastructure changes)
- [ ] **Check basic connectivity:**
```bash
kubectl cluster-info
kubectl get nodes
kubectl get pods -n <namespace>
```
---
## Minute 2-5: Quick Health Check
- [ ] **Pod status:**
```bash
kubectl get pods -n <namespace> -o wide
```
Look for: CrashLoopBackOff, Pending, Error states
- [ ] **Recent events:**
```bash
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
```
Look for: Failed scheduling, image pull errors, resource limits
- [ ] **Ingress/Load Balancer:**
```bash
kubectl get ingress -n <namespace>
```
Check if endpoint is reachable (curl or browser)
- [ ] **Key deployments:**
```bash
kubectl get deployments -n <namespace>
kubectl describe deployment <deployment-name> -n <namespace>
```
---
## Minute 5-8: Collect Diagnostics Bundle
- [ ] **Run canonical diagnostics script:**
```bash
# Download and run the official script
curl -O https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
chmod +x get_k8s_debugging_info.sh
./get_k8s_debugging_info.sh <namespace>
```
This captures: pod logs, events, resource usage, configuration
- [ ] **Save timestamped snapshot:**
```bash
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
mkdir -p artifacts/incident-$TIMESTAMP
kubectl get all -n <namespace> -o yaml > artifacts/incident-$TIMESTAMP/all-resources.yaml
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > artifacts/incident-$TIMESTAMP/events.txt
```
- [ ] **Check logs for obvious errors:**
```bash
# Check API server logs
kubectl logs -n <namespace> -l app=langsmith-api --tail=100
# Check worker logs
kubectl logs -n <namespace> -l app=langsmith-worker --tail=100
```
Look for: connection errors, timeouts, authentication failures, resource exhaustion
---
## Minute 8-10: Identify Likely Root Cause
Based on symptoms, check the most likely culprits:
### If UI/API is down:
- [ ] Check ingress/load balancer status (via cloud helper or kubectl)
- [ ] Check API pod logs for startup errors
- [ ] Verify external services (PostgreSQL, Redis) are reachable
### If traces are missing/delayed:
- [ ] Check ClickHouse connectivity and logs
- [ ] Check worker pod logs for insert errors
- [ ] Verify blob storage configuration (if large payloads)
### If authentication fails:
- [ ] Check OIDC/SAML configuration (Module 2 validation)
- [ ] Check IdP connectivity
- [ ] Review auth-related pod logs
### If ingestion is slow:
- [ ] Check Redis connectivity and latency
- [ ] Check worker pod resource usage
- [ ] Look for queue backlogs
---
## After 10 Minutes: Decision Point
**If you've identified and can safely fix the issue:**
- Document what you changed
- Verify recovery
- Collect post-recovery diagnostics
**If you need help:**
- Use the [Support Escalation Template](../shared/support_escalation_template.md)
- Include the diagnostics bundle
- Note what you've tried and the results
**If the issue is critical and escalating:**
- Continue collecting evidence every 5-10 minutes
- Document timeline of symptoms
- Prepare escalation with all evidence
---
## What NOT to Do
- ❌ Don't delete namespaces or persistent volumes
- ❌ Don't change database passwords or connection strings
- ❌ Don't scale resources without understanding the bottleneck
- ❌ Don't ignore error messages—they're evidence
- ❌ Don't skip the diagnostics bundle—Support will ask for it
---
## Quick Reference: Common Failure Patterns
| Symptom | Likely Cause | First Check |
|---------|--------------|-------------|
| All pods CrashLoopBackOff | Config error, missing secret | `kubectl describe pod` |
| API 5xx errors | Database/Redis connection | Pod logs, service endpoints |
| Traces not appearing | ClickHouse connectivity | ClickHouse pod logs |
| Slow ingestion | Redis latency, worker backlog | Worker logs, Redis metrics |
| Auth redirect loop | OIDC/SAML misconfiguration | Auth pod logs, IdP connectivity |
---
**Remember:** The goal is evidence collection and safe triage, not immediate resolution. A good diagnostics bundle is worth more than a hasty fix.
+185
View File
@@ -0,0 +1,185 @@
# Support Escalation Template
**Use this template when escalating an incident to LangChain Support.**
Copy and fill in each section. Include the diagnostics bundle and any relevant evidence.
---
## Incident Summary
**Start Time:** `YYYY-MM-DD HH:MM:SS UTC`
**Detection Time:** `YYYY-MM-DD HH:MM:SS UTC`
**Current Status:** `[Investigating / Escalating / Resolved]`
**Brief Description:**
```
[One-sentence summary of the issue]
```
---
## Symptoms
**Who is impacted:**
- [ ] All users
- [ ] Specific user(s) or workspace(s)
- [ ] Specific endpoints or features
- [ ] Internal operations only
**What's broken:**
- [ ] UI is unreachable or returns errors
- [ ] API endpoints return 5xx errors
- [ ] Traces are missing or delayed
- [ ] Authentication/authorization failures
- [ ] Ingestion is slow or failing
- [ ] Other: `[describe]`
**Error messages observed:**
```
[Paste relevant error messages, redacting any secrets]
```
**User-facing impact:**
```
[Describe what users experience]
```
---
## Recent Changes
**Deployments/Releases:**
- [ ] Helm upgrade/chart change: `[version/date]`
- [ ] Configuration change: `[what changed]`
- [ ] Infrastructure change: `[what changed]`
- [ ] No recent changes
**Timeline:**
```
[Chronological list of changes leading up to the incident]
```
---
## Environment Details
**Cloud Provider:** `[AWS / Azure / GCP / Other]`
**Region/Location:** `[region]`
**Kubernetes Service:** `[EKS / AKS / GKE / Other]`
**Cluster Name:** `[cluster-name]`
**Namespace:** `[namespace]`
**LangSmith Version:**
- Helm Chart Version: `[version]`
- Image Tags: `[if known]`
- Deployment Method: `[Helm / kubectl / Other]`
**Infrastructure:**
- PostgreSQL: `[RDS / Azure Database / In-cluster / Other]`
- Redis: `[ElastiCache / Azure Cache / In-cluster / Other]`
- ClickHouse: `[Managed / In-cluster]`
- Blob Storage: `[S3 / Azure Blob / GCS / Other]`
---
## Diagnostics Bundle
**Bundle Location:** `[path or URL to diagnostics bundle]`
**Bundle Contents:**
- [ ] Canonical diagnostics script output (`get_k8s_debugging_info.sh`)
- [ ] `kubectl get all -o yaml` snapshot
- [ ] Recent events (`kubectl get events`)
- [ ] Pod logs (API, workers, ClickHouse)
- [ ] Resource usage snapshot (`kubectl top pods/nodes`)
- [ ] Ingress/load balancer configuration
- [ ] Helm values (redacted)
**Bundle Timestamp:** `YYYY-MM-DD HH:MM:SS UTC`
---
## What We've Tried
**Investigation Steps:**
1. `[What you checked and what you found]`
2. `[Next step and result]`
3. `[Continue as needed]`
**Remediation Attempts:**
- [ ] Restarted pods: `[which pods, result]`
- [ ] Checked external service connectivity: `[result]`
- [ ] Verified configuration: `[result]`
- [ ] Other: `[describe]`
**Current Hypothesis:**
```
[Your best guess at the root cause, with evidence]
```
---
## Evidence & Logs
**Key Log Excerpts (redact secrets):**
```
[Paste relevant log lines with timestamps]
```
**Error Patterns:**
```
[Describe patterns you've observed]
```
**Metrics/Signals:**
```
[Any metrics or signals that indicate the issue]
```
---
## Questions for Support
1. `[Your question]`
2. `[Another question]`
3. `[Continue as needed]`
---
## Additional Context
**Related Issues:**
- Previous similar incidents: `[reference]`
- Known limitations: `[describe]`
- Custom configurations: `[describe, redact secrets]`
**Priority:**
- [ ] Critical (service down, all users impacted)
- [ ] High (major feature broken, many users impacted)
- [ ] Medium (degraded performance, some users impacted)
- [ ] Low (minor issue, workaround available)
---
## Next Steps
**What we need from Support:**
- [ ] Root cause analysis
- [ ] Remediation steps
- [ ] Configuration guidance
- [ ] Performance optimization
- [ ] Other: `[describe]`
**Our availability:**
- Timezone: `[timezone]`
- Best time to contact: `[time range]`
- Escalation contact: `[name/email]`
---
**Template Version:** 1.0
**Last Updated:** `[date]`
**Note:** Always redact secrets, API keys, passwords, and connection strings before sharing. Use `[REDACTED]` or similar markers.
@@ -0,0 +1,420 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 4: Setup or Resume Environment\n",
"\n",
"## Overview\n",
"\n",
"This notebook helps you prepare for Module 4 (Troubleshooting & Incident Response). It validates that your LangSmith environment is running and accessible, or directs you to deploy it using Module 1.\n",
"\n",
"**Prerequisites:**\n",
"- Module 1 notebooks available (for deployment if needed)\n",
"- kubectl configured (if environment exists)\n",
"- Cloud provider credentials (if deploying)\n",
"\n",
"**What This Notebook Does:**\n",
"1. Checks if LangSmith is already deployed\n",
"2. If not, provides links to Module 1 deployment notebooks\n",
"3. If yes, validates the environment is healthy and reachable\n",
"4. Confirms prerequisites for Module 4 failure labs\n",
"\n",
"**Estimated time:** 10-15 minutes\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Bootstrap environment\n",
"import sys\n",
"from pathlib import Path\n",
"\n",
"# Add notebooks directory to path so we can import shared as a package\n",
"possible_paths = [\n",
" Path.cwd().parent, # If cwd is module-4, go up one level to notebooks\n",
" Path.cwd(), # If cwd is already notebooks\n",
" Path.cwd() / \"notebooks\", # If cwd is workspace root\n",
"]\n",
"\n",
"notebooks_path = None\n",
"for path in possible_paths:\n",
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
" notebooks_path = path\n",
" break\n",
"\n",
"if not notebooks_path:\n",
" notebooks_path = Path.cwd() / \"notebooks\"\n",
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
"\n",
"# Add notebooks directory to path so 'shared' can be imported as a package\n",
"if str(notebooks_path) not in sys.path:\n",
" sys.path.insert(0, str(notebooks_path))\n",
"\n",
"from shared._bootstrap import bootstrap\n",
"\n",
"# Run bootstrap\n",
"bootstrap_info = bootstrap()\n",
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Configuration\n",
"\n",
"Load and validate configuration from environment variables.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from shared._validation import require_env, ok, warn\n",
"from shared._cloud_helpers import get_cloud_provider, get_region\n",
"\n",
"# Required configuration\n",
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
"\n",
"print(\"### Loading Configuration\\n\")\n",
"\n",
"config = {}\n",
"missing = []\n",
"\n",
"for var in required_vars:\n",
" value = os.environ.get(var, \"\").strip()\n",
" if not value:\n",
" missing.append(var)\n",
" config[var] = value\n",
"\n",
"if missing:\n",
" raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
" f\"💡 Copy env-samples/workshop.env.example to your .env file and fill in values\")\n",
"\n",
"# Optional but recommended\n",
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
"config[\"LANGSMITH_DOMAIN\"] = os.environ.get(\"LANGSMITH_DOMAIN\", \"\")\n",
"\n",
"# Show cloud provider info\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"\n",
"print(f\"Cloud Provider: {provider.upper()}\")\n",
"print(f\"Region: {region}\")\n",
"print(f\"Namespace: {config['NAMESPACE']}\")\n",
"print(f\"Cluster: {config['CLUSTER_NAME']}\")\n",
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
"\n",
"if config[\"LANGSMITH_DOMAIN\"]:\n",
" print(f\"LangSmith Domain: {config['LANGSMITH_DOMAIN']}\")\n",
"\n",
"ok(\"Configuration loaded\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Check if Environment Exists\n",
"\n",
"We'll check if LangSmith is already deployed. If not, we'll provide instructions to deploy using Module 1.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._shell import run\n",
"from shared._cloud_helpers import cluster_exists, configure_kubectl, get_kubernetes_service_name\n",
"\n",
"namespace = config[\"NAMESPACE\"]\n",
"cluster_name = config[\"CLUSTER_NAME\"]\n",
"k8s_service = get_kubernetes_service_name()\n",
"\n",
"print(f\"### Checking {k8s_service} Cluster\\n\")\n",
"\n",
"# Check if cluster exists\n",
"if cluster_exists(cluster_name):\n",
" ok(f\"Cluster '{cluster_name}' exists\")\n",
" \n",
" # Configure kubectl\n",
" print(f\"\\n### Configuring kubectl\\n\")\n",
" try:\n",
" configure_kubectl(cluster_name, region)\n",
" ok(\"kubectl configured\")\n",
" except Exception as e:\n",
" warn(f\"Could not configure kubectl: {e}\")\n",
" print(\"💡 Make sure you have proper cloud provider credentials\")\n",
" raise\n",
"else:\n",
" warn(f\"Cluster '{cluster_name}' not found\")\n",
" print(\"\\n💡 You need to deploy LangSmith first using Module 1 notebooks.\")\n",
" print(\" See the 'Deploy Environment' section below.\")\n",
" raise RuntimeError(\"Cluster not found. Deploy using Module 1 first.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Verify Namespace and Helm Release\n",
"\n",
"Check that the LangSmith namespace exists and Helm release is installed.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"helm_release = config[\"HELM_RELEASE\"]\n",
"\n",
"print(\"### Checking Namespace\\n\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"namespace\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" ok(f\"Namespace '{namespace}' exists\")\n",
"else:\n",
" warn(f\"Namespace '{namespace}' not found\")\n",
" print(\"\\n💡 You need to deploy LangSmith first using Module 1 notebooks.\")\n",
" print(\" See the 'Deploy Environment' section below.\")\n",
" raise RuntimeError(\"Namespace not found. Deploy using Module 1 first.\")\n",
"\n",
"print(\"\\n### Checking Helm Release\\n\")\n",
"result = run(\n",
" [\"helm\", \"list\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" releases = json.loads(result.stdout)\n",
" release_found = any(r.get(\"name\") == helm_release for r in releases)\n",
" \n",
" if release_found:\n",
" ok(f\"Helm release '{helm_release}' found\")\n",
" # Get release info\n",
" result = run(\n",
" [\"helm\", \"status\", helm_release, \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" if result.returncode == 0:\n",
" release_info = json.loads(result.stdout)\n",
" print(f\" Status: {release_info.get('info', {}).get('status', 'unknown')}\")\n",
" print(f\" Chart: {release_info.get('chart', {}).get('metadata', {}).get('name', 'unknown')}\")\n",
" print(f\" Version: {release_info.get('chart', {}).get('metadata', {}).get('version', 'unknown')}\")\n",
" else:\n",
" warn(f\"Helm release '{helm_release}' not found in namespace '{namespace}'\")\n",
" print(\"\\n💡 You need to deploy LangSmith first using Module 1 notebooks.\")\n",
" print(\" See the 'Deploy Environment' section below.\")\n",
" raise RuntimeError(\"Helm release not found. Deploy using Module 1 first.\")\n",
"else:\n",
" warn(\"Could not list Helm releases\")\n",
" print(\"💡 Make sure Helm is installed and kubectl is configured correctly\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Verify Ingress Endpoint\n",
"\n",
"Check that the LangSmith ingress is configured and reachable.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"from urllib.parse import urlparse\n",
"\n",
"print(\"### Checking Ingress\\n\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"ingress\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"ingress_found = False\n",
"ingress_host = None\n",
"\n",
"if result.returncode == 0:\n",
" ingresses = json.loads(result.stdout)\n",
" for ingress in ingresses.get(\"items\", []):\n",
" rules = ingress.get(\"spec\", {}).get(\"rules\", [])\n",
" for rule in rules:\n",
" host = rule.get(\"host\", \"\")\n",
" if host:\n",
" ingress_found = True\n",
" ingress_host = host\n",
" print(f\" Found ingress with host: {host}\")\n",
" break\n",
"\n",
"if not ingress_found:\n",
" warn(\"No ingress found\")\n",
" print(\"💡 Ingress may still be provisioning. Check Module 1 validation notebook.\")\n",
"else:\n",
" ok(f\"Ingress configured with host: {ingress_host}\")\n",
" \n",
" # Try to reach the endpoint\n",
" if config[\"LANGSMITH_DOMAIN\"]:\n",
" test_url = f\"https://{config['LANGSMITH_DOMAIN']}\"\n",
" elif ingress_host:\n",
" test_url = f\"https://{ingress_host}\"\n",
" else:\n",
" test_url = None\n",
" \n",
" if test_url:\n",
" print(f\"\\n### Testing Endpoint Reachability\\n\")\n",
" print(f\"Testing: {test_url}\")\n",
" try:\n",
" # Allow redirects, don't verify SSL (may be self-signed)\n",
" response = requests.get(test_url, allow_redirects=True, verify=False, timeout=10)\n",
" if response.status_code in [200, 302, 401, 403]:\n",
" ok(f\"Endpoint is reachable (HTTP {response.status_code})\")\n",
" else:\n",
" warn(f\"Endpoint returned unexpected status: {response.status_code}\")\n",
" except requests.exceptions.SSLError:\n",
" # SSL error is OK if using self-signed certs\n",
" warn(\"SSL verification failed (may be self-signed certificate)\")\n",
" print(\"💡 This is OK for testing. In production, use proper TLS certificates.\")\n",
" except requests.exceptions.RequestException as e:\n",
" warn(f\"Could not reach endpoint: {e}\")\n",
" print(\"💡 Ingress may still be provisioning. Wait a few minutes and try again.\")\n",
" else:\n",
" warn(\"No domain configured for testing\")\n",
" print(\"💡 Set LANGSMITH_DOMAIN in your .env file to test endpoint reachability\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Quick Health Check\n",
"\n",
"Verify that key deployments are running.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Checking Key Deployments\\n\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" deployments = json.loads(result.stdout)\n",
" deployment_items = deployments.get(\"items\", [])\n",
" \n",
" if deployment_items:\n",
" ok(f\"Found {len(deployment_items)} deployment(s)\")\n",
" print(\"\\nDeployment Status:\")\n",
" for deployment in deployment_items:\n",
" name = deployment.get(\"metadata\", {}).get(\"name\", \"\")\n",
" spec_replicas = deployment.get(\"spec\", {}).get(\"replicas\", 0)\n",
" status_replicas = deployment.get(\"status\", {}).get(\"replicas\", 0)\n",
" ready_replicas = deployment.get(\"status\", {}).get(\"readyReplicas\", 0)\n",
" available_replicas = deployment.get(\"status\", {}).get(\"availableReplicas\", 0)\n",
" \n",
" status_icon = \"✅\" if ready_replicas == spec_replicas and available_replicas == spec_replicas else \"⚠️\"\n",
" print(f\" {status_icon} {name}: {ready_replicas}/{spec_replicas} ready, {available_replicas}/{spec_replicas} available\")\n",
" else:\n",
" warn(\"No deployments found\")\n",
" print(\"💡 LangSmith may not be fully deployed. Check Module 1 validation notebook.\")\n",
"else:\n",
" warn(\"Could not list deployments\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ✅ Environment Ready\n",
"\n",
"Your LangSmith environment is running and accessible. You're ready to proceed with Module 4 failure labs.\n",
"\n",
"**Next Steps:**\n",
"1. Run `01_diagnostics_baseline.ipynb` to capture a baseline snapshot\n",
"2. Proceed with failure labs (10, 20, 30, 40)\n",
"3. Optionally run `90_full_incident_drill.ipynb` for a complete incident simulation\n",
"\n",
"---\n",
"\n",
"## 📝 Important Reminder\n",
"\n",
"**When finished with Module 4, run Module 1's `99_teardown.ipynb` to delete the environment and avoid ongoing cloud costs.**\n",
"\n",
"The teardown notebook will:\n",
"- Remove Helm release\n",
"- Destroy Terraform-managed infrastructure (Kubernetes cluster, database, cache, blob storage, etc.)\n",
"- Clean up any remaining resources\n",
"\n",
"**Location:** `../module-1/99_teardown.ipynb`\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 🚀 Deploy Environment (If Not Already Deployed)\n",
"\n",
"If your environment is not running, follow these steps to deploy LangSmith using Module 1:\n",
"\n",
"### Step 1: Preflight Checks\n",
"Run `../module-1/01_preflight.ipynb` to validate your environment.\n",
"\n",
"### Step 2: Provision Infrastructure\n",
"Run `../module-1/02_terraform_apply.ipynb` to deploy cloud infrastructure (Kubernetes cluster, database, cache, blob storage).\n",
"\n",
"### Step 3: Install LangSmith\n",
"Run `../module-1/03_helm_install_langsmith.ipynb` to install LangSmith using Helm.\n",
"\n",
"### Step 4: Validate Deployment\n",
"Run `../module-1/04_validate_ingress_and_ui.ipynb` to verify everything is working.\n",
"\n",
"### Step 5: Return Here\n",
"Once deployment is complete, return to this notebook and re-run the cells above to verify your environment is ready for Module 4.\n",
"\n",
"---\n",
"\n",
"**Note:** If you encounter errors during deployment, refer to Module 1 documentation and troubleshooting guides.\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -0,0 +1,600 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 4: Diagnostics Baseline\n",
"\n",
"## Overview\n",
"\n",
"**This notebook teaches \"baseline first\" discipline.** Before introducing failures or debugging issues, you must capture what \"good\" looks like. This baseline becomes your reference point for all troubleshooting.\n",
"\n",
"**What This Notebook Does:**\n",
"1. Captures cluster state snapshot (pods, services, deployments)\n",
"2. Collects recent events and resource usage\n",
"3. Runs the canonical diagnostics script\n",
"4. Performs basic health checks\n",
"5. Saves everything to a timestamped directory\n",
"\n",
"**Why This Matters:**\n",
"- You need \"before\" to compare to \"after\"\n",
"- Support will ask for baseline diagnostics\n",
"- Good debugging starts with understanding normal state\n",
"- Evidence collection is time-sensitive\n",
"\n",
"**Estimated time:** 15-20 minutes\n",
"\n",
"**Important:** Run this notebook BEFORE starting any failure labs. It's your evidence baseline.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Bootstrap environment\n",
"import sys\n",
"from pathlib import Path\n",
"from datetime import datetime\n",
"\n",
"# Add notebooks directory to path so we can import shared as a package\n",
"possible_paths = [\n",
" Path.cwd().parent, # If cwd is module-4, go up one level to notebooks\n",
" Path.cwd(), # If cwd is already notebooks\n",
" Path.cwd() / \"notebooks\", # If cwd is workspace root\n",
"]\n",
"\n",
"notebooks_path = None\n",
"for path in possible_paths:\n",
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
" notebooks_path = path\n",
" break\n",
"\n",
"if not notebooks_path:\n",
" notebooks_path = Path.cwd() / \"notebooks\"\n",
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
"\n",
"# Add notebooks directory to path so 'shared' can be imported as a package\n",
"if str(notebooks_path) not in sys.path:\n",
" sys.path.insert(0, str(notebooks_path))\n",
"\n",
"from shared._bootstrap import bootstrap\n",
"\n",
"# Run bootstrap\n",
"bootstrap_info = bootstrap()\n",
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
"\n",
"# Create timestamped directory for this baseline\n",
"timestamp = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
"baseline_dir = artifacts_dir / \"module-4\" / f\"baseline-{timestamp}\"\n",
"baseline_dir.mkdir(parents=True, exist_ok=True)\n",
"\n",
"print(f\"\\nBaseline directory: {baseline_dir}\")\n",
"print(f\"All diagnostics will be saved here.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Configuration\n",
"\n",
"Load and validate configuration from environment variables.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from shared._validation import ok, warn\n",
"from shared._cloud_helpers import get_cloud_provider, get_region\n",
"\n",
"# Required configuration\n",
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
"\n",
"print(\"### Loading Configuration\\n\")\n",
"\n",
"config = {}\n",
"missing = []\n",
"\n",
"for var in required_vars:\n",
" value = os.environ.get(var, \"\").strip()\n",
" if not value:\n",
" missing.append(var)\n",
" config[var] = value\n",
"\n",
"if missing:\n",
" raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
" f\"💡 Copy env-samples/workshop.env.example to your .env file and fill in values\")\n",
"\n",
"# Optional but recommended\n",
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
"config[\"LANGSMITH_DOMAIN\"] = os.environ.get(\"LANGSMITH_DOMAIN\", \"\")\n",
"\n",
"namespace = config[\"NAMESPACE\"]\n",
"\n",
"# Show cloud provider info\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"\n",
"print(f\"Cloud Provider: {provider.upper()}\")\n",
"print(f\"Region: {region}\")\n",
"print(f\"Namespace: {namespace}\")\n",
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
"\n",
"if config[\"LANGSMITH_DOMAIN\"]:\n",
" print(f\"LangSmith Domain: {config['LANGSMITH_DOMAIN']}\")\n",
"\n",
"ok(\"Configuration loaded\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Cluster State Snapshot\n",
"\n",
"Capture a complete snapshot of all resources in the namespace. This is your \"before\" picture.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._shell import run\n",
"import json\n",
"\n",
"print(\"### Capturing Cluster State Snapshot\\n\")\n",
"\n",
"# Get all resources\n",
"print(\"1. Collecting all resources...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"all\", \"-n\", namespace, \"-o\", \"wide\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" snapshot_file = baseline_dir / \"all-resources.txt\"\n",
" with open(snapshot_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" ok(f\"Saved resource snapshot to {snapshot_file.name}\")\n",
" print(f\" Resources captured: {len(result.stdout.splitlines())} lines\")\n",
"else:\n",
" warn(\"Could not capture resource snapshot\")\n",
"\n",
"# Get all resources as YAML (more detailed)\n",
"print(\"\\n2. Collecting detailed YAML...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"all\", \"-n\", namespace, \"-o\", \"yaml\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" yaml_file = baseline_dir / \"all-resources.yaml\"\n",
" with open(yaml_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" ok(f\"Saved detailed YAML to {yaml_file.name}\")\n",
"else:\n",
" warn(\"Could not capture detailed YAML\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Key Deployments Description\n",
"\n",
"Get detailed information about key deployments.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Describing Key Deployments\\n\")\n",
"\n",
"# Get list of deployments\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" deployments = json.loads(result.stdout)\n",
" deployment_items = deployments.get(\"items\", [])\n",
" \n",
" if deployment_items:\n",
" ok(f\"Found {len(deployment_items)} deployment(s)\")\n",
" \n",
" # Describe each deployment\n",
" for deployment in deployment_items:\n",
" name = deployment.get(\"metadata\", {}).get(\"name\", \"\")\n",
" print(f\"\\n3. Describing deployment: {name}\")\n",
" \n",
" result = run(\n",
" [\"kubectl\", \"describe\", \"deployment\", name, \"-n\", namespace],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" desc_file = baseline_dir / f\"deployment-{name}.txt\"\n",
" with open(desc_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" print(f\" ✅ Saved description to {desc_file.name}\")\n",
" else:\n",
" warn(f\"Could not describe deployment {name}\")\n",
" else:\n",
" warn(\"No deployments found\")\n",
"else:\n",
" warn(\"Could not list deployments\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Recent Events\n",
"\n",
"Capture recent events sorted by timestamp. Events often contain the first clues about what's happening.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Collecting Recent Events\\n\")\n",
"\n",
"# Get events sorted by timestamp\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" events_file = baseline_dir / \"events.txt\"\n",
" with open(events_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" ok(f\"Saved events to {events_file.name}\")\n",
" \n",
" # Count events by type\n",
" lines = result.stdout.strip().split(\"\\n\")\n",
" if len(lines) > 1: # Header + events\n",
" event_count = len(lines) - 1\n",
" print(f\" Captured {event_count} event(s)\")\n",
" \n",
" # Show last few events\n",
" if event_count > 0:\n",
" print(\"\\n Last 5 events:\")\n",
" for line in lines[-5:]:\n",
" if line.strip():\n",
" print(f\" {line}\")\n",
" else:\n",
" print(\" No events found (this is normal for a healthy cluster)\")\n",
"else:\n",
" warn(\"Could not collect events\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Resource Usage\n",
"\n",
"Capture resource usage (CPU, memory) if metrics are available.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Collecting Resource Usage\\n\")\n",
"\n",
"# Top pods\n",
"print(\"1. Checking pod resource usage...\")\n",
"result = run(\n",
" [\"kubectl\", \"top\", \"pods\", \"-n\", namespace],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" top_pods_file = baseline_dir / \"top-pods.txt\"\n",
" with open(top_pods_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" ok(f\"Saved pod resource usage to {top_pods_file.name}\")\n",
" print(result.stdout)\n",
"else:\n",
" warn(\"Could not get pod resource usage (metrics server may not be available)\")\n",
" print(\" 💡 This is OK - metrics are optional for baseline collection\")\n",
"\n",
"# Top nodes (if available)\n",
"print(\"\\n2. Checking node resource usage...\")\n",
"result = run(\n",
" [\"kubectl\", \"top\", \"nodes\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" top_nodes_file = baseline_dir / \"top-nodes.txt\"\n",
" with open(top_nodes_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" ok(f\"Saved node resource usage to {top_nodes_file.name}\")\n",
" print(result.stdout)\n",
"else:\n",
" warn(\"Could not get node resource usage (metrics server may not be available)\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Canonical Diagnostics Script\n",
"\n",
"**This is the most important step.** Run the official LangChain diagnostics script that Support expects.\n",
"\n",
"The script captures:\n",
"- Pod logs (all containers)\n",
"- Events (sorted by timestamp)\n",
"- Resource usage (CPU, memory)\n",
"- Configuration (deployments, services, ingress)\n",
"- Storage (PVCs, storage classes)\n",
"- Network (services, endpoints)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import urllib.request\n",
"import subprocess\n",
"\n",
"print(\"### Running Canonical Diagnostics Script\\n\")\n",
"\n",
"# URL to the canonical script\n",
"script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
"script_path = baseline_dir / \"get_k8s_debugging_info.sh\"\n",
"\n",
"print(f\"1. Downloading script from: {script_url}\")\n",
"try:\n",
" urllib.request.urlretrieve(script_url, script_path)\n",
" ok(f\"Downloaded script to {script_path.name}\")\n",
" \n",
" # Make executable\n",
" script_path.chmod(0o755)\n",
" \n",
" # Run the script\n",
" print(f\"\\n2. Running diagnostics script for namespace: {namespace}\")\n",
" print(\" (This may take a few minutes...)\")\n",
" \n",
" result = run(\n",
" [str(script_path), namespace],\n",
" check=False,\n",
" stream=True # Stream output so user can see progress\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" ok(\"Diagnostics script completed successfully\")\n",
" \n",
" # The script creates a tarball - find it\n",
" diagnostics_tarball = None\n",
" for file in baseline_dir.parent.iterdir():\n",
" if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
" diagnostics_tarball = file\n",
" break\n",
" \n",
" if diagnostics_tarball:\n",
" # Move it to our baseline directory\n",
" target_path = baseline_dir / diagnostics_tarball.name\n",
" diagnostics_tarball.rename(target_path)\n",
" ok(f\"Diagnostics bundle saved to: {target_path.name}\")\n",
" print(f\" Size: {target_path.stat().st_size / 1024 / 1024:.2f} MB\")\n",
" else:\n",
" warn(\"Could not find diagnostics tarball (check script output above)\")\n",
" else:\n",
" warn(f\"Diagnostics script returned non-zero exit code: {result.returncode}\")\n",
" print(\" Check the output above for errors\")\n",
" print(\" 💡 The script may still have collected useful information\")\n",
" \n",
"except urllib.request.URLError as e:\n",
" warn(f\"Could not download diagnostics script: {e}\")\n",
" print(\" 💡 You can download it manually and run it:\")\n",
" print(f\" curl -O {script_url}\")\n",
" print(f\" chmod +x get_k8s_debugging_info.sh\")\n",
" print(f\" ./get_k8s_debugging_info.sh {namespace}\")\n",
"except Exception as e:\n",
" warn(f\"Error running diagnostics script: {e}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Basic Health Check\n",
"\n",
"Perform a basic HTTP check to verify the LangSmith endpoint is reachable.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"import urllib3\n",
"\n",
"# Disable SSL warnings for self-signed certs\n",
"urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\n",
"\n",
"print(\"### Testing Endpoint Reachability\\n\")\n",
"\n",
"# Determine endpoint URL\n",
"if config[\"LANGSMITH_DOMAIN\"]:\n",
" test_url = f\"https://{config['LANGSMITH_DOMAIN']}\"\n",
"else:\n",
" # Try to get from ingress\n",
" result = run(\n",
" [\"kubectl\", \"get\", \"ingress\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" ingresses = json.loads(result.stdout)\n",
" for ingress in ingresses.get(\"items\", []):\n",
" rules = ingress.get(\"spec\", {}).get(\"rules\", [])\n",
" for rule in rules:\n",
" host = rule.get(\"host\", \"\")\n",
" if host:\n",
" test_url = f\"https://{host}\"\n",
" break\n",
" else:\n",
" test_url = None\n",
"\n",
"if test_url:\n",
" print(f\"Testing: {test_url}\")\n",
" try:\n",
" response = requests.get(test_url, allow_redirects=True, verify=False, timeout=10)\n",
" \n",
" health_file = baseline_dir / \"endpoint-health.txt\"\n",
" with open(health_file, \"w\") as f:\n",
" f.write(f\"URL: {test_url}\\n\")\n",
" f.write(f\"Status Code: {response.status_code}\\n\")\n",
" f.write(f\"Response Headers:\\n{json.dumps(dict(response.headers), indent=2)}\\n\")\n",
" \n",
" if response.status_code in [200, 302, 401, 403]:\n",
" ok(f\"Endpoint is reachable (HTTP {response.status_code})\")\n",
" print(f\" Response saved to {health_file.name}\")\n",
" else:\n",
" warn(f\"Endpoint returned unexpected status: {response.status_code}\")\n",
" except requests.exceptions.SSLError:\n",
" warn(\"SSL verification failed (may be self-signed certificate)\")\n",
" print(\" 💡 This is OK for testing. In production, use proper TLS certificates.\")\n",
" except requests.exceptions.RequestException as e:\n",
" warn(f\"Could not reach endpoint: {e}\")\n",
" print(\" 💡 Endpoint may still be provisioning or DNS not configured\")\n",
"else:\n",
" warn(\"No endpoint URL available for testing\")\n",
" print(\" 💡 Set LANGSMITH_DOMAIN in your .env file to test endpoint reachability\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. What Good Looks Like\n",
"\n",
"Quick validation checks to confirm the baseline is healthy.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._validation import ok, warn\n",
"\n",
"print(\"### Quick Health Validation\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"healthy_pods = 0\n",
"unhealthy_pods = []\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" for pod in pods.get(\"items\", []):\n",
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
" phase = pod.get(\"status\", {}).get(\"phase\", \"\")\n",
" container_statuses = pod.get(\"status\", {}).get(\"containerStatuses\", [])\n",
" \n",
" is_ready = True\n",
" for cs in container_statuses:\n",
" if not cs.get(\"ready\", False):\n",
" is_ready = False\n",
" break\n",
" \n",
" if phase == \"Running\" and is_ready:\n",
" healthy_pods += 1\n",
" else:\n",
" unhealthy_pods.append((name, phase, is_ready))\n",
" \n",
" if unhealthy_pods:\n",
" warn(f\"Found {len(unhealthy_pods)} pod(s) that are not healthy:\")\n",
" for name, phase, ready in unhealthy_pods:\n",
" print(f\" - {name}: phase={phase}, ready={ready}\")\n",
" else:\n",
" ok(f\"All {healthy_pods} pod(s) are healthy and ready\")\n",
"else:\n",
" warn(\"Could not check pod status\")\n",
"\n",
"# Check for CrashLoopBackOff\n",
"if unhealthy_pods:\n",
" crash_loops = [name for name, phase, _ in unhealthy_pods if phase == \"CrashLoopBackOff\"]\n",
" if crash_loops:\n",
" warn(f\"Found {len(crash_loops)} pod(s) in CrashLoopBackOff:\")\n",
" for name in crash_loops:\n",
" print(f\" - {name}\")\n",
" print(\" 💡 Check pod logs to understand why they're crashing\")\n",
"\n",
"# Check for Pending pods\n",
"pending = [name for name, phase, _ in unhealthy_pods if phase == \"Pending\"]\n",
"if pending:\n",
" warn(f\"Found {len(pending)} pod(s) in Pending state:\")\n",
" for name in pending:\n",
" print(f\" - {name}\")\n",
" print(\" 💡 Check events and resource availability\")\n",
"\n",
"print(\"\\n### Baseline Summary\\n\")\n",
"print(f\"✅ Baseline captured at: {timestamp}\")\n",
"print(f\"📁 Baseline directory: {baseline_dir}\")\n",
"print(f\"📊 Resources captured:\")\n",
"print(f\" - Cluster state snapshot\")\n",
"print(f\" - Deployment descriptions\")\n",
"print(f\" - Recent events\")\n",
"print(f\" - Resource usage (if available)\")\n",
"print(f\" - Canonical diagnostics bundle\")\n",
"print(f\" - Endpoint health check\")\n",
"\n",
"ok(\"Baseline collection complete!\")\n",
"print(\"\\n💡 Use this baseline as your reference point for all failure labs.\")\n",
"print(\" Compare future diagnostics to this baseline to identify what changed.\")\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -0,0 +1,820 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 4: Failure Lab - PostgreSQL\n",
"\n",
"## Overview\n",
"\n",
"**This lab teaches you how to debug PostgreSQL connectivity failures in LangSmith.**\n",
"\n",
"PostgreSQL is LangSmith's primary metadata store. It holds:\n",
"- User accounts and workspaces\n",
"- Project definitions\n",
"- API keys and permissions\n",
"- Trace metadata (not the traces themselves, which go to ClickHouse)\n",
"\n",
"**When PostgreSQL fails, you'll see:**\n",
"- API endpoints return 5xx errors\n",
"- Login/authentication may fail\n",
"- UI may load but actions fail\n",
"- Connection exhaustion patterns in logs\n",
"\n",
"**Learning Objectives:**\n",
"1. Understand how PostgreSQL failures manifest\n",
"2. Practice collecting diagnostics for database issues\n",
"3. Learn to identify connection vs. credential vs. network issues\n",
"4. Practice safe remediation\n",
"\n",
"**Estimated time:** 30-45 minutes\n",
"\n",
"**⚠️ Important:** Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Bootstrap environment\n",
"import sys\n",
"from pathlib import Path\n",
"\n",
"# Add notebooks directory to path\n",
"possible_paths = [\n",
" Path.cwd().parent,\n",
" Path.cwd(),\n",
" Path.cwd() / \"notebooks\",\n",
"]\n",
"\n",
"notebooks_path = None\n",
"for path in possible_paths:\n",
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
" notebooks_path = path\n",
" break\n",
"\n",
"if not notebooks_path:\n",
" notebooks_path = Path.cwd() / \"notebooks\"\n",
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
"\n",
"if str(notebooks_path) not in sys.path:\n",
" sys.path.insert(0, str(notebooks_path))\n",
"\n",
"from shared._bootstrap import bootstrap\n",
"from shared._validation import ok, warn\n",
"\n",
"# Run bootstrap\n",
"bootstrap_info = bootstrap()\n",
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Configuration & Prerequisites\n",
"\n",
"Load configuration and verify prerequisites.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from shared._validation import require_env\n",
"\n",
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
"config = require_env(*required_vars)\n",
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
"\n",
"namespace = config[\"NAMESPACE\"]\n",
"\n",
"print(f\"Namespace: {namespace}\")\n",
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
"\n",
"ok(\"Configuration loaded\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. What This Service Does for LangSmith\n",
"\n",
"PostgreSQL is LangSmith's **primary metadata store**. It holds:\n",
"\n",
"- **User accounts and authentication data**\n",
"- **Workspaces and projects** (organizational structure)\n",
"- **API keys and permissions** (access control)\n",
"- **Trace metadata** (not the trace data itself, which goes to ClickHouse)\n",
"- **Evaluation results and feedback**\n",
"\n",
"**Why it matters:**\n",
"- Without PostgreSQL, users can't log in\n",
"- API calls fail (no authentication, no project lookups)\n",
"- UI loads but can't perform actions\n",
"- All LangSmith functionality depends on it\n",
"\n",
"**How LangSmith connects:**\n",
"- Connection string stored in Kubernetes Secrets\n",
"- Connection pool managed by application\n",
"- Connection limits are critical (PostgreSQL has max connections)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Expected Symptoms When PostgreSQL Fails\n",
"\n",
"**What you'll see:**\n",
"\n",
"1. **API 5xx errors:**\n",
" - `/api/v1/...` endpoints return 500 or 503\n",
" - Error messages mention \"database\" or \"connection\"\n",
"\n",
"2. **Login failures:**\n",
" - Users can't authenticate\n",
" - OIDC/SAML may work (redirects) but session creation fails\n",
"\n",
"3. **UI loads but actions fail:**\n",
" - Pages render (static content)\n",
" - API calls fail (can't load projects, traces, etc.)\n",
"\n",
"4. **Log patterns:**\n",
" - Connection timeout errors\n",
" - \"connection refused\" or \"connection reset\"\n",
" - \"too many connections\" (if connection pool exhausted)\n",
" - \"authentication failed\" (if credentials wrong)\n",
"\n",
"**Timeline:**\n",
"- Symptoms appear within seconds of failure\n",
"- API calls start failing immediately\n",
"- Existing connections may work briefly, then fail\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Failure Injection Options\n",
"\n",
"**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
"\n",
"### Level 1: Subtle Failure (Recommended for first run)\n",
"\n",
"**Option A: Wrong Database Password**\n",
"- Modify the PostgreSQL password in the Kubernetes Secret\n",
"- Symptoms: Authentication failures, connection refused\n",
"\n",
"**Option B: Wrong Database Host**\n",
"- Point connection string to non-existent host\n",
"- Symptoms: Connection timeout, DNS resolution failures\n",
"\n",
"**Option C: Network Isolation (if NetworkPolicy supported)**\n",
"- Apply NetworkPolicy blocking egress to PostgreSQL\n",
"- Symptoms: Connection timeout, no route to host\n",
"\n",
"### Level 2: Obvious Failure\n",
"\n",
"**Option D: Remove Secret Entirely**\n",
"- Delete the PostgreSQL connection secret\n",
"- Symptoms: Pods crash on startup, immediate failures\n",
"\n",
"**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Do the Drill - Step 1: Confirm Baseline\n",
"\n",
"**Before injecting any failure, verify your baseline is healthy.**\n",
"\n",
"💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._shell import run\n",
"import json\n",
"\n",
"print(\"### Quick Baseline Check\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" healthy = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" print(f\"Pods: {healthy}/{total} running\")\n",
" \n",
" if healthy == total and total > 0:\n",
" ok(\"Baseline looks healthy\")\n",
" else:\n",
" warn(\"Some pods are not running - check baseline first\")\n",
"else:\n",
" warn(\"Could not check pod status\")\n",
"\n",
"# Check for PostgreSQL secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"postgres_secrets = []\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" if \"postgres\" in name.lower() or \"database\" in name.lower() or \"db\" in name.lower():\n",
" postgres_secrets.append(name)\n",
"\n",
"if postgres_secrets:\n",
" ok(f\"Found {len(postgres_secrets)} PostgreSQL-related secret(s)\")\n",
" for secret_name in postgres_secrets:\n",
" print(f\" - {secret_name}\")\n",
"else:\n",
" warn(\"No PostgreSQL secrets found\")\n",
" print(\" 💡 PostgreSQL connection may be configured differently\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Do the Drill - Step 2: Apply Failure Injection\n",
"\n",
"**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
"\n",
"Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# FAILURE INJECTION: Wrong Database Password\n",
"# This cell modifies the PostgreSQL password secret to an invalid value\n",
"\n",
"import base64\n",
"import yaml\n",
"from datetime import datetime\n",
"\n",
"# Find PostgreSQL secret (look for common names)\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"postgres_secret_name = None\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" # Common patterns: postgres, database, db, langsmith-db\n",
" if any(keyword in name.lower() for keyword in [\"postgres\", \"database\", \"db\"]):\n",
" # Check if it has password-related keys\n",
" data = secret.get(\"data\", {})\n",
" if any(key in data for key in [\"password\", \"POSTGRES_PASSWORD\", \"DB_PASSWORD\"]):\n",
" postgres_secret_name = name\n",
" break\n",
"\n",
"if not postgres_secret_name:\n",
" raise RuntimeError(\"❌ Could not find PostgreSQL secret. Check your deployment configuration.\")\n",
"\n",
"print(f\"Found PostgreSQL secret: {postgres_secret_name}\")\n",
"\n",
"# Get current secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", postgres_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
" check=True,\n",
" stream=False\n",
")\n",
"\n",
"# Save original secret for restoration\n",
"backup_file = artifacts_dir / \"module-4\" / f\"postgres-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
"backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
"with open(backup_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
"\n",
"ok(f\"Backed up original secret to: {backup_file.name}\")\n",
"\n",
"# Parse YAML and modify password\n",
"secret_data = yaml.safe_load(result.stdout)\n",
"if \"data\" not in secret_data:\n",
" raise RuntimeError(\"Secret has no data section\")\n",
"\n",
"# Find password key (could be password, POSTGRES_PASSWORD, DB_PASSWORD, etc.)\n",
"password_key = None\n",
"for key in [\"password\", \"POSTGRES_PASSWORD\", \"DB_PASSWORD\", \"postgres-password\"]:\n",
" if key in secret_data[\"data\"]:\n",
" password_key = key\n",
" break\n",
"\n",
"if not password_key:\n",
" raise RuntimeError(\"Could not find password key in secret\")\n",
"\n",
"# Set invalid password\n",
"invalid_password = \"INVALID_PASSWORD_12345\"\n",
"invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
"\n",
"# Modify secret\n",
"secret_data[\"data\"][password_key] = invalid_password_b64\n",
"\n",
"# Save modified secret to temp file\n",
"temp_secret_file = artifacts_dir / \"module-4\" / \"postgres-secret-modified.yaml\"\n",
"with open(temp_secret_file, \"w\") as f:\n",
" yaml.dump(secret_data, f)\n",
"\n",
"print(f\"\\n⚠️ READY TO APPLY FAILURE INJECTION\")\n",
"print(f\" This will set an invalid password in secret: {postgres_secret_name}\")\n",
"print(f\" Modified secret saved to: {temp_secret_file.name}\")\n",
"print(f\"\\n To apply, uncomment and run the next cell.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# UNCOMMENT TO APPLY FAILURE INJECTION\n",
"# \n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Failure injection applied - PostgreSQL password is now invalid\")\n",
"# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
"# print(\" This may take 1-2 minutes. Watch for pod restarts:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# # Wait a moment for changes to propagate\n",
"# import time\n",
"# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
"# time.sleep(30)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Do the Drill - Step 3: Observe Symptoms\n",
"\n",
"**Now that the failure is injected, observe how it manifests.**\n",
"\n",
"Check:\n",
"1. Pod logs for connection errors\n",
"2. API endpoint responses\n",
"3. UI behavior\n",
"4. Events for pod restarts\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"# Create incident directory for diagnostics\n",
"incident_dir = artifacts_dir / \"module-4\" / f\"postgres-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
"incident_dir.mkdir(parents=True, exist_ok=True)\n",
"\n",
"print(f\"### Collecting Failure Diagnostics\\n\")\n",
"print(f\"Saving to: {incident_dir}\\n\")\n",
"\n",
"# 1. Check pod status\n",
"print(\"1. Checking pod status...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" print(result.stdout)\n",
" \n",
" # Check for restarts\n",
" lines = result.stdout.split(\"\\n\")\n",
" restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
" if restarts:\n",
" print(\"\\n Pod restart counts:\")\n",
" for line in restarts[1:]: # Skip header\n",
" if line.strip():\n",
" parts = line.split()\n",
" if len(parts) > 3:\n",
" print(f\" {parts[0]}: {parts[3]} restarts\")\n",
"\n",
"# 2. Check recent events\n",
"print(\"\\n2. Checking recent events...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"events.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" if result.stdout.strip():\n",
" print(\" Recent warning/error events:\")\n",
" for line in result.stdout.split(\"\\n\")[-5:]:\n",
" if line.strip():\n",
" print(f\" {line}\")\n",
"\n",
"# 3. Check API pod logs for database errors\n",
"print(\"\\n3. Checking API pod logs for database errors...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-api\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"api_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if api_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, api_pod, \"--tail=50\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" logs_file = incident_dir / f\"api-pod-{api_pod}-logs.txt\"\n",
" with open(logs_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" \n",
" # Look for database-related errors\n",
" error_keywords = [\"database\", \"postgres\", \"connection\", \"timeout\", \"refused\", \"authentication\"]\n",
" error_lines = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if error_lines:\n",
" print(\" Found database-related errors:\")\n",
" for line in error_lines[-5:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" print(\" No obvious database errors in recent logs\")\n",
"else:\n",
" warn(\"Could not find API pod\")\n",
"\n",
"ok(f\"Diagnostics saved to: {incident_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
"\n",
"**This is critical - Support will ask for this bundle.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import urllib.request\n",
"\n",
"print(\"### Running Canonical Diagnostics Script\\n\")\n",
"\n",
"script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
"script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
"\n",
"try:\n",
" urllib.request.urlretrieve(script_url, script_path)\n",
" script_path.chmod(0o755)\n",
" \n",
" print(f\"Running diagnostics script for namespace: {namespace}\")\n",
" result = run(\n",
" [str(script_path), namespace],\n",
" check=False,\n",
" stream=True\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" ok(\"Diagnostics script completed\")\n",
" \n",
" # Find and move tarball\n",
" for file in incident_dir.parent.iterdir():\n",
" if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
" target_path = incident_dir / file.name\n",
" file.rename(target_path)\n",
" ok(f\"Diagnostics bundle: {target_path.name}\")\n",
" break\n",
" else:\n",
" warn(\"Diagnostics script had errors (check output above)\")\n",
" \n",
"except Exception as e:\n",
" warn(f\"Could not run diagnostics script: {e}\")\n",
" print(\" 💡 You can run it manually:\")\n",
" print(f\" curl -O {script_url}\")\n",
" print(f\" chmod +x get_k8s_debugging_info.sh\")\n",
" print(f\" ./get_k8s_debugging_info.sh {namespace}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Do the Drill - Step 5: Guided Triage\n",
"\n",
"**Where to look first for PostgreSQL issues:**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Guided Triage Steps\\n\")\n",
"\n",
"print(\"1. Check pod logs for connection errors:\")\n",
"print(f\" kubectl logs -n {namespace} <pod-name> | grep -i 'database\\\\|postgres\\\\|connection'\")\n",
"print()\n",
"\n",
"print(\"2. Verify secret exists and has correct keys:\")\n",
"print(f\" kubectl get secret {postgres_secret_name} -n {namespace} -o yaml\")\n",
"print(\" (Don't print the actual values - they're base64 encoded)\")\n",
"print()\n",
"\n",
"print(\"3. Check for pod restarts (indicates startup failures):\")\n",
"print(f\" kubectl get pods -n {namespace}\")\n",
"print()\n",
"\n",
"print(\"4. Test database connectivity from a pod (if possible):\")\n",
"print(\" kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \\\\\")\n",
"print(\" psql -h <db-host> -U <user> -d <database>\")\n",
"print()\n",
"\n",
"print(\"5. Check events for authentication/connection errors:\")\n",
"print(f\" kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
"print()\n",
"\n",
"# Check what we can automatically\n",
"print(\"\\n### Automatic Checks\\n\")\n",
"\n",
"# Check secret still exists\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", postgres_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" ok(f\"Secret '{postgres_secret_name}' still exists\")\n",
" secret_data = json.loads(result.stdout)\n",
" keys = list(secret_data.get(\"data\", {}).keys())\n",
" print(f\" Secret keys: {', '.join(keys)}\")\n",
"else:\n",
" warn(f\"Secret '{postgres_secret_name}' not found!\")\n",
"\n",
"# Check for pods with database connection env vars\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" db_related_pods = []\n",
" for pod in pods.get(\"items\", []):\n",
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
" containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
" for container in containers:\n",
" env = container.get(\"env\", [])\n",
" db_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
" for kw in [\"DB\", \"POSTGRES\", \"DATABASE\"])]\n",
" if db_env:\n",
" db_related_pods.append(name)\n",
" break\n",
" \n",
" if db_related_pods:\n",
" print(f\"\\n Pods with database environment variables:\")\n",
" for pod_name in set(db_related_pods):\n",
" print(f\" - {pod_name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Do the Drill - Step 6: Remediation\n",
"\n",
"**Restore the original secret to fix the issue.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# REMEDIATION: Restore original secret\n",
"# UNCOMMENT TO RESTORE\n",
"\n",
"# if backup_file.exists():\n",
"# print(f\"Restoring original secret from: {backup_file.name}\")\n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Original secret restored\")\n",
"# print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
"# print(\" This may take 1-2 minutes. Monitor pod status:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# import time\n",
"# print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
"# time.sleep(60)\n",
"# else:\n",
"# warn(f\"Backup file not found: {backup_file}\")\n",
"# print(\" 💡 You may need to manually restore the secret\")\n",
"\n",
"print(\"⚠️ To restore, uncomment the code above and run this cell.\")\n",
"print(f\" Backup file: {backup_file.name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Do the Drill - Step 7: Confirm Recovery\n",
"\n",
"**Verify that everything is working again.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Verifying Recovery\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" running = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" \n",
" if running == total and total > 0:\n",
" ok(f\"All {total} pod(s) are running\")\n",
" else:\n",
" warn(f\"Only {running}/{total} pod(s) running\")\n",
" print(\" 💡 Wait a bit longer for pods to fully recover\")\n",
"\n",
"# Check for recent errors in logs\n",
"print(\"\\nChecking for recent errors...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-api\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"api_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if api_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, api_pod, \"--tail=20\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" error_keywords = [\"error\", \"fail\", \"database\", \"postgres\", \"connection\"]\n",
" recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if recent_errors:\n",
" warn(\"Still seeing some errors in logs:\")\n",
" for line in recent_errors[-3:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" ok(\"No recent errors in API logs\")\n",
"\n",
"ok(\"Recovery verification complete\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. What Support Will Ask For\n",
"\n",
"**When escalating a PostgreSQL issue, Support will need:**\n",
"\n",
"1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
"2. **PostgreSQL connection details:**\n",
" - Host/endpoint (redacted)\n",
" - Database name\n",
" - Username (redacted)\n",
" - Whether using SSL/TLS\n",
"3. **Error messages from logs:**\n",
" - Full error text (not just \"connection failed\")\n",
" - Timestamps of first occurrence\n",
"4. **Recent changes:**\n",
" - Secret rotations\n",
" - Database migrations\n",
" - Network policy changes\n",
"5. **Connection pool status:**\n",
" - Current connections vs. max connections\n",
" - Connection pool exhaustion patterns\n",
"6. **Database health (if accessible):**\n",
" - PostgreSQL version\n",
" - Active connections\n",
" - Lock contention\n",
"\n",
"**Evidence collected in this lab:**\n",
"- ✅ Diagnostics bundle\n",
"- ✅ Pod logs with database errors\n",
"- ✅ Events showing failures\n",
"- ✅ Secret configuration (structure, not values)\n",
"\n",
"**Additional evidence to gather (if escalating):**\n",
"- Database endpoint connectivity test\n",
"- Connection pool metrics (if available)\n",
"- PostgreSQL logs (if accessible via cloud provider)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Lessons Learned\n",
"\n",
"**Key takeaways from this lab:**\n",
"\n",
"1. **PostgreSQL failures manifest quickly** - API calls fail within seconds\n",
"2. **Logs are your friend** - Connection errors appear in pod logs immediately\n",
"3. **Secrets matter** - Wrong credentials cause authentication failures\n",
"4. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
"5. **Diagnostics bundle is essential** - Support needs it for root cause analysis\n",
"\n",
"**Common mistakes to avoid:**\n",
"- ❌ Changing multiple things at once (hard to identify root cause)\n",
"- ❌ Not collecting diagnostics before remediation\n",
"- ❌ Ignoring connection pool limits\n",
"- ❌ Not testing database connectivity independently\n",
"\n",
"**Next steps:**\n",
"- Practice with other failure injection methods (Level 2)\n",
"- Try the Redis, ClickHouse, or Blob Storage failure labs\n",
"- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -0,0 +1,836 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 4: Failure Lab - Redis\n",
"\n",
"## Overview\n",
"\n",
"**This lab teaches you how to debug Redis connectivity failures in LangSmith.**\n",
"\n",
"Redis is LangSmith's **cache and job queue**. It handles:\n",
"- Job queue for asynchronous trace processing\n",
"- Caching for frequently accessed data\n",
"- Rate limiting and session management\n",
"- Worker coordination\n",
"\n",
"**When Redis fails, you'll see:**\n",
"- Intermittent ingestion issues\n",
"- Latency spikes and retries\n",
"- Worker backlog (jobs piling up)\n",
"- Traces may be delayed or missing\n",
"\n",
"**Learning Objectives:**\n",
"1. Understand how Redis failures manifest\n",
"2. Practice collecting diagnostics for cache/queue issues\n",
"3. Learn to identify connection vs. credential vs. network issues\n",
"4. Practice safe remediation\n",
"\n",
"**Estimated time:** 30-45 minutes\n",
"\n",
"**⚠️ Important:** Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Bootstrap environment\n",
"import sys\n",
"from pathlib import Path\n",
"\n",
"# Add notebooks directory to path\n",
"possible_paths = [\n",
" Path.cwd().parent,\n",
" Path.cwd(),\n",
" Path.cwd() / \"notebooks\",\n",
"]\n",
"\n",
"notebooks_path = None\n",
"for path in possible_paths:\n",
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
" notebooks_path = path\n",
" break\n",
"\n",
"if not notebooks_path:\n",
" notebooks_path = Path.cwd() / \"notebooks\"\n",
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
"\n",
"if str(notebooks_path) not in sys.path:\n",
" sys.path.insert(0, str(notebooks_path))\n",
"\n",
"from shared._bootstrap import bootstrap\n",
"from shared._validation import ok, warn\n",
"\n",
"# Run bootstrap\n",
"bootstrap_info = bootstrap()\n",
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Configuration & Prerequisites\n",
"\n",
"Load configuration and verify prerequisites.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from shared._validation import require_env\n",
"\n",
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
"config = require_env(*required_vars)\n",
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
"\n",
"namespace = config[\"NAMESPACE\"]\n",
"\n",
"print(f\"Namespace: {namespace}\")\n",
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
"\n",
"ok(\"Configuration loaded\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. What This Service Does for LangSmith\n",
"\n",
"Redis is LangSmith's **cache and job queue**. It handles:\n",
"\n",
"- **Job queue for asynchronous processing:**\n",
" - Workers pull trace processing jobs from Redis\n",
" - Jobs are queued when traces arrive via API\n",
" - Queue backlog indicates processing delays\n",
"\n",
"- **Caching:**\n",
" - Frequently accessed data (project metadata, user info)\n",
" - Reduces load on PostgreSQL\n",
" - Improves response times\n",
"\n",
"- **Rate limiting and session management:**\n",
" - API rate limiting\n",
" - Session storage (if configured)\n",
"\n",
"- **Worker coordination:**\n",
" - Distributed locking\n",
" - Task distribution\n",
"\n",
"**Why it matters:**\n",
"- Without Redis, workers can't process traces\n",
"- Job queue fills up, causing delays\n",
"- Cache misses increase load on PostgreSQL\n",
"- Ingestion becomes unreliable\n",
"\n",
"**How LangSmith connects:**\n",
"- Connection string stored in Kubernetes Secrets\n",
"- Workers connect to Redis to pull jobs\n",
"- API servers use Redis for caching\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Expected Symptoms When Redis Fails\n",
"\n",
"**What you'll see:**\n",
"\n",
"1. **Intermittent ingestion issues:**\n",
" - Some traces process, others don't\n",
" - Inconsistent behavior (works sometimes, fails other times)\n",
" - Retries visible in logs\n",
"\n",
"2. **Latency spikes:**\n",
" - API responses slow down\n",
" - Worker processing delays\n",
" - Timeout errors\n",
"\n",
"3. **Worker backlog:**\n",
" - Jobs piling up in queue\n",
" - Workers unable to pull new jobs\n",
" - Queue length increasing\n",
"\n",
"4. **Log patterns:**\n",
" - Connection timeout errors\n",
" - \"connection refused\" or \"connection reset\"\n",
" - \"NOAUTH Authentication required\" (if password wrong)\n",
" - Retry attempts in worker logs\n",
" - Cache miss patterns\n",
"\n",
"**Timeline:**\n",
"- Symptoms may be intermittent (connection pool retries)\n",
"- Worker backlog builds over time\n",
"- Cache misses cause cascading delays\n",
"- Full failure if connection pool exhausted\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Failure Injection Options\n",
"\n",
"**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
"\n",
"### Level 1: Subtle Failure (Recommended for first run)\n",
"\n",
"**Option A: Wrong Redis Password**\n",
"- Modify the Redis password in the Kubernetes Secret\n",
"- Symptoms: Authentication failures, connection refused, intermittent failures\n",
"\n",
"**Option B: Block Egress to Redis Endpoint**\n",
"- Apply NetworkPolicy blocking egress to Redis (if NetworkPolicy supported)\n",
"- Symptoms: Connection timeout, no route to host, intermittent failures\n",
"\n",
"### Level 2: Obvious Failure\n",
"\n",
"**Option C: Wrong Redis Host/Endpoint**\n",
"- Point connection string to non-existent host\n",
"- Symptoms: Connection timeout, DNS resolution failures, immediate failures\n",
"\n",
"**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Do the Drill - Step 1: Confirm Baseline\n",
"\n",
"**Before injecting any failure, verify your baseline is healthy.**\n",
"\n",
"💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._shell import run\n",
"import json\n",
"\n",
"print(\"### Quick Baseline Check\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" healthy = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" print(f\"Pods: {healthy}/{total} running\")\n",
" \n",
" if healthy == total and total > 0:\n",
" ok(\"Baseline looks healthy\")\n",
" else:\n",
" warn(\"Some pods are not running - check baseline first\")\n",
"else:\n",
" warn(\"Could not check pod status\")\n",
"\n",
"# Check for Redis secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"redis_secrets = []\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" if \"redis\" in name.lower() or \"cache\" in name.lower():\n",
" redis_secrets.append(name)\n",
"\n",
"if redis_secrets:\n",
" ok(f\"Found {len(redis_secrets)} Redis-related secret(s)\")\n",
" for secret_name in redis_secrets:\n",
" print(f\" - {secret_name}\")\n",
"else:\n",
" warn(\"No Redis secrets found\")\n",
" print(\" 💡 Redis connection may be configured differently\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Do the Drill - Step 2: Apply Failure Injection\n",
"\n",
"**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
"\n",
"Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# FAILURE INJECTION: Wrong Redis Password\n",
"# This cell modifies the Redis password secret to an invalid value\n",
"\n",
"import base64\n",
"import yaml\n",
"from datetime import datetime\n",
"\n",
"# Find Redis secret (look for common names)\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"redis_secret_name = None\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" # Common patterns: redis, cache\n",
" if any(keyword in name.lower() for keyword in [\"redis\", \"cache\"]):\n",
" # Check if it has password-related keys\n",
" data = secret.get(\"data\", {})\n",
" if any(key in data for key in [\"password\", \"REDIS_PASSWORD\", \"CACHE_PASSWORD\"]):\n",
" redis_secret_name = name\n",
" break\n",
"\n",
"if not redis_secret_name:\n",
" raise RuntimeError(\"❌ Could not find Redis secret. Check your deployment configuration.\")\n",
"\n",
"print(f\"Found Redis secret: {redis_secret_name}\")\n",
"\n",
"# Get current secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", redis_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
" check=True,\n",
" stream=False\n",
")\n",
"\n",
"# Save original secret for restoration\n",
"backup_file = artifacts_dir / \"module-4\" / f\"redis-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
"backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
"with open(backup_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
"\n",
"ok(f\"Backed up original secret to: {backup_file.name}\")\n",
"\n",
"# Parse YAML and modify password\n",
"secret_data = yaml.safe_load(result.stdout)\n",
"if \"data\" not in secret_data:\n",
" raise RuntimeError(\"Secret has no data section\")\n",
"\n",
"# Find password key (could be password, REDIS_PASSWORD, CACHE_PASSWORD, etc.)\n",
"password_key = None\n",
"for key in [\"password\", \"REDIS_PASSWORD\", \"CACHE_PASSWORD\", \"redis-password\"]:\n",
" if key in secret_data[\"data\"]:\n",
" password_key = key\n",
" break\n",
"\n",
"if not password_key:\n",
" raise RuntimeError(\"Could not find password key in secret\")\n",
"\n",
"# Set invalid password\n",
"invalid_password = \"INVALID_PASSWORD_12345\"\n",
"invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
"\n",
"# Modify secret\n",
"secret_data[\"data\"][password_key] = invalid_password_b64\n",
"\n",
"# Save modified secret to temp file\n",
"temp_secret_file = artifacts_dir / \"module-4\" / \"redis-secret-modified.yaml\"\n",
"with open(temp_secret_file, \"w\") as f:\n",
" yaml.dump(secret_data, f)\n",
"\n",
"print(f\"\\n⚠️ READY TO APPLY FAILURE INJECTION\")\n",
"print(f\" This will set an invalid password in secret: {redis_secret_name}\")\n",
"print(f\" Modified secret saved to: {temp_secret_file.name}\")\n",
"print(f\"\\n To apply, uncomment and run the next cell.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# UNCOMMENT TO APPLY FAILURE INJECTION\n",
"# \n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Failure injection applied - Redis password is now invalid\")\n",
"# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
"# print(\" This may take 1-2 minutes. Watch for pod restarts:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# # Wait a moment for changes to propagate\n",
"# import time\n",
"# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
"# time.sleep(30)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Do the Drill - Step 3: Observe Symptoms\n",
"\n",
"**Now that the failure is injected, observe how it manifests.**\n",
"\n",
"Check:\n",
"1. Worker pod logs for Redis connection errors\n",
"2. Queue backlog (if visible)\n",
"3. Worker retry patterns\n",
"4. Latency in API responses\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"# Create incident directory for diagnostics\n",
"incident_dir = artifacts_dir / \"module-4\" / f\"redis-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
"incident_dir.mkdir(parents=True, exist_ok=True)\n",
"\n",
"print(f\"### Collecting Failure Diagnostics\\n\")\n",
"print(f\"Saving to: {incident_dir}\\n\")\n",
"\n",
"# 1. Check pod status\n",
"print(\"1. Checking pod status...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" print(result.stdout)\n",
" \n",
" # Check for restarts\n",
" lines = result.stdout.split(\"\\n\")\n",
" restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
" if restarts:\n",
" print(\"\\n Pod restart counts:\")\n",
" for line in restarts[1:]: # Skip header\n",
" if line.strip():\n",
" parts = line.split()\n",
" if len(parts) > 3:\n",
" print(f\" {parts[0]}: {parts[3]} restarts\")\n",
"\n",
"# 2. Check recent events\n",
"print(\"\\n2. Checking recent events...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"events.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" if result.stdout.strip():\n",
" print(\" Recent warning/error events:\")\n",
" for line in result.stdout.split(\"\\n\")[-5:]:\n",
" if line.strip():\n",
" print(f\" {line}\")\n",
"\n",
"# 3. Check worker pod logs for Redis errors\n",
"print(\"\\n3. Checking worker pod logs for Redis errors...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if worker_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=50\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" logs_file = incident_dir / f\"worker-pod-{worker_pod}-logs.txt\"\n",
" with open(logs_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" \n",
" # Look for Redis-related errors\n",
" error_keywords = [\"redis\", \"cache\", \"connection\", \"timeout\", \"refused\", \"authentication\", \"NOAUTH\", \"retry\"]\n",
" error_lines = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if error_lines:\n",
" print(\" Found Redis-related errors:\")\n",
" for line in error_lines[-5:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" print(\" No obvious Redis errors in recent logs\")\n",
"else:\n",
" warn(\"Could not find worker pod\")\n",
"\n",
"ok(f\"Diagnostics saved to: {incident_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
"\n",
"**This is critical - Support will ask for this bundle.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import urllib.request\n",
"\n",
"print(\"### Running Canonical Diagnostics Script\\n\")\n",
"\n",
"script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
"script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
"\n",
"try:\n",
" urllib.request.urlretrieve(script_url, script_path)\n",
" script_path.chmod(0o755)\n",
" \n",
" print(f\"Running diagnostics script for namespace: {namespace}\")\n",
" result = run(\n",
" [str(script_path), namespace],\n",
" check=False,\n",
" stream=True\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" ok(\"Diagnostics script completed\")\n",
" \n",
" # Find and move tarball\n",
" for file in incident_dir.parent.iterdir():\n",
" if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
" target_path = incident_dir / file.name\n",
" file.rename(target_path)\n",
" ok(f\"Diagnostics bundle: {target_path.name}\")\n",
" break\n",
" else:\n",
" warn(\"Diagnostics script had errors (check output above)\")\n",
" \n",
"except Exception as e:\n",
" warn(f\"Could not run diagnostics script: {e}\")\n",
" print(\" 💡 You can run it manually:\")\n",
" print(f\" curl -O {script_url}\")\n",
" print(f\" chmod +x get_k8s_debugging_info.sh\")\n",
" print(f\" ./get_k8s_debugging_info.sh {namespace}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Do the Drill - Step 5: Guided Triage\n",
"\n",
"**Where to look first for Redis issues:**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Guided Triage Steps\\n\")\n",
"\n",
"print(\"1. Check worker pod logs for Redis connection errors:\")\n",
"print(f\" kubectl logs -n {namespace} <worker-pod-name> | grep -i 'redis\\\\|cache\\\\|connection'\")\n",
"print()\n",
"\n",
"print(\"2. Verify secret exists and has correct keys:\")\n",
"print(f\" kubectl get secret {redis_secret_name} -n {namespace} -o yaml\")\n",
"print(\" (Don't print the actual values - they're base64 encoded)\")\n",
"print()\n",
"\n",
"print(\"3. Check for worker pod restarts (indicates connection failures):\")\n",
"print(f\" kubectl get pods -n {namespace} -l app=langsmith-worker\")\n",
"print()\n",
"\n",
"print(\"4. Test Redis connectivity from a pod (if possible):\")\n",
"print(\" kubectl run -it --rm debug --image=redis:7 --restart=Never -- \\\\\")\n",
"print(\" redis-cli -h <redis-host> -p <port> -a <password> ping\")\n",
"print()\n",
"\n",
"print(\"5. Check events for connection/authentication errors:\")\n",
"print(f\" kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
"print()\n",
"\n",
"# Check what we can automatically\n",
"print(\"\\n### Automatic Checks\\n\")\n",
"\n",
"# Check secret still exists\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", redis_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" ok(f\"Secret '{redis_secret_name}' still exists\")\n",
" secret_data = json.loads(result.stdout)\n",
" keys = list(secret_data.get(\"data\", {}).keys())\n",
" print(f\" Secret keys: {', '.join(keys)}\")\n",
"else:\n",
" warn(f\"Secret '{redis_secret_name}' not found!\")\n",
"\n",
"# Check for pods with Redis connection env vars\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" redis_related_pods = []\n",
" for pod in pods.get(\"items\", []):\n",
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
" containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
" for container in containers:\n",
" env = container.get(\"env\", [])\n",
" redis_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
" for kw in [\"REDIS\", \"CACHE\"])]\n",
" if redis_env:\n",
" redis_related_pods.append(name)\n",
" break\n",
" \n",
" if redis_related_pods:\n",
" print(f\"\\n Pods with Redis environment variables:\")\n",
" for pod_name in set(redis_related_pods):\n",
" print(f\" - {pod_name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Do the Drill - Step 6: Remediation\n",
"\n",
"**Restore the original secret to fix the issue.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# REMEDIATION: Restore original secret\n",
"# UNCOMMENT TO RESTORE\n",
"\n",
"# if backup_file.exists():\n",
"# print(f\"Restoring original secret from: {backup_file.name}\")\n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Original secret restored\")\n",
"# print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
"# print(\" This may take 1-2 minutes. Monitor pod status:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# import time\n",
"# print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
"# time.sleep(60)\n",
"# else:\n",
"# warn(f\"Backup file not found: {backup_file}\")\n",
"# print(\" 💡 You may need to manually restore the secret\")\n",
"\n",
"print(\"⚠️ To restore, uncomment the code above and run this cell.\")\n",
"print(f\" Backup file: {backup_file.name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Do the Drill - Step 7: Confirm Recovery\n",
"\n",
"**Verify that everything is working again.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Verifying Recovery\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" running = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" \n",
" if running == total and total > 0:\n",
" ok(f\"All {total} pod(s) are running\")\n",
" else:\n",
" warn(f\"Only {running}/{total} pod(s) running\")\n",
" print(\" 💡 Wait a bit longer for pods to fully recover\")\n",
"\n",
"# Check for recent errors in worker logs\n",
"print(\"\\nChecking for recent errors in worker logs...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if worker_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=20\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" error_keywords = [\"error\", \"fail\", \"redis\", \"cache\", \"connection\"]\n",
" recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if recent_errors:\n",
" warn(\"Still seeing some errors in logs:\")\n",
" for line in recent_errors[-3:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" ok(\"No recent errors in worker logs\")\n",
"\n",
"ok(\"Recovery verification complete\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. What Support Will Ask For\n",
"\n",
"**When escalating a Redis issue, Support will need:**\n",
"\n",
"1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
"2. **Redis connection details:**\n",
" - Host/endpoint (redacted)\n",
" - Port\n",
" - Password (redacted)\n",
" - Whether using SSL/TLS\n",
"3. **Error messages from logs:**\n",
" - Full error text (not just \"connection failed\")\n",
" - Timestamps of first occurrence\n",
" - Retry patterns\n",
"4. **Recent changes:**\n",
" - Secret rotations\n",
" - Network policy changes\n",
" - Redis configuration changes\n",
"5. **Queue status (if accessible):**\n",
" - Queue length\n",
" - Worker processing rate\n",
" - Backlog growth rate\n",
"6. **Redis health (if accessible):**\n",
" - Redis version\n",
" - Memory usage\n",
" - Connection count\n",
" - Slow queries\n",
"\n",
"**Evidence collected in this lab:**\n",
"- ✅ Diagnostics bundle\n",
"- ✅ Worker pod logs with Redis errors\n",
"- ✅ Events showing failures\n",
"- ✅ Secret configuration (structure, not values)\n",
"\n",
"**Additional evidence to gather (if escalating):**\n",
"- Redis endpoint connectivity test\n",
"- Queue metrics (if available)\n",
"- Redis logs (if accessible via cloud provider)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Lessons Learned\n",
"\n",
"**Key takeaways from this lab:**\n",
"\n",
"1. **Redis failures can be intermittent** - Connection pool retries may mask issues\n",
"2. **Worker logs are critical** - Redis errors appear in worker pod logs\n",
"3. **Queue backlog is a symptom** - Jobs pile up when workers can't connect\n",
"4. **Secrets matter** - Wrong credentials cause authentication failures\n",
"5. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
"\n",
"**Common mistakes to avoid:**\n",
"- ❌ Ignoring intermittent failures (they indicate connection issues)\n",
"- ❌ Not checking worker logs (API logs may not show Redis errors)\n",
"- ❌ Not monitoring queue length (backlog indicates processing delays)\n",
"- ❌ Not testing Redis connectivity independently\n",
"\n",
"**Next steps:**\n",
"- Practice with other failure injection methods (Level 2)\n",
"- Try the ClickHouse or Blob Storage failure labs\n",
"- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -0,0 +1,837 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 4: Failure Lab - ClickHouse\n",
"\n",
"## Overview\n",
"\n",
"**This lab teaches you how to debug ClickHouse connectivity failures in LangSmith.**\n",
"\n",
"ClickHouse is LangSmith's **trace storage**. It handles:\n",
"- Storing trace data (spans, events, metadata)\n",
"- Time-series queries for trace search and filtering\n",
"- High-volume writes from workers\n",
"- Efficient querying for UI display\n",
"\n",
"**When ClickHouse fails, you'll see:**\n",
"- Traces delayed or missing\n",
"- Insert errors and merge/backlog hints\n",
"- UI loads but traces don't appear\n",
"- Query timeouts\n",
"\n",
"**Learning Objectives:**\n",
"1. Understand how ClickHouse failures manifest\n",
"2. Practice collecting diagnostics for trace storage issues\n",
"3. Learn to identify connection vs. credential vs. network issues\n",
"4. Practice safe remediation\n",
"\n",
"**Estimated time:** 30-45 minutes\n",
"\n",
"**⚠️ Important:** Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Bootstrap environment\n",
"import sys\n",
"from pathlib import Path\n",
"\n",
"# Add notebooks directory to path\n",
"possible_paths = [\n",
" Path.cwd().parent,\n",
" Path.cwd(),\n",
" Path.cwd() / \"notebooks\",\n",
"]\n",
"\n",
"notebooks_path = None\n",
"for path in possible_paths:\n",
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
" notebooks_path = path\n",
" break\n",
"\n",
"if not notebooks_path:\n",
" notebooks_path = Path.cwd() / \"notebooks\"\n",
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
"\n",
"if str(notebooks_path) not in sys.path:\n",
" sys.path.insert(0, str(notebooks_path))\n",
"\n",
"from shared._bootstrap import bootstrap\n",
"from shared._validation import ok, warn\n",
"\n",
"# Run bootstrap\n",
"bootstrap_info = bootstrap()\n",
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Configuration & Prerequisites\n",
"\n",
"Load configuration and verify prerequisites.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from shared._validation import require_env\n",
"\n",
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
"config = require_env(*required_vars)\n",
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
"\n",
"namespace = config[\"NAMESPACE\"]\n",
"\n",
"print(f\"Namespace: {namespace}\")\n",
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
"\n",
"ok(\"Configuration loaded\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. What This Service Does for LangSmith\n",
"\n",
"ClickHouse is LangSmith's **clickhouse and job queue**. It handles:\n",
"\n",
"- **Job queue for asynchronous processing:**\n",
" - Workers pull trace processing jobs from ClickHouse\n",
" - Jobs are queued when traces arrive via API\n",
" - Queue backlog indicates processing delays\n",
"\n",
"- **Caching:**\n",
" - Frequently accessed data (project metadata, user info)\n",
" - Reduces load on PostgreSQL\n",
" - Improves response times\n",
"\n",
"- **Rate limiting and session management:**\n",
" - API rate limiting\n",
" - Session storage (if configured)\n",
"\n",
"- **Worker coordination:**\n",
" - Distributed locking\n",
" - Task distribution\n",
"\n",
"**Why it matters:**\n",
"- Without ClickHouse, workers can't process traces\n",
"- Job queue fills up, causing delays\n",
"- Cache misses increase load on PostgreSQL\n",
"- Ingestion becomes unreliable\n",
"\n",
"**How LangSmith connects:**\n",
"- Connection string stored in Kubernetes Secrets\n",
"- Workers connect to ClickHouse to pull jobs\n",
"- API servers use ClickHouse for caching\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Expected Symptoms When ClickHouse Fails\n",
"\n",
"**What you'll see:**\n",
"\n",
"1. **Traces delayed or missing:**\n",
" - Some traces process, others don't\n",
" - Inconsistent behavior (works sometimes, fails other times)\n",
" - Retries visible in logs\n",
"\n",
"2. **Latency spikes:**\n",
" - API responses slow down\n",
" - Worker processing delays\n",
" - Timeout errors\n",
"\n",
"3. **Worker backlog:**\n",
" - Jobs piling up in queue\n",
" - Workers unable to pull new jobs\n",
" - Queue length increasing\n",
"\n",
"4. **Log patterns:**\n",
" - Connection timeout errors\n",
" - \"connection refused\" or \"connection reset\"\n",
" - \"NOAUTH Authentication required\" (if password wrong)\n",
" - Retry attempts in worker logs\n",
" - Cache miss patterns\n",
"\n",
"**Timeline:**\n",
"- Symptoms may be intermittent (connection pool retries)\n",
"- Worker backlog builds over time\n",
"- Cache misses cause cascading delays\n",
"- Full failure if connection pool exhausted\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Failure Injection Options\n",
"\n",
"**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
"\n",
"### Level 1: Subtle Failure (Recommended for first run)\n",
"\n",
"**Option A: Wrong ClickHouse Password**\n",
"- Modify the ClickHouse password in the Kubernetes Secret\n",
"- Symptoms: Authentication failures, connection refused, intermittent failures\n",
"\n",
"**Option B: Block Egress to ClickHouse Endpoint**\n",
"- Apply NetworkPolicy blocking egress to ClickHouse (if NetworkPolicy supported)\n",
"- Symptoms: Connection timeout, no route to host, intermittent failures\n",
"\n",
"### Level 2: Obvious Failure\n",
"\n",
"**Option C: Wrong ClickHouse Host/Endpoint**\n",
"- Point connection string to non-existent host\n",
"- Symptoms: Connection timeout, DNS resolution failures, immediate failures\n",
"\n",
"**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Do the Drill - Step 1: Confirm Baseline\n",
"\n",
"**Before injecting any failure, verify your baseline is healthy.**\n",
"\n",
"💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._shell import run\n",
"import json\n",
"\n",
"print(\"### Quick Baseline Check\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" healthy = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" print(f\"Pods: {healthy}/{total} running\")\n",
" \n",
" if healthy == total and total > 0:\n",
" ok(\"Baseline looks healthy\")\n",
" else:\n",
" warn(\"Some pods are not running - check baseline first\")\n",
"else:\n",
" warn(\"Could not check pod status\")\n",
"\n",
"# Check for ClickHouse secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"clickhouse_secrets = []\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" if \"clickhouse\" in name.lower() or \"clickhouse\" in name.lower():\n",
" clickhouse_secrets.append(name)\n",
"\n",
"if clickhouse_secrets:\n",
" ok(f\"Found {len(clickhouse_secrets)} ClickHouse-related secret(s)\")\n",
" for secret_name in clickhouse_secrets:\n",
" print(f\" - {secret_name}\")\n",
"else:\n",
" warn(\"No ClickHouse secrets found\")\n",
" print(\" 💡 ClickHouse connection may be configured differently\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Do the Drill - Step 2: Apply Failure Injection\n",
"\n",
"**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
"\n",
"Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# FAILURE INJECTION: Wrong ClickHouse Password\n",
"# This cell modifies the ClickHouse password secret to an invalid value\n",
"\n",
"import base64\n",
"import yaml\n",
"from datetime import datetime\n",
"\n",
"# Find ClickHouse secret (look for common names)\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"clickhouse_secret_name = None\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" # Common patterns: clickhouse, clickhouse\n",
" if any(keyword in name.lower() for keyword in [\"clickhouse\", \"clickhouse\"]):\n",
" # Check if it has password-related keys\n",
" data = secret.get(\"data\", {})\n",
" if any(key in data for key in [\"password\", \"REDIS_PASSWORD\", \"CLICKHOUSE_PASSWORD\"]):\n",
" clickhouse_secret_name = name\n",
" break\n",
"\n",
"if not clickhouse_secret_name:\n",
" raise RuntimeError(\"❌ Could not find ClickHouse secret. Check your deployment configuration.\")\n",
"\n",
"print(f\"Found ClickHouse secret: {clickhouse_secret_name}\")\n",
"\n",
"# Get current secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", clickhouse_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
" check=True,\n",
" stream=False\n",
")\n",
"\n",
"# Save original secret for restoration\n",
"backup_file = artifacts_dir / \"module-4\" / f\"clickhouse-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
"backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
"with open(backup_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
"\n",
"ok(f\"Backed up original secret to: {backup_file.name}\")\n",
"\n",
"# Parse YAML and modify password\n",
"secret_data = yaml.safe_load(result.stdout)\n",
"if \"data\" not in secret_data:\n",
" raise RuntimeError(\"Secret has no data section\")\n",
"\n",
"# Find password key (could be password, REDIS_PASSWORD, CLICKHOUSE_PASSWORD, etc.)\n",
"password_key = None\n",
"for key in [\"password\", \"REDIS_PASSWORD\", \"CLICKHOUSE_PASSWORD\", \"clickhouse-password\"]:\n",
" if key in secret_data[\"data\"]:\n",
" password_key = key\n",
" break\n",
"\n",
"if not password_key:\n",
" raise RuntimeError(\"Could not find password key in secret\")\n",
"\n",
"# Set invalid password\n",
"invalid_password = \"INVALID_PASSWORD_12345\"\n",
"invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
"\n",
"# Modify secret\n",
"secret_data[\"data\"][password_key] = invalid_password_b64\n",
"\n",
"# Save modified secret to temp file\n",
"temp_secret_file = artifacts_dir / \"module-4\" / \"clickhouse-secret-modified.yaml\"\n",
"with open(temp_secret_file, \"w\") as f:\n",
" yaml.dump(secret_data, f)\n",
"\n",
"print(f\"\\n⚠️ READY TO APPLY FAILURE INJECTION\")\n",
"print(f\" This will set an invalid password in secret: {clickhouse_secret_name}\")\n",
"print(f\" Modified secret saved to: {temp_secret_file.name}\")\n",
"print(f\"\\n To apply, uncomment and run the next cell.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# UNCOMMENT TO APPLY FAILURE INJECTION\n",
"# \n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Failure injection applied - ClickHouse password is now invalid\")\n",
"# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
"# print(\" This may take 1-2 minutes. Watch for pod restarts:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# # Wait a moment for changes to propagate\n",
"# import time\n",
"# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
"# time.sleep(30)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Do the Drill - Step 3: Observe Symptoms\n",
"\n",
"**Now that the failure is injected, observe how it manifests.**\n",
"\n",
"Check:\n",
"1. Worker pod logs for ClickHouse connection errors\n",
"2. Queue backlog (if visible)\n",
"3. Worker retry patterns\n",
"4. Latency in API responses\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"# Create incident directory for diagnostics\n",
"incident_dir = artifacts_dir / \"module-4\" / f\"clickhouse-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
"incident_dir.mkdir(parents=True, exist_ok=True)\n",
"\n",
"print(f\"### Collecting Failure Diagnostics\\n\")\n",
"print(f\"Saving to: {incident_dir}\\n\")\n",
"\n",
"# 1. Check pod status\n",
"print(\"1. Checking pod status...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" print(result.stdout)\n",
" \n",
" # Check for restarts\n",
" lines = result.stdout.split(\"\\n\")\n",
" restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
" if restarts:\n",
" print(\"\\n Pod restart counts:\")\n",
" for line in restarts[1:]: # Skip header\n",
" if line.strip():\n",
" parts = line.split()\n",
" if len(parts) > 3:\n",
" print(f\" {parts[0]}: {parts[3]} restarts\")\n",
"\n",
"# 2. Check recent events\n",
"print(\"\\n2. Checking recent events...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"events.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" if result.stdout.strip():\n",
" print(\" Recent warning/error events:\")\n",
" for line in result.stdout.split(\"\\n\")[-5:]:\n",
" if line.strip():\n",
" print(f\" {line}\")\n",
"\n",
"# 3. Check worker pod logs for ClickHouse errors\n",
"print(\"\\n3. Checking worker pod logs for ClickHouse errors...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if worker_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=50\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" logs_file = incident_dir / f\"worker-pod-{worker_pod}-logs.txt\"\n",
" with open(logs_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" \n",
" # Look for ClickHouse-related errors\n",
" error_keywords = [\"clickhouse\", \"clickhouse\", \"connection\", \"timeout\", \"refused\", \"authentication\", \"NOAUTH\", \"retry\"]\n",
" error_lines = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if error_lines:\n",
" print(\" Found ClickHouse-related errors:\")\n",
" for line in error_lines[-5:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" print(\" No obvious ClickHouse errors in recent logs\")\n",
"else:\n",
" warn(\"Could not find worker pod\")\n",
"\n",
"ok(f\"Diagnostics saved to: {incident_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
"\n",
"**This is critical - Support will ask for this bundle.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import urllib.request\n",
"\n",
"print(\"### Running Canonical Diagnostics Script\\n\")\n",
"\n",
"script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
"script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
"\n",
"try:\n",
" urllib.request.urlretrieve(script_url, script_path)\n",
" script_path.chmod(0o755)\n",
" \n",
" print(f\"Running diagnostics script for namespace: {namespace}\")\n",
" result = run(\n",
" [str(script_path), namespace],\n",
" check=False,\n",
" stream=True\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" ok(\"Diagnostics script completed\")\n",
" \n",
" # Find and move tarball\n",
" for file in incident_dir.parent.iterdir():\n",
" if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
" target_path = incident_dir / file.name\n",
" file.rename(target_path)\n",
" ok(f\"Diagnostics bundle: {target_path.name}\")\n",
" break\n",
" else:\n",
" warn(\"Diagnostics script had errors (check output above)\")\n",
" \n",
"except Exception as e:\n",
" warn(f\"Could not run diagnostics script: {e}\")\n",
" print(\" 💡 You can run it manually:\")\n",
" print(f\" curl -O {script_url}\")\n",
" print(f\" chmod +x get_k8s_debugging_info.sh\")\n",
" print(f\" ./get_k8s_debugging_info.sh {namespace}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Do the Drill - Step 5: Guided Triage\n",
"\n",
"**Where to look first for ClickHouse issues:**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Guided Triage Steps\\n\")\n",
"\n",
"print(\"1. Check worker pod logs for ClickHouse connection errors:\")\n",
"print(f\" kubectl logs -n {namespace} <worker-pod-name> | grep -i 'clickhouse\\\\|clickhouse\\\\|connection'\")\n",
"print()\n",
"\n",
"print(\"2. Verify secret exists and has correct keys:\")\n",
"print(f\" kubectl get secret {clickhouse_secret_name} -n {namespace} -o yaml\")\n",
"print(\" (Don't print the actual values - they're base64 encoded)\")\n",
"print()\n",
"\n",
"print(\"3. Check for worker pod restarts (indicates connection failures):\")\n",
"print(f\" kubectl get pods -n {namespace} -l app=langsmith-worker\")\n",
"print()\n",
"\n",
"print(\"4. Test ClickHouse connectivity from a pod (if possible):\")\n",
"print(\" kubectl run -it --rm debug --image=clickhouse:7 --restart=Never -- \\\\\")\n",
"print(\" clickhouse-cli -h <clickhouse-host> -p <port> -a <password> ping\")\n",
"print()\n",
"\n",
"print(\"5. Check events for connection/authentication errors:\")\n",
"print(f\" kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
"print()\n",
"\n",
"# Check what we can automatically\n",
"print(\"\\n### Automatic Checks\\n\")\n",
"\n",
"# Check secret still exists\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", clickhouse_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" ok(f\"Secret '{clickhouse_secret_name}' still exists\")\n",
" secret_data = json.loads(result.stdout)\n",
" keys = list(secret_data.get(\"data\", {}).keys())\n",
" print(f\" Secret keys: {', '.join(keys)}\")\n",
"else:\n",
" warn(f\"Secret '{clickhouse_secret_name}' not found!\")\n",
"\n",
"# Check for pods with ClickHouse connection env vars\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" clickhouse_related_pods = []\n",
" for pod in pods.get(\"items\", []):\n",
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
" containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
" for container in containers:\n",
" env = container.get(\"env\", [])\n",
" clickhouse_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
" for kw in [\"REDIS\", \"CLICKHOUSE\"])]\n",
" if clickhouse_env:\n",
" clickhouse_related_pods.append(name)\n",
" break\n",
" \n",
" if clickhouse_related_pods:\n",
" print(f\"\\n Pods with ClickHouse environment variables:\")\n",
" for pod_name in set(clickhouse_related_pods):\n",
" print(f\" - {pod_name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Do the Drill - Step 6: Remediation\n",
"\n",
"**Restore the original secret to fix the issue.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# REMEDIATION: Restore original secret\n",
"# UNCOMMENT TO RESTORE\n",
"\n",
"# if backup_file.exists():\n",
"# print(f\"Restoring original secret from: {backup_file.name}\")\n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Original secret restored\")\n",
"# print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
"# print(\" This may take 1-2 minutes. Monitor pod status:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# import time\n",
"# print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
"# time.sleep(60)\n",
"# else:\n",
"# warn(f\"Backup file not found: {backup_file}\")\n",
"# print(\" 💡 You may need to manually restore the secret\")\n",
"\n",
"print(\"⚠️ To restore, uncomment the code above and run this cell.\")\n",
"if 'backup_file' in locals():\n",
" print(f\" Backup file: {backup_file.name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Do the Drill - Step 7: Confirm Recovery\n",
"\n",
"**Verify that everything is working again.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Verifying Recovery\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" running = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" \n",
" if running == total and total > 0:\n",
" ok(f\"All {total} pod(s) are running\")\n",
" else:\n",
" warn(f\"Only {running}/{total} pod(s) running\")\n",
" print(\" 💡 Wait a bit longer for pods to fully recover\")\n",
"\n",
"# Check for recent errors in worker logs\n",
"print(\"\\nChecking for recent errors in worker logs...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if worker_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=20\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" error_keywords = [\"error\", \"fail\", \"clickhouse\", \"clickhouse\", \"connection\"]\n",
" recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if recent_errors:\n",
" warn(\"Still seeing some errors in logs:\")\n",
" for line in recent_errors[-3:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" ok(\"No recent errors in worker logs\")\n",
"\n",
"ok(\"Recovery verification complete\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. What Support Will Ask For\n",
"\n",
"**When escalating a ClickHouse issue, Support will need:**\n",
"\n",
"1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
"2. **ClickHouse connection details:**\n",
" - Host/endpoint (redacted)\n",
" - Port\n",
" - Password (redacted)\n",
" - Whether using SSL/TLS\n",
"3. **Error messages from logs:**\n",
" - Full error text (not just \"connection failed\")\n",
" - Timestamps of first occurrence\n",
" - Retry patterns\n",
"4. **Recent changes:**\n",
" - Secret rotations\n",
" - Network policy changes\n",
" - ClickHouse configuration changes\n",
"5. **Queue status (if accessible):**\n",
" - Queue length\n",
" - Worker processing rate\n",
" - Backlog growth rate\n",
"6. **ClickHouse health (if accessible):**\n",
" - ClickHouse version\n",
" - Memory usage\n",
" - Connection count\n",
" - Slow queries\n",
"\n",
"**Evidence collected in this lab:**\n",
"- ✅ Diagnostics bundle\n",
"- ✅ Worker pod logs with ClickHouse errors\n",
"- ✅ Events showing failures\n",
"- ✅ Secret configuration (structure, not values)\n",
"\n",
"**Additional evidence to gather (if escalating):**\n",
"- ClickHouse endpoint connectivity test\n",
"- Queue metrics (if available)\n",
"- ClickHouse logs (if accessible via cloud provider)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Lessons Learned\n",
"\n",
"**Key takeaways from this lab:**\n",
"\n",
"1. **ClickHouse failures can be intermittent** - Connection pool retries may mask issues\n",
"2. **Worker logs are critical** - ClickHouse errors appear in worker pod logs\n",
"3. **Queue backlog is a symptom** - Jobs pile up when workers can't connect\n",
"4. **Secrets matter** - Wrong credentials cause authentication failures\n",
"5. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
"\n",
"**Common mistakes to avoid:**\n",
"- ❌ Ignoring intermittent failures (they indicate connection issues)\n",
"- ❌ Not checking worker logs (API logs may not show ClickHouse errors)\n",
"- ❌ Not monitoring queue length (backlog indicates processing delays)\n",
"- ❌ Not testing ClickHouse connectivity independently\n",
"\n",
"**Next steps:**\n",
"- Practice with other failure injection methods (Level 2)\n",
"- Try the ClickHouse or Blob Storage failure labs\n",
"- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -0,0 +1,846 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 4: Failure Lab - Blob Storage\n",
"\n",
"## Overview\n",
"\n",
"**This lab teaches you how to debug Blob Storage configuration failures in LangSmith.**\n",
"\n",
"Blob Storage is LangSmith's **large payload storage**. It handles:\n",
"- Job queue for asynchronous trace processing\n",
"- Caching for frequently accessed data\n",
"- Rate limiting and session management\n",
"- Worker coordination\n",
"\n",
"**When Blob Storage fails, you'll see:**\n",
"- Large payload traces degrade ClickHouse performance\n",
"- Warnings/errors in logs about artifact storage\n",
"- Increased ClickHouse pressure and latency under load\n",
"- Traces with large payloads fail to store properly\n",
"- Intermittent ingestion issues\n",
"- Latency spikes and retries\n",
"- Worker backlog (jobs piling up)\n",
"- Traces may be delayed or missing\n",
"\n",
"**Learning Objectives:**\n",
"1. Understand how Blob Storage failures manifest\n",
"2. Practice collecting diagnostics for blob/queue issues\n",
"3. Learn to identify connection vs. credential vs. network issues\n",
"4. Practice safe remediation\n",
"\n",
"**Estimated time:** 30-45 minutes\n",
"\n",
"**⚠️ Important:** Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Bootstrap environment\n",
"import sys\n",
"from pathlib import Path\n",
"\n",
"# Add notebooks directory to path\n",
"possible_paths = [\n",
" Path.cwd().parent,\n",
" Path.cwd(),\n",
" Path.cwd() / \"notebooks\",\n",
"]\n",
"\n",
"notebooks_path = None\n",
"for path in possible_paths:\n",
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
" notebooks_path = path\n",
" break\n",
"\n",
"if not notebooks_path:\n",
" notebooks_path = Path.cwd() / \"notebooks\"\n",
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
"\n",
"if str(notebooks_path) not in sys.path:\n",
" sys.path.insert(0, str(notebooks_path))\n",
"\n",
"from shared._bootstrap import bootstrap\n",
"from shared._validation import ok, warn\n",
"\n",
"# Run bootstrap\n",
"bootstrap_info = bootstrap()\n",
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Configuration & Prerequisites\n",
"\n",
"Load configuration and verify prerequisites.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from shared._validation import require_env\n",
"\n",
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
"config = require_env(*required_vars)\n",
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
"\n",
"namespace = config[\"NAMESPACE\"]\n",
"\n",
"print(f\"Namespace: {namespace}\")\n",
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
"\n",
"ok(\"Configuration loaded\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. What This Service Does for LangSmith\n",
"\n",
"Blob Storage is LangSmith's **blob and job queue**. It handles:\n",
"\n",
"- **Job queue for asynchronous processing:**\n",
" - Workers pull trace processing jobs from Blob Storage\n",
" - Jobs are queued when traces arrive via API\n",
" - Queue backlog indicates processing delays\n",
"\n",
"- **Caching:**\n",
" - Frequently accessed data (project metadata, user info)\n",
" - Reduces load on PostgreSQL\n",
" - Improves response times\n",
"\n",
"- **Rate limiting and session management:**\n",
" - API rate limiting\n",
" - Session storage (if configured)\n",
"\n",
"- **Worker coordination:**\n",
" - Distributed locking\n",
" - Task distribution\n",
"\n",
"**Why it matters:**\n",
"- Without Blob Storage, workers can't process traces\n",
"- Job queue fills up, causing delays\n",
"- Cache misses increase load on PostgreSQL\n",
"- Ingestion becomes unreliable\n",
"\n",
"**How LangSmith connects:**\n",
"- Connection string stored in Kubernetes Secrets\n",
"- Workers connect to Blob Storage to pull jobs\n",
"- API servers use Blob Storage for caching\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Expected Symptoms When Blob Storage Fails\n",
"\n",
"**What you'll see:**\n",
"\n",
"1. **Large payload traces degrade ClickHouse:**\n",
" - ClickHouse performance degrades under load\n",
" - Insert operations slow down\n",
" - Query performance suffers\n",
" - Storage pressure increases\n",
"\n",
"2. **Warnings/errors in logs about artifact storage:**\n",
" - Worker logs show artifact upload failures\n",
" - Bucket access errors\n",
" - Credential errors\n",
" - \"No such bucket\" or \"Access Denied\" errors\n",
"\n",
"3. **Increased ClickHouse pressure:**\n",
" - ClickHouse latency increases\n",
" - Merge operations backlog\n",
" - Storage usage spikes\n",
" - Query timeouts\n",
"\n",
"4. **Log patterns:**\n",
" - Artifact storage errors in worker logs\n",
" - S3/blob storage connection errors\n",
" - Bucket access denied errors\n",
" - Credential errors\n",
" - Configuration errors\n",
"\n",
"**Timeline:**\n",
"- Symptoms appear gradually (under load)\n",
"- ClickHouse performance degrades over time\n",
"- Large traces fail or are rejected\n",
"- Full failure if blob storage completely unavailable\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Failure Injection Options\n",
"\n",
"**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
"\n",
"### Level 1: Subtle Failure (Recommended for first run)\n",
"\n",
"**Option A: Wrong Blob Storage Password**\n",
"- Modify the Blob Storage password in the Kubernetes Secret\n",
"- Symptoms: Authentication failures, connection refused, intermittent failures\n",
"\n",
"**Option B: Block Egress to Blob Storage Endpoint**\n",
"- Apply NetworkPolicy blocking egress to Blob Storage (if NetworkPolicy supported)\n",
"- Symptoms: Connection timeout, no route to host, intermittent failures\n",
"\n",
"### Level 2: Obvious Failure\n",
"\n",
"**Option C: Wrong Blob Storage Host/Endpoint**\n",
"- Point connection string to non-existent host\n",
"- Symptoms: Connection timeout, DNS resolution failures, immediate failures\n",
"\n",
"**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Do the Drill - Step 1: Confirm Baseline\n",
"\n",
"**Before injecting any failure, verify your baseline is healthy.**\n",
"\n",
"💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._shell import run\n",
"import json\n",
"\n",
"print(\"### Quick Baseline Check\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" healthy = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" print(f\"Pods: {healthy}/{total} running\")\n",
" \n",
" if healthy == total and total > 0:\n",
" ok(\"Baseline looks healthy\")\n",
" else:\n",
" warn(\"Some pods are not running - check baseline first\")\n",
"else:\n",
" warn(\"Could not check pod status\")\n",
"\n",
"# Check for Blob Storage secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"blob_secrets = []\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" if \"blob\" in name.lower() or \"blob\" in name.lower():\n",
" blob_secrets.append(name)\n",
"\n",
"if blob_secrets:\n",
" ok(f\"Found {len(blob_secrets)} Blob Storage-related secret(s)\")\n",
" for secret_name in blob_secrets:\n",
" print(f\" - {secret_name}\")\n",
"else:\n",
" warn(\"No Blob Storage secrets found\")\n",
" print(\" 💡 Blob Storage connection may be configured differently\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Do the Drill - Step 2: Apply Failure Injection\n",
"\n",
"**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
"\n",
"Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# FAILURE INJECTION: Wrong Blob Storage Password\n",
"# This cell modifies the Blob Storage password secret to an invalid value\n",
"\n",
"import base64\n",
"import yaml\n",
"from datetime import datetime\n",
"\n",
"# Find Blob Storage secret (look for common names)\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"blob_secret_name = None\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" # Common patterns: blob, blob\n",
" if any(keyword in name.lower() for keyword in [\"blob\", \"blob\"]):\n",
" # Check if it has password-related keys\n",
" data = secret.get(\"data\", {})\n",
" if any(key in data for key in [\"password\", \"REDIS_PASSWORD\", \"BLOB_PASSWORD\"]):\n",
" blob_secret_name = name\n",
" break\n",
"\n",
"if not blob_secret_name:\n",
" raise RuntimeError(\"❌ Could not find Blob Storage secret. Check your deployment configuration.\")\n",
"\n",
"print(f\"Found Blob Storage secret: {blob_secret_name}\")\n",
"\n",
"# Get current secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", blob_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
" check=True,\n",
" stream=False\n",
")\n",
"\n",
"# Save original secret for restoration\n",
"backup_file = artifacts_dir / \"module-4\" / f\"blob-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
"backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
"with open(backup_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
"\n",
"ok(f\"Backed up original secret to: {backup_file.name}\")\n",
"\n",
"# Parse YAML and modify password\n",
"secret_data = yaml.safe_load(result.stdout)\n",
"if \"data\" not in secret_data:\n",
" raise RuntimeError(\"Secret has no data section\")\n",
"\n",
"# Find password key (could be password, REDIS_PASSWORD, BLOB_PASSWORD, etc.)\n",
"password_key = None\n",
"for key in [\"password\", \"REDIS_PASSWORD\", \"BLOB_PASSWORD\", \"blob-password\"]:\n",
" if key in secret_data[\"data\"]:\n",
" password_key = key\n",
" break\n",
"\n",
"if not password_key:\n",
" raise RuntimeError(\"Could not find password key in secret\")\n",
"\n",
"# Set invalid password\n",
"invalid_password = \"INVALID_PASSWORD_12345\"\n",
"invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
"\n",
"# Modify secret\n",
"secret_data[\"data\"][password_key] = invalid_password_b64\n",
"\n",
"# Save modified secret to temp file\n",
"temp_secret_file = artifacts_dir / \"module-4\" / \"blob-secret-modified.yaml\"\n",
"with open(temp_secret_file, \"w\") as f:\n",
" yaml.dump(secret_data, f)\n",
"\n",
"print(f\"\\n⚠️ READY TO APPLY FAILURE INJECTION\")\n",
"print(f\" This will set an invalid password in secret: {blob_secret_name}\")\n",
"print(f\" Modified secret saved to: {temp_secret_file.name}\")\n",
"print(f\"\\n To apply, uncomment and run the next cell.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# UNCOMMENT TO APPLY FAILURE INJECTION\n",
"# \n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Failure injection applied - Blob Storage password is now invalid\")\n",
"# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
"# print(\" This may take 1-2 minutes. Watch for pod restarts:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# # Wait a moment for changes to propagate\n",
"# import time\n",
"# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
"# time.sleep(30)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Do the Drill - Step 3: Observe Symptoms\n",
"\n",
"**Now that the failure is injected, observe how it manifests.**\n",
"\n",
"Check:\n",
"1. Worker pod logs for Blob Storage connection errors\n",
"2. Queue backlog (if visible)\n",
"3. Worker retry patterns\n",
"4. Latency in API responses\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"# Create incident directory for diagnostics\n",
"incident_dir = artifacts_dir / \"module-4\" / f\"blob-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
"incident_dir.mkdir(parents=True, exist_ok=True)\n",
"\n",
"print(f\"### Collecting Failure Diagnostics\\n\")\n",
"print(f\"Saving to: {incident_dir}\\n\")\n",
"\n",
"# 1. Check pod status\n",
"print(\"1. Checking pod status...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" print(result.stdout)\n",
" \n",
" # Check for restarts\n",
" lines = result.stdout.split(\"\\n\")\n",
" restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
" if restarts:\n",
" print(\"\\n Pod restart counts:\")\n",
" for line in restarts[1:]: # Skip header\n",
" if line.strip():\n",
" parts = line.split()\n",
" if len(parts) > 3:\n",
" print(f\" {parts[0]}: {parts[3]} restarts\")\n",
"\n",
"# 2. Check recent events\n",
"print(\"\\n2. Checking recent events...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"events.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" if result.stdout.strip():\n",
" print(\" Recent warning/error events:\")\n",
" for line in result.stdout.split(\"\\n\")[-5:]:\n",
" if line.strip():\n",
" print(f\" {line}\")\n",
"\n",
"# 3. Check worker pod logs for Blob Storage errors\n",
"print(\"\\n3. Checking worker pod logs for Blob Storage errors...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if worker_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=50\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" logs_file = incident_dir / f\"worker-pod-{worker_pod}-logs.txt\"\n",
" with open(logs_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" \n",
" # Look for Blob Storage-related errors\n",
" error_keywords = [\"blob\", \"blob\", \"connection\", \"timeout\", \"refused\", \"authentication\", \"NOAUTH\", \"retry\"]\n",
" error_lines = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if error_lines:\n",
" print(\" Found Blob Storage-related errors:\")\n",
" for line in error_lines[-5:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" print(\" No obvious Blob Storage errors in recent logs\")\n",
"else:\n",
" warn(\"Could not find worker pod\")\n",
"\n",
"ok(f\"Diagnostics saved to: {incident_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
"\n",
"**This is critical - Support will ask for this bundle.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import urllib.request\n",
"\n",
"print(\"### Running Canonical Diagnostics Script\\n\")\n",
"\n",
"script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
"script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
"\n",
"try:\n",
" urllib.request.urlretrieve(script_url, script_path)\n",
" script_path.chmod(0o755)\n",
" \n",
" print(f\"Running diagnostics script for namespace: {namespace}\")\n",
" result = run(\n",
" [str(script_path), namespace],\n",
" check=False,\n",
" stream=True\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" ok(\"Diagnostics script completed\")\n",
" \n",
" # Find and move tarball\n",
" for file in incident_dir.parent.iterdir():\n",
" if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
" target_path = incident_dir / file.name\n",
" file.rename(target_path)\n",
" ok(f\"Diagnostics bundle: {target_path.name}\")\n",
" break\n",
" else:\n",
" warn(\"Diagnostics script had errors (check output above)\")\n",
" \n",
"except Exception as e:\n",
" warn(f\"Could not run diagnostics script: {e}\")\n",
" print(\" 💡 You can run it manually:\")\n",
" print(f\" curl -O {script_url}\")\n",
" print(f\" chmod +x get_k8s_debugging_info.sh\")\n",
" print(f\" ./get_k8s_debugging_info.sh {namespace}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Do the Drill - Step 5: Guided Triage\n",
"\n",
"**Where to look first for Blob Storage issues:**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Guided Triage Steps\\n\")\n",
"\n",
"print(\"1. Check worker pod logs for Blob Storage connection errors:\")\n",
"print(f\" kubectl logs -n {namespace} <worker-pod-name> | grep -i 'blob\\\\|blob\\\\|connection'\")\n",
"print()\n",
"\n",
"print(\"2. Verify secret exists and has correct keys:\")\n",
"print(f\" kubectl get secret {blob_secret_name} -n {namespace} -o yaml\")\n",
"print(\" (Don't print the actual values - they're base64 encoded)\")\n",
"print()\n",
"\n",
"print(\"3. Check for worker pod restarts (indicates connection failures):\")\n",
"print(f\" kubectl get pods -n {namespace} -l app=langsmith-worker\")\n",
"print()\n",
"\n",
"print(\"4. Test Blob Storage connectivity from a pod (if possible):\")\n",
"print(\" kubectl run -it --rm debug --image=blob:7 --restart=Never -- \\\\\")\n",
"print(\" blob-cli -h <blob-host> -p <port> -a <password> ping\")\n",
"print()\n",
"\n",
"print(\"5. Check events for connection/authentication errors:\")\n",
"print(f\" kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
"print()\n",
"\n",
"# Check what we can automatically\n",
"print(\"\\n### Automatic Checks\\n\")\n",
"\n",
"# Check secret still exists\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", blob_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" ok(f\"Secret '{blob_secret_name}' still exists\")\n",
" secret_data = json.loads(result.stdout)\n",
" keys = list(secret_data.get(\"data\", {}).keys())\n",
" print(f\" Secret keys: {', '.join(keys)}\")\n",
"else:\n",
" warn(f\"Secret '{blob_secret_name}' not found!\")\n",
"\n",
"# Check for pods with Blob Storage connection env vars\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" blob_related_pods = []\n",
" for pod in pods.get(\"items\", []):\n",
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
" containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
" for container in containers:\n",
" env = container.get(\"env\", [])\n",
" blob_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
" for kw in [\"REDIS\", \"BLOB\"])]\n",
" if blob_env:\n",
" blob_related_pods.append(name)\n",
" break\n",
" \n",
" if blob_related_pods:\n",
" print(f\"\\n Pods with Blob Storage environment variables:\")\n",
" for pod_name in set(blob_related_pods):\n",
" print(f\" - {pod_name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Do the Drill - Step 6: Remediation\n",
"\n",
"**Restore the original secret to fix the issue.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# REMEDIATION: Restore original secret\n",
"# UNCOMMENT TO RESTORE\n",
"\n",
"# if backup_file.exists():\n",
"# print(f\"Restoring original secret from: {backup_file.name}\")\n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Original secret restored\")\n",
"# print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
"# print(\" This may take 1-2 minutes. Monitor pod status:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# import time\n",
"# print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
"# time.sleep(60)\n",
"# else:\n",
"# warn(f\"Backup file not found: {backup_file}\")\n",
"# print(\" 💡 You may need to manually restore the secret\")\n",
"\n",
"print(\"⚠️ To restore, uncomment the code above and run this cell.\")\n",
"if 'backup_file' in locals() and backup_file:\n",
" print(f\" Backup file: {backup_file.name}\")\n",
"else:\n",
" print(\" 💡 If you modified Helm values, restore them manually\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Do the Drill - Step 7: Confirm Recovery\n",
"\n",
"**Verify that everything is working again.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Verifying Recovery\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" running = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" \n",
" if running == total and total > 0:\n",
" ok(f\"All {total} pod(s) are running\")\n",
" else:\n",
" warn(f\"Only {running}/{total} pod(s) running\")\n",
" print(\" 💡 Wait a bit longer for pods to fully recover\")\n",
"\n",
"# Check for recent errors in worker logs\n",
"print(\"\\nChecking for recent errors in worker logs...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if worker_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=20\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" error_keywords = [\"error\", \"fail\", \"blob\", \"blob\", \"connection\"]\n",
" recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if recent_errors:\n",
" warn(\"Still seeing some errors in logs:\")\n",
" for line in recent_errors[-3:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" ok(\"No recent errors in worker logs\")\n",
"\n",
"ok(\"Recovery verification complete\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. What Support Will Ask For\n",
"\n",
"**When escalating a Blob Storage issue, Support will need:**\n",
"\n",
"1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
"2. **Blob Storage connection details:**\n",
" - Host/endpoint (redacted)\n",
" - Port\n",
" - Password (redacted)\n",
" - Whether using SSL/TLS\n",
"3. **Error messages from logs:**\n",
" - Full error text (not just \"connection failed\")\n",
" - Timestamps of first occurrence\n",
" - Retry patterns\n",
"4. **Recent changes:**\n",
" - Secret rotations\n",
" - Network policy changes\n",
" - Blob Storage configuration changes\n",
"5. **Queue status (if accessible):**\n",
" - Queue length\n",
" - Worker processing rate\n",
" - Backlog growth rate\n",
"6. **Blob Storage health (if accessible):**\n",
" - Blob Storage version\n",
" - Memory usage\n",
" - Connection count\n",
" - Slow queries\n",
"\n",
"**Evidence collected in this lab:**\n",
"- ✅ Diagnostics bundle\n",
"- ✅ Worker pod logs with Blob Storage errors\n",
"- ✅ Events showing failures\n",
"- ✅ Secret configuration (structure, not values)\n",
"\n",
"**Additional evidence to gather (if escalating):**\n",
"- Blob Storage endpoint connectivity test\n",
"- Queue metrics (if available)\n",
"- Blob Storage logs (if accessible via cloud provider)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Lessons Learned\n",
"\n",
"**Key takeaways from this lab:**\n",
"\n",
"1. **Blob Storage failures can be intermittent** - Connection pool retries may mask issues\n",
"2. **Worker logs are critical** - Blob Storage errors appear in worker pod logs\n",
"3. **Queue backlog is a symptom** - Jobs pile up when workers can't connect\n",
"4. **Secrets matter** - Wrong credentials cause authentication failures\n",
"5. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
"\n",
"**Common mistakes to avoid:**\n",
"- ❌ Ignoring intermittent failures (they indicate connection issues)\n",
"- ❌ Not checking worker logs (API logs may not show Blob Storage errors)\n",
"- ❌ Not monitoring queue length (backlog indicates processing delays)\n",
"- ❌ Not testing Blob Storage connectivity independently\n",
"\n",
"**Next steps:**\n",
"- Practice with other failure injection methods (Level 2)\n",
"- Try the ClickHouse or Blob Storage failure labs\n",
"- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
+37
View File
@@ -0,0 +1,37 @@
# Module 4: Troubleshooting & Incident Response
This directory contains notebooks for Module 4 of the LangSmith Self-Hosted Operator workshop.
## Notebooks
### Setup & Baseline
- **`00_setup_or_resume_environment.ipynb`** - Validates environment is ready for Module 4
- **`01_diagnostics_baseline.ipynb`** - Captures baseline diagnostics (run this first!)
### Failure Labs
- **`10_failure_lab_postgres.ipynb`** - PostgreSQL connectivity failure debugging
- **`20_failure_lab_redis.ipynb`** - Redis connectivity failure debugging
- **`30_failure_lab_clickhouse.ipynb`** - ClickHouse connectivity failure debugging
- **`40_failure_lab_blob_storage.ipynb`** - Blob storage configuration failure debugging
### Advanced
- **`90_full_incident_drill.ipynb`** - Complete incident simulation (optional)
## Workflow
1. Run `00_setup_or_resume_environment.ipynb` to verify your environment
2. Run `01_diagnostics_baseline.ipynb` to capture baseline
3. Run failure labs in order (10, 20, 30, 40) or pick specific ones
4. Optionally run `90_full_incident_drill.ipynb` for complete practice
## Important Notes
- **Always run baseline first** - You need "before" to compare to "after"
- **Failure injections are reversible** - All labs include remediation steps
- **Don't skip diagnostics collection** - Support will ask for the canonical bundle
- **Practice in test environments only** - These labs modify your deployment
## Documentation
See `docs/modules/module-4.md` for complete module documentation.