Merge pull request #1 from langchain-ai/cwaddingham/refinements

Refinements to the initial creation
2026-07-01 20:44:14 -04:00 · 2026-01-06 16:46:31 -08:00
parent dd93069314 62b8abef6c
commit 800ce7eaa6
40 changed files with 14317 additions and 679 deletions
@@ -0,0 +1,3 @@
+# Automatically strip output cells from Jupyter notebooks before committing
+*.ipynb filter=nbstripout
+
@@ -0,0 +1,120 @@
+# GitHub Actions Workflows
+
+This directory contains CI/CD workflows for the LangSmith Self-Hosted Workshops repository.
+
+## Workflows
+
+### `test-notebooks.yml`
+
+Main workflow for testing notebook syntax and execution.
+
+**Triggers:**
+- Pull requests to `main`/`master`
+- Pushes to `main`/`master`
+- Manual workflow dispatch
+
+**Jobs:**
+1. **test-notebook-syntax**: Validates notebook JSON structure
+2. **test-module-1-preflight**: Tests Module 1 preflight notebook
+3. **test-module-2-syntax**: Tests Module 2 auth validation notebooks
+4. **lint-python**: Lints Python code in shared modules
+
+## Environment Variables
+
+### Required for Syntax Tests (Always Available)
+
+These are set in the workflow and don't require secrets:
+
+```yaml
+NAMESPACE: "langsmith-test"
+CLUSTER_NAME: "test-cluster"
+HELM_RELEASE: "langsmith"
+CLOUD_PROVIDER: "aws"
+AWS_REGION: "us-west-2"
+LANGSMITH_DOMAIN: "test.langsmith.example.com"
+```
+
+### Required for Full Execution Tests (Optional)
+
+For full notebook execution, set these in GitHub Secrets:
+
+**AWS:**
+- `AWS_ACCESS_KEY_ID`
+- `AWS_SECRET_ACCESS_KEY`
+- `AWS_REGION`
+- `AWS_ACCOUNT_ID` (optional, for validation)
+
+**Azure:**
+- `AZURE_CLIENT_ID`
+- `AZURE_CLIENT_SECRET`
+- `AZURE_TENANT_ID`
+- `AZURE_SUBSCRIPTION_ID`
+- `AZURE_LOCATION`
+
+**Infrastructure:**
+- `CLUSTER_NAME`
+- `NAMESPACE`
+- `TERRAFORM_REPO_DIR`
+- `HELM_REPO_DIR`
+
+**OIDC/SAML (Module 2):**
+- `OIDC_ISSUER`
+- `OIDC_CLIENT_ID`
+- `OIDC_CLIENT_SECRET`
+- `OIDC_REDIRECT_URI`
+- `SAML_METADATA_URL` (if using SAML)
+
+## Customizing Workflows
+
+### Adding New Test Jobs
+
+1. Add job to `test-notebooks.yml`
+2. Set appropriate `needs:` dependencies
+3. Configure environment variables
+4. Add artifact uploads if needed
+
+### Enabling Full Execution Tests
+
+To enable full notebook execution in CI:
+
+1. Set required secrets in GitHub repository settings
+2. Remove or modify `CI_SKIP_EXECUTION` environment variable
+3. Update test conditions in `test_notebook_execution.py`
+
+### Adding New Modules
+
+When adding Module 3, 4, etc.:
+
+1. Create new test class in `test_notebook_execution.py`
+2. Add parametrized tests for new notebooks
+3. Add new job in GitHub Actions workflow
+4. Update this README
+
+## Workflow Status
+
+Workflow status badges can be added to README:
+
+```markdown
+![Test Notebooks](https://github.com/your-org/langsmith-self-hosted-workshops/workflows/Test%20Notebooks/badge.svg)
+```
+
+## Troubleshooting
+
+### Workflow Fails on Syntax Tests
+
+- Check notebook JSON is valid
+- Verify all imports are available
+- Check Python version compatibility
+
+### Workflow Times Out
+
+- Increase `timeout-minutes` in job definition
+- Check for long-running operations
+- Consider splitting into smaller jobs
+
+### Environment Variable Issues
+
+- Verify secrets are set in repository settings
+- Check variable names match exactly
+- Ensure secrets are accessible to workflow
+
@@ -0,0 +1,256 @@
+name: Test Notebooks
+
+on:
+  pull_request:
+    branches:
+      - main
+      - master
+  push:
+    branches:
+      - main
+      - master
+  workflow_dispatch:  # Allow manual triggering
+
+jobs:
+  test-notebook-syntax:
+    name: Test Notebook Syntax
+    runs-on: ubuntu-latest
+    timeout-minutes: 30
+    
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+      
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.10'
+          cache: 'pip'
+      
+      - name: Install system dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y jq
+      
+      - name: Install test dependencies
+        run: |
+          pip install -r tests/requirements.txt
+      
+      - name: Run notebook syntax tests
+        env:
+          CI_SKIP_EXECUTION: "true"  # Skip full execution, only test syntax
+          NAMESPACE: "langsmith-test"
+          CLUSTER_NAME: "test-cluster"
+          HELM_RELEASE: "langsmith"
+          CLOUD_PROVIDER: "aws"
+          AWS_REGION: "us-west-2"
+          AZURE_LOCATION: "eastus"
+          LANGSMITH_DOMAIN: "test.langsmith.example.com"
+          OIDC_ISSUER: "https://test-idp.example.com/oauth2/default"
+          OIDC_CLIENT_ID: "test-client-id"
+          OIDC_CLIENT_SECRET: "test-client-secret"
+          OIDC_REDIRECT_URI: "https://test.langsmith.example.com/auth/callback"
+        run: |
+          pytest tests/test_notebook_execution.py::TestModule1Notebooks::test_module1_notebook_syntax -v
+          pytest tests/test_notebook_execution.py::TestModule2Notebooks::test_module2_notebook_syntax -v
+          pytest tests/test_notebook_execution.py::TestModule3Notebooks::test_module3_notebook_syntax -v
+          pytest tests/test_notebook_execution.py::TestModule4Notebooks::test_module4_notebook_syntax -v
+      
+      - name: Upload test artifacts
+        if: always()
+        uses: actions/upload-artifact@v3
+        with:
+          name: test-artifacts
+          path: tests/artifacts/
+          retention-days: 1
+
+  test-module-1-preflight:
+    name: Test Module 1 Preflight (Dry Run)
+    runs-on: ubuntu-latest
+    timeout-minutes: 15
+    needs: test-notebook-syntax
+    
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+      
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.10'
+          cache: 'pip'
+      
+      - name: Install system dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y jq
+          # Install mock tools (these won't actually work, but allow import checks)
+          sudo ln -sf /bin/true /usr/local/bin/aws || true
+          sudo ln -sf /bin/true /usr/local/bin/terraform || true
+          sudo ln -sf /bin/true /usr/local/bin/helm || true
+          sudo ln -sf /bin/true /usr/local/bin/kubectl || true
+      
+      - name: Install test dependencies
+        run: |
+          pip install -r tests/requirements.txt
+      
+      - name: Create test environment file
+        run: |
+          mkdir -p notebooks
+          cat > notebooks/workshop.env <<EOF
+          WORKSHOP_NAME="langsmith-test"
+          NAMESPACE="langsmith-test"
+          CLUSTER_NAME="test-cluster"
+          AWS_REGION="us-west-2"
+          HELM_RELEASE="langsmith"
+          ARTIFACTS_DIR="./tests/artifacts"
+          DRY_RUN="true"
+          EOF
+      
+      - name: Run Module 1 preflight notebook (syntax only)
+        env:
+          CI_SKIP_EXECUTION: "true"
+          NAMESPACE: "langsmith-test"
+          CLUSTER_NAME: "test-cluster"
+          HELM_RELEASE: "langsmith"
+          CLOUD_PROVIDER: "aws"
+          AWS_REGION: "us-west-2"
+        run: |
+          # Test that notebook can be loaded and parsed
+          python -c "
+          import json
+          import sys
+          from pathlib import Path
+          
+          nb_path = Path('notebooks/module-1/01_preflight.ipynb')
+          with open(nb_path) as f:
+              nb = json.load(f)
+          
+          # Validate structure
+          assert 'cells' in nb
+          assert len(nb['cells']) > 0
+          print(f'✅ Notebook structure valid: {len(nb[\"cells\"])} cells')
+          sys.exit(0)
+          "
+      
+      - name: Upload test artifacts
+        if: always()
+        uses: actions/upload-artifact@v3
+        with:
+          name: module-1-artifacts
+          path: tests/artifacts/
+          retention-days: 1
+
+  test-module-2-syntax:
+    name: Test Module 2 Notebooks (Syntax)
+    runs-on: ubuntu-latest
+    timeout-minutes: 15
+    needs: test-notebook-syntax
+    
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+      
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.10'
+          cache: 'pip'
+      
+      - name: Install test dependencies
+        run: |
+          pip install -r tests/requirements.txt
+      
+      - name: Create test environment file
+        run: |
+          mkdir -p notebooks
+          cat > notebooks/workshop.env <<EOF
+          WORKSHOP_NAME="langsmith-test"
+          NAMESPACE="langsmith-test"
+          CLUSTER_NAME="test-cluster"
+          AWS_REGION="us-west-2"
+          HELM_RELEASE="langsmith"
+          ARTIFACTS_DIR="./tests/artifacts"
+          LANGSMITH_DOMAIN="test.langsmith.example.com"
+          OIDC_ISSUER="https://test-idp.example.com/oauth2/default"
+          OIDC_CLIENT_ID="test-client-id"
+          OIDC_CLIENT_SECRET="test-client-secret"
+          OIDC_REDIRECT_URI="https://test.langsmith.example.com/auth/callback"
+          EOF
+      
+      - name: Validate Module 2 notebooks
+        env:
+          CI_SKIP_EXECUTION: "true"
+          NAMESPACE: "langsmith-test"
+          CLUSTER_NAME: "test-cluster"
+          HELM_RELEASE: "langsmith"
+          LANGSMITH_DOMAIN: "test.langsmith.example.com"
+          OIDC_ISSUER: "https://test-idp.example.com/oauth2/default"
+          OIDC_CLIENT_ID: "test-client-id"
+          OIDC_CLIENT_SECRET: "test-client-secret"
+          OIDC_REDIRECT_URI: "https://test.langsmith.example.com/auth/callback"
+        run: |
+          python -c "
+          import json
+          import sys
+          from pathlib import Path
+          
+          notebooks = [
+              'notebooks/module-2/01_sso_oidc_validation.ipynb',
+              'notebooks/module-2/02_sso_saml_validation.ipynb',
+          ]
+          
+          for nb_path_str in notebooks:
+              nb_path = Path(nb_path_str)
+              if not nb_path.exists():
+                  print(f'❌ Notebook not found: {nb_path}')
+                  sys.exit(1)
+              
+              with open(nb_path) as f:
+                  nb = json.load(f)
+              
+              assert 'cells' in nb, f'Missing cells in {nb_path}'
+              assert len(nb['cells']) > 0, f'No cells in {nb_path}'
+              
+              code_cells = [c for c in nb['cells'] if c.get('cell_type') == 'code']
+              print(f'✅ {nb_path.name}: {len(code_cells)} code cells, {len(nb[\"cells\"])} total cells')
+          
+          print('✅ All Module 2 notebooks validated')
+          sys.exit(0)
+          "
+      
+      - name: Upload test artifacts
+        if: always()
+        uses: actions/upload-artifact@v3
+        with:
+          name: module-2-artifacts
+          path: tests/artifacts/
+          retention-days: 1
+
+  lint-python:
+    name: Lint Python Code
+    runs-on: ubuntu-latest
+    timeout-minutes: 10
+    
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+      
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.10'
+          cache: 'pip'
+      
+      - name: Install linting tools
+        run: |
+          pip install flake8 black isort
+      
+      - name: Run flake8
+        run: |
+          flake8 notebooks/shared/ tests/ --max-line-length=120 --ignore=E501,W503 || true
+      
+      - name: Check code formatting with black
+        run: |
+          black --check notebooks/shared/ tests/ || true
+
@@ -6,6 +6,8 @@ The workshop is designed for **platform, infrastructure, and MLOps engineers** r

 > **Note:** This workshop assumes deployment using *NIX-based servers, preferably Linux. If you must use Windows please raise an issue in the [Github](https://github.com/langchain-ai/langsmith-self-hosted-workshops) repo and LangChain engineers will address it. 

+> **Note:** This workshop uses Jupyter notebooks for its demonstrations. You have the option of running them locally via your own [Jupyter server](https://jupyter.org/) or use Google's [Github-to-Colab tool](https://githubtocolab.com) with your existing Google Suite account. 
+
 This repo complements (but does not replace) the high-level deployment instructions [the LangSmith documentation](https://docs.langchain.com). Where the docs explain *what* to do, this workshop focuses on *how to do it safely and repeatedly*.

 ---
@@ -178,7 +180,7 @@ git clone https://github.com/langchain-ai/helm.git <your-helm-path>
 ### 3. Start the Workshop

 1. Read `docs/modules/module-1.md` for module overview and context
-2. Open `notebooks/module-1/01_aws_preflight.ipynb` in Jupyter
+2. Open `notebooks/module-1/01_preflight.ipynb` in Jupyter
 3. Run the bootstrap cell (first cell) to validate your environment
 4. Follow the notebook cells sequentially

@@ -0,0 +1,606 @@
+# Module 1: Deployment & Baseline Validation
+
+**Goal:** Deploy LangSmith self-hosted using the official Terraform and Helm repositories, establishing a supported baseline configuration.
+
+**Duration:** ~2 hours  
+**Audience:** Platform engineers, infrastructure teams, and operators deploying LangSmith for the first time  
+**Prerequisites:**
+- Cloud provider account with appropriate permissions
+- Local tooling installed (`aws`/`az`, `terraform`, `kubectl`, `helm`, `jq`)
+- LangSmith self-hosted license key
+- Basic familiarity with Kubernetes (pods, services, ingress)
+
+---
+
+## Motivation
+
+Most self-hosted LangSmith failures occur **before** users ever touch the product:
+
+- Mis-sized clusters that "work" until users arrive
+- Unsupported ingress setups causing connectivity issues
+- In-cluster databases used past their limits
+- Missing storage primitives (blob storage, persistent volumes)
+- Incorrect infrastructure configuration leading to data loss
+
+Module 1 exists to ensure every deployment starts from a **supported baseline** using the **official Terraform and Helm repositories**. This baseline becomes the foundation for production operations (Module 3) and authentication (Module 2).
+
+---
+
+## Outcomes
+
+By the end of this module, participants will:
+
+- Deploy cloud infrastructure using the official `langchain-ai/terraform` repository
+- Install LangSmith using the official `langchain-ai/helm` chart
+- Validate cluster readiness, storage, and ingress
+- Understand *why* specific architectural choices are required
+- Establish a baseline configuration for future modules
+- Be ready to layer in authentication (Module 2) and production operations (Module 3)
+
+---
+
+## What This Module Avoids
+
+- **SSO / OIDC / SAML:** Covered in Module 2
+- **HA tuning beyond defaults:** Covered in Module 3
+- **Advanced autoscaling (KEDA):** Covered in Module 3
+- **Performance benchmarking:** Out of scope
+- **Custom infrastructure:** We use official Terraform modules only
+- **Forked repositories:** We reference official repos directly
+
+This keeps the baseline clean, repeatable, and supportable.
+
+---
+
+## Architecture Baseline (What We Support)
+
+This workshop uses a **single, opinionated baseline**:
+
+### Compute
+- **AWS:** Amazon EKS (Elastic Kubernetes Service)
+- **Azure:** Azure Kubernetes Service (AKS)
+- **GCP:** Google Kubernetes Engine (GKE) - coming soon
+
+### Ingress
+- **AWS:** AWS Application Load Balancer (ALB) - cloud-native load balancer only
+- **Azure:** Azure Application Gateway - cloud-native load balancer only
+- **Why:** Cloud-native load balancers provide automatic scaling, health checks, and integration with cloud provider services
+
+### Datastores
+- **PostgreSQL:** Managed service (RDS for AWS, Azure Database for PostgreSQL)
+- **Redis:** Managed service (ElastiCache for AWS, Azure Cache for Redis)
+- **ClickHouse:** Managed service (ClickHouse Cloud) OR in-cluster with EBS CSI/Azure Disk CSI
+- **Why:** Managed services reduce operational overhead and provide automated backups
+
+### Blob Storage
+- **AWS:** S3 (Simple Storage Service) - **required for production**
+- **Azure:** Azure Blob Storage - **required for production**
+- **Why:** Without blob storage, ClickHouse table size explodes under load, making the system unusable
+
+### Provisioning
+- **Infrastructure:** Terraform (official `langchain-ai/terraform` repository)
+- **Application:** Helm (official `langchain-ai/helm` chart)
+
+### Deviations
+
+Deviations from this baseline are discussed in advanced modules but not used here. This ensures:
+- Support can help troubleshoot standard configurations
+- Updates and security patches are straightforward
+- Documentation and runbooks apply directly
+
+---
+
+## Workshop Flow
+
+### 1️⃣ Environment Readiness & Preflight (20–30 min)
+
+**Notebook:** `01_preflight.ipynb`
+
+**What we validate:**
+- Tooling validation (cloud CLI, terraform, kubectl, helm, jq)
+- Cloud provider credentials & region sanity check
+- Cluster capacity expectations
+- Storage prerequisites (CSI drivers, StorageClasses)
+- Blob storage requirement (cloud object storage)
+
+**Key emphasis:**
+- Verify you're using the correct cloud account/subscription
+- Ensure all required tools are available and in PATH
+- Validate storage CSI drivers are installed
+- Confirm blob storage is accessible
+
+**Output:**
+- Environment validated and ready
+- Artifacts directory created
+- Cloud provider identity confirmed
+
+---
+
+### 2️⃣ Terraform: Provisioning the Platform Substrate (45–60 min)
+
+**Notebook:** `02_terraform_apply.ipynb`
+
+**What we deploy:**
+- Managed Kubernetes cluster (EKS/AKS)
+- Managed PostgreSQL database (RDS/Azure Database)
+- Managed Redis cache (ElastiCache/Azure Cache)
+- Object storage for blob storage (S3/Azure Blob Storage)
+- IAM/RBAC roles and policies
+- Storage CSI driver addon
+
+**Key principles:**
+- Use the **official** Terraform repo (do not fork)
+- Pin module versions for reproducibility
+- Use remote state & locking
+- Plan before applying
+- Capture outputs needed for Helm
+
+**Workflow:**
+1. Clone and navigate to official Terraform repository
+2. Identify correct module path for your cloud provider
+3. Pin module versions in `versions.tf`
+4. Configure Terraform variables (region, cluster name, database credentials)
+5. Initialize Terraform (`terraform init`)
+6. Create Terraform plan (`terraform plan`)
+7. Review plan carefully
+8. Apply infrastructure (`terraform apply`)
+9. Capture outputs for Helm configuration
+
+**Key emphasis:**
+- Why we do *not* fork upstream
+- Why remote state & locking matter
+- What support will expect to see later
+- How to interpret Terraform outputs
+
+**Output:**
+- Infrastructure deployed and healthy
+- Terraform outputs captured
+- Cluster accessible via kubectl
+
+---
+
+### 3️⃣ Helm: Installing LangSmith (45–60 min)
+
+**Notebook:** `03_helm_install_langsmith.ipynb`
+
+**What we install:**
+- LangSmith application components
+- External service connections (PostgreSQL, Redis, blob storage)
+- Resource requests & limits
+- Ingress configuration
+
+**Key principles:**
+- Use the **official** Helm chart (do not fork)
+- Pin chart versions for reproducibility
+- Create minimal, sane values file
+- Inject required secrets properly
+- Render templates before install
+- Understand that "helm install succeeded" ≠ "system is healthy"
+
+**Workflow:**
+1. Clone and navigate to official Helm repository
+2. Identify correct chart path
+3. Pin chart version
+4. Create minimal values file:
+   - External service connections (database, cache, blob storage)
+   - Resource requests & limits
+   - Ingress configuration
+   - Required secrets
+5. Create Kubernetes secrets for sensitive values
+6. Render templates (`helm template`) to validate
+7. Install chart (`helm install`)
+8. Verify installation (`helm status`)
+
+**Key emphasis:**
+- External services wiring (why managed services matter)
+- Resource requests & limits (why they're required)
+- Why "helm install succeeded" ≠ "system is healthy"
+- Start with minimal values file and only configure what you need
+
+**Output:**
+- LangSmith application deployed
+- Pods starting (may not be ready yet)
+- Helm release created
+
+---
+
+### 4️⃣ Validation & Go/No-Go Checklist (20–30 min)
+
+**Notebook:** `04_validate_ingress_and_ui.ipynb`
+
+**What we validate:**
+1. Pod readiness (all pods running)
+2. License key validation (properly configured)
+3. PVC binding (storage provisioned)
+4. External services connectivity (PostgreSQL, Redis, blob storage)
+5. Ingress provisioning (load balancer created)
+6. Endpoint reachability (services accessible)
+7. Basic UI availability (web interface works)
+8. Basic functional test (optional trace submission)
+
+**Key emphasis:**
+- This checklist becomes your **baseline reference** for future troubleshooting
+- Most issues are caught here, before real users onboard
+- Validation ensures you're on a **supported path**
+
+**Workflow:**
+1. Verify all pods are running and ready
+2. Validate license key is configured correctly
+3. Check PVCs are bound (storage provisioned)
+4. Test connectivity to external services
+5. Verify ingress is provisioned and accessible
+6. Test endpoint reachability (HTTPS)
+7. Verify UI is accessible
+8. Optional: Submit test trace to validate functionality
+
+**Output:**
+- Deployment validated and healthy
+- Baseline reference established
+- Ready for Module 2 (authentication)
+
+---
+
+### 5️⃣ Teardown & Cleanup (Optional, 30–45 min)
+
+**Notebook:** `99_teardown.ipynb`
+
+**What we clean up:**
+- Helm release (LangSmith application)
+- Kubernetes resources (secrets, PVCs)
+- Terraform-managed infrastructure (cluster, database, cache, blob storage)
+
+**Key emphasis:**
+- Avoid ongoing cloud costs
+- Practice proper resource lifecycle management
+- Verify cleanup completed successfully
+
+**Workflow:**
+1. Uninstall Helm release
+2. Clean up remaining Kubernetes resources
+3. Destroy Terraform infrastructure
+4. Verify all resources removed
+
+**Output:**
+- All resources destroyed
+- No ongoing costs
+- Clean slate for re-deployment
+
+---
+
+## Common Pitfalls Addressed in Module 1
+
+### ClickHouse PVCs Stuck in `Pending`
+
+**Symptom:** ClickHouse pods cannot start, PVCs remain in `Pending` state.
+
+**Cause:** Missing EBS CSI driver (AWS) or Azure Disk CSI driver (Azure).
+
+**Fix:** Install CSI driver addon before deploying LangSmith.
+
+**Prevention:** Preflight checks validate CSI driver installation.
+
+### Load Balancer Never Appears
+
+**Symptom:** Ingress created but no load balancer provisioned.
+
+**Cause:** Wrong ingress class or missing ingress controller.
+
+**Fix:** Use cloud-native ingress class (AWS: `alb`, Azure: `azure/application-gateway`).
+
+**Prevention:** Preflight checks validate ingress controller installation.
+
+### Inline Trace Payloads Exploding ClickHouse
+
+**Symptom:** ClickHouse table size grows rapidly, queries slow down.
+
+**Cause:** Blob storage not configured, large payloads stored inline in ClickHouse.
+
+**Fix:** Configure S3 (AWS) or Azure Blob Storage (Azure) before deployment.
+
+**Prevention:** Preflight checks validate blob storage accessibility.
+
+### Under-Sized Clusters That "Work" Until Users Arrive
+
+**Symptom:** Deployment works initially but fails under load.
+
+**Cause:** Cluster nodes too small, insufficient resources.
+
+**Fix:** Use recommended node sizes (see service sizing baselines in Module 3).
+
+**Prevention:** Preflight checks validate cluster capacity expectations.
+
+### Terraform State Lock Issues
+
+**Symptom:** `terraform apply` fails with state lock error.
+
+**Cause:** Another process holds the lock, or previous operation didn't release lock.
+
+**Fix:** Use remote state backend with locking (S3 + DynamoDB for AWS, Azure Storage for Azure).
+
+**Prevention:** Terraform configuration uses remote state by default.
+
+---
+
+## Service Sizing Baselines
+
+### Kubernetes Cluster
+
+**Production baseline:**
+- **Node instance type:** m5.xlarge (4 vCPU, 16 GB RAM) minimum
+- **Node count:** 3+ nodes (for HA)
+- **Storage:** EBS gp3 (AWS) or Premium SSD (Azure) with 100+ GB per node
+
+**Non-production guidance:**
+- m5.large (2 vCPU, 8 GB RAM) acceptable for development
+- 2 nodes sufficient for non-production
+
+### PostgreSQL
+
+**Production baseline:**
+- **Instance size:** db.r5.xlarge (4 vCPU, 32 GB RAM) minimum
+- **Storage:** 500 GB+ with autoscaling enabled
+- **High availability:** Multi-AZ deployment (RDS) or read replicas (Azure)
+
+**Non-production guidance:**
+- db.t3.medium (2 vCPU, 4 GB RAM) acceptable for development
+- Single-AZ acceptable for non-production
+
+### Redis
+
+**Production baseline:**
+- **Instance type:** cache.r6g.xlarge (6 vCPU, 13.07 GB RAM) minimum
+- **High availability:** Redis Cluster mode enabled (3+ nodes)
+
+**Non-production guidance:**
+- cache.t3.micro acceptable for development
+- Single node acceptable for non-production
+
+### ClickHouse
+
+**Production baseline:**
+- **Deployment:** Managed ClickHouse (ClickHouse Cloud) OR in-cluster with EBS CSI
+- **In-cluster sizing:** 3-node cluster minimum (for HA)
+- **Resources per node:** 8 CPU, 32 GB RAM, 1 TB storage
+
+**Non-production guidance:**
+- Single node acceptable for development
+- 4 CPU, 16 GB RAM per node sufficient
+
+---
+
+## Blob Storage Requirement
+
+### Why Blob Storage is Required
+
+**Problem without blob storage:**
+- Large trace payloads stored inline in ClickHouse
+- ClickHouse table size explodes
+- Query performance degrades
+- Storage costs increase dramatically
+- System becomes unusable under load
+
+**Solution with blob storage:**
+- Large payloads stored in S3/Azure Blob Storage
+- ClickHouse stores only references (small strings)
+- Query performance remains stable
+- Storage costs scale linearly
+- System handles production load
+
+### Requirements
+
+**Production:**
+- **Service:** S3 (AWS) or Azure Blob Storage (Azure)
+- **Bucket/Container:** Dedicated bucket for LangSmith
+- **Access:** IAM roles (AWS) or Managed Identity (Azure) - no access keys
+- **Versioning:** Enabled for data protection
+- **Encryption:** Server-side encryption enabled
+
+**Non-production:**
+- Local MinIO or in-cluster object storage acceptable
+- Access keys acceptable (not for production)
+- No versioning required
+
+---
+
+## Terraform Best Practices
+
+### Use Official Repository
+
+**Why:**
+- Support expects standard configurations
+- Updates and security patches are provided
+- Documentation and examples are maintained
+- Compatibility with Helm chart is guaranteed
+
+**How:**
+- Clone `langchain-ai/terraform` repository
+- Reference modules directly (do not fork)
+- Pin module versions in `versions.tf`
+
+### Remote State & Locking
+
+**Why:**
+- Prevents concurrent modifications
+- Enables team collaboration
+- Provides state history
+- Prevents state corruption
+
+**Configuration:**
+- **AWS:** S3 backend with DynamoDB table for locking
+- **Azure:** Azure Storage backend with blob container
+
+### Plan Before Apply
+
+**Why:**
+- Review changes before applying
+- Catch configuration errors early
+- Understand resource impact
+- Validate variable values
+
+**Workflow:**
+1. `terraform init` - Initialize backend and modules
+2. `terraform plan` - Generate execution plan
+3. Review plan carefully
+4. `terraform apply` - Apply changes
+
+---
+
+## Helm Best Practices
+
+### Use Official Chart
+
+**Why:**
+- Support expects standard configurations
+- Updates and security patches are provided
+- Documentation and examples are maintained
+- Compatibility with Terraform outputs is guaranteed
+
+**How:**
+- Clone `langchain-ai/helm` repository
+- Reference chart directly (do not fork)
+- Pin chart version
+
+### Minimal Values File
+
+**Principle:** Start with minimal configuration and only add what you need.
+
+**Why:**
+- Reduces complexity
+- Fewer points of failure
+- Easier to troubleshoot
+- Clearer configuration intent
+
+**What to include:**
+- External service connections (database, cache, blob storage)
+- Resource requests & limits
+- Ingress configuration
+- Required secrets
+
+**What to avoid:**
+- Configuration for services you're not using
+- Over-optimization before baseline works
+- Custom modifications without justification
+
+### Render Before Install
+
+**Why:**
+- Validate template syntax
+- Review generated manifests
+- Catch configuration errors early
+- Understand what will be deployed
+
+**Command:**
+```bash
+helm template <release-name> <chart-path> -f <values-file> -n <namespace>
+```
+
+---
+
+## Validation Checklist
+
+See `notebooks/module-1/04_validate_ingress_and_ui.ipynb` for complete validation.
+
+**Quick checklist:**
+- [ ] All pods running and ready
+- [ ] License key configured correctly
+- [ ] PVCs bound (storage provisioned)
+- [ ] External services accessible (PostgreSQL, Redis, blob storage)
+- [ ] Ingress provisioned and accessible
+- [ ] Endpoint reachable via HTTPS
+- [ ] UI accessible in browser
+- [ ] Basic functional test passes (optional)
+
+---
+
+## Artifacts Participants Leave With
+
+1. **Working baseline deployment**
+   - LangSmith accessible via HTTPS
+   - All services healthy and connected
+   - Ingress configured correctly
+
+2. **Pinned Terraform + Helm configuration**
+   - Terraform module versions documented
+   - Helm chart version documented
+   - Values file saved and version controlled
+
+3. **Validated ingress endpoint**
+   - HTTPS URL accessible
+   - TLS certificate valid
+   - DNS configured correctly
+
+4. **Readiness checklist**
+   - Validation results documented
+   - Baseline reference established
+   - Troubleshooting evidence collected
+
+5. **Confidence they're on a supported path**
+   - Official repositories used
+   - Standard configuration applied
+   - Support can help troubleshoot
+
+---
+
+## Next Steps
+
+1. **Run the validation notebook:**
+   - `notebooks/module-1/04_validate_ingress_and_ui.ipynb`
+   - Address any failures before proceeding
+
+2. **Proceed to Module 2:**
+   - Configure authentication (OIDC/SAML)
+   - Set up role mapping
+   - Validate SSO flows
+
+3. **Proceed to Module 3:**
+   - Configure production operations
+   - Set up autoscaling
+   - Establish observability
+
+---
+
+## References
+
+- [Official Terraform Repository](https://github.com/langchain-ai/terraform)
+- [Official Helm Repository](https://github.com/langchain-ai/helm)
+- LangSmith Self-Hosted Documentation
+- Cloud Provider Documentation (AWS EKS, Azure AKS)
+
+---
+
+## Troubleshooting
+
+### Common Issues
+
+**Terraform apply fails:**
+- Check cloud provider credentials
+- Verify IAM permissions
+- Review Terraform plan for errors
+- Check remote state backend configuration
+
+**Helm install fails:**
+- Verify chart path is correct
+- Check values file syntax
+- Validate secrets exist
+- Review Helm template output
+
+**Pods not starting:**
+- Check pod logs: `kubectl logs <pod> -n <namespace>`
+- Check events: `kubectl get events -n <namespace>`
+- Verify resource requests/limits
+- Check PVC binding status
+
+**Ingress not accessible:**
+- Verify ingress controller installed
+- Check ingress class matches controller
+- Verify DNS configuration
+- Check TLS certificate validity
+
+**External services not accessible:**
+- Verify network connectivity (VPC/VNet)
+- Check security group/NSG rules
+- Validate connection strings
+- Test connectivity from pod
+
+For detailed troubleshooting, see the validation notebook and Module 3 operations guide.
+
@@ -0,0 +1,541 @@
+# Module 2: Identity & Authentication
+
+**Duration:** ~2 hours  
+**Audience:** Operators deploying and managing LangSmith self-hosted  
+**Prerequisite:** Module 1 complete (working deployment with DNS/TLS/Ingress configured)
+
+---
+
+## Motivation
+
+Most production LangSmith deployments require centralized identity management. Configuring SSO **before** onboarding users prevents:
+
+- Manual user provisioning overhead
+- Security gaps from shared credentials
+- Compliance violations from unmanaged access
+- Operational toil from authentication failures
+
+This module ensures your authentication setup is **correct from day one**, not retrofitted after users are already in the system.
+
+---
+
+## Outcomes
+
+By the end of this module, participants will:
+
+- Understand LangSmith's authentication and authorization model
+- Configure OIDC or SAML SSO with their identity provider
+- Validate authentication flows end-to-end
+- Map identity provider groups to LangSmith roles
+- Troubleshoot common authentication failures
+- Maintain authentication configuration as code
+
+---
+
+## What This Module Avoids
+
+- **IdP admin tutorials:** We assume your IdP team provides required configuration values
+- **SCIM deep-dive:** User provisioning via SCIM is out of scope
+- **Multi-IdP scenarios:** We focus on single IdP configuration
+- **Local auth production use:** Local authentication is discouraged for production deployments
+
+---
+
+## Supported Identity Models
+
+### OIDC (Preferred)
+- **When to use:** Modern IdPs (Okta, Azure AD, Google Workspace, Auth0)
+- **Advantages:** Standard protocol, easier debugging, better error messages
+- **Requirements:** OIDC-compliant IdP with client credentials
+
+### SAML (Fallback)
+- **When to use:** Legacy IdPs or enterprise requirements
+- **Advantages:** Widely supported, enterprise-standard
+- **Requirements:** SAML 2.0 IdP with metadata endpoint or XML file
+
+### Local Authentication (Discouraged)
+- **When to use:** Development/testing only
+- **Limitations:** No centralized management, manual user creation, security risk
+- **Note:** This module does not cover local auth configuration
+
+---
+
+## Authentication Request Flow
+
+```
+┌─────────┐         ┌──────────────┐         ┌─────────────┐
+│ Browser │         │  LangSmith   │         │  Identity   │
+│         │         │   (Ingress)  │         │  Provider   │
+└────┬────┘         └──────┬───────┘         └──────┬──────┘
+     │                     │                         │
+     │ 1. GET /login       │                         │
+     ├────────────────────>│                         │
+     │                     │                         │
+     │ 2. Redirect to IdP  │                         │
+     │    (with state)     │                         │
+     │<────────────────────┤                         │
+     │                     │                         │
+     │ 3. GET /authorize    │                         │
+     ├───────────────────────────────────────────────>│
+     │                     │                         │
+     │ 4. User authenticates                          │
+     │    (IdP UI)                                    │
+     │                     │                         │
+     │ 5. Callback with code/token                   │
+     │<───────────────────────────────────────────────┤
+     │                     │                         │
+     │ 6. POST /callback   │                         │
+     ├────────────────────>│                         │
+     │                     │                         │
+     │ 7. Exchange code for token                     │
+     │                     ├─────────────────────────>│
+     │                     │<─────────────────────────┤
+     │                     │                         │
+     │ 8. Validate token & extract claims             │
+     │                     │                         │
+     │ 9. Create/update user session                  │
+     │                     │                         │
+     │ 10. Redirect to dashboard                      │
+     │<────────────────────┤                         │
+     │                     │                         │
+```
+
+**Key Points:**
+- Redirect URI must match **exactly** (protocol, domain, path, trailing slashes)
+- State parameter prevents CSRF attacks
+- Token validation includes signature, expiration, and issuer verification
+- Claims mapping determines user roles and workspace access
+
+---
+
+## Workshop Flow
+
+### 1. LangSmith Authentication Model
+
+**Authentication vs Authorization:**
+- **Authentication (AuthN):** "Who are you?" - Verified by IdP
+- **Authorization (AuthZ):** "What can you do?" - Determined by role mapping
+
+**Roles:**
+- **Admin:** Full system access, workspace management, user management
+- **Member:** Workspace access, project creation, trace viewing
+- **Viewer:** Read-only access to assigned workspaces
+
+**Workspaces & Organizations:**
+- Users belong to **organizations** (top-level container)
+- Users access **workspaces** within organizations
+- Role mapping determines which workspaces a user can access
+- **No shared admin accounts** - each user authenticates individually
+
+**Key Principle:** Authentication is centralized (IdP), authorization is application-level (LangSmith role mapping).
+
+---
+
+### 2. Choosing OIDC vs SAML
+
+**Decision Rule:**
+
+```
+IF IdP supports OIDC AND you can configure OIDC client
+  → Use OIDC (preferred)
+ELSE IF IdP only supports SAML OR enterprise requires SAML
+  → Use SAML (fallback)
+ELSE
+  → Re-evaluate IdP choice
+```
+
+**OIDC Advantages:**
+- Better error messages
+- Easier debugging (standard endpoints)
+- Modern protocol with better security defaults
+- Simpler configuration
+
+**SAML Advantages:**
+- Enterprise-standard
+- Widely supported
+- Mature protocol
+
+**Recommendation:** Start with OIDC unless blocked by IdP limitations or policy.
+
+---
+
+### 3. Configuring OIDC
+
+#### Required IdP Inputs
+
+Your IdP team must provide:
+
+1. **Issuer URL** (e.g., `https://your-org.okta.com/oauth2/default`)
+   - Must be HTTPS
+   - Must be reachable from LangSmith pods
+   - Used for discovery and token validation
+
+2. **Client ID**
+   - OAuth2 client identifier
+   - Public value (safe to log)
+
+3. **Client Secret**
+   - OAuth2 client secret
+   - **Never log or print**
+   - Store in Kubernetes secret
+
+4. **Redirect URI**
+   - **Exact format:** `https://your-langsmith-domain.com/auth/callback`
+   - Must match **exactly** (case-sensitive, no trailing slash unless specified)
+   - IdP team must whitelist this URI
+
+5. **Required Claims**
+   - `email` (required): User email address
+   - `name` (optional): Display name
+   - `groups` (optional): Group membership for role mapping
+
+6. **Scopes**
+   - `openid` (required)
+   - `email` (required)
+   - `profile` (optional)
+   - `groups` (optional, if using group-based role mapping)
+
+#### Helm/Environment Configuration
+
+**Helm Values (recommended):**
+
+```yaml
+auth:
+  provider: oidc
+  oidc:
+    issuer: "https://your-org.okta.com/oauth2/default"
+    clientId: "your-client-id"
+    clientSecret:
+      secretName: langsmith-oidc-secret
+      secretKey: client-secret
+    redirectURI: "https://your-langsmith-domain.com/auth/callback"
+    scopes:
+      - openid
+      - email
+      - profile
+      - groups
+    claimMapping:
+      email: email
+      name: name
+      groups: groups
+```
+
+**Environment Variables (alternative):**
+
+```bash
+AUTH_PROVIDER=oidc
+OIDC_ISSUER=https://your-org.okta.com/oauth2/default
+OIDC_CLIENT_ID=your-client-id
+OIDC_CLIENT_SECRET=<from-secret>
+OIDC_REDIRECT_URI=https://your-langsmith-domain.com/auth/callback
+OIDC_SCOPES=openid,email,profile,groups
+```
+
+#### Redirect URI Exactness
+
+**Critical:** The redirect URI must match **exactly** between:
+- LangSmith configuration
+- IdP whitelist
+- Actual callback URL
+
+**Common Mistakes:**
+- Trailing slash mismatch: `/auth/callback` vs `/auth/callback/`
+- Protocol mismatch: `http://` vs `https://`
+- Domain mismatch: `langsmith.example.com` vs `www.langsmith.example.com`
+- Port mismatch: `:443` vs no port
+
+**Validation:** Use the validation notebook to verify exact match.
+
+#### TLS Requirements
+
+- IdP issuer URL must be HTTPS
+- LangSmith domain must have valid TLS certificate
+- Certificate must be trusted by browser (not self-signed for production)
+- Certificate must match domain exactly (no wildcard issues)
+
+#### Clock Skew
+
+- LangSmith and IdP clocks must be synchronized
+- Maximum allowed skew: typically 5 minutes
+- Use NTP on Kubernetes nodes
+- Validate with: `kubectl exec <pod> -- date` vs IdP server time
+
+---
+
+### 4. Role Mapping
+
+**Principle:** Map IdP groups to LangSmith roles, not individual users.
+
+#### Group-Based Mapping (Recommended)
+
+```yaml
+auth:
+  roleMapping:
+    groups:
+      - group: "langsmith-admins"
+        role: "admin"
+      - group: "langsmith-members"
+        role: "member"
+      - group: "langsmith-viewers"
+        role: "viewer"
+```
+
+**Benefits:**
+- Centralized management in IdP
+- Easier audit trail
+- Scales to large organizations
+
+#### User-Based Mapping (Fallback)
+
+```yaml
+auth:
+  roleMapping:
+    users:
+      - email: "admin@example.com"
+        role: "admin"
+```
+
+**Use only when:**
+- Group claims unavailable
+- Temporary workaround
+- Small team (< 10 users)
+
+#### Minimal Admins Principle
+
+- **Start with 1-2 admins**
+- Add admins only when necessary
+- Use group-based mapping for admins
+- Document admin assignments
+
+#### Mapping Claims to Roles
+
+**Claim Structure:**
+
+```json
+{
+  "email": "user@example.com",
+  "name": "John Doe",
+  "groups": ["langsmith-members", "engineering"]
+}
+```
+
+**Mapping Logic:**
+1. Extract `groups` claim
+2. Match against role mapping configuration
+3. Assign highest privilege role found
+4. Default to "member" if no match
+
+**Validation:** Test with users in different groups to verify mapping.
+
+---
+
+### 5. SAML Configuration
+
+#### Required Metadata
+
+Your IdP team must provide:
+
+1. **SAML Metadata URL** (preferred)
+   - HTTPS endpoint serving XML metadata
+   - Must be reachable from LangSmith pods
+   - Auto-refreshes configuration
+
+2. **SAML Metadata XML** (fallback)
+   - Static XML file
+   - Must be updated manually when IdP changes
+   - Store in Kubernetes secret or ConfigMap
+
+#### Expected Attributes
+
+**Required:**
+- `email` or `http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress`
+- `name` or `http://schemas.xmlsoap.org/ws/2005/05/identity/claims/name`
+
+**Optional (for role mapping):**
+- `groups` or `http://schemas.microsoft.com/ws/2008/06/identity/claims/groups`
+- Custom attribute names (must match exactly)
+
+#### Common Failures
+
+1. **Missing Attributes**
+   - Symptom: User authenticates but has no email/name
+   - Cause: IdP not sending required attributes
+   - Fix: Configure IdP to send required attributes
+
+2. **Attribute Name Mismatch**
+   - Symptom: Claims not mapped correctly
+   - Cause: LangSmith expects different attribute name
+   - Fix: Update attribute mapping in Helm values
+
+3. **Signature Validation Failure**
+   - Symptom: Authentication fails with "invalid signature"
+   - Cause: Certificate mismatch or expired certificate
+   - Fix: Update IdP certificate in metadata
+
+4. **Assertion Expired**
+   - Symptom: Authentication times out
+   - Cause: Clock skew or assertion validity window too short
+   - Fix: Synchronize clocks, adjust validity window
+
+---
+
+### 6. Validation & Failure Drills
+
+#### Validation Checklist
+
+See `docs/shared/auth_validation_checklist.md` for complete checklist.
+
+**Quick Validation:**
+1. ✅ Ingress/TLS configured correctly
+2. ✅ Redirect URI matches exactly
+3. ✅ IdP issuer reachable
+4. ✅ Client credentials valid
+5. ✅ Role mapping configured
+6. ✅ Login flow works end-to-end
+7. ✅ Logout works
+8. ✅ Session invalidation works
+
+#### Failure Drills
+
+**Purpose:** Understand failure modes and recovery procedures.
+
+**Drill 1: Redirect URI Mismatch**
+- **Change:** Modify redirect URI in Helm values (add trailing slash)
+- **Observe:** Login redirect fails
+- **Recover:** Revert change, restart pods
+- **Validate:** Login works again
+
+**Drill 2: Missing Claim**
+- **Change:** Remove `groups` claim from IdP configuration
+- **Observe:** Users authenticate but have no role
+- **Recover:** Restore `groups` claim
+- **Validate:** Role mapping works again
+
+**Drill 3: Secret Rotation Wrong**
+- **Change:** Update client secret in IdP but not in LangSmith
+- **Observe:** Authentication fails with "invalid client"
+- **Recover:** Update Kubernetes secret, restart pods
+- **Validate:** Authentication works again
+
+**Note:** These drills are **optional** and should only be run in non-production environments.
+
+---
+
+## Common Pitfalls
+
+### Login Loop
+**Symptom:** User redirected to IdP, then back to LangSmith, then to IdP again (infinite loop)
+
+**Causes:**
+- Redirect URI mismatch
+- Session cookie not set (TLS/cookie issues)
+- Token validation failure
+
+**Fix:** Check redirect URI exactness, verify TLS certificate, check token validation logs
+
+### No Data After Login
+**Symptom:** User authenticates successfully but sees empty workspace
+
+**Causes:**
+- Role mapping not configured
+- User not in any mapped groups
+- Workspace not assigned to user's organization
+
+**Fix:** Verify role mapping configuration, check user's group membership, verify workspace assignment
+
+### TLS Callback Issues
+**Symptom:** IdP callback fails with TLS errors
+
+**Causes:**
+- Self-signed certificate on LangSmith domain
+- Certificate chain incomplete
+- Certificate expired
+
+**Fix:** Use valid TLS certificate from trusted CA, ensure full chain is present
+
+### Multiple IdPs
+**Symptom:** Confusion about which IdP to use
+
+**Causes:**
+- Multiple IdP configurations present
+- Configuration precedence unclear
+
+**Fix:** Use single IdP configuration, remove unused configurations
+
+---
+
+## Security & Compliance Callouts
+
+### Least Privilege
+- Start with minimal admin access
+- Use group-based role mapping
+- Regular access reviews
+- Document all admin assignments
+
+### Auditability
+- All authentication events logged
+- Role changes tracked
+- Session creation/destruction logged
+- Export logs to SIEM for compliance
+
+### Centralized Identity Governance
+- Manage users in IdP, not LangSmith
+- Use IdP groups for access control
+- Regular access reviews in IdP
+- Deprovision users in IdP when they leave
+
+---
+
+## Artifacts Participants Leave With
+
+1. **SSO Configuration**
+   - Helm values file with auth configuration
+   - Kubernetes secrets for client credentials
+   - Documentation of IdP settings
+
+2. **IdP Settings Document**
+   - Redirect URI whitelisted
+   - Required claims configured
+   - Scopes configured
+   - Group structure documented
+
+3. **Mapping Reference**
+   - Group-to-role mapping table
+   - Admin assignments documented
+   - Workspace access rules
+
+4. **Validation Checklist**
+   - Completed validation checklist
+   - Test results for admin and standard user
+   - Logout/session invalidation verified
+
+5. **Debugging Playbook**
+   - Troubleshooting guide reference
+   - Log locations documented
+   - Support bundle procedure
+
+---
+
+## Next Steps
+
+1. **Run the validation notebook:**
+   - `notebooks/module-2/01_sso_oidc_validation.ipynb` (OIDC)
+   - `notebooks/module-2/02_sso_saml_validation.ipynb` (SAML)
+
+2. **Complete the validation checklist:**
+   - `docs/shared/auth_validation_checklist.md`
+
+3. **Review troubleshooting guide:**
+   - `docs/shared/auth_troubleshooting.md`
+
+4. **Proceed to Module 3** (if applicable)
+
+---
+
+## References
+
+- [OIDC Specification](https://openid.net/specs/openid-connect-core-1_0.html)
+- [SAML 2.0 Specification](http://docs.oasis-open.org/security/saml/v2.0/)
+- LangSmith Helm Chart Documentation
+- Your IdP's OIDC/SAML documentation
+
@@ -0,0 +1,679 @@
+# Module 3: Production Operations & Scaling
+
+**Goal:** Enable operators to run LangSmith reliably under real production load, understand scaling domains, and respond effectively when things go wrong (day-2 operations).
+
+**Duration:** ~2 hours  
+**Audience:** Platform engineers, infrastructure teams, SREs, and on-call operators  
+**Prerequisites:**
+- Module 1 complete: LangSmith deployed and reachable (AWS/EKS or Azure/AKS baseline)
+- Module 2 complete: Authentication and authorization configured (OIDC/SAML)
+
+---
+
+## Overview
+
+Module 3 transitions from "it works" to "it works reliably under load." This module covers production operations, scaling strategies, observability, and the mental models needed for day-2 operations.
+
+**What you'll accomplish:**
+- Understand LangSmith's distributed architecture and scaling domains
+- Configure production-grade service sizing and HA
+- Implement autoscaling strategies (HPA and KEDA)
+- Set up observability and early warning signals
+- Validate production readiness
+- Prepare for incident response
+
+**What this module avoids:**
+- Deep dives into specific monitoring tools (Prometheus/Grafana setup)
+- Custom alerting rule creation (covered in incident response)
+- Performance tuning and optimization (out of scope)
+- Multi-region deployments (advanced topic)
+
+---
+
+## Section 1: Production Mental Model
+
+### Distributed System Reality
+
+LangSmith is a **distributed system** with multiple services that must coordinate:
+
+- **API Server:** Handles HTTP requests, authentication, routing
+- **Workers:** Process traces, spans, and evaluations asynchronously
+- **ClickHouse:** Time-series data storage and queries
+- **PostgreSQL:** Metadata, users, workspaces, projects
+- **Redis:** Caching, rate limiting, job queues
+- **Blob Storage:** Large payload storage (traces, artifacts)
+
+**Key insight:** These services have different scaling characteristics and failure modes. Understanding these differences is critical for production operations.
+
+### Scaling Domains
+
+**Scaling domains** are groups of resources that scale together or have shared bottlenecks:
+
+1. **Ingestion Domain:**
+   - API server pods (stateless, horizontal scaling)
+   - Ingress/Load Balancer (cloud-managed, scales automatically)
+   - **Bottleneck:** API server CPU/memory under high request volume
+
+2. **Processing Domain:**
+   - Worker pods (stateless, horizontal scaling)
+   - Redis (single instance or cluster, vertical scaling)
+   - **Bottleneck:** Worker capacity and Redis throughput
+
+3. **Storage Domain:**
+   - ClickHouse (stateful, complex scaling)
+   - PostgreSQL (stateful, vertical scaling + read replicas)
+   - Blob Storage (cloud-managed, effectively unlimited)
+   - **Bottleneck:** ClickHouse query performance, PostgreSQL connection limits
+
+4. **Control Plane:**
+   - Kubernetes cluster (managed service)
+   - Helm releases, ConfigMaps, Secrets
+   - **Bottleneck:** Cluster capacity and node resources
+
+**Critical understanding:** Scaling one domain without addressing downstream bottlenecks creates cascading failures.
+
+---
+
+## Section 2: Scaling Model
+
+### What Scales Well
+
+**Horizontal scaling (add more pods):**
+- API server pods (stateless HTTP handlers)
+- Worker pods (stateless job processors)
+- Ingress controllers (cloud-managed load balancers)
+
+**Why:** These services are stateless and can be scaled independently based on load.
+
+### What Does NOT Autoscale
+
+**Vertical scaling only (increase resources per instance):**
+- PostgreSQL (managed RDS/Azure Database)
+- Redis (managed ElastiCache/Azure Cache)
+- ClickHouse (in-cluster or managed, complex scaling)
+
+**Why:** These are stateful services with data locality requirements. Scaling requires careful planning and may involve downtime.
+
+**Manual scaling required:**
+- Kubernetes node capacity (cluster autoscaling helps, but has limits)
+- Blob storage buckets (unlimited capacity, but requires configuration)
+- Network bandwidth (cloud-managed, but has limits)
+
+### Failure Pattern: HPA Increases Ingestion → Downstream Saturation
+
+**Common anti-pattern:**
+
+1. High request volume triggers HPA to scale API server pods
+2. API servers successfully handle more requests
+3. Workers cannot keep up with increased trace volume
+4. Redis queue fills up
+5. ClickHouse ingestion rate saturates
+6. PostgreSQL connection pool exhausts
+7. System degrades despite "scaled" API servers
+
+**Solution:** Scale all domains together, or implement backpressure and rate limiting.
+
+**Key principle:** Monitor downstream services, not just upstream services.
+
+---
+
+## Section 3: Service Sizing Baselines
+
+### PostgreSQL (Database)
+
+**Production baseline:**
+- **Instance size:** db.r5.xlarge (4 vCPU, 32 GB RAM) minimum
+- **Storage:** 500 GB+ with autoscaling enabled
+- **High availability:** Multi-AZ deployment (RDS) or read replicas (Azure)
+- **Connection pool:** 100+ connections configured in LangSmith
+- **Backups:** Automated daily backups with 7-day retention minimum
+
+**Non-production guidance:**
+- db.t3.medium (2 vCPU, 4 GB RAM) acceptable for development
+- Single-AZ acceptable for non-production
+- 30-day backup retention sufficient
+
+**Verification:**
+```bash
+# AWS RDS
+aws rds describe-db-instances --db-instance-identifier <instance-id>
+
+# Azure Database
+az postgres flexible-server show --name <server-name> --resource-group <rg>
+```
+
+**What to check:**
+- Instance class/size
+- Multi-AZ status
+- Storage autoscaling enabled
+- Backup retention period
+- Private networking (VPC/subnet configuration)
+
+### Redis (Cache)
+
+**Production baseline:**
+- **Instance type:** cache.r6g.xlarge (6 vCPU, 13.07 GB RAM) minimum
+- **High availability:** Redis Cluster mode enabled (3+ nodes)
+- **Memory:** 50% headroom for growth
+- **Persistence:** AOF (Append Only File) enabled for durability
+
+**Non-production guidance:**
+- cache.t3.micro acceptable for development
+- Single node acceptable for non-production
+- RDB snapshots sufficient (no AOF required)
+
+**Verification:**
+```bash
+# AWS ElastiCache
+aws elasticache describe-cache-clusters --cache-cluster-id <cluster-id>
+
+# Azure Cache
+az redis show --name <cache-name> --resource-group <rg>
+```
+
+**What to check:**
+- Node type and memory size
+- Cluster mode enabled (production)
+- AOF persistence enabled
+- Private networking
+
+### ClickHouse
+
+**Production baseline:**
+- **Deployment:** Managed ClickHouse (ClickHouse Cloud) OR in-cluster with EBS CSI
+- **In-cluster sizing:** 3-node cluster minimum (for HA)
+- **Resources per node:** 8 CPU, 32 GB RAM, 1 TB storage
+- **Storage:** EBS gp3 volumes with 3000 IOPS
+- **Replication:** 2x replication factor (6 total pods for 3-node cluster)
+
+**Non-production guidance:**
+- Single node acceptable for development
+- 4 CPU, 16 GB RAM per node sufficient
+- 100 GB storage per node
+
+**Verification:**
+```bash
+# In-cluster ClickHouse
+kubectl get statefulset -n <namespace> | grep clickhouse
+kubectl get pvc -n <namespace> | grep clickhouse
+
+# Check ClickHouse cluster status
+kubectl exec -it <clickhouse-pod> -n <namespace> -- clickhouse-client --query "SELECT * FROM system.clusters"
+```
+
+**What to check:**
+- StatefulSet replica count
+- PVC size and storage class
+- Resource requests/limits
+- Replication factor
+
+### Managed vs In-Cluster
+
+**Managed services (recommended for production):**
+- PostgreSQL: RDS (AWS) or Azure Database for PostgreSQL
+- Redis: ElastiCache (AWS) or Azure Cache for Redis
+- ClickHouse: ClickHouse Cloud (managed service)
+
+**Benefits:**
+- Automated backups and maintenance
+- High availability built-in
+- Security patches applied automatically
+- Monitoring and alerting included
+
+**In-cluster services (acceptable for non-production):**
+- PostgreSQL: Postgres operator (Crunchy Data, Zalando)
+- Redis: Redis operator or Helm chart
+- ClickHouse: ClickHouse operator
+
+**Trade-offs:**
+- More operational overhead
+- Requires backup strategy
+- Manual HA configuration
+- Lower cost for development
+
+### Private Networking
+
+**Production requirement:** All data stores must be in private subnets with no public internet access.
+
+**Why:**
+- Security: Reduces attack surface
+- Compliance: Required for many compliance frameworks
+- Performance: Lower latency within VPC/VNet
+
+**Verification:**
+- RDS/Azure Database: Check subnet group (private subnets only)
+- ElastiCache/Azure Cache: Check subnet group (private subnets only)
+- ClickHouse: Check pod network policies and service mesh egress rules
+
+---
+
+## Section 4: Blob Storage REQUIRED for Production
+
+### Why Blob Storage is Required
+
+**Problem without blob storage:**
+- Large trace payloads stored inline in ClickHouse
+- ClickHouse table size explodes
+- Query performance degrades
+- Storage costs increase dramatically
+- System becomes unusable under load
+
+**Solution with blob storage:**
+- Large payloads stored in S3/Azure Blob Storage
+- ClickHouse stores only references (small strings)
+- Query performance remains stable
+- Storage costs scale linearly
+- System handles production load
+
+### Requirements
+
+**Production:**
+- **Service:** S3 (AWS) or Azure Blob Storage (Azure)
+- **Bucket/Container:** Dedicated bucket for LangSmith
+- **Access:** IAM roles (AWS) or Managed Identity (Azure) - no access keys
+- **Lifecycle policies:** Configured for cost optimization (move to Glacier/Cool tier after 90 days)
+- **Versioning:** Enabled for data protection
+- **Encryption:** Server-side encryption enabled
+
+**Non-production:**
+- Local MinIO or in-cluster object storage acceptable
+- Access keys acceptable (not for production)
+- No lifecycle policies required
+
+### Verification
+
+**Check Helm values:**
+```yaml
+blobStorage:
+  provider: s3  # or azure
+  bucket: langsmith-traces
+  region: us-west-2
+  # IAM role ARN (not access keys)
+  iamRoleArn: arn:aws:iam::<account>:role/langsmith-blob-storage
+```
+
+**Check environment variables:**
+```bash
+kubectl exec <api-pod> -n <namespace> -- env | grep -i blob
+kubectl exec <api-pod> -n <namespace> -- env | grep -i s3
+```
+
+**What to verify:**
+- Blob storage provider configured (not "local" or "filesystem")
+- Bucket/container name present
+- IAM role or managed identity configured (no access keys)
+- Blob storage health check passes (see ops sanity checks notebook)
+
+---
+
+## Section 5: Autoscaling Strategy
+
+### HPA (Horizontal Pod Autoscaler) for API Servers
+
+**Use case:** Scale API server pods based on CPU/memory utilization.
+
+**Configuration:**
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: langsmith-api
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: langsmith-api
+  minReplicas: 2
+  maxReplicas: 10
+  metrics:
+  - type: Resource
+    resource:
+      name: cpu
+      target:
+        type: Utilization
+        averageUtilization: 70
+  - type: Resource
+    resource:
+      name: memory
+      target:
+        type: Utilization
+        averageUtilization: 80
+```
+
+**Baseline:**
+- **Min replicas:** 2 (for HA)
+- **Max replicas:** 10 (adjust based on cluster capacity)
+- **CPU target:** 70% average utilization
+- **Memory target:** 80% average utilization
+
+**Verification:**
+```bash
+kubectl get hpa -n <namespace>
+kubectl describe hpa langsmith-api -n <namespace>
+```
+
+### KEDA for Bursty Worker Scaling
+
+**Why KEDA instead of HPA:**
+- Workers process jobs from Redis queues
+- Queue depth is a better scaling signal than CPU/memory
+- Bursty workloads need rapid scaling (seconds, not minutes)
+- KEDA supports Redis queue depth metrics
+
+**Configuration:**
+```yaml
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+  name: langsmith-workers
+spec:
+  scaleTargetRef:
+    name: langsmith-worker
+  minReplicaCount: 1
+  maxReplicaCount: 20
+  triggers:
+  - type: redis
+    metadata:
+      address: <redis-host>:6379
+      listName: langsmith:jobs:traces
+      listLength: "10"  # Scale up when queue depth > 10
+```
+
+**Baseline:**
+- **Min replicas:** 1
+- **Max replicas:** 20 (adjust based on workload)
+- **Queue depth threshold:** 10 jobs (adjust based on processing time)
+- **Cooldown period:** 30 seconds
+
+**Verification:**
+```bash
+kubectl get scaledobject -n <namespace>
+kubectl describe scaledobject langsmith-workers -n <namespace>
+```
+
+### What Does NOT Autoscale
+
+**Manual scaling required:**
+- PostgreSQL instance size (vertical scaling only)
+- Redis cluster size (add nodes manually)
+- ClickHouse nodes (StatefulSet scaling requires data rebalancing)
+- Kubernetes nodes (cluster autoscaler helps, but has limits)
+
+**Key principle:** Monitor these services and scale proactively based on capacity planning, not reactively based on alerts.
+
+---
+
+## Section 6: Observability & Early Warning Signals
+
+### Three Layers of Observability
+
+**1. Kubernetes Layer:**
+- Pod status, restarts, resource usage
+- Node capacity and utilization
+- Events and warnings
+- **Tools:** `kubectl`, `kubectl top`, cluster monitoring
+
+**2. LangSmith Application Layer:**
+- Request rates, latencies, error rates
+- Trace ingestion rates
+- Worker queue depths
+- **Tools:** Application metrics, logs, dashboards
+
+**3. Data Store Layer:**
+- PostgreSQL connection counts, query performance
+- Redis memory usage, hit rates
+- ClickHouse query performance, table sizes
+- **Tools:** Cloud provider monitoring, database metrics
+
+### Early Warning Signals
+
+See `docs/shared/ops_signals_and_thresholds.md` for complete signal catalog.
+
+**Critical signals (red flags):**
+- Pod restart count > 5 in 1 hour
+- Pending pods > 0 for > 5 minutes
+- API server CPU > 80% for > 10 minutes
+- Worker queue depth > 100
+- PostgreSQL connections > 80% of max
+- Redis memory > 90%
+- ClickHouse query latency > 5 seconds (p95)
+
+**Warning signals (yellow flags):**
+- Pod restart count > 2 in 1 hour
+- API server CPU > 70% for > 10 minutes
+- Worker queue depth > 50
+- PostgreSQL connections > 60% of max
+- Redis memory > 75%
+
+### Red Flag Thresholds
+
+**Immediate action required:**
+- Any pod in `CrashLoopBackOff` state
+- Any pod `Pending` for > 10 minutes
+- API server error rate > 5%
+- Worker queue depth > 200
+- PostgreSQL connection pool exhausted
+- Redis out of memory
+- ClickHouse query timeout > 10 seconds
+
+**Escalation evidence:**
+- Pod logs (last 100 lines)
+- Recent events (`kubectl get events --sort-by=.lastTimestamp`)
+- Resource usage (`kubectl top pods`)
+- Application metrics snapshot
+- Database connection counts
+
+---
+
+## Section 7: Backups, DR, and Failure Domains
+
+### What Backups Cover
+
+**PostgreSQL backups (managed services):**
+- Automated daily backups (RDS/Azure Database)
+- Point-in-time recovery (PITR) for last 7 days
+- Cross-region backup replication (if configured)
+- **Covers:** Database schema, user data, workspace/project metadata
+
+**ClickHouse backups:**
+- Manual backups via `clickhouse-backup` tool
+- Cloud storage snapshots (if using managed ClickHouse)
+- **Covers:** Trace data, span data, evaluation results
+
+**Blob storage:**
+- Versioning enabled (S3/Azure Blob)
+- Lifecycle policies for cost optimization
+- Cross-region replication (if configured)
+- **Covers:** Large trace payloads, artifacts, files
+
+### What Backups Do NOT Cover
+
+**Not backed up automatically:**
+- Kubernetes secrets (stored in cluster, not in backups)
+- Helm values (stored in Git, not in backups)
+- In-cluster ClickHouse data (unless backup job configured)
+- Redis data (ephemeral cache, not backed up)
+- Application configuration (ConfigMaps, stored in cluster)
+
+**Manual backup required:**
+- Kubernetes secrets (export to encrypted storage)
+- Helm values files (store in Git)
+- In-cluster ClickHouse (configure backup job)
+- Application logs (export to log aggregation service)
+
+### Failure Domains
+
+**Availability Zone (AZ) failures:**
+- **Impact:** Pods in one AZ unavailable
+- **Mitigation:** Multi-AZ deployment (pods spread across AZs)
+- **Recovery:** Kubernetes reschedules pods to healthy AZs
+
+**Node failures:**
+- **Impact:** All pods on failed node unavailable
+- **Mitigation:** Multiple nodes, pod anti-affinity rules
+- **Recovery:** Kubernetes reschedules pods to healthy nodes
+
+**Database failures:**
+- **Impact:** Application cannot read/write data
+- **Mitigation:** Multi-AZ RDS, automated failover
+- **Recovery:** RDS promotes standby to primary (5-10 minutes)
+
+**Region failures:**
+- **Impact:** Entire deployment unavailable
+- **Mitigation:** Multi-region deployment (advanced, out of scope)
+- **Recovery:** Manual failover to secondary region
+
+**Reality check:** Most failures are AZ or node-level. Region failures are rare but catastrophic. Plan accordingly.
+
+---
+
+## Section 8: Production Readiness Checklist
+
+See `docs/shared/production_readiness_checklist.md` for complete checklist.
+
+**Each checklist item maps to real incidents:**
+
+1. **Blob storage configured** → Prevents ClickHouse table explosion
+2. **PostgreSQL HA enabled** → Prevents database downtime
+3. **Redis cluster mode** → Prevents cache failures
+4. **ClickHouse replication** → Prevents data loss
+5. **HPA configured** → Prevents API server overload
+6. **KEDA configured** → Prevents worker queue saturation
+7. **Monitoring enabled** → Enables early detection
+8. **Backups configured** → Enables data recovery
+9. **Private networking** → Meets security requirements
+10. **Resource limits set** → Prevents resource exhaustion
+
+**Validation:**
+- Run `notebooks/module-3/01_ops_sanity_checks.ipynb` to validate each item
+- Review cloud provider console for managed service configuration
+- Check Helm values for application configuration
+- Verify monitoring dashboards show expected metrics
+
+---
+
+## Section 9: Sidecars & Service Mesh (Istio)
+
+### When Sidecars Are Needed
+
+**Use cases:**
+- **Egress control:** Restrict outbound traffic to approved destinations
+- **mTLS:** Encrypt traffic between services
+- **Policy enforcement:** Rate limiting, circuit breakers
+- **Observability:** Distributed tracing, metrics collection
+
+**When NOT needed:**
+- Simple deployments without egress requirements
+- Development environments
+- Proof-of-concept deployments
+
+### How to Enable Injection Safely
+
+**Namespace-level injection (recommended for LangSmith):**
+```yaml
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: langsmith
+  labels:
+    istio-injection: enabled
+    istio-discovery: enabled
+```
+
+**Per-workload annotation (for selective injection):**
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: langsmith-api
+spec:
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/inject: "true"
+```
+
+**Revision-based injection (for canary/blue-green):**
+```yaml
+labels:
+  istio-injection: enabled
+  istio.io/rev: default
+```
+
+### Operational Implications
+
+**Logging and kubectl logs:**
+- Multi-container pods require container selection
+- **App logs:** `kubectl logs <pod> -c <container-name> -n <namespace>`
+- **Proxy logs:** `kubectl logs <pod> -c istio-proxy -n <namespace>`
+- **All logs:** `kubectl logs <pod> --all-containers=true -n <namespace>`
+
+**Common issue:** "If logs appear missing after injection, you're likely looking at the wrong container."
+
+**Health probes and timeouts:**
+- Sidecar adds latency to health checks
+- Increase probe timeouts if sidecars are enabled
+- Verify readiness probes account for sidecar startup
+
+**Egress to external databases:**
+- Configure `ServiceEntry` for external PostgreSQL/Redis endpoints
+- Configure `DestinationRule` for traffic policies
+- Verify egress rules allow database connections
+
+See `docs/shared/sidecars_and_service_mesh.md` for detailed guidance.
+
+---
+
+## Section 10: Transition to Incident Response
+
+Module 3 establishes the baseline for production operations. The next step is **incident response**:
+
+**What you'll learn:**
+- How to diagnose common failure modes
+- How to gather evidence for support
+- How to implement runbooks
+- How to perform post-incident reviews
+
+**Prerequisites:**
+- Module 3 complete (production readiness validated)
+- Monitoring and alerting configured
+- On-call rotation established
+
+---
+
+## Artifacts Participants Leave With
+
+1. **Production readiness checklist** (completed)
+2. **Service sizing documentation** (baselines documented)
+3. **Autoscaling configuration** (HPA and KEDA configured)
+4. **Observability setup** (signals and thresholds documented)
+5. **Backup strategy** (backups configured and tested)
+6. **Ops sanity checks notebook** (validation results)
+
+---
+
+## Next Steps
+
+1. **Run the ops sanity checks notebook:**
+   - `notebooks/module-3/01_ops_sanity_checks.ipynb`
+
+2. **Review production readiness checklist:**
+   - `docs/shared/production_readiness_checklist.md`
+
+3. **Document your thresholds:**
+   - `docs/shared/ops_signals_and_thresholds.md`
+
+4. **Configure monitoring and alerting** (next module)
+
+5. **Proceed to incident response training** (next module)
+
+---
+
+## References
+
+- [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
+- [KEDA Documentation](https://keda.sh/docs/)
+- [Istio Service Mesh](https://istio.io/latest/docs/)
+- LangSmith Helm Chart Documentation
+- Cloud Provider Documentation (AWS RDS, Azure Database, etc.)
+
@@ -0,0 +1,426 @@
+# Module 4: Troubleshooting & Incident Response
+
+**Goal:** Teach operators how to diagnose LangSmith self-hosted issues under pressure, collect the right evidence, and resolve incidents efficiently—either independently or with LangChain Support.
+
+**Duration:** ~3-4 hours (with optional full incident drill)  
+**Audience:** On-call engineers, platform owners, SREs, and anyone responsible for keeping LangSmith healthy  
+**Prerequisites:**
+- Module 1 complete: LangSmith deployed and reachable
+- Module 2 complete: Authentication configured
+- Module 3 complete: Production operations concepts understood
+- Participants own day-2 operations
+
+---
+
+## Overview
+
+Module 4 is hands-on: learners will introduce subtle but noticeable failures and debug them using standard tools and the canonical diagnostics bundle. This module builds the muscle memory needed for real incidents.
+
+**What you'll accomplish:**
+- Understand common failure modes and their symptoms
+- Master the "first 10 minutes" incident response checklist
+- Learn to collect canonical diagnostics bundles
+- Practice debugging with guided failure labs
+- Know when and how to escalate to Support
+
+**What this module avoids:**
+- Deep dives into specific monitoring tools (assumes basic kubectl/helm)
+- Performance optimization (covered in Module 3)
+- Infrastructure provisioning (covered in Module 1)
+- Authentication configuration (covered in Module 2)
+
+---
+
+## Section 1: Incident Reality Check
+
+### The Mindset
+
+**Incidents happen.** Even with perfect configuration, production systems fail. The difference between a 30-minute incident and a 4-hour outage is often preparation and process.
+
+**Key principles:**
+1. **Collect evidence first.** Don't redeploy, restart, or reconfigure until you understand what's wrong.
+2. **Time is evidence.** Every minute that passes without collecting diagnostics is lost information.
+3. **Symptoms are clues.** The same root cause can manifest differently depending on load, timing, and configuration.
+4. **Support needs context.** A good diagnostics bundle is worth more than a perfect description.
+
+### What Makes Incidents Hard
+
+**Pressure:**
+- Users are impacted
+- Management is asking for updates
+- You're on-call and tired
+- Multiple systems are involved
+
+**Complexity:**
+- Distributed systems have many moving parts
+- Failures cascade (one service fails, others follow)
+- Symptoms don't always point to root cause
+- Configuration drift accumulates over time
+
+**Tooling:**
+- Too many tools (which one shows the truth?)
+- Too few tools (missing critical information)
+- Tools that hide the problem (aggregation, sampling)
+
+**This module prepares you for all of these.**
+
+---
+
+## Section 2: Common Failure Modes
+
+### Ingestion & Tracing Failures
+
+**Symptoms:**
+- Traces appear delayed or missing
+- Worker pods show errors in logs
+- ClickHouse insert errors
+- Queue backlogs
+
+**Common causes:**
+- ClickHouse connectivity issues (network, credentials, resource limits)
+- Blob storage misconfiguration (large payloads fail)
+- Worker resource exhaustion (CPU/memory limits)
+- Redis connectivity (job queue backing up)
+
+**What to check first:**
+- Worker pod logs
+- ClickHouse pod status and logs
+- Redis connectivity and latency
+- Blob storage configuration
+
+### UI & API Failures
+
+**Symptoms:**
+- UI returns 5xx errors
+- API endpoints timeout
+- Login fails or redirects loop
+- Specific features don't work
+
+**Common causes:**
+- Database connectivity (PostgreSQL unreachable)
+- Authentication misconfiguration (OIDC/SAML)
+- Ingress/load balancer issues
+- API pod crashes or resource limits
+
+**What to check first:**
+- API pod logs
+- Database connectivity
+- Ingress status and configuration
+- Authentication configuration (Module 2 validation)
+
+### Authentication Failures
+
+**Symptoms:**
+- Users can't log in
+- Redirect loops
+- 403 errors after successful login
+- Session timeouts
+
+**Common causes:**
+- IdP connectivity issues
+- OIDC/SAML configuration drift
+- Secret rotation without updating LangSmith
+- Network policies blocking egress
+
+**What to check first:**
+- Auth pod logs
+- IdP connectivity (curl to issuer URL)
+- OIDC/SAML configuration (Module 2 validation)
+- Network policies
+
+---
+
+## Section 3: First 10 Minutes Checklist
+
+**The first 10 minutes of an incident are critical.** This is when you collect the most valuable evidence and make decisions that determine how long the incident lasts.
+
+### What NOT to Do
+
+**Resist the urge to:**
+- Run `helm upgrade` or `kubectl rollout restart`
+- Delete pods "to see if they come back"
+- Scale resources up/down
+- Change configuration
+
+**Why:** Redeploying destroys evidence and may mask the root cause. Collect diagnostics first.
+
+### The Checklist
+
+See [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md) for the complete reference.
+
+**Quick summary:**
+1. **Minute 0-2:** Triage & scope (what's broken, who's impacted)
+2. **Minute 2-5:** Quick health check (pods, events, ingress)
+3. **Minute 5-8:** Collect diagnostics bundle (canonical script + snapshots)
+4. **Minute 8-10:** Identify likely root cause (symptoms → checks)
+
+**Key insight:** This checklist is not about fixing the issue—it's about collecting evidence and making informed decisions.
+
+---
+
+## Section 4: Standard Diagnostics Collection
+
+### The Canonical Script
+
+LangChain provides an official diagnostics script that captures everything Support needs:
+
+**Location:**
+```
+https://github.com/langchain-ai/helm/blob/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
+```
+
+**What it captures:**
+- Pod logs (all containers)
+- Events (sorted by timestamp)
+- Resource usage (CPU, memory)
+- Configuration (deployments, services, ingress)
+- Storage (PVCs, storage classes)
+- Network (services, endpoints)
+
+**How to use it:**
+```bash
+curl -O https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
+chmod +x get_k8s_debugging_info.sh
+./get_k8s_debugging_info.sh <namespace>
+```
+
+**Important:** Always run this script before making changes. The bundle it creates is your evidence.
+
+### What Good Debugging Looks Like
+
+**Good debugging:**
+- Starts with a baseline (what was working before)
+- Collects evidence systematically (checklist-driven)
+- Documents hypotheses and tests them
+- Preserves evidence (saves diagnostics bundles)
+- Escalates with context (diagnostics + timeline)
+
+**Bad debugging:**
+- Changes things without understanding
+- Doesn't collect evidence
+- Jumps to conclusions
+- Destroys evidence (redeploys, deletes)
+- Escalates without context ("it's broken, fix it")
+
+**The difference:** Good debugging produces a clear root cause and fix. Bad debugging produces more incidents.
+
+---
+
+## Section 5: Working with Support
+
+### What Speeds Up Support
+
+**Good escalation includes:**
+- Diagnostics bundle (canonical script output)
+- Timeline (when did it start, what changed)
+- Symptoms (what's broken, who's impacted)
+- What you've tried (investigation steps, results)
+- Environment details (versions, configuration)
+
+**Use the [Support Escalation Template](../shared/support_escalation_template.md).**
+
+### What Slows Down Support
+
+**Poor escalation includes:**
+- No diagnostics bundle ("just look at it")
+- Vague symptoms ("it's slow")
+- No timeline ("it broke")
+- No environment details ("it's on Kubernetes")
+- Secrets in logs (security risk)
+
+**Result:** Support has to ask for information you could have provided, delaying resolution.
+
+### Required Metadata
+
+**Support will always ask for:**
+1. Diagnostics bundle (canonical script)
+2. Helm chart version
+3. Image tags (if known)
+4. Recent changes (deployments, config, infrastructure)
+5. Cloud provider and region
+6. Kubernetes version
+7. What you've tried and results
+
+**Provide this upfront to speed resolution.**
+
+---
+
+## Section 6: Preventing Repeat Incidents
+
+### Post-Incident Review
+
+**After an incident is resolved:**
+1. **Document the root cause** (what actually broke)
+2. **Identify contributing factors** (what made it worse)
+3. **List what worked** (what helped you debug)
+4. **List what didn't work** (what slowed you down)
+5. **Create action items** (what to change to prevent recurrence)
+
+**Key questions:**
+- Could we have detected this earlier? (monitoring, alerts)
+- Could we have prevented this? (configuration, testing)
+- Could we have fixed it faster? (runbooks, tooling)
+- What did we learn? (new failure mode, new tool)
+
+### Common Patterns
+
+**Configuration drift:**
+- Secrets rotate, but LangSmith config isn't updated
+- Infrastructure changes, but Helm values aren't updated
+- IdP settings change, but OIDC/SAML config isn't updated
+
+**Prevention:** Automated validation (Module 2, Module 3 notebooks), configuration as code, regular audits.
+
+**Resource exhaustion:**
+- ClickHouse runs out of disk
+- PostgreSQL hits connection limits
+- Workers hit CPU/memory limits
+
+**Prevention:** Monitoring (Module 3), autoscaling (Module 3), capacity planning.
+
+**Network issues:**
+- Egress blocked by NetworkPolicy
+- Load balancer misconfiguration
+- DNS resolution failures
+
+**Prevention:** Network policy testing, ingress validation (Module 1), DNS checks.
+
+---
+
+## Section 7: Hands-on Failure Labs
+
+**This is where you practice.** Each lab follows the same pattern:
+
+1. **Baseline snapshot:** Capture what "good" looks like
+2. **Introduce failure:** Apply a subtle but noticeable fault
+3. **Observe symptoms:** See how the failure manifests
+4. **Collect diagnostics:** Run the canonical script and gather evidence
+5. **Hypothesize root cause:** Based on symptoms, identify likely cause
+6. **Verify with targeted checks:** Confirm your hypothesis
+7. **Remediate:** Revert the failure
+8. **Confirm recovery:** Verify everything is working again
+9. **Capture lessons learned:** Document what you discovered
+
+### Lab Structure
+
+**Each failure lab includes:**
+- **What this service does for LangSmith:** Context on the service's role
+- **Expected symptoms when it fails:** What you'll see when it breaks
+- **Failure injection options:** Two levels (subtle vs. obvious)
+- **Do the drill:** Step-by-step debugging process
+- **What Support will ask for:** Service-specific evidence
+
+### Available Labs
+
+1. **PostgreSQL Failure Lab** (`10_failure_lab_postgres.ipynb`)
+   - Connection failures, wrong credentials, network isolation
+   - Symptoms: API 5xx, login failures, connection exhaustion
+
+2. **Redis Failure Lab** (`20_failure_lab_redis.ipynb`)
+   - Connectivity issues, wrong credentials
+   - Symptoms: Intermittent ingestion, latency spikes, worker backlog
+
+3. **ClickHouse Failure Lab** (`30_failure_lab_clickhouse.ipynb`)
+   - Endpoint misconfiguration, network isolation, resource limits
+   - Symptoms: Traces delayed/missing, insert errors, UI loads but traces don't appear
+
+4. **Blob Storage Failure Lab** (`40_failure_lab_blob_storage.ipynb`)
+   - Credential misconfiguration, bucket name errors
+   - Symptoms: Large payload traces degrade ClickHouse, warnings in logs
+
+5. **Full Incident Drill** (`90_full_incident_drill.ipynb`) (optional)
+   - Combined failure + timeline pressure
+   - Practice "first 10 minutes" checklist
+   - Produce incident summary using escalation template
+
+---
+
+## Section 8: Workshop Wrap-up
+
+### What You've Learned
+
+- How to respond to incidents systematically
+- How to collect canonical diagnostics bundles
+- How to debug common failure modes
+- How to escalate effectively to Support
+- How to prevent repeat incidents
+
+### Next Steps
+
+**Immediate:**
+- Run through failure labs to build muscle memory
+- Customize the "first 10 minutes" checklist for your environment
+- Set up monitoring and alerts (Module 3)
+
+**Ongoing:**
+- Practice incident response regularly (drills)
+- Keep diagnostics script updated
+- Document your own failure modes and fixes
+- Share learnings with your team
+
+### Resources
+
+- [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)
+- [Support Escalation Template](../shared/support_escalation_template.md)
+- [Canonical Diagnostics Script](https://github.com/langchain-ai/helm/blob/main/charts/langsmith/scripts/get_k8s_debugging_info.sh)
+- Module 1: Deployment & Baseline Validation
+- Module 2: Identity & Authentication
+- Module 3: Production Operations & Scaling
+
+---
+
+## Artifacts
+
+**Participants leave with:**
+- A working incident response process
+- Experience debugging real failure modes
+- A diagnostics bundle collection workflow
+- An escalation template customized for their environment
+- Confidence to handle incidents independently
+
+---
+
+## Common Pitfalls
+
+**Don't:**
+- Skip the baseline snapshot (you need "before" to compare to "after")
+- Redeploy before collecting evidence (destroys diagnostics)
+- Ignore error messages (they're clues)
+- Escalate without diagnostics bundle (slows Support)
+- Delete evidence (you'll need it for post-incident review)
+
+**Do:**
+- Follow the checklist (it's battle-tested)
+- Collect diagnostics early (time is evidence)
+- Document your investigation (helps you and Support)
+- Test your process (run drills)
+- Learn from each incident (prevent repeats)
+
+---
+
+## Troubleshooting
+
+**"The diagnostics script fails":**
+- Check kubectl access and namespace
+- Verify script is up-to-date (check GitHub)
+- Run with verbose output to see what's failing
+
+**"I can't reproduce the failure":**
+- Check that failure injection was applied correctly
+- Verify symptoms match expected behavior
+- Try a different failure injection method (Level 2 if Level 1 didn't work)
+
+**"The remediation doesn't work":**
+- Verify you reverted the exact change you made
+- Check for cascading failures (one failure caused another)
+- Collect post-remediation diagnostics to compare
+
+**"I don't understand the symptoms":**
+- Review the service's role in LangSmith (lab introduction)
+- Check logs for error patterns
+- Compare to baseline snapshot (what changed?)
+
+---
+
+**Remember:** Incident response is a skill. Practice makes perfect. The more you drill, the better you'll be when real incidents happen.
+
@@ -0,0 +1,366 @@
+# Authentication Troubleshooting Playbook
+
+**Purpose:** Triage tree for common authentication failures  
+**Audience:** Operators troubleshooting SSO issues
+
+---
+
+## Triage Tree
+
+### 1. Login Loop
+
+**Symptoms:**
+- User redirected to IdP
+- User authenticates successfully
+- Redirected back to LangSmith
+- Immediately redirected to IdP again (infinite loop)
+
+**Likely Causes:**
+1. Redirect URI mismatch (most common)
+2. Session cookie not being set (TLS/cookie issues)
+3. Token validation failure
+4. State parameter mismatch
+
+**Evidence Gathering:**
+```bash
+# Check pod logs for redirect errors
+kubectl logs <api-pod> -n <namespace> --tail=100 | grep -i "redirect\|callback\|auth"
+
+# Check ingress configuration
+kubectl get ingress -n <namespace> -o yaml
+
+# Test redirect URI exactness
+curl -I https://<domain>/auth/callback
+
+# Check browser console for cookie errors
+# (Manual check in browser developer tools)
+```
+
+**Commands:**
+```bash
+# Verify redirect URI in Helm values
+helm get values <release> -n <namespace> | grep -i redirect
+
+# Check environment variables
+kubectl exec <pod> -n <namespace> -- env | grep -i redirect
+
+# Verify IdP whitelist (manual check in IdP admin console)
+```
+
+**Fix:**
+1. Verify redirect URI matches **exactly** (case, trailing slashes, protocol)
+2. Check IdP whitelist includes exact redirect URI
+3. Verify TLS certificate is valid (browser must accept cookies)
+4. Check session cookie settings (SameSite, Secure flags)
+
+---
+
+### 2. 403/Unauthorized After Login
+
+**Symptoms:**
+- User authenticates successfully at IdP
+- Redirected back to LangSmith
+- Receives 403 Forbidden or "Unauthorized" error
+- Cannot access any resources
+
+**Likely Causes:**
+1. Role mapping not configured
+2. User not in any mapped groups
+3. Workspace not assigned to user's organization
+4. Claims/attributes not being sent by IdP
+
+**Evidence Gathering:**
+```bash
+# Check pod logs for authorization errors
+kubectl logs <api-pod> -n <namespace> --tail=100 | grep -i "403\|unauthorized\|forbidden\|role"
+
+# Check role mapping configuration
+helm get values <release> -n <namespace> | grep -i "role\|mapping\|group"
+
+# Check user's group membership (from IdP)
+# (Manual check - verify user is in expected groups)
+```
+
+**Commands:**
+```bash
+# Verify role mapping in Helm values
+helm get values <release> -n <namespace> | grep -A 10 "roleMapping"
+
+# Check environment variables for claim mappings
+kubectl exec <pod> -n <namespace> -- env | grep -i "claim\|attribute\|group"
+
+# Test with different user (in mapped group)
+```
+
+**Fix:**
+1. Verify user is in a group that's mapped to a role
+2. Check role mapping configuration in Helm values
+3. Verify IdP is sending group claims/attributes
+4. Assign user to appropriate group in IdP
+5. Verify workspace assignment in LangSmith
+
+---
+
+### 3. SAML Assertion Missing Attributes
+
+**Symptoms:**
+- User authenticates successfully
+- Login completes but user has no email/name
+- Role mapping doesn't work
+- User cannot access resources
+
+**Likely Causes:**
+1. IdP not configured to send required attributes
+2. Attribute names don't match configuration
+3. Attribute mapping incorrect in LangSmith
+
+**Evidence Gathering:**
+```bash
+# Check logs for missing attribute errors
+kubectl logs <api-pod> -n <namespace> --tail=100 | grep -i "attribute\|missing\|email\|name"
+
+# Check SAML attribute mapping
+helm get values <release> -n <namespace> | grep -i "saml.*attribute"
+
+# Verify SAML metadata includes attribute definitions
+# (Check IdP metadata XML)
+```
+
+**Commands:**
+```bash
+# Verify attribute mapping configuration
+kubectl exec <pod> -n <namespace> -- env | grep -i "SAML.*ATTRIBUTE"
+
+# Check SAML metadata for attribute definitions
+curl <SAML_METADATA_URL> | grep -i "Attribute"
+
+# Test with SAML tracer (browser extension) to see actual assertion
+```
+
+**Fix:**
+1. Configure IdP to send required attributes (email, name, groups)
+2. Verify attribute names match LangSmith configuration exactly
+3. Update attribute mapping in Helm values if names differ
+4. Test with SAML tracer to verify attributes in assertion
+
+---
+
+### 4. Redirect Mismatch
+
+**Symptoms:**
+- Login attempt fails immediately
+- Error: "redirect_uri_mismatch" or similar
+- User never reaches IdP login page
+
+**Likely Causes:**
+1. Redirect URI in LangSmith doesn't match IdP whitelist
+2. Trailing slash mismatch
+3. Protocol mismatch (http vs https)
+4. Domain mismatch
+
+**Evidence Gathering:**
+```bash
+# Check configured redirect URI
+helm get values <release> -n <namespace> | grep -i redirect
+
+# Verify exact redirect URI format
+kubectl exec <pod> -n <namespace> -- env | grep -i REDIRECT
+
+# Test redirect URI endpoint
+curl -I https://<domain>/auth/callback
+```
+
+**Commands:**
+```bash
+# Compare redirect URIs
+echo "LangSmith config:"
+kubectl exec <pod> -n <namespace> -- env | grep OIDC_REDIRECT_URI
+
+echo "IdP whitelist:"
+# (Manual check in IdP admin console)
+
+# Verify exact match (including trailing slashes, case)
+```
+
+**Fix:**
+1. Get exact redirect URI from LangSmith configuration
+2. Verify it matches IdP whitelist **exactly** (character-by-character)
+3. Update IdP whitelist if needed
+4. Restart LangSmith pods after configuration change
+
+---
+
+### 5. TLS/Callback Issues
+
+**Symptoms:**
+- IdP callback fails with TLS errors
+- Browser shows "Not Secure" warning
+- Certificate errors in browser console
+- Callback never completes
+
+**Likely Causes:**
+1. Self-signed certificate (browser rejects)
+2. Certificate chain incomplete
+3. Certificate expired
+4. Certificate doesn't match domain
+
+**Evidence Gathering:**
+```bash
+# Check TLS certificate
+openssl s_client -connect <domain>:443 -servername <domain> < /dev/null
+
+# Check certificate expiration
+echo | openssl s_client -connect <domain>:443 -servername <domain> 2>/dev/null | \
+  openssl x509 -noout -dates
+
+# Check ingress TLS configuration
+kubectl get ingress -n <namespace> -o yaml | grep -A 5 tls
+```
+
+**Commands:**
+```bash
+# Verify certificate validity
+kubectl get ingress -n <namespace> -o jsonpath='{.items[0].spec.tls[0].secretName}'
+kubectl get secret <tls-secret> -n <namespace> -o yaml
+
+# Test certificate from pod
+kubectl exec <pod> -n <namespace> -- openssl s_client -connect <domain>:443 -servername <domain>
+```
+
+**Fix:**
+1. Use valid TLS certificate from trusted CA (not self-signed)
+2. Ensure full certificate chain is present
+3. Renew certificate if expired
+4. Verify certificate matches domain exactly
+5. Update ingress TLS secret if needed
+
+---
+
+## What Support Will Ask For
+
+When contacting LangSmith support for authentication issues, provide:
+
+### Minimal Evidence Bundle
+
+1. **Configuration Summary (redacted)**
+   - Auth provider type (OIDC/SAML)
+   - Issuer/metadata URL (no secrets)
+   - Domain
+   - Claim/attribute mappings
+   - Role mapping configuration
+
+2. **Pod Logs**
+   - Last 200 lines from API/server pods
+   - Filtered for auth-related errors
+   - Timestamp of failure
+
+3. **Recent Events**
+   ```bash
+   kubectl get events -n <namespace> --sort-by=.lastTimestamp > events.txt
+   ```
+
+4. **Ingress Configuration**
+   ```bash
+   kubectl get ingress -n <namespace> -o yaml > ingress.yaml
+   ```
+
+5. **Helm Values (redacted)**
+   ```bash
+   helm get values <release> -n <namespace> > helm-values.txt
+   # Manually redact secrets before sending
+   ```
+
+### Do NOT Include
+
+- Client secrets
+- Tokens
+- Passwords
+- Private keys
+- Full certificate chains (public certs OK)
+
+### Support Bundle Script
+
+```bash
+#!/bin/bash
+# Collect minimal auth troubleshooting bundle
+
+NAMESPACE="${NAMESPACE:-langsmith}"
+RELEASE="${HELM_RELEASE:-langsmith}"
+OUTPUT_DIR="auth-support-$(date +%Y%m%d-%H%M%S)"
+
+mkdir -p "$OUTPUT_DIR"
+
+# Pod logs (last 200 lines, auth-related)
+kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].metadata.name}' | \
+  tr ' ' '\n' | grep -E "(api|server|backend)" | head -3 | while read pod; do
+    kubectl logs "$pod" -n "$NAMESPACE" --tail=200 | \
+      grep -i -E "(auth|oidc|saml|sso|login|redirect)" > "$OUTPUT_DIR/${pod}-auth-logs.txt" || true
+  done
+
+# Events
+kubectl get events -n "$NAMESPACE" --sort-by=.lastTimestamp > "$OUTPUT_DIR/events.txt"
+
+# Ingress
+kubectl get ingress -n "$NAMESPACE" -o yaml > "$OUTPUT_DIR/ingress.yaml"
+
+# Helm values (redact secrets manually)
+helm get values "$RELEASE" -n "$NAMESPACE" > "$OUTPUT_DIR/helm-values.txt"
+echo "⚠️  REDACT SECRETS FROM helm-values.txt BEFORE SENDING"
+
+# Configuration summary
+cat > "$OUTPUT_DIR/config-summary.txt" <<EOF
+Auth Configuration Summary
+Generated: $(date -Iseconds)
+
+Namespace: $NAMESPACE
+Release: $RELEASE
+Domain: ${LANGSMITH_DOMAIN:-N/A}
+Provider: ${AUTH_PROVIDER:-N/A}
+
+Note: Secrets not included for security.
+EOF
+
+echo "Support bundle saved to: $OUTPUT_DIR"
+echo "⚠️  Review and redact secrets before sending to support"
+```
+
+---
+
+## Quick Reference
+
+### OIDC Issues
+- **Redirect mismatch:** Check exact URI match
+- **Token validation:** Check issuer URL, clock skew
+- **Missing claims:** Verify scopes and IdP configuration
+
+### SAML Issues
+- **Missing attributes:** Check IdP attribute configuration
+- **Signature failure:** Verify certificate in metadata
+- **Entity ID mismatch:** Check entity ID configuration
+
+### Common Commands
+```bash
+# Check auth configuration
+kubectl exec <pod> -n <namespace> -- env | grep -i -E "(auth|oidc|saml)"
+
+# Check logs
+kubectl logs <pod> -n <namespace> --tail=100 | grep -i auth
+
+# Check Helm values
+helm get values <release> -n <namespace>
+
+# Restart pods (after config change)
+kubectl rollout restart deployment -n <namespace>
+```
+
+---
+
+## Escalation
+
+If issues persist after following this playbook:
+
+1. Collect minimal evidence bundle (see above)
+2. Document exact steps to reproduce
+3. Note any recent configuration changes
+4. Contact LangSmith support with evidence bundle
+
@@ -0,0 +1,110 @@
+# Authentication Validation Checklist
+
+**Purpose:** Operator-friendly checklist for validating SSO configuration  
+**Use:** Complete this checklist after running the validation notebook(s)
+
+---
+
+## Preconditions
+
+- [ ] DNS configured and resolving correctly
+- [ ] TLS certificate valid and trusted (not self-signed in production)
+- [ ] Ingress configured and accessible
+- [ ] LangSmith deployment healthy (all pods running, PVCs bound)
+
+---
+
+## Configuration Inputs
+
+### OIDC Configuration
+- [ ] `OIDC_ISSUER` set and accessible
+- [ ] `OIDC_CLIENT_ID` set
+- [ ] `OIDC_CLIENT_SECRET` set (stored in Kubernetes secret)
+- [ ] `OIDC_REDIRECT_URI` matches exactly between LangSmith and IdP
+- [ ] `OIDC_SCOPES` includes `openid` and `email`
+- [ ] `OIDC_SCOPES` includes `groups` (if using group-based role mapping)
+
+### SAML Configuration
+- [ ] `SAML_METADATA_URL` accessible OR `SAML_METADATA_FILE` exists
+- [ ] SAML metadata XML is valid
+- [ ] Entity ID matches between LangSmith and IdP
+- [ ] Signing certificate present in metadata
+- [ ] SSO endpoints found in metadata
+
+### Common to Both
+- [ ] `LANGSMITH_DOMAIN` matches actual domain
+- [ ] Claim/attribute mappings configured
+- [ ] Role mapping configured (groups or users)
+
+---
+
+## Role Mapping
+
+- [ ] Group-to-role mapping configured (preferred)
+- [ ] Admin groups identified and mapped
+- [ ] Member groups identified and mapped
+- [ ] Viewer groups identified and mapped (if applicable)
+- [ ] Minimal admin principle followed (1-2 admins to start)
+
+---
+
+## Login Validation
+
+### Admin User
+- [ ] Admin user can log in via SSO
+- [ ] Admin user sees correct role (admin)
+- [ ] Admin user can access organization settings
+- [ ] Admin user can manage workspaces
+- [ ] Admin user can manage users (if applicable)
+
+### Standard User
+- [ ] Standard user can log in via SSO
+- [ ] Standard user sees correct role (member/viewer)
+- [ ] Standard user can access assigned workspaces
+- [ ] Standard user cannot access organization settings
+- [ ] Standard user cannot manage users
+
+---
+
+## Session Management
+
+- [ ] Logout works correctly
+- [ ] Session invalidation works (logout from IdP invalidates LangSmith session)
+- [ ] Session timeout configured appropriately
+- [ ] Multiple browser sessions work independently
+
+---
+
+## Audit Evidence
+
+- [ ] Authentication events logged
+- [ ] Role assignments logged
+- [ ] Session creation/destruction logged
+- [ ] Failed authentication attempts logged
+- [ ] Logs exportable to SIEM (if required)
+
+---
+
+## Documentation
+
+- [ ] Helm values file saved (with secrets redacted)
+- [ ] IdP settings documented
+- [ ] Group-to-role mapping table created
+- [ ] Admin assignments documented
+- [ ] Troubleshooting playbook bookmarked
+
+---
+
+## Sign-Off
+
+**Validated by:** _________________  
+**Date:** _________________  
+**Notes:** _________________
+
+---
+
+**Next Steps:**
+- Proceed to Module 3 (if applicable)
+- Schedule regular access reviews
+- Document any deviations from standard configuration
+
@@ -0,0 +1,163 @@
+# First 10 Minutes: Incident Response Checklist
+
+**When:** You detect or are alerted to a LangSmith self-hosted issue.
+
+**Goal:** Collect evidence, stabilize if possible, and prepare for escalation—without making things worse.
+
+---
+
+## ⚠️ Critical: Do NOT Redeploy
+
+**Resist the urge to:**
+- Run `helm upgrade` or `kubectl rollout restart`
+- Delete pods "to see if they come back"
+- Scale resources up/down
+- Change configuration
+
+**Why:** Redeploying destroys evidence and may mask the root cause. Collect diagnostics first.
+
+---
+
+## Minute 0-2: Triage & Scope
+
+- [ ] **Confirm the issue:** What's broken? (UI down, API 5xx, traces missing, auth failing)
+- [ ] **Check who's impacted:** All users, specific endpoints, specific features?
+- [ ] **Note the time:** Record detection time and any recent changes (deployments, config changes, infrastructure changes)
+- [ ] **Check basic connectivity:**
+  ```bash
+  kubectl cluster-info
+  kubectl get nodes
+  kubectl get pods -n <namespace>
+  ```
+
+---
+
+## Minute 2-5: Quick Health Check
+
+- [ ] **Pod status:**
+  ```bash
+  kubectl get pods -n <namespace> -o wide
+  ```
+  Look for: CrashLoopBackOff, Pending, Error states
+
+- [ ] **Recent events:**
+  ```bash
+  kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
+  ```
+  Look for: Failed scheduling, image pull errors, resource limits
+
+- [ ] **Ingress/Load Balancer:**
+  ```bash
+  kubectl get ingress -n <namespace>
+  ```
+  Check if endpoint is reachable (curl or browser)
+
+- [ ] **Key deployments:**
+  ```bash
+  kubectl get deployments -n <namespace>
+  kubectl describe deployment <deployment-name> -n <namespace>
+  ```
+
+---
+
+## Minute 5-8: Collect Diagnostics Bundle
+
+- [ ] **Run canonical diagnostics script:**
+  ```bash
+  # Download and run the official script
+  curl -O https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
+  chmod +x get_k8s_debugging_info.sh
+  ./get_k8s_debugging_info.sh <namespace>
+  ```
+  This captures: pod logs, events, resource usage, configuration
+
+- [ ] **Save timestamped snapshot:**
+  ```bash
+  TIMESTAMP=$(date +%Y%m%d-%H%M%S)
+  mkdir -p artifacts/incident-$TIMESTAMP
+  
+  kubectl get all -n <namespace> -o yaml > artifacts/incident-$TIMESTAMP/all-resources.yaml
+  kubectl get events -n <namespace> --sort-by='.lastTimestamp' > artifacts/incident-$TIMESTAMP/events.txt
+  ```
+
+- [ ] **Check logs for obvious errors:**
+  ```bash
+  # Check API server logs
+  kubectl logs -n <namespace> -l app=langsmith-api --tail=100
+  
+  # Check worker logs
+  kubectl logs -n <namespace> -l app=langsmith-worker --tail=100
+  ```
+  Look for: connection errors, timeouts, authentication failures, resource exhaustion
+
+---
+
+## Minute 8-10: Identify Likely Root Cause
+
+Based on symptoms, check the most likely culprits:
+
+### If UI/API is down:
+- [ ] Check ingress/load balancer status (via cloud helper or kubectl)
+- [ ] Check API pod logs for startup errors
+- [ ] Verify external services (PostgreSQL, Redis) are reachable
+
+### If traces are missing/delayed:
+- [ ] Check ClickHouse connectivity and logs
+- [ ] Check worker pod logs for insert errors
+- [ ] Verify blob storage configuration (if large payloads)
+
+### If authentication fails:
+- [ ] Check OIDC/SAML configuration (Module 2 validation)
+- [ ] Check IdP connectivity
+- [ ] Review auth-related pod logs
+
+### If ingestion is slow:
+- [ ] Check Redis connectivity and latency
+- [ ] Check worker pod resource usage
+- [ ] Look for queue backlogs
+
+---
+
+## After 10 Minutes: Decision Point
+
+**If you've identified and can safely fix the issue:**
+- Document what you changed
+- Verify recovery
+- Collect post-recovery diagnostics
+
+**If you need help:**
+- Use the [Support Escalation Template](../shared/support_escalation_template.md)
+- Include the diagnostics bundle
+- Note what you've tried and the results
+
+**If the issue is critical and escalating:**
+- Continue collecting evidence every 5-10 minutes
+- Document timeline of symptoms
+- Prepare escalation with all evidence
+
+---
+
+## What NOT to Do
+
+- ❌ Don't delete namespaces or persistent volumes
+- ❌ Don't change database passwords or connection strings
+- ❌ Don't scale resources without understanding the bottleneck
+- ❌ Don't ignore error messages—they're evidence
+- ❌ Don't skip the diagnostics bundle—Support will ask for it
+
+---
+
+## Quick Reference: Common Failure Patterns
+
+| Symptom | Likely Cause | First Check |
+|---------|--------------|-------------|
+| All pods CrashLoopBackOff | Config error, missing secret | `kubectl describe pod` |
+| API 5xx errors | Database/Redis connection | Pod logs, service endpoints |
+| Traces not appearing | ClickHouse connectivity | ClickHouse pod logs |
+| Slow ingestion | Redis latency, worker backlog | Worker logs, Redis metrics |
+| Auth redirect loop | OIDC/SAML misconfiguration | Auth pod logs, IdP connectivity |
+
+---
+
+**Remember:** The goal is evidence collection and safe triage, not immediate resolution. A good diagnostics bundle is worth more than a hasty fix.
+
@@ -0,0 +1,312 @@
+# Operations Signals and Thresholds
+
+**Purpose:** Define early warning signals and red flag thresholds for LangSmith operations  
+**Use:** Configure monitoring and alerting based on these thresholds  
+**Frequency:** Review quarterly and adjust based on historical data
+
+---
+
+## Signal Categories
+
+### Critical Signals (Red Flags - Immediate Action)
+
+**Pod Health:**
+- Pod in `CrashLoopBackOff` state → **IMMEDIATE**
+- Pod `Pending` for > 10 minutes → **IMMEDIATE**
+- Pod restart count > 5 in 1 hour → **IMMEDIATE**
+- Pod `ImagePullBackOff` → **IMMEDIATE**
+
+**Resource Saturation:**
+- Node CPU > 90% for > 5 minutes → **IMMEDIATE**
+- Node memory > 95% for > 5 minutes → **IMMEDIATE**
+- Pod CPU > 90% for > 10 minutes → **IMMEDIATE**
+- Pod memory > 95% for > 10 minutes → **IMMEDIATE**
+
+**Application Health:**
+- API server error rate > 5% → **IMMEDIATE**
+- API server latency p95 > 5 seconds → **IMMEDIATE**
+- Worker queue depth > 200 → **IMMEDIATE**
+- Worker processing rate < 10 jobs/minute → **IMMEDIATE**
+
+**Data Store Health:**
+- PostgreSQL connection pool exhausted → **IMMEDIATE**
+- PostgreSQL query timeout > 10 seconds → **IMMEDIATE**
+- Redis out of memory → **IMMEDIATE**
+- Redis connection refused → **IMMEDIATE**
+- ClickHouse query timeout > 10 seconds → **IMMEDIATE**
+- ClickHouse table size > 1 TB (single table) → **IMMEDIATE**
+
+### Warning Signals (Yellow Flags - Monitor Closely)
+
+**Pod Health:**
+- Pod restart count > 2 in 1 hour → **WARNING**
+- Pod `Pending` for > 5 minutes → **WARNING**
+- Pod CPU > 70% for > 10 minutes → **WARNING**
+- Pod memory > 80% for > 10 minutes → **WARNING**
+
+**Application Health:**
+- API server error rate > 1% → **WARNING**
+- API server latency p95 > 2 seconds → **WARNING**
+- Worker queue depth > 50 → **WARNING**
+- Worker processing rate < 50 jobs/minute → **WARNING**
+
+**Data Store Health:**
+- PostgreSQL connections > 80% of max → **WARNING**
+- PostgreSQL query latency p95 > 2 seconds → **WARNING**
+- Redis memory > 90% → **WARNING**
+- Redis hit rate < 80% → **WARNING**
+- ClickHouse query latency p95 > 3 seconds → **WARNING**
+- ClickHouse disk usage > 80% → **WARNING**
+
+---
+
+## Threshold Definitions
+
+### Pod Restart Count
+
+**Measurement:** `kubectl get pods -n <namespace> --field-selector=status.phase=Running` → count restarts
+
+**Calculation:**
+```bash
+kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'
+```
+
+**Thresholds:**
+- **Critical:** > 5 restarts in 1 hour
+- **Warning:** > 2 restarts in 1 hour
+
+**Action:**
+- Check pod logs: `kubectl logs <pod> -n <namespace> --tail=100`
+- Check events: `kubectl get events -n <namespace> --sort-by=.lastTimestamp`
+- Check resource limits: `kubectl describe pod <pod> -n <namespace>`
+
+### Pending Pods
+
+**Measurement:** Pods in `Pending` state
+
+**Calculation:**
+```bash
+kubectl get pods -n <namespace> --field-selector=status.phase=Pending
+```
+
+**Thresholds:**
+- **Critical:** Pending for > 10 minutes
+- **Warning:** Pending for > 5 minutes
+
+**Action:**
+- Check events: `kubectl describe pod <pod> -n <namespace>`
+- Check node capacity: `kubectl top nodes`
+- Check PVC binding: `kubectl get pvc -n <namespace>`
+
+### API Server Error Rate
+
+**Measurement:** HTTP 5xx responses / total requests
+
+**Calculation:**
+- Application metrics: `rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])`
+- Or: Check application logs for error patterns
+
+**Thresholds:**
+- **Critical:** > 5% error rate
+- **Warning:** > 1% error rate
+
+**Action:**
+- Check pod logs: `kubectl logs <api-pod> -n <namespace> --tail=100 | grep -i error`
+- Check downstream services (PostgreSQL, Redis, ClickHouse)
+- Check resource usage: `kubectl top pod <api-pod> -n <namespace>`
+
+### Worker Queue Depth
+
+**Measurement:** Number of jobs in Redis queue
+
+**Calculation:**
+```bash
+# Redis CLI
+redis-cli LLEN langsmith:jobs:traces
+```
+
+**Or via application metrics:**
+- KEDA metrics: `redis_queue_length`
+
+**Thresholds:**
+- **Critical:** > 200 jobs
+- **Warning:** > 50 jobs
+
+**Action:**
+- Scale workers: Check KEDA ScaledObject
+- Check worker processing rate
+- Check for stuck jobs
+
+### PostgreSQL Connection Count
+
+**Measurement:** Active connections / max connections
+
+**Calculation:**
+```sql
+SELECT count(*) FROM pg_stat_activity;
+SELECT setting FROM pg_settings WHERE name = 'max_connections';
+```
+
+**Or via cloud provider metrics:**
+- AWS RDS: `DatabaseConnections` metric
+- Azure Database: `active_connections` metric
+
+**Thresholds:**
+- **Critical:** > 90% of max connections
+- **Warning:** > 80% of max connections
+
+**Action:**
+- Check for connection leaks
+- Review connection pool configuration
+- Consider increasing max connections (if justified)
+
+### Redis Memory Usage
+
+**Measurement:** Used memory / max memory
+
+**Calculation:**
+```bash
+redis-cli INFO memory
+# used_memory / maxmemory
+```
+
+**Or via cloud provider metrics:**
+- AWS ElastiCache: `DatabaseMemoryUsagePercentage`
+- Azure Cache: `usedmemorypercentage`
+
+**Thresholds:**
+- **Critical:** > 95% memory usage
+- **Warning:** > 90% memory usage
+
+**Action:**
+- Check for memory leaks
+- Review key expiration policies
+- Consider scaling up instance size
+
+### ClickHouse Query Latency
+
+**Measurement:** p95 query latency
+
+**Calculation:**
+- ClickHouse system tables: `SELECT quantile(0.95)(query_duration_ms) FROM system.query_log WHERE event_time > now() - INTERVAL 1 HOUR`
+
+**Thresholds:**
+- **Critical:** p95 > 10 seconds
+- **Warning:** p95 > 3 seconds
+
+**Action:**
+- Check table sizes (may need partitioning)
+- Check for slow queries: `SELECT * FROM system.query_log WHERE query_duration_ms > 5000 ORDER BY query_duration_ms DESC LIMIT 10`
+- Check disk I/O: `kubectl top pod <clickhouse-pod> -n <namespace>`
+
+---
+
+## Log Patterns to Monitor
+
+### Common Failure Patterns
+
+**Connection Refused:**
+```bash
+kubectl logs <pod> -n <namespace> --tail=100 | grep -i "connection refused"
+```
+
+**Timeouts:**
+```bash
+kubectl logs <pod> -n <namespace> --tail=100 | grep -i "timeout"
+```
+
+**Out of Memory:**
+```bash
+kubectl logs <pod> -n <namespace> --tail=100 | grep -i "out of memory\|OOM"
+```
+
+**Database Errors:**
+```bash
+kubectl logs <pod> -n <namespace> --tail=100 | grep -i "database\|postgres\|redis\|clickhouse" | grep -i "error\|fail"
+```
+
+**Authentication Errors:**
+```bash
+kubectl logs <pod> -n <namespace> --tail=100 | grep -i "unauthorized\|forbidden\|auth"
+```
+
+---
+
+## Escalation Evidence
+
+When escalating to support, gather:
+
+1. **Pod Status:**
+   ```bash
+   kubectl get pods -n <namespace> -o wide
+   kubectl describe pod <problem-pod> -n <namespace>
+   ```
+
+2. **Recent Events:**
+   ```bash
+   kubectl get events -n <namespace> --sort-by=.lastTimestamp | tail -50
+   ```
+
+3. **Resource Usage:**
+   ```bash
+   kubectl top pods -n <namespace>
+   kubectl top nodes
+   ```
+
+4. **Pod Logs:**
+   ```bash
+   kubectl logs <pod> -n <namespace> --tail=200
+   ```
+
+5. **Application Metrics:**
+   - Error rates, latencies, queue depths
+   - Database connection counts
+   - Cache hit rates
+
+6. **Configuration:**
+   - Helm values (redacted)
+   - Environment variables (redacted)
+   - Resource requests/limits
+
+---
+
+## Threshold Tuning
+
+**Initial thresholds:** Use the values above as starting points.
+
+**Tuning process:**
+1. Monitor for 1-2 weeks
+2. Identify false positives (alerts that don't require action)
+3. Identify missed incidents (issues that should have alerted)
+4. Adjust thresholds based on historical data
+5. Document threshold changes and rationale
+
+**Factors to consider:**
+- Workload patterns (peak hours, batch jobs)
+- Growth trajectory (user growth, data growth)
+- Resource capacity (cluster size, database size)
+- Business requirements (SLA, RTO, RPO)
+
+---
+
+## Quick Reference
+
+| Signal | Critical | Warning | Measurement |
+|--------|----------|---------|-------------|
+| Pod restarts | > 5/hour | > 2/hour | `kubectl get pods` |
+| Pending pods | > 10 min | > 5 min | `kubectl get pods` |
+| API error rate | > 5% | > 1% | Application metrics |
+| Worker queue | > 200 | > 50 | Redis queue length |
+| PostgreSQL connections | > 90% max | > 80% max | Database metrics |
+| Redis memory | > 95% | > 90% | Redis INFO memory |
+| ClickHouse latency | > 10s p95 | > 3s p95 | Query log |
+
+---
+
+## Next Steps
+
+1. **Configure alerts** based on these thresholds
+2. **Test alerts** to ensure they fire correctly
+3. **Document runbooks** for each alert type
+4. **Review quarterly** and adjust based on experience
+
@@ -0,0 +1,238 @@
+# Production Readiness Checklist
+
+**Purpose:** Validate that LangSmith deployment meets production requirements  
+**Use:** Complete this checklist before declaring production-ready  
+**Frequency:** Review quarterly or after significant changes
+
+---
+
+## Infrastructure & Networking
+
+### Cloud Provider Configuration
+- [ ] Correct cloud account/subscription (verified)
+- [ ] Correct region selected (verified)
+- [ ] Private networking configured (all data stores in private subnets)
+- [ ] VPC/VNet peering configured (if multi-VPC deployment)
+- [ ] Security groups/NSGs configured correctly
+- [ ] IAM roles/Managed Identities configured (no access keys)
+
+### Kubernetes Cluster
+- [ ] Cluster version supported (check compatibility matrix)
+- [ ] Node capacity sufficient (headroom for scaling)
+- [ ] Cluster autoscaling enabled (if applicable)
+- [ ] CSI storage drivers installed (EBS CSI for AWS, Azure Disk CSI for Azure)
+- [ ] Network policies configured (if required)
+- [ ] Resource quotas set (if multi-tenant)
+
+---
+
+## Data Stores
+
+### PostgreSQL
+- [ ] Instance size meets baseline (db.r5.xlarge minimum for production)
+- [ ] Multi-AZ enabled (RDS) or read replicas configured (Azure)
+- [ ] Storage autoscaling enabled
+- [ ] Automated backups configured (7-day retention minimum)
+- [ ] Connection pool configured (100+ connections)
+- [ ] Private networking (no public access)
+- [ ] Encryption at rest enabled
+- [ ] Performance insights/monitoring enabled
+
+### Redis
+- [ ] Instance type meets baseline (cache.r6g.xlarge minimum for production)
+- [ ] Cluster mode enabled (3+ nodes for production)
+- [ ] AOF persistence enabled (production)
+- [ ] Memory headroom sufficient (50% free)
+- [ ] Private networking (no public access)
+- [ ] Encryption at rest enabled
+
+### ClickHouse
+- [ ] Deployment type: Managed (ClickHouse Cloud) OR in-cluster with proper sizing
+- [ ] In-cluster: 3-node cluster minimum (for HA)
+- [ ] Resources per node: 8 CPU, 32 GB RAM, 1 TB storage (production)
+- [ ] Replication factor: 2x (6 total pods for 3-node cluster)
+- [ ] Storage class: EBS gp3 with 3000 IOPS (AWS) or Premium SSD (Azure)
+- [ ] Backups configured (if in-cluster)
+- [ ] Private networking (no public access)
+
+### Blob Storage (REQUIRED)
+- [ ] Blob storage provider configured (S3 or Azure Blob Storage)
+- [ ] NOT using local filesystem or in-cluster storage
+- [ ] Bucket/container created and accessible
+- [ ] IAM role/Managed Identity configured (no access keys)
+- [ ] Versioning enabled
+- [ ] Encryption at rest enabled
+- [ ] Lifecycle policies configured (cost optimization)
+- [ ] Health check passes (see ops sanity checks notebook)
+
+**Critical:** Blob storage is REQUIRED for production. Without it, ClickHouse will become unusable under load.
+
+---
+
+## Application Configuration
+
+### Helm Configuration
+- [ ] Helm values file reviewed and documented
+- [ ] Resource requests/limits set for all containers
+- [ ] Replica counts set appropriately (min 2 for HA)
+- [ ] Environment variables documented
+- [ ] Secrets stored in Kubernetes (not in values file)
+- [ ] Values file version controlled (Git)
+
+### High Availability
+- [ ] API server replicas: 2+ (for HA)
+- [ ] Worker replicas: 1+ (scaled via KEDA)
+- [ ] Pod anti-affinity rules configured (spread across nodes/AZs)
+- [ ] Readiness probes configured correctly
+- [ ] Liveness probes configured correctly
+
+### Autoscaling
+- [ ] HPA configured for API servers (CPU/memory targets)
+- [ ] HPA min replicas: 2
+- [ ] HPA max replicas: 10+ (adjust based on capacity)
+- [ ] KEDA ScaledObject configured for workers (queue depth)
+- [ ] KEDA min replicas: 1
+- [ ] KEDA max replicas: 20+ (adjust based on workload)
+
+---
+
+## Observability
+
+### Monitoring
+- [ ] Kubernetes metrics available (pod CPU/memory)
+- [ ] Application metrics exposed (request rates, latencies)
+- [ ] Database metrics available (connection counts, query performance)
+- [ ] Redis metrics available (memory usage, hit rates)
+- [ ] ClickHouse metrics available (query latency, table sizes)
+- [ ] Log aggregation configured (CloudWatch, Azure Monitor, etc.)
+
+### Alerting
+- [ ] Critical alerts configured (pod crashes, high error rates)
+- [ ] Warning alerts configured (resource saturation, queue depth)
+- [ ] Alert thresholds documented (see ops_signals_and_thresholds.md)
+- [ ] On-call rotation configured
+- [ ] Escalation paths defined
+
+### Dashboards
+- [ ] Kubernetes dashboard (pod status, resource usage)
+- [ ] Application dashboard (request rates, error rates)
+- [ ] Database dashboard (connection counts, query performance)
+- [ ] Queue depth dashboard (worker queue metrics)
+
+---
+
+## Security
+
+### Authentication & Authorization
+- [ ] SSO configured (OIDC or SAML)
+- [ ] Local auth disabled (production)
+- [ ] Role mapping configured correctly
+- [ ] Admin access restricted (minimal admins)
+
+### Network Security
+- [ ] Ingress TLS configured (valid certificate)
+- [ ] mTLS enabled (if service mesh used)
+- [ ] Egress policies configured (if service mesh used)
+- [ ] Network policies configured (if required)
+
+### Secrets Management
+- [ ] Secrets stored in Kubernetes (not in code)
+- [ ] Secrets rotation process documented
+- [ ] Access to secrets restricted (RBAC)
+
+---
+
+## Backup & Disaster Recovery
+
+### Backups
+- [ ] PostgreSQL backups automated (daily, 7-day retention)
+- [ ] ClickHouse backups configured (if in-cluster)
+- [ ] Blob storage versioning enabled
+- [ ] Backup restoration tested (last 6 months)
+
+### Disaster Recovery
+- [ ] DR plan documented
+- [ ] RTO/RPO defined
+- [ ] Failover procedure tested
+- [ ] Cross-region replication configured (if required)
+
+---
+
+## Operational Readiness
+
+### Documentation
+- [ ] Runbooks documented (common operations)
+- [ ] Incident response procedures documented
+- [ ] Escalation paths documented
+- [ ] Service sizing baselines documented
+
+### Testing
+- [ ] Load testing performed (validates scaling)
+- [ ] Failover testing performed (validates HA)
+- [ ] Backup restoration tested
+- [ ] Ops sanity checks notebook run (all checks pass)
+
+### Team Readiness
+- [ ] On-call rotation established
+- [ ] Team trained on operations
+- [ ] Access to cloud console (for managed services)
+- [ ] Access to monitoring/alerting tools
+
+---
+
+## Service Mesh (If Applicable)
+
+### Istio Configuration
+- [ ] Istio installed and configured
+- [ ] Sidecar injection enabled (namespace or per-workload)
+- [ ] ServiceEntry configured (for external databases)
+- [ ] DestinationRule configured (traffic policies)
+- [ ] Egress policies configured (if required)
+- [ ] mTLS enabled (if required)
+
+### Operational Considerations
+- [ ] Log selection documented (app vs proxy logs)
+- [ ] Health probe timeouts adjusted (account for sidecar)
+- [ ] Multi-container pod logging understood
+
+---
+
+## Sign-Off
+
+**Validated by:** _________________  
+**Date:** _________________  
+**Next Review Date:** _________________  
+**Notes:** _________________
+
+---
+
+## Post-Checklist Actions
+
+1. **Run ops sanity checks notebook:**
+   - `notebooks/module-3/01_ops_sanity_checks.ipynb`
+   - Address any failures before production
+
+2. **Document thresholds:**
+   - Update `docs/shared/ops_signals_and_thresholds.md`
+   - Configure alerts based on thresholds
+
+3. **Schedule quarterly reviews:**
+   - Review checklist quarterly
+   - Update baselines as workload grows
+   - Adjust thresholds based on historical data
+
+---
+
+## Common Gaps
+
+**Most common production readiness gaps:**
+1. Blob storage not configured (CRITICAL)
+2. PostgreSQL single-AZ (no HA)
+3. Redis single node (no cluster mode)
+4. No autoscaling configured
+5. No monitoring/alerting
+6. Backups not tested
+7. Resource limits not set
+
+**Address these before declaring production-ready.**
+
@@ -0,0 +1,468 @@
+# Sidecars & Service Mesh (Istio)
+
+**Purpose:** Guide for enabling and operating Istio sidecars in LangSmith deployments  
+**Audience:** Platform engineers and operators managing service mesh configurations  
+**Prerequisites:** Istio installed in cluster (out of scope for this guide)
+
+---
+
+## When Sidecars Are Needed
+
+### Use Cases
+
+**Egress Control:**
+- Restrict outbound traffic to approved destinations only
+- Prevent pods from accessing unauthorized external services
+- Enforce network policies at the service mesh level
+
+**mTLS (Mutual TLS):**
+- Encrypt traffic between services within the cluster
+- Provide service-to-service authentication
+- Meet compliance requirements for encrypted communication
+
+**Policy Enforcement:**
+- Rate limiting between services
+- Circuit breakers for fault tolerance
+- Traffic splitting for canary deployments
+
+**Observability:**
+- Distributed tracing across services
+- Service-level metrics collection
+- Request/response logging
+
+### When NOT Needed
+
+**Simple deployments:**
+- Development environments
+- Proof-of-concept deployments
+- Single-service deployments
+
+**No egress requirements:**
+- All traffic stays within cluster
+- No external database connections
+- No outbound API calls
+
+**Alternative solutions:**
+- Network policies (Kubernetes native)
+- Ingress controllers (for north-south traffic)
+- Application-level rate limiting
+
+---
+
+## How to Enable Injection Safely
+
+### Namespace-Level Injection (Recommended)
+
+**Best for:** LangSmith namespace (all workloads need sidecars)
+
+**Configuration:**
+```yaml
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: langsmith
+  labels:
+    istio-injection: enabled
+    istio-discovery: enabled
+```
+
+**Apply:**
+```bash
+kubectl label namespace langsmith istio-injection=enabled istio-discovery=enabled
+```
+
+**Verification:**
+```bash
+kubectl get namespace langsmith --show-labels
+```
+
+**Behavior:**
+- All new pods in namespace get sidecars automatically
+- Existing pods require restart to get sidecars
+- Pods can opt out with annotation: `sidecar.istio.io/inject: "false"`
+
+### Per-Workload Annotation (Selective Injection)
+
+**Best for:** Specific workloads that need sidecars
+
+**Configuration:**
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: langsmith-api
+spec:
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/inject: "true"
+    spec:
+      containers:
+      - name: api
+        # ... container spec
+```
+
+**Apply:**
+```bash
+kubectl patch deployment langsmith-api -n langsmith -p '{"spec":{"template":{"metadata":{"annotations":{"sidecar.istio.io/inject":"true"}}}}}'
+```
+
+**Behavior:**
+- Only annotated workloads get sidecars
+- Works even if namespace injection is disabled
+- More granular control
+
+### Revision-Based Injection (Canary/Blue-Green)
+
+**Best for:** Gradual rollout or canary deployments
+
+**Configuration:**
+```yaml
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: langsmith
+  labels:
+    istio-injection: enabled
+    istio.io/rev: default  # or specific revision
+```
+
+**Behavior:**
+- Allows multiple Istio control planes
+- Enables gradual migration
+- Supports canary deployments
+
+---
+
+## Operational Implications
+
+### Logging and kubectl logs Behavior
+
+**Multi-container pods:** After sidecar injection, pods have multiple containers:
+- Application container (e.g., `langsmith-api`)
+- Sidecar container (`istio-proxy`)
+
+**Default behavior:**
+```bash
+# This shows logs from the FIRST container (usually application)
+kubectl logs <pod> -n <namespace>
+
+# This may show proxy logs if proxy is first container
+kubectl logs <pod> -n <namespace> -c istio-proxy
+
+# Show logs from specific container
+kubectl logs <pod> -n <namespace> -c <container-name>
+
+# Show logs from all containers
+kubectl logs <pod> -n <namespace> --all-containers=true
+```
+
+**Common issue:** "If logs appear missing after injection, you're likely looking at the wrong container."
+
+**Solution:**
+```bash
+# List containers in pod
+kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.containers[*].name}'
+
+# Get logs from application container
+kubectl logs <pod> -n <namespace> -c langsmith-api
+
+# Get logs from proxy container
+kubectl logs <pod> -n <namespace> -c istio-proxy
+```
+
+### Health Probes and Timeouts
+
+**Sidecar adds latency:**
+- Sidecar intercepts health check requests
+- Adds ~10-50ms latency per request
+- May cause probe timeouts if thresholds are too low
+
+**Adjust probe timeouts:**
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: langsmith-api
+spec:
+  template:
+    spec:
+      containers:
+      - name: api
+        readinessProbe:
+          httpGet:
+            path: /health
+            port: 8080
+          initialDelaySeconds: 5
+          periodSeconds: 10
+          timeoutSeconds: 5  # Increase if sidecars enabled
+          failureThreshold: 3
+        livenessProbe:
+          httpGet:
+            path: /health
+            port: 8080
+          initialDelaySeconds: 30
+          periodSeconds: 30
+          timeoutSeconds: 5  # Increase if sidecars enabled
+          failureThreshold: 3
+```
+
+**Verification:**
+```bash
+# Check probe success rate
+kubectl get pods -n <namespace> -o wide
+# Look for pods in Ready state
+
+# Check probe failures
+kubectl describe pod <pod> -n <namespace> | grep -A 5 "Liveness\|Readiness"
+```
+
+### Egress to External Databases
+
+**Problem:** Sidecars block outbound traffic by default.
+
+**Solution:** Configure `ServiceEntry` for external endpoints.
+
+**Example ServiceEntry for PostgreSQL:**
+```yaml
+apiVersion: networking.istio.io/v1beta1
+kind: ServiceEntry
+metadata:
+  name: postgres-external
+  namespace: langsmith
+spec:
+  hosts:
+  - <postgres-hostname>.rds.amazonaws.com
+  ports:
+  - number: 5432
+    name: postgres
+    protocol: TCP
+  location: MESH_EXTERNAL
+  resolution: DNS
+```
+
+**Example ServiceEntry for Redis:**
+```yaml
+apiVersion: networking.istio.io/v1beta1
+kind: ServiceEntry
+metadata:
+  name: redis-external
+  namespace: langsmith
+spec:
+  hosts:
+  - <redis-endpoint>.cache.amazonaws.com
+  ports:
+  - number: 6379
+    name: redis
+    protocol: TCP
+  location: MESH_EXTERNAL
+  resolution: DNS
+```
+
+**Apply:**
+```bash
+kubectl apply -f serviceentry-postgres.yaml -n langsmith
+kubectl apply -f serviceentry-redis.yaml -n langsmith
+```
+
+**Verification:**
+```bash
+# Check ServiceEntry
+kubectl get serviceentry -n langsmith
+
+# Test connectivity from pod
+kubectl exec -it <pod> -n langsmith -c <app-container> -- nc -zv <db-host> <port>
+```
+
+### DestinationRule for Traffic Policies
+
+**Example DestinationRule:**
+```yaml
+apiVersion: networking.istio.io/v1beta1
+kind: DestinationRule
+metadata:
+  name: postgres-dr
+  namespace: langsmith
+spec:
+  host: <postgres-hostname>.rds.amazonaws.com
+  trafficPolicy:
+    connectionPool:
+      tcp:
+        maxConnections: 100
+      http:
+        http1MaxPendingRequests: 10
+        http2MaxRequests: 100
+    tls:
+      mode: SIMPLE
+```
+
+---
+
+## Sample Labels and Annotations
+
+### Namespace Labels
+
+```yaml
+labels:
+  istio-injection: enabled
+  istio-discovery: enabled
+```
+
+### Pod Annotations
+
+```yaml
+annotations:
+  sidecar.istio.io/inject: "true"  # Enable injection
+  # or
+  sidecar.istio.io/inject: "false" # Disable injection
+```
+
+### Complete Example
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: langsmith-api
+  namespace: langsmith
+spec:
+  replicas: 2
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/inject: "true"
+    spec:
+      containers:
+      - name: api
+        image: langsmith/api:latest
+        ports:
+        - containerPort: 8080
+        readinessProbe:
+          httpGet:
+            path: /health
+            port: 8080
+          timeoutSeconds: 5
+        livenessProbe:
+          httpGet:
+            path: /health
+            port: 8080
+          timeoutSeconds: 5
+```
+
+---
+
+## Verification Commands
+
+### Check Sidecar Injection
+
+```bash
+# List pods and containers
+kubectl get pods -n langsmith -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].name}{"\n"}{end}'
+
+# Check for istio-proxy container
+kubectl get pod <pod> -n langsmith -o jsonpath='{.spec.containers[*].name}' | grep istio-proxy
+
+# Describe pod to see all containers
+kubectl describe pod <pod> -n langsmith | grep -A 10 "Containers:"
+```
+
+### Check ServiceEntry
+
+```bash
+# List ServiceEntries
+kubectl get serviceentry -n langsmith
+
+# Describe ServiceEntry
+kubectl describe serviceentry <name> -n langsmith
+```
+
+### Check DestinationRule
+
+```bash
+# List DestinationRules
+kubectl get destinationrule -n langsmith
+
+# Describe DestinationRule
+kubectl describe destinationrule <name> -n langsmith
+```
+
+### Test Connectivity
+
+```bash
+# Test from application container
+kubectl exec -it <pod> -n langsmith -c <app-container> -- curl -v <external-url>
+
+# Test from proxy container (if needed)
+kubectl exec -it <pod> -n langsmith -c istio-proxy -- curl -v <external-url>
+```
+
+---
+
+## Troubleshooting
+
+### Logs Appear Missing
+
+**Symptom:** `kubectl logs <pod>` shows no output or wrong logs.
+
+**Cause:** Looking at wrong container (proxy instead of app).
+
+**Solution:**
+```bash
+# List containers
+kubectl get pod <pod> -n langsmith -o jsonpath='{.spec.containers[*].name}'
+
+# Get logs from correct container
+kubectl logs <pod> -n langsmith -c <app-container-name>
+```
+
+### Health Probes Failing
+
+**Symptom:** Pods not becoming Ready after sidecar injection.
+
+**Cause:** Probe timeouts too low for sidecar latency.
+
+**Solution:** Increase `timeoutSeconds` in probe configuration.
+
+### External Database Connection Refused
+
+**Symptom:** Cannot connect to external PostgreSQL/Redis.
+
+**Cause:** ServiceEntry not configured or incorrect.
+
+**Solution:**
+1. Check ServiceEntry exists: `kubectl get serviceentry -n langsmith`
+2. Verify hostname matches: `kubectl describe serviceentry <name> -n langsmith`
+3. Check egress policies: `kubectl get authorizationpolicy -n langsmith`
+
+### High Latency After Injection
+
+**Symptom:** Request latency increased after sidecar injection.
+
+**Cause:** Normal sidecar overhead (10-50ms per request).
+
+**Solution:** This is expected. If latency is excessive (>100ms), check:
+- Proxy resource limits
+- Network policies
+- mTLS overhead
+
+---
+
+## Best Practices
+
+1. **Start with namespace-level injection** for simplicity
+2. **Adjust health probe timeouts** after injection
+3. **Configure ServiceEntry** for all external dependencies
+4. **Monitor proxy resource usage** (CPU/memory)
+5. **Document container names** for log access
+6. **Test connectivity** after configuration changes
+7. **Use per-workload annotation** for selective injection
+
+---
+
+## References
+
+- [Istio Documentation](https://istio.io/latest/docs/)
+- [Istio Sidecar Injection](https://istio.io/latest/docs/setup/additional-setup/sidecar-injection/)
+- [Istio ServiceEntry](https://istio.io/latest/docs/reference/config/networking/service-entry/)
+- [Istio DestinationRule](https://istio.io/latest/docs/reference/config/networking/destination-rule/)
+
@@ -0,0 +1,185 @@
+# Support Escalation Template
+
+**Use this template when escalating an incident to LangChain Support.**
+
+Copy and fill in each section. Include the diagnostics bundle and any relevant evidence.
+
+---
+
+## Incident Summary
+
+**Start Time:** `YYYY-MM-DD HH:MM:SS UTC`  
+**Detection Time:** `YYYY-MM-DD HH:MM:SS UTC`  
+**Current Status:** `[Investigating / Escalating / Resolved]`
+
+**Brief Description:**
+```
+[One-sentence summary of the issue]
+```
+
+---
+
+## Symptoms
+
+**Who is impacted:**
+- [ ] All users
+- [ ] Specific user(s) or workspace(s)
+- [ ] Specific endpoints or features
+- [ ] Internal operations only
+
+**What's broken:**
+- [ ] UI is unreachable or returns errors
+- [ ] API endpoints return 5xx errors
+- [ ] Traces are missing or delayed
+- [ ] Authentication/authorization failures
+- [ ] Ingestion is slow or failing
+- [ ] Other: `[describe]`
+
+**Error messages observed:**
+```
+[Paste relevant error messages, redacting any secrets]
+```
+
+**User-facing impact:**
+```
+[Describe what users experience]
+```
+
+---
+
+## Recent Changes
+
+**Deployments/Releases:**
+- [ ] Helm upgrade/chart change: `[version/date]`
+- [ ] Configuration change: `[what changed]`
+- [ ] Infrastructure change: `[what changed]`
+- [ ] No recent changes
+
+**Timeline:**
+```
+[Chronological list of changes leading up to the incident]
+```
+
+---
+
+## Environment Details
+
+**Cloud Provider:** `[AWS / Azure / GCP / Other]`  
+**Region/Location:** `[region]`  
+**Kubernetes Service:** `[EKS / AKS / GKE / Other]`  
+**Cluster Name:** `[cluster-name]`  
+**Namespace:** `[namespace]`
+
+**LangSmith Version:**
+- Helm Chart Version: `[version]`
+- Image Tags: `[if known]`
+- Deployment Method: `[Helm / kubectl / Other]`
+
+**Infrastructure:**
+- PostgreSQL: `[RDS / Azure Database / In-cluster / Other]`
+- Redis: `[ElastiCache / Azure Cache / In-cluster / Other]`
+- ClickHouse: `[Managed / In-cluster]`
+- Blob Storage: `[S3 / Azure Blob / GCS / Other]`
+
+---
+
+## Diagnostics Bundle
+
+**Bundle Location:** `[path or URL to diagnostics bundle]`
+
+**Bundle Contents:**
+- [ ] Canonical diagnostics script output (`get_k8s_debugging_info.sh`)
+- [ ] `kubectl get all -o yaml` snapshot
+- [ ] Recent events (`kubectl get events`)
+- [ ] Pod logs (API, workers, ClickHouse)
+- [ ] Resource usage snapshot (`kubectl top pods/nodes`)
+- [ ] Ingress/load balancer configuration
+- [ ] Helm values (redacted)
+
+**Bundle Timestamp:** `YYYY-MM-DD HH:MM:SS UTC`
+
+---
+
+## What We've Tried
+
+**Investigation Steps:**
+1. `[What you checked and what you found]`
+2. `[Next step and result]`
+3. `[Continue as needed]`
+
+**Remediation Attempts:**
+- [ ] Restarted pods: `[which pods, result]`
+- [ ] Checked external service connectivity: `[result]`
+- [ ] Verified configuration: `[result]`
+- [ ] Other: `[describe]`
+
+**Current Hypothesis:**
+```
+[Your best guess at the root cause, with evidence]
+```
+
+---
+
+## Evidence & Logs
+
+**Key Log Excerpts (redact secrets):**
+```
+[Paste relevant log lines with timestamps]
+```
+
+**Error Patterns:**
+```
+[Describe patterns you've observed]
+```
+
+**Metrics/Signals:**
+```
+[Any metrics or signals that indicate the issue]
+```
+
+---
+
+## Questions for Support
+
+1. `[Your question]`
+2. `[Another question]`
+3. `[Continue as needed]`
+
+---
+
+## Additional Context
+
+**Related Issues:**
+- Previous similar incidents: `[reference]`
+- Known limitations: `[describe]`
+- Custom configurations: `[describe, redact secrets]`
+
+**Priority:**
+- [ ] Critical (service down, all users impacted)
+- [ ] High (major feature broken, many users impacted)
+- [ ] Medium (degraded performance, some users impacted)
+- [ ] Low (minor issue, workaround available)
+
+---
+
+## Next Steps
+
+**What we need from Support:**
+- [ ] Root cause analysis
+- [ ] Remediation steps
+- [ ] Configuration guidance
+- [ ] Performance optimization
+- [ ] Other: `[describe]`
+
+**Our availability:**
+- Timezone: `[timezone]`
+- Best time to contact: `[time range]`
+- Escalation contact: `[name/email]`
+
+---
+
+**Template Version:** 1.0  
+**Last Updated:** `[date]`
+
+**Note:** Always redact secrets, API keys, passwords, and connection strings before sharing. Use `[REDACTED]` or similar markers.
+
@@ -35,3 +35,40 @@ VALUES_FILE="./helm/langsmith-values/values.aws-demo.yaml"
 ARTIFACTS_DIR="./artifacts"
 LOG_LEVEL="info"   # info|debug
 DRY_RUN="true"     # true by default; notebooks should flip this explicitly when applying
+
+# ===== OIDC SSO Configuration (Module 2) =====
+# Required: Get these values from your IdP team
+
+# LangSmith domain (must match your ingress domain)
+LANGSMITH_DOMAIN="langsmith.example.com"
+
+# OIDC Configuration (required)
+OIDC_ISSUER="https://your-org.okta.com/oauth2/default"  # IdP issuer URL
+OIDC_CLIENT_ID="your-client-id"                         # OAuth2 client ID (public)
+OIDC_CLIENT_SECRET="your-client-secret"                 # OAuth2 client secret (store in K8s secret, never commit)
+OIDC_REDIRECT_URI="https://langsmith.example.com/auth/callback"  # Must match EXACTLY in IdP whitelist
+
+# OIDC Scopes (optional, defaults shown)
+OIDC_SCOPES="openid,email,profile,groups"  # Include 'groups' for group-based role mapping
+
+# Claim Mappings (optional, defaults shown)
+OIDC_EMAIL_CLAIM="email"    # Claim name for user email (required)
+OIDC_NAME_CLAIM="name"      # Claim name for user display name (optional)
+OIDC_GROUPS_CLAIM="groups"  # Claim name for group membership (optional, for role mapping)
+
+# ===== SAML SSO Configuration (Module 2 - Alternative) =====
+# Use SAML if your IdP doesn't support OIDC or enterprise policy requires SAML
+
+# SAML_METADATA_URL="https://your-idp.com/saml/metadata"  # Preferred: metadata URL
+# SAML_METADATA_FILE="/path/to/metadata.xml"              # Alternative: metadata file path
+# SAML_ENTITY_ID="https://langsmith.example.com"          # Optional: entity ID
+# SAML_EMAIL_ATTRIBUTE="email"                            # Optional: email attribute name
+# SAML_NAME_ATTRIBUTE="name"                              # Optional: name attribute name
+# SAML_GROUPS_ATTRIBUTE="groups"                           # Optional: groups attribute name
+
+# ===== Notes =====
+# 1. OIDC_CLIENT_SECRET should be stored in Kubernetes secret, not in this file
+# 2. Redirect URI must match EXACTLY (case, trailing slashes, protocol)
+# 3. IdP team must whitelist the redirect URI
+# 4. For production, use HTTPS for all URLs
+# 5. See docs/modules/module-2.md for complete configuration guide
@@ -15,7 +15,7 @@ AWS_REGION="us-east-1"
 AWS_ACCOUNT_ID=""

 # Naming (used by notebooks for display + validation)
-CLUSTER_NAME="langsmith-workshop"
+#CLUSTER_NAME=""

 # Local repo paths (absolute is safest)
 TERRAFORM_REPO_DIR="$HOME/src/langchain-ai/terraform"
@@ -1,589 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Module 1: Preflight Checks\n",
-    "\n",
-    "## Overview\n",
-    "\n",
-    "This notebook validates your environment before deploying LangSmith. Most self-hosted failures occur **before** users ever touch the product due to:\n",
-    "\n",
-    "- Mis-sized clusters\n",
-    "- Unsupported ingress setups\n",
-    "- In-cluster databases used past their limits\n",
-    "- Missing storage primitives (blob, PVs)\n",
-    "\n",
-    "This preflight ensures you start from a **supported baseline**.\n",
-    "\n",
-    "## What We'll Check\n",
-    "\n",
-    "1. ✅ Tooling validation (cloud CLI, terraform, kubectl, helm, jq)\n",
-    "2. ✅ Cloud provider credentials & region sanity check\n",
-    "3. ✅ Cluster capacity expectations\n",
-    "4. ✅ Storage prerequisites (CSI drivers, StorageClasses)\n",
-    "5. ✅ Blob storage requirement (cloud object storage)\n",
-    "\n",
-    "**Estimated time:** 20-30 minutes\n",
-    "\n",
-    "**Supported Cloud Providers:** AWS, Azure (GCP coming soon)\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Bootstrap environment\n",
-    "import sys\n",
-    "from pathlib import Path\n",
-    "\n",
-    "# Add notebooks directory to path so we can import shared as a package\n",
-    "# Find the notebooks directory by looking for the shared folder\n",
-    "possible_paths = [\n",
-    "    Path.cwd().parent,  # If cwd is module-1, go up one level to notebooks\n",
-    "    Path.cwd(),  # If cwd is already notebooks\n",
-    "    Path.cwd() / \"notebooks\",  # If cwd is workspace root\n",
-    "]\n",
-    "\n",
-    "notebooks_path = None\n",
-    "for path in possible_paths:\n",
-    "    if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
-    "        notebooks_path = path\n",
-    "        break\n",
-    "\n",
-    "if not notebooks_path:\n",
-    "    # Fallback: try workspace root\n",
-    "    notebooks_path = Path.cwd() / \"notebooks\"\n",
-    "    if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
-    "        raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
-    "\n",
-    "# Add notebooks directory to path so 'shared' can be imported as a package\n",
-    "if str(notebooks_path) not in sys.path:\n",
-    "    sys.path.insert(0, str(notebooks_path))\n",
-    "\n",
-    "from shared._bootstrap import bootstrap\n",
-    "\n",
-    "# Run bootstrap: loads env, checks tools, validates AWS, creates artifacts dir\n",
-    "bootstrap_info = bootstrap()\n",
-    "print(f\"\\nBootstrap complete! Artifacts directory: {bootstrap_info['artifacts_dir']}\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Cloud Provider Account & Region Validation\n",
-    "\n",
-    "Verify you're using the correct cloud provider account/subscription and region. This is critical for avoiding accidental deployments to production or wrong regions.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import os\n",
-    "import json\n",
-    "from shared._cloud_helpers import (\n",
-    "    get_cloud_provider,\n",
-    "    get_region,\n",
-    "    get_identity,\n",
-    "    assert_account,\n",
-    ")\n",
-    "from shared._validation import require_env, print_config, ok, warn\n",
-    "\n",
-    "# Get cloud configuration\n",
-    "provider = get_cloud_provider()\n",
-    "region = get_region()\n",
-    "identity = get_identity()\n",
-    "\n",
-    "provider_display = provider.upper()\n",
-    "print(f\"### Current {provider_display} Session\")\n",
-    "print(f\"Cloud Provider: {provider_display}\")\n",
-    "print(f\"Region: {region}\")\n",
-    "\n",
-    "if provider == \"aws\":\n",
-    "    print(f\"Account ID: {identity['Account']}\")\n",
-    "    print(f\"User ARN: {identity['Arn']}\")\n",
-    "    account_var = \"AWS_ACCOUNT_ID\"\n",
-    "elif provider == \"azure\":\n",
-    "    subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"\")\n",
-    "    subscription_name = identity.get(\"SubscriptionName\", \"\")\n",
-    "    print(f\"Subscription ID: {subscription_id}\")\n",
-    "    print(f\"Subscription Name: {subscription_name}\")\n",
-    "    account_var = \"AZURE_SUBSCRIPTION_ID\"\n",
-    "else:\n",
-    "    account_var = None\n",
-    "\n",
-    "# Optional: Validate against expected account/subscription\n",
-    "if account_var:\n",
-    "    expected_account = os.environ.get(account_var, \"\").strip()\n",
-    "    if expected_account:\n",
-    "        assert_account(expected_account)\n",
-    "    else:\n",
-    "        warn(f\"{account_var} not set in environment - skipping account validation\")\n",
-    "        print(f\"💡 Tip: Set {account_var} in your .env file to add a guardrail against wrong account deployments\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Required Environment Variables\n",
-    "\n",
-    "Verify that all required configuration is present. These values will be used throughout the deployment.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Check required environment variables\n",
-    "from shared._cloud_helpers import get_cloud_provider\n",
-    "\n",
-    "provider = get_cloud_provider()\n",
-    "\n",
-    "# Base required vars (cloud-agnostic)\n",
-    "required_vars = [\n",
-    "    \"WORKSHOP_NAME\",\n",
-    "    \"NAMESPACE\",\n",
-    "    \"CLUSTER_NAME\",\n",
-    "    \"TERRAFORM_DIR\",\n",
-    "    \"HELM_RELEASE\",\n",
-    "    \"HELM_NAMESPACE\",\n",
-    "    \"HELM_CHART_REF\",\n",
-    "]\n",
-    "\n",
-    "# Add cloud-specific required vars\n",
-    "if provider == \"aws\":\n",
-    "    required_vars.append(\"AWS_REGION\")\n",
-    "elif provider == \"azure\":\n",
-    "    required_vars.append(\"AZURE_LOCATION\")\n",
-    "\n",
-    "config = require_env(*required_vars)\n",
-    "\n",
-    "# Optional but recommended (cloud-specific)\n",
-    "optional_vars = {}\n",
-    "if provider == \"aws\":\n",
-    "    optional_vars = {\n",
-    "        \"AWS_PROFILE\": os.environ.get(\"AWS_PROFILE\", \"\"),\n",
-    "        \"AWS_ACCOUNT_ID\": os.environ.get(\"AWS_ACCOUNT_ID\", \"\"),\n",
-    "        \"VALUES_FILE\": os.environ.get(\"VALUES_FILE\", \"\"),\n",
-    "    }\n",
-    "elif provider == \"azure\":\n",
-    "    optional_vars = {\n",
-    "        \"AZURE_SUBSCRIPTION_ID\": os.environ.get(\"AZURE_SUBSCRIPTION_ID\", \"\"),\n",
-    "        \"AZURE_RESOURCE_GROUP\": os.environ.get(\"AZURE_RESOURCE_GROUP\", \"\"),\n",
-    "        \"VALUES_FILE\": os.environ.get(\"VALUES_FILE\", \"\"),\n",
-    "    }\n",
-    "\n",
-    "print(\"\\n### Configuration Summary\")\n",
-    "print(f\"Cloud Provider: {provider.upper()}\")\n",
-    "print_config(config, redact_keys={\"AWS_PROFILE\"})\n",
-    "print(\"\\n### Optional Configuration\")\n",
-    "for k, v in optional_vars.items():\n",
-    "    if v:\n",
-    "        print(f\"- {k}: {v}\")\n",
-    "    else:\n",
-    "        print(f\"- {k}: (not set)\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Cluster Capacity Expectations\n",
-    "\n",
-    "LangSmith requires adequate cluster resources. Before deploying, understand what you'll need:\n",
-    "\n",
-    "- **Minimum:** 3 nodes, 4 vCPU, 16GB RAM each (for development/testing)\n",
-    "- **Recommended:** 3 nodes, 8 vCPU, 32GB RAM each (for production workloads)\n",
-    "- **Storage:** EBS CSI driver required for ClickHouse PVCs\n",
-    "\n",
-    "Let's check if a cluster already exists and validate its configuration.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from shared._aws_helpers import eks_cluster_exists\n",
-    "from shared._shell import run\n",
-    "\n",
-    "cluster_name = os.environ[\"CLUSTER_NAME\"]\n",
-    "region = aws_region()\n",
-    "\n",
-    "print(f\"### Checking EKS Cluster: {cluster_name}\")\n",
-    "print(f\"Region: {region}\\n\")\n",
-    "\n",
-    "if eks_cluster_exists(cluster_name):\n",
-    "    ok(f\"Cluster '{cluster_name}' exists\")\n",
-    "    \n",
-    "    # Get cluster details\n",
-    "    result = run(\n",
-    "        [\"aws\", \"eks\", \"describe-cluster\", \"--name\", cluster_name, \"--region\", region, \"--output\", \"json\"],\n",
-    "        check=True,\n",
-    "        stream=False\n",
-    "    )\n",
-    "    cluster_info = json.loads(result.stdout)[\"cluster\"]\n",
-    "    \n",
-    "    print(f\"\\nCluster Status: {cluster_info['status']}\")\n",
-    "    print(f\"Kubernetes Version: {cluster_info['version']}\")\n",
-    "    print(f\"Platform Version: {cluster_info.get('platformVersion', 'N/A')}\")\n",
-    "    \n",
-    "    # Check node groups\n",
-    "    print(\"\\n### Node Groups\")\n",
-    "    ng_result = run(\n",
-    "        [\"aws\", \"eks\", \"list-nodegroups\", \"--cluster-name\", cluster_name, \"--region\", region, \"--output\", \"json\"],\n",
-    "        check=True,\n",
-    "        stream=False\n",
-    "    )\n",
-    "    nodegroups = json.loads(ng_result.stdout).get(\"nodegroups\", [])\n",
-    "    \n",
-    "    if nodegroups:\n",
-    "        for ng in nodegroups:\n",
-    "            ng_detail = run(\n",
-    "                [\"aws\", \"eks\", \"describe-nodegroup\", \"--cluster-name\", cluster_name, \n",
-    "                 \"--nodegroup-name\", ng, \"--region\", region, \"--output\", \"json\"],\n",
-    "                check=True,\n",
-    "                stream=False\n",
-    "            )\n",
-    "            ng_info = json.loads(ng_detail.stdout)[\"nodegroup\"]\n",
-    "            scaling = ng_info.get(\"scalingConfig\", {})\n",
-    "            print(f\"\\n  Node Group: {ng}\")\n",
-    "            print(f\"    Status: {ng_info['status']}\")\n",
-    "            print(f\"    Desired: {scaling.get('desiredSize', 'N/A')}\")\n",
-    "            print(f\"    Min: {scaling.get('minSize', 'N/A')}\")\n",
-    "            print(f\"    Max: {scaling.get('maxSize', 'N/A')}\")\n",
-    "            print(f\"    Instance Types: {', '.join(ng_info.get('instanceTypes', []))}\")\n",
-    "    else:\n",
-    "        warn(\"No node groups found\")\n",
-    "        print(\"💡 You'll need to create node groups when deploying with Terraform\")\n",
-    "else:\n",
-    "    warn(f\"Cluster '{cluster_name}' does not exist yet\")\n",
-    "    print(\"💡 This is expected if you haven't run Terraform yet. Proceed to notebook 02_terraform_apply.ipynb\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Storage Prerequisites\n",
-    "\n",
-    "LangSmith requires persistent storage for ClickHouse. The cloud storage CSI driver must be installed and StorageClasses must be configured.\n",
-    "\n",
-    "**Why this matters:** Without the appropriate CSI driver, ClickHouse PVCs will remain in `Pending` state forever.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Check if kubectl is configured for the cluster\n",
-    "from shared._cloud_helpers import (\n",
-    "    get_cloud_provider,\n",
-    "    get_region,\n",
-    "    configure_kubectl,\n",
-    "    get_storage_driver_name,\n",
-    ")\n",
-    "\n",
-    "provider = get_cloud_provider()\n",
-    "cluster_name = os.environ[\"CLUSTER_NAME\"]\n",
-    "region = get_region()\n",
-    "storage_driver = get_storage_driver_name()\n",
-    "\n",
-    "k8s_service = \"EKS\" if provider == \"aws\" else \"AKS\" if provider == \"azure\" else \"Kubernetes\"\n",
-    "print(f\"### Configuring kubectl for {k8s_service} cluster\")\n",
-    "try:\n",
-    "    # Configure kubectl (cloud-agnostic)\n",
-    "    configure_kubectl(cluster_name, region)\n",
-    "    ok(\"kubectl configured for cluster\")\n",
-    "    \n",
-    "    # Check CSI driver (cloud-specific labels)\n",
-    "    print(f\"\\n### Checking {storage_driver} Driver\")\n",
-    "    \n",
-    "    if provider == \"aws\":\n",
-    "        driver_label = \"app=ebs-csi-controller\"\n",
-    "        driver_name = \"EBS CSI\"\n",
-    "    elif provider == \"azure\":\n",
-    "        driver_label = \"app=csi-azuredisk-controller\"\n",
-    "        driver_name = \"Azure Disk CSI\"\n",
-    "    else:\n",
-    "        driver_label = None\n",
-    "        driver_name = \"Storage CSI\"\n",
-    "    \n",
-    "    if driver_label:\n",
-    "        result = run(\n",
-    "            [\"kubectl\", \"get\", \"daemonset\", \"-n\", \"kube-system\", \"-l\", driver_label, \"-o\", \"json\"],\n",
-    "            check=False,\n",
-    "            stream=False\n",
-    "        )\n",
-    "        \n",
-    "        if result.returncode == 0 and result.stdout.strip():\n",
-    "            ds_info = json.loads(result.stdout)\n",
-    "            if ds_info.get(\"items\"):\n",
-    "                ok(f\"{driver_name} driver is installed\")\n",
-    "                print(f\"  DaemonSet: {ds_info['items'][0]['metadata']['name']}\")\n",
-    "            else:\n",
-    "                warn(f\"{driver_name} driver not found\")\n",
-    "                print(f\"💡 {driver_name} driver must be installed before deploying LangSmith\")\n",
-    "                print(\"   The Terraform module should handle this, but verify after deployment\")\n",
-    "        else:\n",
-    "            warn(f\"{driver_name} driver not found\")\n",
-    "            print(f\"💡 {driver_name} driver must be installed before deploying LangSmith\")\n",
-    "    \n",
-    "    # Check StorageClasses\n",
-    "    print(\"\\n### Checking StorageClasses\")\n",
-    "    result = run(\n",
-    "        [\"kubectl\", \"get\", \"storageclass\", \"-o\", \"json\"],\n",
-    "        check=True,\n",
-    "        stream=False\n",
-    "    )\n",
-    "    sc_list = json.loads(result.stdout)\n",
-    "    \n",
-    "    # Find cloud-specific storage classes\n",
-    "    if provider == \"aws\":\n",
-    "        storage_scs = [sc for sc in sc_list.get(\"items\", []) if \"ebs\" in sc[\"metadata\"][\"name\"].lower() or \n",
-    "                       sc.get(\"provisioner\", \"\").endswith(\"ebs.csi.aws.com\")]\n",
-    "    elif provider == \"azure\":\n",
-    "        storage_scs = [sc for sc in sc_list.get(\"items\", []) if \"disk\" in sc[\"metadata\"][\"name\"].lower() or \n",
-    "                       sc.get(\"provisioner\", \"\").endswith(\"disk.csi.azure.com\")]\n",
-    "    else:\n",
-    "        storage_scs = []\n",
-    "    \n",
-    "    if storage_scs:\n",
-    "        ok(f\"Found {len(storage_scs)} {storage_driver} StorageClass(es):\")\n",
-    "        for sc in storage_scs:\n",
-    "            name = sc[\"metadata\"][\"name\"]\n",
-    "            default = sc.get(\"metadata\", {}).get(\"annotations\", {}).get(\"storageclass.kubernetes.io/is-default-class\", \"false\")\n",
-    "            print(f\"  - {name} (default: {default})\")\n",
-    "    else:\n",
-    "        warn(f\"No {storage_driver} StorageClasses found\")\n",
-    "        print(f\"💡 At least one {storage_driver} StorageClass is required for ClickHouse PVCs\")\n",
-    "        \n",
-    "except Exception as e:\n",
-    "    warn(f\"Could not check storage prerequisites: {e}\")\n",
-    "    print(\"💡 This is expected if the cluster doesn't exist yet\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Blob Storage Requirement\n",
-    "\n",
-    "**Critical:** LangSmith requires cloud object storage (S3, Blob Storage, etc.) for blob storage in production. Inline trace payloads will explode ClickHouse if blob storage is not configured.\n",
-    "\n",
-    "Let's verify access to your cloud provider's object storage service and check if a storage account/bucket exists or needs to be created.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from shared._cloud_helpers import (\n",
-    "    get_cloud_provider,\n",
-    "    get_region,\n",
-    "    get_blob_storage_service_name,\n",
-    "    verify_blob_storage_access,\n",
-    ")\n",
-    "from shared._shell import run\n",
-    "import json\n",
-    "\n",
-    "provider = get_cloud_provider()\n",
-    "region = get_region()\n",
-    "blob_service = get_blob_storage_service_name()\n",
-    "\n",
-    "print(f\"### {blob_service} Access Check\")\n",
-    "print(f\"Cloud Provider: {provider.upper()}\")\n",
-    "print(f\"Region: {region}\\n\")\n",
-    "\n",
-    "# Test blob storage access\n",
-    "try:\n",
-    "    if provider == \"aws\":\n",
-    "        result = run(\n",
-    "            [\"aws\", \"s3\", \"ls\", \"--region\", region],\n",
-    "            check=True,\n",
-    "            stream=False\n",
-    "        )\n",
-    "        ok(f\"{blob_service} access verified\")\n",
-    "        \n",
-    "        # List buckets\n",
-    "        buckets_result = run(\n",
-    "            [\"aws\", \"s3api\", \"list-buckets\", \"--output\", \"json\"],\n",
-    "            check=True,\n",
-    "            stream=False\n",
-    "        )\n",
-    "        buckets = json.loads(buckets_result.stdout).get(\"Buckets\", [])\n",
-    "        \n",
-    "        print(f\"\\nFound {len(buckets)} S3 bucket(s):\")\n",
-    "        for bucket in buckets[:10]:  # Show first 10\n",
-    "            print(f\"  - {bucket['Name']} (created: {bucket['CreationDate']})\")\n",
-    "        \n",
-    "        if len(buckets) > 10:\n",
-    "            print(f\"  ... and {len(buckets) - 10} more\")\n",
-    "        \n",
-    "    elif provider == \"azure\":\n",
-    "        result = run(\n",
-    "            [\"az\", \"storage\", \"account\", \"list\", \"--output\", \"json\"],\n",
-    "            check=True,\n",
-    "            stream=False\n",
-    "        )\n",
-    "        ok(f\"{blob_service} access verified\")\n",
-    "        \n",
-    "        # List storage accounts\n",
-    "        accounts = json.loads(result.stdout)\n",
-    "        \n",
-    "        print(f\"\\nFound {len(accounts)} Storage Account(s):\")\n",
-    "        for account in accounts[:10]:  # Show first 10\n",
-    "            name = account.get(\"name\", \"N/A\")\n",
-    "            location = account.get(\"location\", \"N/A\")\n",
-    "            print(f\"  - {name} (location: {location})\")\n",
-    "        \n",
-    "        if len(accounts) > 10:\n",
-    "            print(f\"  ... and {len(accounts) - 10} more\")\n",
-    "    \n",
-    "    print(f\"\\n💡 Note: The Terraform module should create a {blob_service} resource for LangSmith blob storage\")\n",
-    "    print(\"   Verify the resource exists after Terraform deployment\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    warn(f\"{blob_service} access check failed: {e}\")\n",
-    "    if provider == \"aws\":\n",
-    "        print(\"💡 Ensure your AWS credentials have S3 permissions\")\n",
-    "    elif provider == \"azure\":\n",
-    "        print(\"💡 Ensure your Azure credentials have Storage Account permissions\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Terraform & Helm Repository Paths\n",
-    "\n",
-    "Verify that the Terraform and Helm repository paths are correctly configured and accessible.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import re\n",
-    "from pathlib import Path\n",
-    "from shared._validation import ok, warn\n",
-    "\n",
-    "def expand_env_vars(path_str: str) -> str:\n",
-    "    \"\"\"Expand environment variable references in a path string.\"\"\"\n",
-    "    # Expand $VAR and ${VAR} references\n",
-    "    def replace_var(match):\n",
-    "        var_name = match.group(1) or match.group(2)\n",
-    "        return os.environ.get(var_name, match.group(0))\n",
-    "    \n",
-    "    # Replace $VAR and ${VAR} patterns\n",
-    "    path_str = re.sub(r'\\$\\{([^}]+)\\}|\\$([a-zA-Z_][a-zA-Z0-9_]*)', replace_var, path_str)\n",
-    "    return path_str\n",
-    "\n",
-    "# Expand environment variables in paths (e.g., $TERRAFORM_REPO_DIR, $HELM_REPO_DIR, $HOME)\n",
-    "terraform_dir_str = expand_env_vars(os.environ[\"TERRAFORM_DIR\"])\n",
-    "terraform_dir = Path(terraform_dir_str).expanduser().resolve()\n",
-    "\n",
-    "helm_chart_ref_str = expand_env_vars(os.environ[\"HELM_CHART_REF\"])\n",
-    "helm_chart_ref = Path(helm_chart_ref_str).expanduser().resolve()\n",
-    "\n",
-    "print(\"### Repository Paths Check\\n\")\n",
-    "\n",
-    "# Check Terraform directory\n",
-    "print(f\"Terraform Directory: {terraform_dir}\")\n",
-    "if terraform_dir.exists():\n",
-    "    ok(f\"Terraform directory exists\")\n",
-    "    \n",
-    "    # Check for main.tf or similar\n",
-    "    tf_files = list(terraform_dir.glob(\"*.tf\"))\n",
-    "    if tf_files:\n",
-    "        print(f\"  Found {len(tf_files)} Terraform file(s)\")\n",
-    "    else:\n",
-    "        warn(\"No .tf files found in Terraform directory\")\n",
-    "        print(\"💡 Ensure you're pointing to the correct Terraform module path\")\n",
-    "else:\n",
-    "    warn(f\"Terraform directory does not exist: {terraform_dir}\")\n",
-    "    print(\"💡 Update TERRAFORM_DIR in your .env file to point to the langchain-ai/terraform repo\")\n",
-    "\n",
-    "# Check Helm chart\n",
-    "print(f\"\\nHelm Chart Reference: {helm_chart_ref}\")\n",
-    "if helm_chart_ref.exists():\n",
-    "    ok(f\"Helm chart path exists\")\n",
-    "    \n",
-    "    # Check for Chart.yaml\n",
-    "    chart_yaml = helm_chart_ref / \"Chart.yaml\"\n",
-    "    if chart_yaml.exists():\n",
-    "        print(f\"  Found Chart.yaml\")\n",
-    "    else:\n",
-    "        warn(\"Chart.yaml not found\")\n",
-    "        print(\"💡 Ensure you're pointing to the correct Helm chart path\")\n",
-    "else:\n",
-    "    warn(f\"Helm chart path does not exist: {helm_chart_ref}\")\n",
-    "    print(\"💡 Update HELM_CHART_REF in your .env file to point to the langchain-ai/helm chart\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Preflight Summary\n",
-    "\n",
-    "Review the checklist below. All items should be ✅ before proceeding to Terraform deployment.\n",
-    "\n",
-    "### ✅ Checklist\n",
-    "\n",
-    "- [ ] All required tools installed (cloud CLI, terraform, kubectl, helm, jq)\n",
-    "- [ ] Cloud provider credentials valid and correct account/subscription/region\n",
-    "- [ ] Required environment variables set\n",
-    "- [ ] Terraform directory path correct\n",
-    "- [ ] Helm chart path correct\n",
-    "- [ ] Blob storage access verified (S3/Blob Storage)\n",
-    "- [ ] (If cluster exists) Storage CSI driver installed\n",
-    "- [ ] (If cluster exists) StorageClasses configured\n",
-    "\n",
-    "### Next Steps\n",
-    "\n",
-    "If all checks pass, proceed to **02_terraform_apply.ipynb** to deploy the infrastructure.\n",
-    "\n",
-    "If any checks failed, review the warnings above and fix the issues before continuing.\n"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.14.2"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
@@ -129,6 +129,77 @@
    "        print(f\"💡 Tip: Set {account_var} in your .env file to add a guardrail against wrong account deployments\")\n"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Workshop Identifier Setup\n",
+    "\n",
+    "To ensure unique resource names and enable idempotent deployments, we need a unique identifier for your workshop deployment. This identifier will be used for all Terraform resources.\n",
+    "\n",
+    "**We'll use your email address** (hashed for privacy) to create a deterministic identifier that:\n",
+    "- ✅ Stays the same across notebook runs (idempotent)\n",
+    "- ✅ Is unique per student\n",
+    "- ✅ Works with the date-based prefix for resource naming\n",
+    "\n",
+    "Enter your email address below. It will be hashed and used to generate your unique workshop identifier.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Generate deterministic workshop identifier from email\n",
+    "import hashlib\n",
+    "import json\n",
+    "from datetime import date\n",
+    "from pathlib import Path\n",
+    "\n",
+    "print(\"### Workshop Identifier Setup\\n\")\n",
+    "print(\"Enter your email address to generate a unique, deterministic identifier for your deployment.\\n\")\n",
+    "print(\"This identifier will be used for all Terraform resources and ensures:\")\n",
+    "print(\"  - Same email = same identifier (idempotent)\")\n",
+    "print(\"  - Different emails = different identifiers (unique)\")\n",
+    "print(\"  - No additional environment variables needed\\n\")\n",
+    "\n",
+    "# Prompt for email (using input() - works in Jupyter)\n",
+    "email = input(\"Enter your email address: \").strip().lower()\n",
+    "\n",
+    "if not email or \"@\" not in email:\n",
+    "    raise ValueError(\"Invalid email address. Please enter a valid email.\")\n",
+    "\n",
+    "# Hash email for privacy and determinism\n",
+    "email_hash = hashlib.md5(email.encode()).hexdigest()[:6]\n",
+    "\n",
+    "# Build identifier: -workshop-YYYYMMDD-<hash>\n",
+    "today = date.today()\n",
+    "date_str = today.strftime('%Y%m%d')\n",
+    "workshop_identifier = f\"-workshop-{date_str}-{email_hash}\"\n",
+    "\n",
+    "# Save to artifacts directory for use in Terraform notebook\n",
+    "identifier_file = Path(bootstrap_info['artifacts_dir']) / \"workshop_identifier.json\"\n",
+    "identifier_data = {\n",
+    "    \"email_hash\": email_hash,\n",
+    "    \"identifier\": workshop_identifier,\n",
+    "    \"date\": date_str,\n",
+    "    \"created_at\": date.today().isoformat()\n",
+    "}\n",
+    "\n",
+    "with open(identifier_file, 'w') as f:\n",
+    "    json.dump(identifier_data, f, indent=2)\n",
+    "\n",
+    "print(f\"\\n✅ Workshop identifier generated:\")\n",
+    "print(f\"   Identifier: {workshop_identifier}\")\n",
+    "print(f\"   Date component: {date_str}\")\n",
+    "print(f\"   Hash (from email): {email_hash}\")\n",
+    "print(f\"\\n💡 This identifier will be used for all Terraform resources\")\n",
+    "print(f\"   Saved to: {identifier_file}\")\n",
+    "print(f\"\\n⚠️  IMPORTANT: Use the same email address if you re-run this notebook\")\n",
+    "print(f\"   to ensure Terraform can manage existing resources.\")\n"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -153,7 +224,6 @@
    "required_vars = [\n",
    "    \"WORKSHOP_NAME\",\n",
    "    \"NAMESPACE\",\n",
-    "    \"CLUSTER_NAME\",\n",
    "    \"TERRAFORM_DIR\",\n",
    "    \"HELM_RELEASE\",\n",
    "    \"HELM_NAMESPACE\",\n",
@@ -222,9 +292,27 @@
    "    get_kubernetes_service_name,\n",
    ")\n",
    "from shared._shell import run\n",
+    "from shared._validation import ok, warn, require_env\n",
+    "from pathlib import Path\n",
+    "import json\n",
+    "\n",
+    "# Load workshop identifier if it exists (from identifier setup cell)\n",
+    "identifier_file = Path(bootstrap_info['artifacts_dir']) / \"workshop_identifier.json\"\n",
+    "if identifier_file.exists():\n",
+    "    with open(identifier_file) as f:\n",
+    "        identifier_data = json.load(f)\n",
+    "    workshop_identifier = identifier_data[\"identifier\"]\n",
+    "    # Compute expected cluster name: langsmith-eks${identifier}\n",
+    "    cluster_name = f\"langsmith-eks{workshop_identifier}\"\n",
+    "    print(f\"💡 Using cluster name from workshop identifier: {cluster_name}\\n\")\n",
+    "else:\n",
+    "    # Fallback to CLUSTER_NAME env var if identifier not set yet\n",
+    "    config = require_env(\"CLUSTER_NAME\")\n",
+    "    cluster_name = config[\"CLUSTER_NAME\"]\n",
+    "    warn(\"Workshop identifier not found - using CLUSTER_NAME from environment\")\n",
+    "    print(\"💡 Run the 'Workshop Identifier Setup' cell above to generate a unique identifier\\n\")\n",
    "\n",
    "provider = get_cloud_provider()\n",
-    "cluster_name = os.environ[\"CLUSTER_NAME\"]\n",
    "region = get_region()\n",
    "k8s_service = get_kubernetes_service_name()\n",
    "\n",
@@ -97,6 +97,7 @@
   "source": [
    "import os\n",
    "import re\n",
+    "import json\n",
    "from pathlib import Path\n",
    "from shared._validation import require_env, ok, warn, fail\n",
    "from shared._shell import run\n",
@@ -131,15 +132,41 @@
    "terraform_dir_str = expand_env_vars(config[\"TERRAFORM_DIR\"])\n",
    "terraform_dir = Path(terraform_dir_str).expanduser().resolve()\n",
    "\n",
-    "cluster_name = config[\"CLUSTER_NAME\"]\n",
+    "# Load workshop identifier from preflight notebook\n",
+    "identifier_file = artifacts_dir / \"workshop_identifier.json\"\n",
+    "if not identifier_file.exists():\n",
+    "    fail(f\"Workshop identifier not found: {identifier_file}\")\n",
+    "    print(\"\\n💡 To fix this:\")\n",
+    "    print(\"   1. Run the preflight notebook (01_preflight.ipynb) first\")\n",
+    "    print(\"   2. Complete the 'Workshop Identifier Setup' cell\")\n",
+    "    print(\"   3. Then return to this notebook\")\n",
+    "    raise RuntimeError(f\"Workshop identifier not found. Please run 01_preflight.ipynb first.\")\n",
+    "\n",
+    "with open(identifier_file) as f:\n",
+    "    identifier_data = json.load(f)\n",
+    "\n",
+    "workshop_identifier = identifier_data[\"identifier\"]\n",
+    "print(f\"✅ Loaded workshop identifier: {workshop_identifier}\")\n",
+    "\n",
+    "# Compute expected cluster name for validation/display\n",
+    "# Terraform computes: cluster_name = \"langsmith-eks${local.identifier}\"\n",
+    "cluster_name = f\"langsmith-eks{workshop_identifier}\"\n",
+    "\n",
    "region = config[region_var]\n",
    "workshop_name = config[\"WORKSHOP_NAME\"]\n",
    "\n",
-    "print(\"### Terraform Configuration\")\n",
+    "print(\"\\n### Terraform Configuration\")\n",
    "print(f\"Terraform Directory: {terraform_dir}\")\n",
-    "print(f\"Cluster Name: {cluster_name}\")\n",
+    "print(f\"Workshop Identifier: {workshop_identifier}\")\n",
+    "print(f\"Expected Cluster Name: {cluster_name}\")\n",
    "print(f\"Region: {region}\")\n",
-    "print(f\"Workshop Name: {workshop_name}\\n\")\n",
+    "print(f\"Workshop Name: {workshop_name}\")\n",
+    "print(f\"\\n💡 Terraform will use this identifier for all resource names:\")\n",
+    "print(f\"   Cluster: langsmith-eks{workshop_identifier}\")\n",
+    "print(f\"   Redis: langsmith-redis{workshop_identifier}\")\n",
+    "print(f\"   S3: langsmith-s3{workshop_identifier}\")\n",
+    "print(f\"   Postgres: langsmith-postgres{workshop_identifier}\")\n",
+    "print(f\"   VPC: langsmith-vpc{workshop_identifier}\\n\")\n",
    "\n",
    "if not terraform_dir.exists():\n",
    "    fail(f\"Terraform directory does not exist: {terraform_dir}\")\n",
@@ -371,6 +398,8 @@
   "metadata": {},
   "outputs": [],
   "source": [
+    "import getpass\n",
+    "\n",
    "# Create terraform plan\n",
    "plan_file = artifacts_dir / \"terraform-plan.txt\"\n",
    "\n",
@@ -383,7 +412,22 @@
    "postgres_username = os.environ.get(\"POSTGRES_USERNAME\", \"\").strip()\n",
    "postgres_password = os.environ.get(\"POSTGRES_PASSWORD\", \"\").strip()\n",
    "\n",
+    "if not postgres_username:\n",
+    "    print(\"Please provide a PostgreSQL username: \")\n",
+    "    postgres_username = input().strip()\n",
+    "\n",
+    "if not postgres_password:\n",
+    "    print(\"Please provide a PostgreSQL password: \")\n",
+    "    postgres_password = getpass.getpass().strip()\n",
+    "\n",
    "print(\"### Terraform Variables\\n\")\n",
+    "\n",
+    "# Pass workshop identifier to Terraform\n",
+    "# This is the key variable that controls all resource naming\n",
+    "terraform_vars.extend([\"-var\", f\"identifier={workshop_identifier}\"])\n",
+    "print(f\"✅ IDENTIFIER: {workshop_identifier}\")\n",
+    "print(f\"   This will be used for all resource names (cluster, redis, s3, postgres, vpc)\\n\")\n",
+    "\n",
    "missing_vars = []\n",
    "\n",
    "if postgres_username:\n",
@@ -439,6 +483,232 @@
    "    print(\"💡 Review the errors above before proceeding\")\n"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Pre-Apply Safety Check\n",
+    "\n",
+    "**⚠️ CRITICAL:** Before applying Terraform, verify that resources don't already exist. This prevents accidentally modifying or overwriting existing infrastructure.\n",
+    "\n",
+    "This check will:\n",
+    "- Verify the cluster doesn't already exist (or warn if it does)\n",
+    "- Check for existing RDS/Redis/S3 resources that might conflict\n",
+    "- Require explicit confirmation if resources are found\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Pre-apply safety check: Verify resources don't already exist\n",
+    "from shared._cloud_helpers import (\n",
+    "    get_cloud_provider,\n",
+    "    get_region,\n",
+    "    cluster_exists,\n",
+    "    get_kubernetes_service_name,\n",
+    "    get_database_service_name,\n",
+    "    get_cache_service_name,\n",
+    "    get_blob_storage_service_name,\n",
+    ")\n",
+    "from shared._validation import ok, warn, fail\n",
+    "from shared._shell import run\n",
+    "import json\n",
+    "\n",
+    "provider = get_cloud_provider()\n",
+    "region = get_region()\n",
+    "k8s_service = get_kubernetes_service_name()\n",
+    "\n",
+    "print(\"### Pre-Apply Resource Existence Check\\n\")\n",
+    "print(\"Checking for existing resources that might conflict...\\n\")\n",
+    "\n",
+    "existing_resources = []\n",
+    "warnings = []\n",
+    "\n",
+    "# Check if cluster already exists\n",
+    "if cluster_exists(cluster_name):\n",
+    "    existing_resources.append(f\"{k8s_service} cluster: {cluster_name}\")\n",
+    "    warn(f\"⚠️  Cluster '{cluster_name}' already exists!\")\n",
+    "    print(f\"   If you proceed with Terraform apply, Terraform may attempt to:\")\n",
+    "    print(f\"   - Import the existing cluster into state, OR\")\n",
+    "    print(f\"   - Modify the existing cluster configuration\")\n",
+    "    print(f\"   This could cause unexpected changes to your existing infrastructure.\\n\")\n",
+    "    print(f\"   💡 If this is intentional, ensure your Terraform configuration matches the existing cluster\")\n",
+    "    print(f\"   💡 If this is NOT intentional, STOP and update CLUSTER_NAME in your .env file\")\n",
+    "else:\n",
+    "    ok(f\"Cluster '{cluster_name}' does not exist (safe to create)\")\n",
+    "\n",
+    "# Check for existing RDS instances (AWS) or PostgreSQL servers (Azure)\n",
+    "db_service = get_database_service_name()\n",
+    "print(f\"\\n### Checking for Existing {db_service} Resources\\n\")\n",
+    "\n",
+    "if provider == \"aws\":\n",
+    "    # Check for RDS instances that might match our naming pattern\n",
+    "    # We'll check for instances in the same region\n",
+    "    try:\n",
+    "        result = run(\n",
+    "            [\"aws\", \"rds\", \"describe-db-instances\", \"--region\", region, \"--output\", \"json\"],\n",
+    "            check=False,\n",
+    "            stream=False\n",
+    "        )\n",
+    "        if result.returncode == 0:\n",
+    "            rds_instances = json.loads(result.stdout).get(\"DBInstances\", [])\n",
+    "            # Check if any instance name might conflict (exact match or similar pattern)\n",
+    "            # Terraform typically uses cluster_name or workshop_name in resource names\n",
+    "            for instance in rds_instances:\n",
+    "                db_id = instance.get(\"DBInstanceIdentifier\", \"\")\n",
+    "                # Check if instance name contains cluster_name or workshop_name\n",
+    "                if cluster_name.lower() in db_id.lower() or workshop_name.lower() in db_id.lower():\n",
+    "                    existing_resources.append(f\"RDS instance: {db_id}\")\n",
+    "                    warnings.append(f\"Found RDS instance '{db_id}' that may conflict with Terraform resources\")\n",
+    "    except Exception as e:\n",
+    "        warn(f\"Could not check for RDS instances: {e}\")\n",
+    "        print(\"   💡 This is OK - proceeding with caution\")\n",
+    "\n",
+    "elif provider == \"azure\":\n",
+    "    # Check for PostgreSQL servers\n",
+    "    try:\n",
+    "        result = run(\n",
+    "            [\"az\", \"postgres\", \"server\", \"list\", \"--output\", \"json\"],\n",
+    "            check=False,\n",
+    "            stream=False\n",
+    "        )\n",
+    "        if result.returncode == 0:\n",
+    "            postgres_servers = json.loads(result.stdout)\n",
+    "            for server in postgres_servers:\n",
+    "                server_name = server.get(\"name\", \"\")\n",
+    "                server_location = server.get(\"location\", \"\")\n",
+    "                # Check if server is in same location and name might conflict\n",
+    "                if server_location.lower() == region.lower():\n",
+    "                    if cluster_name.lower() in server_name.lower() or workshop_name.lower() in server_name.lower():\n",
+    "                        existing_resources.append(f\"PostgreSQL server: {server_name}\")\n",
+    "                        warnings.append(f\"Found PostgreSQL server '{server_name}' that may conflict with Terraform resources\")\n",
+    "    except Exception as e:\n",
+    "        warn(f\"Could not check for PostgreSQL servers: {e}\")\n",
+    "        print(\"   💡 This is OK - proceeding with caution\")\n",
+    "\n",
+    "# Check for existing Redis/ElastiCache clusters\n",
+    "cache_service = get_cache_service_name()\n",
+    "print(f\"\\n### Checking for Existing {cache_service} Resources\\n\")\n",
+    "\n",
+    "if provider == \"aws\":\n",
+    "    # Check for ElastiCache clusters\n",
+    "    try:\n",
+    "        result = run(\n",
+    "            [\"aws\", \"elasticache\", \"describe-cache-clusters\", \"--region\", region, \"--output\", \"json\", \"--show-cache-node-info\"],\n",
+    "            check=False,\n",
+    "            stream=False\n",
+    "        )\n",
+    "        if result.returncode == 0:\n",
+    "            cache_clusters = json.loads(result.stdout).get(\"CacheClusters\", [])\n",
+    "            for cluster in cache_clusters:\n",
+    "                cluster_id = cluster.get(\"CacheClusterId\", \"\")\n",
+    "                if cluster_name.lower() in cluster_id.lower() or workshop_name.lower() in cluster_id.lower():\n",
+    "                    existing_resources.append(f\"ElastiCache cluster: {cluster_id}\")\n",
+    "                    warnings.append(f\"Found ElastiCache cluster '{cluster_id}' that may conflict with Terraform resources\")\n",
+    "    except Exception as e:\n",
+    "        warn(f\"Could not check for ElastiCache clusters: {e}\")\n",
+    "        print(\"   💡 This is OK - proceeding with caution\")\n",
+    "\n",
+    "elif provider == \"azure\":\n",
+    "    # Check for Redis caches\n",
+    "    try:\n",
+    "        result = run(\n",
+    "            [\"az\", \"redis\", \"list\", \"--output\", \"json\"],\n",
+    "            check=False,\n",
+    "            stream=False\n",
+    "        )\n",
+    "        if result.returncode == 0:\n",
+    "            redis_caches = json.loads(result.stdout)\n",
+    "            for cache in redis_caches:\n",
+    "                cache_name = cache.get(\"name\", \"\")\n",
+    "                cache_location = cache.get(\"location\", \"\")\n",
+    "                if cache_location.lower() == region.lower():\n",
+    "                    if cluster_name.lower() in cache_name.lower() or workshop_name.lower() in cache_name.lower():\n",
+    "                        existing_resources.append(f\"Redis cache: {cache_name}\")\n",
+    "                        warnings.append(f\"Found Redis cache '{cache_name}' that may conflict with Terraform resources\")\n",
+    "    except Exception as e:\n",
+    "        warn(f\"Could not check for Redis caches: {e}\")\n",
+    "        print(\"   💡 This is OK - proceeding with caution\")\n",
+    "\n",
+    "# Check for existing S3 buckets (AWS) or Storage Accounts (Azure)\n",
+    "blob_service = get_blob_storage_service_name()\n",
+    "print(f\"\\n### Checking for Existing {blob_service} Resources\\n\")\n",
+    "\n",
+    "if provider == \"aws\":\n",
+    "    # Check for S3 buckets\n",
+    "    try:\n",
+    "        result = run(\n",
+    "            [\"aws\", \"s3api\", \"list-buckets\", \"--output\", \"json\"],\n",
+    "            check=False,\n",
+    "            stream=False\n",
+    "        )\n",
+    "        if result.returncode == 0:\n",
+    "            buckets = json.loads(result.stdout).get(\"Buckets\", [])\n",
+    "            # Check if any bucket name might conflict\n",
+    "            for bucket in buckets:\n",
+    "                bucket_name = bucket.get(\"Name\", \"\")\n",
+    "                if cluster_name.lower() in bucket_name.lower() or workshop_name.lower() in bucket_name.lower():\n",
+    "                    existing_resources.append(f\"S3 bucket: {bucket_name}\")\n",
+    "                    warnings.append(f\"Found S3 bucket '{bucket_name}' that may conflict with Terraform resources\")\n",
+    "    except Exception as e:\n",
+    "        warn(f\"Could not check for S3 buckets: {e}\")\n",
+    "        print(\"   💡 This is OK - proceeding with caution\")\n",
+    "\n",
+    "elif provider == \"azure\":\n",
+    "    # Check for Storage Accounts\n",
+    "    try:\n",
+    "        result = run(\n",
+    "            [\"az\", \"storage\", \"account\", \"list\", \"--output\", \"json\"],\n",
+    "            check=False,\n",
+    "            stream=False\n",
+    "        )\n",
+    "        if result.returncode == 0:\n",
+    "            storage_accounts = json.loads(result.stdout)\n",
+    "            for account in storage_accounts:\n",
+    "                account_name = account.get(\"name\", \"\")\n",
+    "                account_location = account.get(\"location\", \"\")\n",
+    "                if account_location.lower() == region.lower():\n",
+    "                    if cluster_name.lower() in account_name.lower() or workshop_name.lower() in account_name.lower():\n",
+    "                        existing_resources.append(f\"Storage account: {account_name}\")\n",
+    "                        warnings.append(f\"Found Storage account '{account_name}' that may conflict with Terraform resources\")\n",
+    "    except Exception as e:\n",
+    "        warn(f\"Could not check for Storage accounts: {e}\")\n",
+    "        print(\"   💡 This is OK - proceeding with caution\")\n",
+    "\n",
+    "# Summary and decision\n",
+    "print(\"\\n\" + \"=\" * 60)\n",
+    "print(\"### Safety Check Summary\")\n",
+    "print(\"=\" * 60)\n",
+    "\n",
+    "if existing_resources:\n",
+    "    fail(f\"Found {len(existing_resources)} existing resource(s) that may conflict:\")\n",
+    "    for resource in existing_resources:\n",
+    "        print(f\"  ⚠️  {resource}\")\n",
+    "    \n",
+    "    print(\"\\n\" + \"=\" * 60)\n",
+    "    print(\"⚠️  WARNING: Proceeding with Terraform apply may:\")\n",
+    "    print(\"   - Modify existing infrastructure\")\n",
+    "    print(\"   - Import existing resources into Terraform state\")\n",
+    "    print(\"   - Cause unexpected changes or conflicts\")\n",
+    "    print(\"=\" * 60)\n",
+    "    print(\"\\n💡 Recommendations:\")\n",
+    "    print(\"   1. If these resources are from a previous deployment, that's OK\")\n",
+    "    print(\"   2. If these resources are UNRELATED to this deployment:\")\n",
+    "    print(f\"      - Update CLUSTER_NAME or WORKSHOP_NAME in your .env file\")\n",
+    "    print(f\"      - Use different resource names to avoid conflicts\")\n",
+    "    print(\"   3. Review the Terraform plan carefully before applying\")\n",
+    "    print(\"   4. Consider using Terraform import if you want to manage existing resources\")\n",
+    "    print(\"\\n⚠️  You must explicitly confirm you understand the risks before proceeding.\")\n",
+    "    print(\"    Review the plan output and ensure you're comfortable with the changes.\")\n",
+    "else:\n",
+    "    ok(\"No conflicting resources found\")\n",
+    "    print(\"✅ Safe to proceed with Terraform apply\")\n",
+    "    print(\"💡 Still review the Terraform plan before applying to ensure it's correct\")\n"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -449,60 +449,72 @@
    "if not license_key:\n",
    "    raise RuntimeError(\"❌ LANGSMITH_LICENSE_KEY is required\")\n",
    "\n",
+    "# Helper function to create or update secret\n",
+    "def create_or_update_secret(secret_name: str, literals: dict, namespace: str):\n",
+    "    \"\"\"Create a secret if it doesn't exist, or update it if it does.\"\"\"\n",
+    "    # Check if secret exists\n",
+    "    check_result = run(\n",
+    "        [\"kubectl\", \"get\", \"secret\", secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    # Build kubectl command\n",
+    "    cmd = [\"kubectl\", \"create\", \"secret\", \"generic\", secret_name, \"-n\", namespace]\n",
+    "    for key, value in literals.items():\n",
+    "        cmd.extend([\"--from-literal\", f\"{key}={value}\"])\n",
+    "    \n",
+    "    if check_result.returncode == 0:\n",
+    "        # Secret exists - update it using apply\n",
+    "        print(f\"  Secret '{secret_name}' exists, updating...\")\n",
+    "        # Generate YAML using dry-run, then apply it\n",
+    "        create_cmd = cmd + [\"--dry-run=client\", \"-o\", \"yaml\"]\n",
+    "        result = run(create_cmd, check=True, stream=False)\n",
+    "        \n",
+    "        # Apply the YAML\n",
+    "        apply_result = run(\n",
+    "            [\"kubectl\", \"apply\", \"-f\", \"-\"],\n",
+    "            input=result.stdout,\n",
+    "            check=True,\n",
+    "            stream=True\n",
+    "        )\n",
+    "        return \"updated\"\n",
+    "    else:\n",
+    "        # Secret doesn't exist - create it\n",
+    "        print(f\"  Secret '{secret_name}' does not exist, creating...\")\n",
+    "        run(cmd, check=True, stream=True)\n",
+    "        return \"created\"\n",
+    "\n",
    "# Create license key secret\n",
    "print(\"Creating license key secret...\")\n",
-    "run(\n",
-    "    [\n",
-    "        \"kubectl\", \"create\", \"secret\", \"generic\", \"langsmith-license\",\n",
-    "        f\"--from-literal=license-key={license_key}\",\n",
-    "        \"-n\", namespace,\n",
-    "        \"--dry-run=client\", \"-o\", \"yaml\"\n",
-    "    ],\n",
-    "    check=True,\n",
-    "    stream=False\n",
+    "action = create_or_update_secret(\n",
+    "    \"langsmith-license\",\n",
+    "    {\"license-key\": license_key},\n",
+    "    namespace\n",
    ")\n",
-    "# Actually create it (remove dry-run)\n",
-    "run(\n",
-    "    [\n",
-    "        \"kubectl\", \"create\", \"secret\", \"generic\", \"langsmith-license\",\n",
-    "        f\"--from-literal=license-key={license_key}\",\n",
-    "        \"-n\", namespace\n",
-    "    ],\n",
-    "    check=False,  # May already exist\n",
-    "    stream=True\n",
-    ")\n",
-    "ok(\"License key secret created/updated\")\n",
+    "ok(f\"License key secret {action}\")\n",
    "\n",
    "# Create database secret if credentials provided\n",
    "if db_user and db_password:\n",
    "    print(\"\\nCreating database secret...\")\n",
-    "    run(\n",
-    "        [\n",
-    "            \"kubectl\", \"create\", \"secret\", \"generic\", \"langsmith-db\",\n",
-    "            f\"--from-literal=username={db_user}\",\n",
-    "            f\"--from-literal=password={db_password}\",\n",
-    "            \"-n\", namespace\n",
-    "        ],\n",
-    "        check=False,  # May already exist\n",
-    "        stream=True\n",
+    "    action = create_or_update_secret(\n",
+    "        \"langsmith-db\",\n",
+    "        {\"username\": db_user, \"password\": db_password},\n",
+    "        namespace\n",
    "    )\n",
-    "    ok(\"Database secret created/updated\")\n",
+    "    ok(f\"Database secret {action}\")\n",
    "else:\n",
    "    print(\"💡 Skipping database secret (using IAM auth or not needed)\")\n",
    "\n",
    "# Create Redis secret if password provided\n",
    "if redis_password:\n",
    "    print(\"\\nCreating Redis secret...\")\n",
-    "    run(\n",
-    "        [\n",
-    "            \"kubectl\", \"create\", \"secret\", \"generic\", \"langsmith-redis\",\n",
-    "            f\"--from-literal=password={redis_password}\",\n",
-    "            \"-n\", namespace\n",
-    "        ],\n",
-    "        check=False,  # May already exist\n",
-    "        stream=True\n",
+    "    action = create_or_update_secret(\n",
+    "        \"langsmith-redis\",\n",
+    "        {\"password\": redis_password},\n",
+    "        namespace\n",
    "    )\n",
-    "    ok(\"Redis secret created/updated\")\n",
+    "    ok(f\"Redis secret {action}\")\n",
    "else:\n",
    "    print(\"💡 Skipping Redis secret (using IAM auth or not needed)\")\n",
    "\n",
@@ -568,6 +580,79 @@
    "    print(\"💡 Review the errors above before proceeding\")\n"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Pre-Install Safety Check\n",
+    "\n",
+    "**⚠️ CRITICAL:** Before installing with Helm, verify that a release doesn't already exist. This prevents accidentally overwriting or conflicting with existing deployments.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Pre-install safety check: Verify Helm release doesn't already exist\n",
+    "from shared._validation import ok, warn, fail\n",
+    "from shared._shell import run\n",
+    "import json\n",
+    "\n",
+    "print(\"### Pre-Install Helm Release Check\\n\")\n",
+    "print(\"Checking if Helm release already exists...\\n\")\n",
+    "\n",
+    "# Check if Helm release exists\n",
+    "result = run(\n",
+    "    [\"helm\", \"list\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,  # Don't fail if namespace doesn't exist yet\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "releases = []\n",
+    "if result.returncode == 0:\n",
+    "    try:\n",
+    "        releases = json.loads(result.stdout)\n",
+    "    except json.JSONDecodeError:\n",
+    "        # Empty output or invalid JSON\n",
+    "        releases = []\n",
+    "elif \"not found\" in result.stderr.lower() or \"does not exist\" in result.stderr.lower():\n",
+    "    # Namespace doesn't exist, which is fine\n",
+    "    ok(f\"Namespace '{namespace}' does not exist (will be created)\")\n",
+    "    releases = []\n",
+    "else:\n",
+    "    # Some other error\n",
+    "    warn(f\"Could not check for Helm releases: {result.stderr}\")\n",
+    "    print(\"💡 Proceeding with caution\")\n",
+    "\n",
+    "# Check if our release name already exists\n",
+    "langsmith_releases = [r for r in releases if r.get(\"name\") == helm_release]\n",
+    "\n",
+    "if langsmith_releases:\n",
+    "    release = langsmith_releases[0]\n",
+    "    fail(f\"⚠️  Helm release '{helm_release}' already exists in namespace '{namespace}'!\")\n",
+    "    print(f\"\\nRelease Details:\")\n",
+    "    print(f\"  Name: {release.get('name', 'N/A')}\")\n",
+    "    print(f\"  Status: {release.get('status', 'N/A')}\")\n",
+    "    print(f\"  Chart: {release.get('chart', 'N/A')}\")\n",
+    "    print(f\"  Revision: {release.get('revision', 'N/A')}\")\n",
+    "    print(f\"  Namespace: {release.get('namespace', 'N/A')}\")\n",
+    "    \n",
+    "    print(\"\\n\" + \"=\" * 60)\n",
+    "    print(\"⚠️  WARNING: Cannot install - release already exists!\")\n",
+    "    print(\"=\" * 60)\n",
+    "    print(\"\\n💡 Options:\")\n",
+    "    print(f\"   1. To upgrade the existing release, use: helm upgrade {helm_release} ...\")\n",
+    "    print(f\"   2. To reinstall, first uninstall: helm uninstall {helm_release} -n {namespace}\")\n",
+    "    print(f\"   3. To use a different release name, update HELM_RELEASE in your .env file\")\n",
+    "    print(\"\\n❌ Do NOT proceed with 'helm install' - it will fail.\")\n",
+    "    raise RuntimeError(f\"Helm release '{helm_release}' already exists. Use 'helm upgrade' or uninstall first.\")\n",
+    "else:\n",
+    "    ok(f\"Helm release '{helm_release}' does not exist (safe to install)\")\n",
+    "    print(\"✅ Safe to proceed with Helm install\")\n"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -13,10 +13,13 @@
    "### What We'll Validate\n",
    "\n",
    "1. ✅ Pod readiness (all pods running)\n",
-    "2. ✅ PVC binding (storage provisioned)\n",
-    "3. ✅ Ingress provisioning (ALB created)\n",
-    "4. ✅ Endpoint reachability (services accessible)\n",
-    "5. ✅ Basic UI availability (web interface works)\n",
+    "2. ✅ License key validation (properly configured)\n",
+    "3. ✅ PVC binding (storage provisioned)\n",
+    "4. ✅ External services connectivity (PostgreSQL, Redis, blob storage)\n",
+    "5. ✅ Ingress provisioning (load balancer created)\n",
+    "6. ✅ Endpoint reachability (services accessible)\n",
+    "7. ✅ Basic UI availability (web interface works)\n",
+    "8. ✅ Basic functional test (optional trace submission)\n",
    "\n",
    "### Why This Matters\n",
    "\n",
@@ -72,7 +75,7 @@
   "source": [
    "## Setting Up Cluster Access\n",
    "\n",
-    "Ensure kubectl is configured for the EKS cluster.\n"
+    "Ensure kubectl is configured for the Kubernetes cluster.\n"
   ]
  },
  {
@@ -82,12 +85,13 @@
   "outputs": [],
   "source": [
    "import os\n",
-    "from shared._validation import require_env, ok\n",
+    "from shared._validation import require_env, ok, warn\n",
    "from shared._cloud_helpers import (\n",
    "    get_cloud_provider,\n",
    "    get_region,\n",
    "    configure_kubectl,\n",
    ")\n",
+    "from shared._shell import run\n",
    "\n",
    "provider = get_cloud_provider()\n",
    "\n",
@@ -131,6 +135,8 @@
   "outputs": [],
   "source": [
    "from shared._k8s_helpers import get_pods, wait_for_deployments_ready, require_namespace\n",
+    "from shared._validation import warn\n",
+    "from shared._shell import run\n",
    "import json\n",
    "\n",
    "# Ensure namespace exists\n",
@@ -201,11 +207,279 @@
    "        run([\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by=.lastTimestamp\"], check=False, stream=True)\n"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1.5. License Key Validation\n",
+    "\n",
+    "**Critical:** Verify that the LangSmith license key is properly configured and valid. License issues will prevent the system from functioning correctly.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Check license key secret\n",
+    "print(\"### License Key Validation\\n\")\n",
+    "\n",
+    "# Check if license secret exists\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", \"langsmith-license\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    ok(\"License key secret exists\")\n",
+    "    \n",
+    "    # Try to check if license key is set (without revealing it)\n",
+    "    secret_data = json.loads(result.stdout)\n",
+    "    if \"data\" in secret_data and \"license-key\" in secret_data[\"data\"]:\n",
+    "        ok(\"License key is present in secret\")\n",
+    "    else:\n",
+    "        warn(\"License key secret exists but 'license-key' field not found\")\n",
+    "        print(\"💡 Secret may use a different key name\")\n",
+    "else:\n",
+    "    warn(\"License key secret not found\")\n",
+    "    print(\"💡 License secret 'langsmith-license' should exist in namespace\")\n",
+    "    print(\"   Check that you created the secret during Helm installation\")\n",
+    "\n",
+    "# Check pod logs for license-related errors\n",
+    "print(\"\\n### Checking Pod Logs for License Errors\\n\")\n",
+    "\n",
+    "# Get all pods in namespace\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"jsonpath={.items[*].metadata.name}\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0 and result.stdout.strip():\n",
+    "    pod_names = result.stdout.strip().split()\n",
+    "    license_errors_found = False\n",
+    "    \n",
+    "    # Check logs from a few key pods (limit to first 3 to avoid too much output)\n",
+    "    key_pods = [p for p in pod_names if any(keyword in p.lower() for keyword in [\"api\", \"server\", \"backend\"])][:3]\n",
+    "    if not key_pods:\n",
+    "        key_pods = pod_names[:3]  # Fallback to first 3 pods\n",
+    "    \n",
+    "    for pod_name in key_pods:\n",
+    "        try:\n",
+    "            # Get recent logs (last 50 lines)\n",
+    "            log_result = run(\n",
+    "                [\"kubectl\", \"logs\", pod_name, \"-n\", namespace, \"--tail=50\"],\n",
+    "                check=False,\n",
+    "                stream=False\n",
+    "            )\n",
+    "            \n",
+    "            if log_result.returncode == 0:\n",
+    "                logs = log_result.stdout.lower()\n",
+    "                # Look for common license-related error patterns\n",
+    "                license_error_patterns = [\n",
+    "                    \"license\",\n",
+    "                    \"unauthorized\",\n",
+    "                    \"invalid license\",\n",
+    "                    \"license expired\",\n",
+    "                    \"license key\",\n",
+    "                    \"beacon.langchain.com\",\n",
+    "                ]\n",
+    "                \n",
+    "                for pattern in license_error_patterns:\n",
+    "                    if pattern in logs:\n",
+    "                        # Check if it's actually an error (not just a log message)\n",
+    "                        lines = log_result.stdout.split(\"\\n\")\n",
+    "                        error_lines = [line for line in lines if pattern in line.lower() and any(err in line.lower() for err in [\"error\", \"fail\", \"invalid\", \"unauthorized\"])]\n",
+    "                        if error_lines:\n",
+    "                            license_errors_found = True\n",
+    "                            warn(f\"Potential license issue found in {pod_name} logs\")\n",
+    "                            print(f\"   Pattern: '{pattern}'\")\n",
+    "                            print(f\"   Sample: {error_lines[0][:100]}...\")\n",
+    "                            break\n",
+    "        except Exception as e:\n",
+    "            # Skip pods that can't be logged (may not be ready)\n",
+    "            pass\n",
+    "    \n",
+    "    if not license_errors_found:\n",
+    "        ok(\"No obvious license-related errors found in pod logs\")\n",
+    "    else:\n",
+    "        print(\"\\n💡 If license errors are present, verify:\")\n",
+    "        print(\"   - License key is valid and not expired\")\n",
+    "        print(\"   - Egress to https://beacon.langchain.com is allowed (if not air-gapped)\")\n",
+    "        print(\"   - License secret is correctly mounted in pods\")\n",
+    "else:\n",
+    "    warn(\"Could not retrieve pod names to check logs\")\n"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2.5. External Services Connectivity\n",
+    "\n",
+    "**Important:** Verify that external services (PostgreSQL, Redis, blob storage) are accessible from the cluster. These are critical dependencies for LangSmith.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._cloud_helpers import (\n",
+    "    get_database_service_name,\n",
+    "    get_cache_service_name,\n",
+    "    get_blob_storage_service_name,\n",
+    ")\n",
+    "\n",
+    "# Check external services connectivity\n",
+    "print(\"### External Services Connectivity Check\\n\")\n",
+    "\n",
+    "# Try to load Terraform outputs to get service endpoints\n",
+    "terraform_outputs_file = artifacts_dir / \"terraform-outputs.json\"\n",
+    "terraform_outputs = {}\n",
+    "\n",
+    "if terraform_outputs_file.exists():\n",
+    "    try:\n",
+    "        with open(terraform_outputs_file) as f:\n",
+    "            terraform_outputs_raw = json.load(f)\n",
+    "        \n",
+    "        # Unwrap Terraform output format\n",
+    "        for key, value in terraform_outputs_raw.items():\n",
+    "            if isinstance(value, dict) and \"value\" in value:\n",
+    "                terraform_outputs[key] = value[\"value\"]\n",
+    "            else:\n",
+    "                terraform_outputs[key] = value\n",
+    "        \n",
+    "        print(\"💡 Loaded Terraform outputs for service endpoints\\n\")\n",
+    "    except Exception as e:\n",
+    "        warn(f\"Could not parse Terraform outputs: {e}\")\n",
+    "        print(\"💡 Will attempt basic connectivity checks without endpoint details\")\n",
+    "else:\n",
+    "    print(\"💡 Terraform outputs file not found - will check service connectivity from cluster\\n\")\n",
+    "\n",
+    "# Check PostgreSQL connectivity\n",
+    "print(\"### PostgreSQL/Database Connectivity\\n\")\n",
+    "db_service = get_database_service_name()\n",
+    "\n",
+    "# Try to find a pod we can exec into for connectivity tests\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"jsonpath={.items[0].metadata.name}\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0 and result.stdout.strip():\n",
+    "    test_pod = result.stdout.strip()\n",
+    "    \n",
+    "    # Check if we can reach database (basic connectivity test)\n",
+    "    # This is a simple test - actual connection requires credentials\n",
+    "    db_endpoint = None\n",
+    "    if \"rds_endpoint\" in terraform_outputs:\n",
+    "        db_endpoint = terraform_outputs[\"rds_endpoint\"]\n",
+    "    elif \"postgres_endpoint\" in terraform_outputs:\n",
+    "        db_endpoint = terraform_outputs[\"postgres_endpoint\"]\n",
+    "    elif \"database_endpoint\" in terraform_outputs:\n",
+    "        db_endpoint = terraform_outputs[\"database_endpoint\"]\n",
+    "    \n",
+    "    if db_endpoint:\n",
+    "        # Extract hostname from endpoint (remove port if present)\n",
+    "        db_host = db_endpoint.split(\":\")[0] if \":\" in db_endpoint else db_endpoint\n",
+    "        print(f\"Testing connectivity to {db_service} at {db_host}...\")\n",
+    "        \n",
+    "        # Try a simple DNS lookup or ping test\n",
+    "        dns_result = run(\n",
+    "            [\"kubectl\", \"exec\", \"-n\", namespace, test_pod, \"--\", \"nslookup\", db_host],\n",
+    "            check=False,\n",
+    "            stream=False\n",
+    "        )\n",
+    "        \n",
+    "        if dns_result.returncode == 0:\n",
+    "            ok(f\"{db_service} hostname resolves: {db_host}\")\n",
+    "        else:\n",
+    "            warn(f\"Could not resolve {db_service} hostname\")\n",
+    "            print(\"💡 This may be normal if DNS is not fully configured yet\")\n",
+    "    else:\n",
+    "        print(f\"💡 {db_service} endpoint not found in Terraform outputs\")\n",
+    "        print(\"   Verify database is accessible from cluster in cloud console\")\n",
+    "else:\n",
+    "    print(\"💡 Could not find pod for connectivity testing\")\n",
+    "    print(f\"   Manually verify {db_service} is accessible from cluster\")\n",
+    "\n",
+    "# Check Redis connectivity\n",
+    "print(\"\\n### Redis/Cache Connectivity\\n\")\n",
+    "cache_service = get_cache_service_name()\n",
+    "\n",
+    "if result.returncode == 0 and result.stdout.strip():\n",
+    "    redis_endpoint = None\n",
+    "    if \"redis_endpoint\" in terraform_outputs:\n",
+    "        redis_endpoint = terraform_outputs[\"redis_endpoint\"]\n",
+    "    elif \"cache_endpoint\" in terraform_outputs:\n",
+    "        redis_endpoint = terraform_outputs[\"cache_endpoint\"]\n",
+    "    elif \"elasticache_endpoint\" in terraform_outputs:\n",
+    "        redis_endpoint = terraform_outputs[\"elasticache_endpoint\"]\n",
+    "    \n",
+    "    if redis_endpoint:\n",
+    "        # Extract hostname from endpoint\n",
+    "        redis_host = redis_endpoint.split(\":\")[0] if \":\" in redis_endpoint else redis_endpoint\n",
+    "        print(f\"Testing connectivity to {cache_service} at {redis_host}...\")\n",
+    "        \n",
+    "        dns_result = run(\n",
+    "            [\"kubectl\", \"exec\", \"-n\", namespace, test_pod, \"--\", \"nslookup\", redis_host],\n",
+    "            check=False,\n",
+    "            stream=False\n",
+    "        )\n",
+    "        \n",
+    "        if dns_result.returncode == 0:\n",
+    "            ok(f\"{cache_service} hostname resolves: {redis_host}\")\n",
+    "        else:\n",
+    "            warn(f\"Could not resolve {cache_service} hostname\")\n",
+    "            print(\"💡 This may be normal if DNS is not fully configured yet\")\n",
+    "    else:\n",
+    "        print(f\"💡 {cache_service} endpoint not found in Terraform outputs\")\n",
+    "        print(\"   Verify cache is accessible from cluster in cloud console\")\n",
+    "\n",
+    "# Check blob storage (S3/Azure Blob)\n",
+    "print(\"\\n### Blob Storage Connectivity\\n\")\n",
+    "blob_service = get_blob_storage_service_name()\n",
+    "\n",
+    "# Check if blob storage secret exists (indicates it's configured)\n",
+    "blob_secret_result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", \"-n\", namespace, \"-o\", \"jsonpath={.items[*].metadata.name}\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if blob_secret_result.returncode == 0:\n",
+    "    secrets = blob_secret_result.stdout.split()\n",
+    "    blob_secrets = [s for s in secrets if any(keyword in s.lower() for keyword in [\"s3\", \"storage\", \"blob\", \"aws\"])]\n",
+    "    if blob_secrets:\n",
+    "        ok(f\"Blob storage secrets found: {', '.join(blob_secrets)}\")\n",
+    "    else:\n",
+    "        print(\"💡 Blob storage secrets not found (may use IAM roles instead)\")\n",
+    "\n",
+    "# Check for S3 bucket or blob storage account in Terraform outputs\n",
+    "if \"s3_bucket\" in terraform_outputs or \"bucket_name\" in terraform_outputs:\n",
+    "    bucket_name = terraform_outputs.get(\"s3_bucket\") or terraform_outputs.get(\"bucket_name\")\n",
+    "    ok(f\"Blob storage bucket/container configured: {bucket_name}\")\n",
+    "elif \"storage_account\" in terraform_outputs:\n",
+    "    storage_account = terraform_outputs[\"storage_account\"]\n",
+    "    ok(f\"Azure storage account configured: {storage_account}\")\n",
+    "else:\n",
+    "    print(f\"💡 Verify {blob_service} is configured and accessible\")\n",
+    "    print(\"   Check Terraform outputs or cloud console for storage resource\")\n",
+    "\n",
+    "print(\"\\n💡 For comprehensive functional testing of external services,\")\n",
+    "print(\"   see the validation guide for trace submission and attachment tests\")\n"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -269,9 +543,9 @@
   "source": [
    "## 3. Ingress Provisioning Check\n",
    "\n",
-    "**Critical:** The AWS ALB (Application Load Balancer) must be provisioned. This is how external traffic reaches LangSmith.\n",
+    "**Critical:** The load balancer (ALB for AWS, Application Gateway for Azure) must be provisioned. This is how external traffic reaches LangSmith.\n",
    "\n",
-    "Common issue: ALB never appears due to wrong ingress assumptions.\n"
+    "Common issue: Load balancer never appears due to wrong ingress assumptions.\n"
   ]
  },
  {
@@ -317,16 +591,25 @@
    "            if ingress_hosts:\n",
    "                print(f\"  Hosts: {', '.join(ingress_hosts)}\")\n",
    "            \n",
-    "            # Check for ALB address\n",
+    "            # Check for load balancer address (cloud-agnostic)\n",
    "            if load_balancer.get(\"ingress\"):\n",
-    "                alb_addresses = [ing.get(\"hostname\", ing.get(\"ip\", \"\")) for ing in load_balancer[\"ingress\"]]\n",
-    "                if alb_addresses:\n",
-    "                    ok(f\"ALB provisioned: {', '.join(alb_addresses)}\")\n",
-    "                    print(f\"  💡 Access LangSmith at: https://{alb_addresses[0]}\")\n",
+    "                lb_addresses = [ing.get(\"hostname\", ing.get(\"ip\", \"\")) for ing in load_balancer[\"ingress\"]]\n",
+    "                if lb_addresses:\n",
+    "                    # Determine load balancer type based on address format\n",
+    "                    lb_type = \"Load Balancer\"\n",
+    "                    if provider == \"aws\":\n",
+    "                        if \".elb.\" in lb_addresses[0] or \".amazonaws.com\" in lb_addresses[0]:\n",
+    "                            lb_type = \"ALB (Application Load Balancer)\"\n",
+    "                    elif provider == \"azure\":\n",
+    "                        if \".azure.com\" in lb_addresses[0] or \"appgw\" in lb_addresses[0]:\n",
+    "                            lb_type = \"Application Gateway\"\n",
+    "                    \n",
+    "                    ok(f\"{lb_type} provisioned: {', '.join(lb_addresses)}\")\n",
+    "                    print(f\"  💡 Access LangSmith at: https://{lb_addresses[0]}\")\n",
    "                else:\n",
-    "                    warn(\"ALB ingress entry exists but no address found\")\n",
+    "                    warn(\"Load balancer ingress entry exists but no address found\")\n",
    "            else:\n",
-    "                warn(\"ALB not yet provisioned (may take a few minutes)\")\n",
+    "                warn(\"Load balancer not yet provisioned (may take a few minutes)\")\n",
    "                print(\"  💡 Wait a few minutes and check again\")\n",
    "    else:\n",
    "        warn(\"No ingress resources found\")\n",
@@ -335,24 +618,62 @@
    "    warn(\"Could not retrieve ingress resources\")\n",
    "    print(\"💡 Ingress may not exist yet or namespace is incorrect\")\n",
    "\n",
-    "# Also check for ALB Ingress Controller\n",
-    "print(\"\\n### ALB Ingress Controller\\n\")\n",
-    "result = run(\n",
-    "    [\"kubectl\", \"get\", \"pods\", \"-n\", \"kube-system\", \"-l\", \"app.kubernetes.io/name=aws-load-balancer-controller\", \"-o\", \"json\"],\n",
-    "    check=False,\n",
-    "    stream=False\n",
-    ")\n",
+    "# Check for ingress controller (cloud-agnostic)\n",
+    "print(\"\\n### Ingress Controller\\n\")\n",
    "\n",
-    "if result.returncode == 0:\n",
-    "    controller_data = json.loads(result.stdout)\n",
-    "    controllers = controller_data.get(\"items\", [])\n",
-    "    if controllers:\n",
-    "        ok(f\"ALB Ingress Controller found ({len(controllers)} pod(s))\")\n",
+    "if provider == \"aws\":\n",
+    "    # Check for ALB Ingress Controller\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"get\", \"pods\", \"-n\", \"kube-system\", \"-l\", \"app.kubernetes.io/name=aws-load-balancer-controller\", \"-o\", \"json\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        controller_data = json.loads(result.stdout)\n",
+    "        controllers = controller_data.get(\"items\", [])\n",
+    "        if controllers:\n",
+    "            ok(f\"ALB Ingress Controller found ({len(controllers)} pod(s))\")\n",
+    "        else:\n",
+    "            warn(\"ALB Ingress Controller not found\")\n",
+    "            print(\"💡 ALB Ingress Controller must be installed for ingress to work\")\n",
    "    else:\n",
-    "        warn(\"ALB Ingress Controller not found\")\n",
-    "        print(\"💡 ALB Ingress Controller must be installed for ingress to work\")\n",
+    "        warn(\"Could not check ALB Ingress Controller status\")\n",
+    "\n",
+    "elif provider == \"azure\":\n",
+    "    # Check for Azure Application Gateway Ingress Controller\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"get\", \"pods\", \"-n\", \"kube-system\", \"-l\", \"app=ingress-azure\", \"-o\", \"json\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        controller_data = json.loads(result.stdout)\n",
+    "        controllers = controller_data.get(\"items\", [])\n",
+    "        if controllers:\n",
+    "            ok(f\"Azure Application Gateway Ingress Controller found ({len(controllers)} pod(s))\")\n",
+    "        else:\n",
+    "            # Also check for AGIC (Application Gateway Ingress Controller)\n",
+    "            result2 = run(\n",
+    "                [\"kubectl\", \"get\", \"pods\", \"-n\", \"kube-system\", \"-l\", \"app=ingress-appgw\", \"-o\", \"json\"],\n",
+    "                check=False,\n",
+    "                stream=False\n",
+    "            )\n",
+    "            if result2.returncode == 0:\n",
+    "                controller_data2 = json.loads(result2.stdout)\n",
+    "                controllers2 = controller_data2.get(\"items\", [])\n",
+    "                if controllers2:\n",
+    "                    ok(f\"Application Gateway Ingress Controller found ({len(controllers2)} pod(s))\")\n",
+    "                else:\n",
+    "                    warn(\"Application Gateway Ingress Controller not found\")\n",
+    "                    print(\"💡 Application Gateway Ingress Controller must be installed for ingress to work\")\n",
+    "            else:\n",
+    "                warn(\"Could not check Application Gateway Ingress Controller status\")\n",
+    "    else:\n",
+    "        warn(\"Could not check Application Gateway Ingress Controller status\")\n",
    "else:\n",
-    "    warn(\"Could not check ALB Ingress Controller status\")\n"
+    "    print(\"💡 Verify ingress controller is installed for your cloud provider\")\n"
   ]
  },
  {
@@ -424,6 +745,129 @@
    "    warn(\"No services found\")\n"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Basic Functional Test (Optional)\n",
+    "\n",
+    "**Optional:** Submit a simple test trace to verify the end-to-end pipeline is working. This validates that traces can be ingested, stored, and retrieved.\n",
+    "\n",
+    "> **Note:** For comprehensive functional testing (traces, attachments, feedback, datasets), see the full validation guide.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Optional: Basic functional test\n",
+    "print(\"### Basic Functional Test (Optional)\\n\")\n",
+    "\n",
+    "# Check if we have the necessary information to run a test\n",
+    "ingress_result = run(\n",
+    "    [\"kubectl\", \"get\", \"ingress\", \"-n\", namespace, \"-o\", \"jsonpath={.items[0].status.loadBalancer.ingress[0].hostname}\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if ingress_result.returncode == 0 and ingress_result.stdout.strip():\n",
+    "    ingress_host = ingress_result.stdout.strip()\n",
+    "    langsmith_endpoint = f\"https://{ingress_host}/api\"\n",
+    "    \n",
+    "    print(f\"LangSmith endpoint: {langsmith_endpoint}\")\n",
+    "    print(\"\\n💡 To run a basic functional test:\")\n",
+    "    print(\"   1. Generate an API key from the LangSmith UI\")\n",
+    "    print(\"   2. Set LANGSMITH_API_KEY environment variable\")\n",
+    "    print(\"   3. Run the test script below (or see validation guide for comprehensive tests)\\n\")\n",
+    "    \n",
+    "    # Check if API key is available\n",
+    "    api_key = os.environ.get(\"LANGSMITH_API_KEY\", \"\").strip()\n",
+    "    \n",
+    "    if api_key:\n",
+    "        print(\"✅ LANGSMITH_API_KEY found - attempting basic trace submission...\\n\")\n",
+    "        \n",
+    "        try:\n",
+    "            # Simple test: submit a basic trace\n",
+    "            test_code = f'''\n",
+    "import os\n",
+    "import requests\n",
+    "from langsmith import traceable\n",
+    "\n",
+    "# Configure LangSmith\n",
+    "os.environ[\"LANGSMITH_TRACING\"] = \"true\"\n",
+    "os.environ[\"LANGSMITH_API_KEY\"] = \"{api_key}\"\n",
+    "os.environ[\"LANGSMITH_ENDPOINT\"] = \"{langsmith_endpoint}\"\n",
+    "os.environ[\"LANGSMITH_PROJECT\"] = \"validation-test\"\n",
+    "\n",
+    "# Simple traced function\n",
+    "@traceable(name=\"test_basic_function\")\n",
+    "def test_function():\n",
+    "    return \"Hello from LangSmith validation test!\"\n",
+    "\n",
+    "# Run test\n",
+    "try:\n",
+    "    result = test_function()\n",
+    "    print(f\"✅ Test trace submitted successfully: {{result}}\")\n",
+    "    print(f\"💡 Check the LangSmith UI at https://{{ingress_host}} to see the trace\")\n",
+    "    print(\"   Navigate to the 'validation-test' project\")\n",
+    "except Exception as e:\n",
+    "    print(f\"⚠️  Error submitting trace: {{e}}\")\n",
+    "    print(\"💡 This may be normal if LangSmith is still initializing\")\n",
+    "'''\n",
+    "            \n",
+    "            # Try to import langsmith to see if it's available\n",
+    "            try:\n",
+    "                import langsmith\n",
+    "                print(\"Running basic trace test...\")\n",
+    "                exec(test_code)\n",
+    "                ok(\"Basic functional test completed\")\n",
+    "            except ImportError:\n",
+    "                print(\"⚠️  langsmith package not installed\")\n",
+    "                print(\"💡 Install with: pip install langsmith\")\n",
+    "                print(\"\\nTest script (save and run separately):\")\n",
+    "                print(\"=\" * 60)\n",
+    "                print(test_code)\n",
+    "                print(\"=\" * 60)\n",
+    "        except Exception as e:\n",
+    "            warn(f\"Could not run functional test: {e}\")\n",
+    "            print(\"💡 This is optional - you can test functionality manually in the UI\")\n",
+    "    else:\n",
+    "        print(\"💡 To enable automated testing, set LANGSMITH_API_KEY in your environment\")\n",
+    "        print(\"   Get an API key from: https://{ingress_host}/settings/api-keys\")\n",
+    "        print(\"\\nExample test script (run after getting API key):\")\n",
+    "        print(\"=\" * 60)\n",
+    "        print(f'''\n",
+    "import os\n",
+    "from langsmith import traceable\n",
+    "\n",
+    "os.environ[\"LANGSMITH_TRACING\"] = \"true\"\n",
+    "os.environ[\"LANGSMITH_API_KEY\"] = \"<your-api-key>\"\n",
+    "os.environ[\"LANGSMITH_ENDPOINT\"] = \"{langsmith_endpoint}\"\n",
+    "os.environ[\"LANGSMITH_PROJECT\"] = \"validation-test\"\n",
+    "\n",
+    "@traceable(name=\"test_basic_function\")\n",
+    "def test_function():\n",
+    "    return \"Hello from LangSmith!\"\n",
+    "\n",
+    "test_function()\n",
+    "print(\"Check the UI for the trace!\")\n",
+    "''')\n",
+    "        print(\"=\" * 60)\n",
+    "else:\n",
+    "    print(\"💡 Ingress not available yet - functional test requires accessible endpoint\")\n",
+    "    print(\"   Complete ingress validation first, then return to this section\")\n",
+    "\n",
+    "print(\"\\n💡 For comprehensive functional testing including:\")\n",
+    "print(\"   - Trace submission & ClickHouse analytics\")\n",
+    "print(\"   - Attachments & blob storage\")\n",
+    "print(\"   - Feedback system\")\n",
+    "print(\"   - Dataset management\")\n",
+    "print(\"   - Agent deployments\")\n",
+    "print(\"   See the full validation guide for detailed test scripts\")\n"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -460,7 +904,14 @@
    "    # Try to access the UI (HTTPS)\n",
    "    ui_url = f\"https://{ingress_host}\"\n",
    "    print(f\"\\nTesting UI availability at: {ui_url}\")\n",
-    "    print(\"(This may take a moment if ALB is still provisioning...)\\n\")\n",
+    "    \n",
+    "    # Cloud-specific messaging\n",
+    "    if provider == \"aws\":\n",
+    "        print(\"(This may take a moment if ALB is still provisioning...)\\n\")\n",
+    "    elif provider == \"azure\":\n",
+    "        print(\"(This may take a moment if Application Gateway is still provisioning...)\\n\")\n",
+    "    else:\n",
+    "        print(\"(This may take a moment if load balancer is still provisioning...)\\n\")\n",
    "    \n",
    "    try:\n",
    "        # Use a short timeout and allow redirects\n",
@@ -482,19 +933,36 @@
    "        print(\"   Browser may show security warning - this is normal for self-signed certs\")\n",
    "    except requests.exceptions.Timeout:\n",
    "        warn(\"UI request timed out\")\n",
-    "        print(\"💡 ALB may still be provisioning, or ingress is not fully configured\")\n",
-    "        print(f\"   Try again in a few minutes: {ui_url}\")\n",
+    "        if provider == \"aws\":\n",
+    "            print(\"💡 ALB may still be provisioning, or ingress is not fully configured\")\n",
+    "            print(f\"   Check AWS console for ALB status, then try: {ui_url}\")\n",
+    "        elif provider == \"azure\":\n",
+    "            print(\"💡 Application Gateway may still be provisioning, or ingress is not fully configured\")\n",
+    "            print(f\"   Check Azure portal for Application Gateway status, then try: {ui_url}\")\n",
+    "        else:\n",
+    "            print(f\"   Try again in a few minutes: {ui_url}\")\n",
    "    except requests.exceptions.ConnectionError as e:\n",
    "        warn(f\"Could not connect to UI: {e}\")\n",
-    "        print(\"💡 ALB may still be provisioning\")\n",
-    "        print(f\"   Check AWS console for ALB status, then try: {ui_url}\")\n",
+    "        if provider == \"aws\":\n",
+    "            print(\"💡 ALB may still be provisioning\")\n",
+    "            print(f\"   Check AWS console for ALB status, then try: {ui_url}\")\n",
+    "        elif provider == \"azure\":\n",
+    "            print(\"💡 Application Gateway may still be provisioning\")\n",
+    "            print(f\"   Check Azure portal for Application Gateway status, then try: {ui_url}\")\n",
+    "        else:\n",
+    "            print(f\"   Try again in a few minutes: {ui_url}\")\n",
    "    except Exception as e:\n",
    "        warn(f\"Error accessing UI: {e}\")\n",
    "        print(f\"💡 Manual check: Open {ui_url} in a browser\")\n",
    "else:\n",
    "    warn(\"Could not determine ingress hostname\")\n",
    "    print(\"💡 Ingress may not be provisioned yet\")\n",
-    "    print(\"   Run the ingress check above and wait for ALB to be created\")\n"
+    "    if provider == \"aws\":\n",
+    "        print(\"   Run the ingress check above and wait for ALB to be created\")\n",
+    "    elif provider == \"azure\":\n",
+    "        print(\"   Run the ingress check above and wait for Application Gateway to be created\")\n",
+    "    else:\n",
+    "        print(\"   Run the ingress check above and wait for load balancer to be created\")\n"
   ]
  },
  {
@@ -561,10 +1029,13 @@
    "### ✅ Validation Checklist\n",
    "\n",
    "- [ ] All pods are running and ready\n",
+    "- [ ] License key is properly configured (no errors in logs)\n",
    "- [ ] All PVCs are bound\n",
-    "- [ ] Ingress/ALB is provisioned\n",
+    "- [ ] External services are accessible (PostgreSQL, Redis, blob storage)\n",
+    "- [ ] Ingress/load balancer is provisioned\n",
    "- [ ] Services are accessible\n",
-    "- [ ] UI is reachable (or ALB is provisioning)\n",
+    "- [ ] UI is reachable (or load balancer is provisioning)\n",
+    "- [ ] Basic functional test passed (optional)\n",
    "- [ ] Diagnostic artifacts collected\n",
    "\n",
    "### 🎯 Next Steps\n",
@@ -573,15 +1044,18 @@
    "- ✅ You have a working baseline deployment\n",
    "- ✅ You're on a supported path\n",
    "- ✅ Ready to proceed to Module 2 (SSO/OIDC configuration)\n",
+    "- 💡 For comprehensive functional testing, see the full validation guide\n",
    "\n",
    "**If checks fail:**\n",
    "- Review the warnings above\n",
    "- Check diagnostic artifacts\n",
    "- Common issues:\n",
-    "  - **PVCs pending:** EBS CSI driver not installed\n",
-    "  - **ALB not appearing:** Wrong ingress configuration\n",
-    "  - **Pods not ready:** Check events and logs\n",
-    "  - **UI not accessible:** Wait for ALB provisioning (can take 5-10 minutes)\n",
+    "  - **PVCs pending:** Storage CSI driver not installed\n",
+    "- **Load balancer not appearing:** Wrong ingress configuration\n",
+    "- **Pods not ready:** Check events and logs\n",
+    "- **UI not accessible:** Wait for load balancer provisioning (can take 5-10 minutes)\n",
+    "- **License errors:** Verify license key is valid and secret is correctly mounted\n",
+    "- **External services unreachable:** Check network connectivity and security groups\n",
    "\n",
    "### 📋 Baseline Reference\n",
    "\n",
@@ -425,7 +425,7 @@
    "\n",
    "If you want to start over:\n",
    "1. Review and update your `.env` file\n",
-    "2. Run `01_aws_preflight.ipynb` again\n",
+    "2. Run `01_preflight.ipynb` again\n",
    "3. Proceed through the module notebooks\n",
    "\n",
    "**Thank you for completing Module 1!**\n"
@@ -0,0 +1,863 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Module 2: SAML SSO Validation (Optional)\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "This notebook validates SAML SSO configuration for your LangSmith deployment. Use this if your IdP only supports SAML or if enterprise policy requires SAML.\n",
+    "\n",
+    "**⚠️ SAFETY NOTICE:** This notebook is **READ-ONLY**. It performs validation checks only and does NOT modify any infrastructure, Helm values, secrets, or deployments. All operations are safe to run against production environments.\n",
+    "\n",
+    "**Prerequisites:**\n",
+    "- Module 1 deployment is healthy and accessible\n",
+    "- DNS configured and resolving correctly\n",
+    "- TLS certificate valid and trusted\n",
+    "- Ingress configured and working\n",
+    "- IdP team has provided SAML metadata or metadata URL\n",
+    "\n",
+    "## What We'll Validate\n",
+    "\n",
+    "1. ✅ Environment configuration (SAML settings, redacted)\n",
+    "2. ✅ Preflight checks (tools, kubectl, namespace, Helm release)\n",
+    "3. ✅ Current auth configuration (without leaking secrets)\n",
+    "4. ✅ Ingress/TLS preconditions (domain, HTTPS)\n",
+    "5. ✅ SAML metadata validation (URL reachability, XML parsing, required attributes)\n",
+    "6. ✅ Deployment verification (pods, logs, endpoints)\n",
+    "7. ✅ Common failure signatures\n",
+    "8. ✅ Support bundle pointers\n",
+    "\n",
+    "**Estimated time:** 30-45 minutes\n",
+    "\n",
+    "**Important:** \n",
+    "- This notebook never prints secrets. All sensitive values are redacted.\n",
+    "- This notebook does NOT modify any resources. It is safe for production use.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bootstrap environment\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Add notebooks directory to path so we can import shared as a package\n",
+    "possible_paths = [\n",
+    "    Path.cwd().parent,  # If cwd is module-2, go up one level to notebooks\n",
+    "    Path.cwd(),  # If cwd is already notebooks\n",
+    "    Path.cwd() / \"notebooks\",  # If cwd is workspace root\n",
+    "]\n",
+    "\n",
+    "notebooks_path = None\n",
+    "for path in possible_paths:\n",
+    "    if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        notebooks_path = path\n",
+    "        break\n",
+    "\n",
+    "if not notebooks_path:\n",
+    "    notebooks_path = Path.cwd() / \"notebooks\"\n",
+    "    if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
+    "\n",
+    "# Add notebooks directory to path so 'shared' can be imported as a package\n",
+    "if str(notebooks_path) not in sys.path:\n",
+    "    sys.path.insert(0, str(notebooks_path))\n",
+    "\n",
+    "from shared._bootstrap import bootstrap\n",
+    "\n",
+    "# Run bootstrap\n",
+    "bootstrap_info = bootstrap()\n",
+    "artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
+    "print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Configuration\n",
+    "\n",
+    "Load and validate SAML configuration from environment variables. All secrets are redacted in output.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import json\n",
+    "from pathlib import Path\n",
+    "from shared._validation import require_env, print_config, redact, ok, warn\n",
+    "from shared._shell import run\n",
+    "\n",
+    "# Required SAML configuration variables\n",
+    "required_vars = [\n",
+    "    \"NAMESPACE\",\n",
+    "    \"SAML_METADATA_URL\",  # OR SAML_METADATA_FILE (one must be provided)\n",
+    "    \"LANGSMITH_DOMAIN\",\n",
+    "]\n",
+    "\n",
+    "# Optional but recommended\n",
+    "optional_vars = [\n",
+    "    \"SAML_ENTITY_ID\",\n",
+    "    \"SAML_EMAIL_ATTRIBUTE\",\n",
+    "    \"SAML_NAME_ATTRIBUTE\",\n",
+    "    \"SAML_GROUPS_ATTRIBUTE\",\n",
+    "]\n",
+    "\n",
+    "print(\"### Loading SAML Configuration\\n\")\n",
+    "\n",
+    "# Load required variables\n",
+    "config = {}\n",
+    "missing = []\n",
+    "\n",
+    "for var in required_vars:\n",
+    "    value = os.environ.get(var, \"\").strip()\n",
+    "    if not value:\n",
+    "        missing.append(var)\n",
+    "    config[var] = value\n",
+    "\n",
+    "# Check if SAML_METADATA_FILE is provided as alternative\n",
+    "saml_metadata_file = os.environ.get(\"SAML_METADATA_FILE\", \"\").strip()\n",
+    "if not config.get(\"SAML_METADATA_URL\") and not saml_metadata_file:\n",
+    "    missing.append(\"SAML_METADATA_URL or SAML_METADATA_FILE\")\n",
+    "\n",
+    "if missing:\n",
+    "    raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
+    "                      f\"💡 Copy env-samples/oidc.env.example to your .env file and fill in SAML values\")\n",
+    "\n",
+    "# Load optional variables\n",
+    "for var in optional_vars:\n",
+    "    config[var] = os.environ.get(var, \"\").strip()\n",
+    "\n",
+    "# Set defaults for optional variables\n",
+    "if not config.get(\"SAML_EMAIL_ATTRIBUTE\"):\n",
+    "    config[\"SAML_EMAIL_ATTRIBUTE\"] = \"email\"\n",
+    "if not config.get(\"SAML_NAME_ATTRIBUTE\"):\n",
+    "    config[\"SAML_NAME_ATTRIBUTE\"] = \"name\"\n",
+    "if not config.get(\"SAML_GROUPS_ATTRIBUTE\"):\n",
+    "    config[\"SAML_GROUPS_ATTRIBUTE\"] = \"groups\"\n",
+    "\n",
+    "# Print configuration (redacted)\n",
+    "print_config(config, redact_keys=set())\n",
+    "\n",
+    "ok(\"Configuration loaded\")\n",
+    "\n",
+    "# Validate metadata source\n",
+    "if config.get(\"SAML_METADATA_URL\"):\n",
+    "    metadata_url = config[\"SAML_METADATA_URL\"]\n",
+    "    if not metadata_url.startswith(\"https://\"):\n",
+    "        warn(\"SAML metadata URL should use HTTPS\")\n",
+    "    print(f\"\\n💡 Using metadata URL: {metadata_url}\")\n",
+    "elif saml_metadata_file:\n",
+    "    metadata_path = Path(saml_metadata_file)\n",
+    "    if not metadata_path.exists():\n",
+    "        raise RuntimeError(f\"❌ SAML metadata file not found: {saml_metadata_file}\")\n",
+    "    print(f\"\\n💡 Using metadata file: {saml_metadata_file}\")\n",
+    "else:\n",
+    "    raise RuntimeError(\"❌ Either SAML_METADATA_URL or SAML_METADATA_FILE must be provided\")\n",
+    "\n",
+    "domain = config[\"LANGSMITH_DOMAIN\"]\n",
+    "print(f\"\\n💡 Verify these values match your IdP configuration:\")\n",
+    "print(f\"   - Entity ID: {config.get('SAML_ENTITY_ID', 'N/A')}\")\n",
+    "print(f\"   - Metadata URL/File: {config.get('SAML_METADATA_URL', saml_metadata_file)}\")\n",
+    "print(f\"   - Domain: {domain}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Safety Check: Verify Environment\n",
+    "\n",
+    "Before proceeding with validation, confirm you're working with the correct environment and that auth configuration is appropriate to validate.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Safety check: Verify environment and auth configuration state\n",
+    "from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
+    "from shared._validation import ok, warn\n",
+    "from shared._shell import run\n",
+    "import json\n",
+    "\n",
+    "provider = get_cloud_provider()\n",
+    "region = get_region()\n",
+    "identity = get_identity()\n",
+    "\n",
+    "print(\"### Environment Safety Check\\n\")\n",
+    "\n",
+    "# Show current environment\n",
+    "provider_display = provider.upper()\n",
+    "print(f\"Cloud Provider: {provider_display}\")\n",
+    "print(f\"Region: {region}\")\n",
+    "\n",
+    "if provider == \"aws\":\n",
+    "    print(f\"Account ID: {identity.get('Account', 'N/A')}\")\n",
+    "    print(f\"User ARN: {identity.get('Arn', 'N/A')}\")\n",
+    "elif provider == \"azure\":\n",
+    "    print(f\"Subscription ID: {identity.get('SubscriptionId', identity.get('Account', 'N/A'))}\")\n",
+    "    print(f\"Subscription Name: {identity.get('SubscriptionName', 'N/A')}\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 60)\n",
+    "print(\"⚠️  IMPORTANT: This notebook is READ-ONLY\")\n",
+    "print(\"=\" * 60)\n",
+    "print(\"\\nThis notebook will:\")\n",
+    "print(\"  ✅ Validate SAML configuration\")\n",
+    "print(\"  ✅ Check deployment status\")\n",
+    "print(\"  ✅ Inspect current auth settings (secrets redacted)\")\n",
+    "print(\"  ✅ Collect support bundles\")\n",
+    "print(\"\\nThis notebook will NOT:\")\n",
+    "print(\"  ❌ Modify Helm values or releases\")\n",
+    "print(\"  ❌ Create or update secrets\")\n",
+    "print(\"  ❌ Restart pods or deployments\")\n",
+    "print(\"  ❌ Change any infrastructure\")\n",
+    "print(\"\\n\" + \"=\" * 60)\n",
+    "\n",
+    "# Check if auth is already configured\n",
+    "print(\"\\n### Checking Current Auth Configuration State\\n\")\n",
+    "namespace = config.get(\"NAMESPACE\", \"\")\n",
+    "helm_release = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
+    "\n",
+    "# Check for auth-related secrets\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "auth_configured = False\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    auth_secrets = [s for s in secrets.get(\"items\", [])\n",
+    "                   if any(keyword in s.get(\"metadata\", {}).get(\"name\", \"\").lower()\n",
+    "                         for keyword in [\"auth\", \"saml\", \"sso\"])]\n",
+    "    \n",
+    "    if auth_secrets:\n",
+    "        auth_configured = True\n",
+    "        ok(f\"Found {len(auth_secrets)} auth-related secret(s) - auth appears configured\")\n",
+    "        print(\"   💡 This validation will check if your SAML configuration matches existing setup\")\n",
+    "    else:\n",
+    "        warn(\"No auth-related secrets found - auth may not be configured yet\")\n",
+    "        print(\"   💡 This validation will verify your SAML configuration is ready to apply\")\n",
+    "\n",
+    "# Check Helm values for auth config\n",
+    "result = run(\n",
+    "    [\"helm\", \"get\", \"values\", helm_release, \"-n\", namespace, \"--output\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    try:\n",
+    "        values = json.loads(result.stdout)\n",
+    "        if \"auth\" in str(values).lower() or \"saml\" in str(values).lower():\n",
+    "            if not auth_configured:\n",
+    "                auth_configured = True\n",
+    "            ok(\"Helm values contain auth configuration\")\n",
+    "        else:\n",
+    "            warn(\"No auth configuration found in Helm values\")\n",
+    "    except json.JSONDecodeError:\n",
+    "        pass\n",
+    "\n",
+    "if auth_configured:\n",
+    "    print(\"\\n\" + \"=\" * 60)\n",
+    "    print(\"⚠️  Auth is already configured in this environment\")\n",
+    "    print(\"=\" * 60)\n",
+    "    print(\"\\nThis validation will:\")\n",
+    "    print(\"  - Verify your SAML settings match the existing configuration\")\n",
+    "    print(\"  - Check if authentication is working correctly\")\n",
+    "    print(\"  - Identify any configuration mismatches\")\n",
+    "    print(\"\\n💡 If you need to CHANGE auth configuration, use Helm upgrade separately\")\n",
+    "    print(\"   This notebook only validates, it does not modify configuration\")\n",
+    "else:\n",
+    "    print(\"\\n\" + \"=\" * 60)\n",
+    "    print(\"ℹ️  Auth not yet configured\")\n",
+    "    print(\"=\" * 60)\n",
+    "    print(\"\\nThis validation will:\")\n",
+    "    print(\"  - Verify your SAML settings are correct\")\n",
+    "    print(\"  - Check prerequisites (DNS, TLS, ingress)\")\n",
+    "    print(\"  - Validate IdP metadata\")\n",
+    "    print(\"\\n💡 After validation passes, apply configuration using Helm upgrade\")\n",
+    "    print(\"   This notebook only validates, it does not apply configuration\")\n",
+    "\n",
+    "ok(\"Environment safety check complete\")\n",
+    "print(\"\\n✅ Safe to proceed with validation\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Preflight Checks\n",
+    "\n",
+    "Same as OIDC notebook - verify tools, kubectl context, namespace, and Helm release.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._validation import ok, warn\n",
+    "from shared._k8s_helpers import require_namespace, namespace_exists\n",
+    "from shared._shell import run\n",
+    "from shared._cloud_helpers import get_cloud_provider, get_region\n",
+    "\n",
+    "provider = get_cloud_provider()\n",
+    "region = get_region()\n",
+    "namespace = config[\"NAMESPACE\"]\n",
+    "\n",
+    "print(\"### Preflight Checks\\n\")\n",
+    "\n",
+    "# Check kubectl is available\n",
+    "print(\"1. Checking kubectl...\")\n",
+    "result = run([\"kubectl\", \"version\", \"--client\", \"--short\"], check=False, stream=False)\n",
+    "if result.returncode == 0:\n",
+    "    ok(\"kubectl is available\")\n",
+    "    print(f\"   {result.stdout.strip()}\")\n",
+    "else:\n",
+    "    raise RuntimeError(\"❌ kubectl is not available or not working\")\n",
+    "\n",
+    "# Check kubectl context\n",
+    "print(\"\\n2. Checking kubectl context...\")\n",
+    "result = run([\"kubectl\", \"config\", \"current-context\"], check=False, stream=False)\n",
+    "if result.returncode == 0:\n",
+    "    context = result.stdout.strip()\n",
+    "    ok(f\"Current context: {context}\")\n",
+    "else:\n",
+    "    warn(\"Could not determine kubectl context\")\n",
+    "\n",
+    "# Check namespace exists\n",
+    "print(f\"\\n3. Checking namespace '{namespace}'...\")\n",
+    "if namespace_exists(namespace):\n",
+    "    ok(f\"Namespace '{namespace}' exists\")\n",
+    "else:\n",
+    "    raise RuntimeError(f\"❌ Namespace '{namespace}' does not exist. Complete Module 1 first.\")\n",
+    "\n",
+    "# Check Helm release\n",
+    "print(f\"\\n4. Checking Helm release...\")\n",
+    "helm_release = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
+    "result = run(\n",
+    "    [\"helm\", \"list\", \"-n\", namespace, \"--output\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    try:\n",
+    "        releases = json.loads(result.stdout)\n",
+    "        release_names = [r.get(\"name\") for r in releases]\n",
+    "        if helm_release in release_names:\n",
+    "            ok(f\"Helm release '{helm_release}' exists\")\n",
+    "        else:\n",
+    "            raise RuntimeError(f\"❌ Helm release '{helm_release}' not found\")\n",
+    "    except json.JSONDecodeError:\n",
+    "        warn(\"Could not parse Helm release list\")\n",
+    "\n",
+    "ok(\"Preflight checks complete\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Inspect Current Auth Configuration\n",
+    "\n",
+    "Examine the current authentication configuration without leaking secrets.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Inspecting Current Auth Configuration\\n\")\n",
+    "\n",
+    "# Check for auth-related environment variables in deployments\n",
+    "print(\"1. Checking deployment environment variables...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    deployments = json.loads(result.stdout)\n",
+    "    auth_vars_found = False\n",
+    "    \n",
+    "    for deployment in deployments.get(\"items\", []):\n",
+    "        name = deployment.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        containers = deployment.get(\"spec\", {}).get(\"template\", {}).get(\"spec\", {}).get(\"containers\", [])\n",
+    "        \n",
+    "        for container in containers:\n",
+    "            env_vars = container.get(\"env\", [])\n",
+    "            auth_env = [e for e in env_vars if any(keyword in e.get(\"name\", \"\").upper() for keyword in [\"AUTH\", \"SAML\", \"SSO\"])]\n",
+    "            \n",
+    "            if auth_env:\n",
+    "                auth_vars_found = True\n",
+    "                print(f\"\\n   Deployment: {name}\")\n",
+    "                for env in auth_env:\n",
+    "                    env_name = env.get(\"name\", \"\")\n",
+    "                    if \"SECRET\" in env_name.upper() or \"PASSWORD\" in env_name.upper():\n",
+    "                        print(f\"     - {env_name}: <redacted>\")\n",
+    "                    elif env.get(\"valueFrom\"):\n",
+    "                        print(f\"     - {env_name}: <from secret/configmap>\")\n",
+    "    \n",
+    "    if not auth_vars_found:\n",
+    "        warn(\"No auth-related environment variables found\")\n",
+    "\n",
+    "# Check for auth-related secrets (names only)\n",
+    "print(\"\\n2. Checking for auth-related secrets...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    auth_secrets = [s for s in secrets.get(\"items\", [])\n",
+    "                   if any(keyword in s.get(\"metadata\", {}).get(\"name\", \"\").lower()\n",
+    "                         for keyword in [\"auth\", \"saml\", \"sso\"])]\n",
+    "    \n",
+    "    if auth_secrets:\n",
+    "        ok(f\"Found {len(auth_secrets)} auth-related secret(s)\")\n",
+    "        for secret in auth_secrets:\n",
+    "            name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "            print(f\"   - {name} (values not displayed)\")\n",
+    "\n",
+    "ok(\"Auth configuration inspection complete (no secrets displayed)\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Validate Ingress/TLS Preconditions\n",
+    "\n",
+    "Verify domain resolution, HTTPS accessibility, and TLS certificate validity.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import socket\n",
+    "import ssl\n",
+    "import requests\n",
+    "\n",
+    "domain = config[\"LANGSMITH_DOMAIN\"]\n",
+    "print(f\"### Validating Ingress/TLS for {domain}\\n\")\n",
+    "\n",
+    "# 1. DNS Resolution\n",
+    "print(\"1. Checking DNS resolution...\")\n",
+    "try:\n",
+    "    ip_address = socket.gethostbyname(domain)\n",
+    "    ok(f\"Domain resolves to: {ip_address}\")\n",
+    "except socket.gaierror as e:\n",
+    "    raise RuntimeError(f\"❌ DNS resolution failed for {domain}: {e}\")\n",
+    "\n",
+    "# 2. HTTPS Reachability\n",
+    "print(f\"\\n2. Checking HTTPS reachability...\")\n",
+    "https_url = f\"https://{domain}\"\n",
+    "\n",
+    "try:\n",
+    "    response = requests.get(https_url, timeout=10, verify=True, allow_redirects=True)\n",
+    "    ok(f\"HTTPS accessible: {response.status_code}\")\n",
+    "except requests.exceptions.SSLError as e:\n",
+    "    warn(f\"SSL verification failed: {e}\")\n",
+    "    print(\"   💡 Certificate may be self-signed or invalid\")\n",
+    "except requests.exceptions.RequestException as e:\n",
+    "    raise RuntimeError(f\"❌ Could not connect to {domain}: {e}\")\n",
+    "\n",
+    "# 3. TLS Certificate Check\n",
+    "print(f\"\\n3. Checking TLS certificate...\")\n",
+    "try:\n",
+    "    context = ssl.create_default_context()\n",
+    "    with socket.create_connection((domain, 443), timeout=10) as sock:\n",
+    "        with context.wrap_socket(sock, server_hostname=domain) as ssock:\n",
+    "            cert = ssock.getpeercert()\n",
+    "            subject = dict(x[0] for x in cert['subject'])\n",
+    "            \n",
+    "            import datetime\n",
+    "            not_after = datetime.datetime.strptime(cert['notAfter'], '%b %d %H:%M:%S %Y %Z')\n",
+    "            days_until_expiry = (not_after - datetime.datetime.now()).days\n",
+    "            \n",
+    "            if days_until_expiry > 30:\n",
+    "                ok(f\"Certificate valid for {days_until_expiry} more days\")\n",
+    "            elif days_until_expiry > 0:\n",
+    "                warn(f\"Certificate expires in {days_until_expiry} days\")\n",
+    "            else:\n",
+    "                raise RuntimeError(f\"❌ Certificate expired\")\n",
+    "except Exception as e:\n",
+    "    warn(f\"Could not verify TLS certificate: {e}\")\n",
+    "\n",
+    "ok(\"Ingress/TLS preconditions validated\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. SAML Metadata Validation\n",
+    "\n",
+    "Validate SAML metadata URL reachability, XML parsing, and required attributes.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import xml.etree.ElementTree as ET\n",
+    "import requests\n",
+    "\n",
+    "print(\"### Validating SAML Metadata\\n\")\n",
+    "\n",
+    "metadata_url = config.get(\"SAML_METADATA_URL\", \"\")\n",
+    "metadata_file = os.environ.get(\"SAML_METADATA_FILE\", \"\").strip()\n",
+    "\n",
+    "# 1. Fetch or Load Metadata\n",
+    "print(\"1. Loading SAML metadata...\")\n",
+    "metadata_xml = None\n",
+    "\n",
+    "if metadata_url:\n",
+    "    print(f\"   Fetching from URL: {metadata_url}\")\n",
+    "    try:\n",
+    "        response = requests.get(metadata_url, timeout=10, verify=True)\n",
+    "        if response.status_code == 200:\n",
+    "            ok(\"Metadata URL accessible\")\n",
+    "            metadata_xml = response.text\n",
+    "        else:\n",
+    "            raise RuntimeError(f\"❌ Metadata URL returned {response.status_code}\")\n",
+    "    except requests.exceptions.RequestException as e:\n",
+    "        raise RuntimeError(f\"❌ Could not fetch metadata URL: {e}\")\n",
+    "elif metadata_file:\n",
+    "    print(f\"   Loading from file: {metadata_file}\")\n",
+    "    try:\n",
+    "        with open(metadata_file, \"r\") as f:\n",
+    "            metadata_xml = f.read()\n",
+    "        ok(\"Metadata file loaded\")\n",
+    "    except Exception as e:\n",
+    "        raise RuntimeError(f\"❌ Could not load metadata file: {e}\")\n",
+    "\n",
+    "if not metadata_xml:\n",
+    "    raise RuntimeError(\"❌ No metadata XML available\")\n",
+    "\n",
+    "# 2. Parse XML\n",
+    "print(\"\\n2. Parsing SAML metadata XML...\")\n",
+    "try:\n",
+    "    # Register namespaces\n",
+    "    namespaces = {\n",
+    "        'md': 'urn:oasis:names:tc:SAML:2.0:metadata',\n",
+    "        'ds': 'http://www.w3.org/2000/09/xmldsig#',\n",
+    "    }\n",
+    "    \n",
+    "    root = ET.fromstring(metadata_xml)\n",
+    "    ok(\"Metadata XML is valid\")\n",
+    "except ET.ParseError as e:\n",
+    "    raise RuntimeError(f\"❌ Invalid XML: {e}\")\n",
+    "\n",
+    "# 3. Extract Entity Descriptor\n",
+    "print(\"\\n3. Extracting entity information...\")\n",
+    "entity_id = None\n",
+    "try:\n",
+    "    entity_descriptor = root.find('.//md:EntityDescriptor', namespaces)\n",
+    "    if entity_descriptor is not None:\n",
+    "        entity_id = entity_descriptor.get('entityID')\n",
+    "        if entity_id:\n",
+    "            ok(f\"Entity ID found: {entity_id}\")\n",
+    "            if config.get(\"SAML_ENTITY_ID\") and entity_id != config.get(\"SAML_ENTITY_ID\"):\n",
+    "                warn(f\"Entity ID mismatch: config={config.get('SAML_ENTITY_ID')}, metadata={entity_id}\")\n",
+    "        else:\n",
+    "            warn(\"Entity ID not found in metadata\")\n",
+    "    else:\n",
+    "        warn(\"EntityDescriptor not found in metadata\")\n",
+    "except Exception as e:\n",
+    "    warn(f\"Could not extract entity information: {e}\")\n",
+    "\n",
+    "# 4. Extract IDP SSO Descriptor\n",
+    "print(\"\\n4. Extracting IdP SSO descriptor...\")\n",
+    "try:\n",
+    "    idp_sso = root.find('.//md:IDPSSODescriptor', namespaces)\n",
+    "    if idp_sso is not None:\n",
+    "        ok(\"IdP SSO descriptor found\")\n",
+    "        \n",
+    "        # Extract SSO endpoints\n",
+    "        sso_endpoints = idp_sso.findall('.//md:SingleSignOnService', namespaces)\n",
+    "        if sso_endpoints:\n",
+    "            print(f\"   Found {len(sso_endpoints)} SSO endpoint(s):\")\n",
+    "            for endpoint in sso_endpoints:\n",
+    "                location = endpoint.get('Location', '')\n",
+    "                binding = endpoint.get('Binding', '')\n",
+    "                print(f\"     - {binding}: {location}\")\n",
+    "        else:\n",
+    "            warn(\"No SSO endpoints found\")\n",
+    "    else:\n",
+    "        warn(\"IDPSSODescriptor not found - may not be IdP metadata\")\n",
+    "except Exception as e:\n",
+    "    warn(f\"Could not extract IdP SSO descriptor: {e}\")\n",
+    "\n",
+    "# 5. Extract Certificates\n",
+    "print(\"\\n5. Checking for signing certificates...\")\n",
+    "try:\n",
+    "    certificates = root.findall('.//ds:X509Certificate', namespaces)\n",
+    "    if certificates:\n",
+    "        ok(f\"Found {len(certificates)} certificate(s)\")\n",
+    "        for i, cert in enumerate(certificates):\n",
+    "            cert_text = cert.text.strip() if cert.text else \"\"\n",
+    "            if cert_text:\n",
+    "                print(f\"   Certificate {i+1}: {len(cert_text)} characters\")\n",
+    "            else:\n",
+    "                warn(f\"Certificate {i+1} is empty\")\n",
+    "    else:\n",
+    "        warn(\"No signing certificates found\")\n",
+    "        print(\"   💡 IdP must provide signing certificate for assertion validation\")\n",
+    "except Exception as e:\n",
+    "    warn(f\"Could not extract certificates: {e}\")\n",
+    "\n",
+    "# 6. Validate Required Attributes\n",
+    "print(\"\\n6. Validating attribute configuration...\")\n",
+    "print(f\"   Expected email attribute: {config['SAML_EMAIL_ATTRIBUTE']}\")\n",
+    "print(f\"   Expected name attribute: {config['SAML_NAME_ATTRIBUTE']}\")\n",
+    "print(f\"   Expected groups attribute: {config['SAML_GROUPS_ATTRIBUTE']}\")\n",
+    "\n",
+    "ok(\"SAML metadata validation complete\")\n",
+    "print(\"\\n💡 Verify your IdP sends these attributes in SAML assertions:\")\n",
+    "print(f\"   - {config['SAML_EMAIL_ATTRIBUTE']} (required)\")\n",
+    "print(f\"   - {config['SAML_NAME_ATTRIBUTE']} (optional)\")\n",
+    "print(f\"   - {config['SAML_GROUPS_ATTRIBUTE']} (optional, for role mapping)\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._shell import run\n",
+    "\n",
+    "print(\"### Checking for Common SAML Failure Signatures\\n\")\n",
+    "\n",
+    "# Get pod names\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"jsonpath={.items[*].metadata.name}\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0 and result.stdout.strip():\n",
+    "    pod_names = result.stdout.strip().split()\n",
+    "    api_pods = [p for p in pod_names if any(keyword in p.lower() for keyword in [\"api\", \"server\", \"backend\"])]\n",
+    "    \n",
+    "    if not api_pods:\n",
+    "        api_pods = pod_names[:2]\n",
+    "    \n",
+    "    failure_patterns = {\n",
+    "        \"Missing Attributes\": [\n",
+    "            \"missing attribute\",\n",
+    "            \"attribute not found\",\n",
+    "            \"email attribute\",\n",
+    "            \"required attribute\",\n",
+    "        ],\n",
+    "        \"Signature Validation\": [\n",
+    "            \"signature validation failed\",\n",
+    "            \"invalid signature\",\n",
+    "            \"certificate\",\n",
+    "            \"signing key\",\n",
+    "        ],\n",
+    "        \"Assertion Expired\": [\n",
+    "            \"assertion expired\",\n",
+    "            \"notonorafter\",\n",
+    "            \"clock skew\",\n",
+    "            \"timeout\",\n",
+    "        ],\n",
+    "        \"Entity ID Mismatch\": [\n",
+    "            \"entity id\",\n",
+    "            \"issuer mismatch\",\n",
+    "            \"audience\",\n",
+    "        ],\n",
+    "        \"Metadata Issues\": [\n",
+    "            \"metadata\",\n",
+    "            \"xml parse\",\n",
+    "            \"invalid metadata\",\n",
+    "        ],\n",
+    "    }\n",
+    "    \n",
+    "    found_issues = []\n",
+    "    \n",
+    "    for pod_name in api_pods[:2]:\n",
+    "        try:\n",
+    "            log_result = run(\n",
+    "                [\"kubectl\", \"logs\", pod_name, \"-n\", namespace, \"--tail=200\"],\n",
+    "                check=False,\n",
+    "                stream=False\n",
+    "            )\n",
+    "            \n",
+    "            if log_result.returncode == 0:\n",
+    "                logs_lower = log_result.stdout.lower()\n",
+    "                \n",
+    "                for category, patterns in failure_patterns.items():\n",
+    "                    for pattern in patterns:\n",
+    "                        if pattern in logs_lower:\n",
+    "                            # Check if it's actually an error (not just a log message)\n",
+    "                            lines = log_result.stdout.split(\"\\n\")\n",
+    "                            error_lines = [line for line in lines \n",
+    "                                         if pattern in line.lower() \n",
+    "                                         and any(err in line.lower() for err in [\"error\", \"fail\", \"invalid\", \"missing\"])]\n",
+    "                            \n",
+    "                            if error_lines and category not in found_issues:\n",
+    "                                found_issues.append(category)\n",
+    "                                warn(f\"Potential {category} issue found in {pod_name} logs\")\n",
+    "                                print(f\"   Pattern: '{pattern}'\")\n",
+    "                                # Don't print full log line as it may contain sensitive data\n",
+    "                                break\n",
+    "        except Exception:\n",
+    "            pass\n",
+    "    \n",
+    "    if not found_issues:\n",
+    "        ok(\"No common SAML failure signatures found in logs\")\n",
+    "    else:\n",
+    "        print(f\"\\n💡 Found potential issues: {', '.join(found_issues)}\")\n",
+    "        print(\"   Review logs manually for details:\")\n",
+    "        print(f\"   kubectl logs <pod-name> -n {namespace} --tail=100 | grep -i saml\")\n",
+    "else:\n",
+    "    warn(\"Could not retrieve pod names\")\n",
+    "\n",
+    "print(\"\\n💡 Common SAML failure causes:\")\n",
+    "print(\"   1. Missing required attributes in assertion\")\n",
+    "print(\"   2. Certificate mismatch or expired certificate\")\n",
+    "print(\"   3. Clock skew between LangSmith and IdP\")\n",
+    "print(\"   4. Entity ID mismatch\")\n",
+    "print(\"   5. Attribute name mismatch\")\n",
+    "print(\"\\n   See docs/shared/auth_troubleshooting.md for detailed troubleshooting\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Deployment Verification & Support Bundle\n",
+    "\n",
+    "Same as OIDC notebook - verify pods, check logs, collect support bundle.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._k8s_helpers import get_pods, wait_for_deployments_ready\n",
+    "from datetime import datetime\n",
+    "import requests\n",
+    "\n",
+    "print(\"### Deployment Verification\\n\")\n",
+    "\n",
+    "# 1. Pod Readiness\n",
+    "print(\"1. Checking pod readiness...\")\n",
+    "require_namespace(namespace)\n",
+    "\n",
+    "try:\n",
+    "    wait_for_deployments_ready(namespace, timeout=\"5m\")\n",
+    "    ok(\"All deployments ready\")\n",
+    "except Exception as e:\n",
+    "    warn(f\"Some deployments may not be ready: {e}\")\n",
+    "\n",
+    "pods_output = get_pods(namespace)\n",
+    "print(\"\\nPod Status:\")\n",
+    "print(pods_output)\n",
+    "\n",
+    "# 2. Test Endpoint Auth Behavior\n",
+    "print(f\"\\n2. Testing endpoint auth behavior...\")\n",
+    "domain = config[\"LANGSMITH_DOMAIN\"]\n",
+    "test_url = f\"https://{domain}/api/v1/me\"\n",
+    "\n",
+    "try:\n",
+    "    response = requests.get(test_url, timeout=10, verify=True, allow_redirects=False)\n",
+    "    if response.status_code in [401, 403]:\n",
+    "        ok(f\"Endpoint requires authentication ({response.status_code})\")\n",
+    "    elif response.status_code in [301, 302, 307, 308]:\n",
+    "        redirect_location = response.headers.get(\"Location\", \"\")\n",
+    "        if \"login\" in redirect_location.lower() or \"saml\" in redirect_location.lower():\n",
+    "            ok(\"Endpoint redirects to authentication\")\n",
+    "        else:\n",
+    "            warn(f\"Endpoint redirects but not to auth: {redirect_location}\")\n",
+    "    else:\n",
+    "        warn(f\"Unexpected status code: {response.status_code}\")\n",
+    "except requests.exceptions.RequestException as e:\n",
+    "    warn(f\"Could not test endpoint: {e}\")\n",
+    "\n",
+    "# 3. Support Bundle\n",
+    "print(f\"\\n3. Collecting support bundle...\")\n",
+    "timestamp = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
+    "support_dir = artifacts_dir / f\"saml-support-{timestamp}\"\n",
+    "support_dir.mkdir(exist_ok=True)\n",
+    "\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"jsonpath={.items[*].metadata.name}\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0 and result.stdout.strip():\n",
+    "    pod_names = result.stdout.strip().split()\n",
+    "    api_pods = [p for p in pod_names if any(keyword in p.lower() for keyword in [\"api\", \"server\", \"backend\"])]\n",
+    "    \n",
+    "    for pod_name in (api_pods[:3] if api_pods else pod_names[:3]):\n",
+    "        try:\n",
+    "            log_result = run(\n",
+    "                [\"kubectl\", \"logs\", pod_name, \"-n\", namespace, \"--tail=200\"],\n",
+    "                check=False,\n",
+    "                stream=False\n",
+    "            )\n",
+    "            if log_result.returncode == 0:\n",
+    "                log_file = support_dir / f\"{pod_name}-logs.txt\"\n",
+    "                with open(log_file, \"w\") as f:\n",
+    "                    f.write(log_result.stdout)\n",
+    "                print(f\"   ✅ Saved logs for {pod_name}\")\n",
+    "        except Exception:\n",
+    "            pass\n",
+    "\n",
+    "ok(f\"Support bundle saved to: {support_dir}\")\n",
+    "print(\"\\n💡 Include pod logs and configuration when contacting support\")\n",
+    "print(\"   See docs/shared/auth_troubleshooting.md for complete bundle procedure\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,933 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Module 3: Operations Sanity Checks\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "This notebook performs read-only validation and signal checks for your LangSmith production deployment. It assumes Module 1 and Module 2 are complete.\n",
+    "\n",
+    "**⚠️ SAFETY NOTICE:** This notebook is **READ-ONLY**. It performs validation checks only and does NOT modify any infrastructure, Helm values, secrets, deployments, or resources. All operations are safe to run against production environments.\n",
+    "\n",
+    "**Prerequisites:**\n",
+    "- Module 1 deployment is healthy and accessible\n",
+    "- Module 2 authentication is configured\n",
+    "- kubectl access to the cluster\n",
+    "- Read access to cloud provider APIs (for managed services)\n",
+    "\n",
+    "## What We'll Check\n",
+    "\n",
+    "1. ✅ Configuration (environment variables, redacted)\n",
+    "2. ✅ Preflight (kubectl context, namespace, deployments)\n",
+    "3. ✅ Current state snapshot (pods, services, events)\n",
+    "4. ✅ Early warning signals (restarts, pending pods, resource saturation)\n",
+    "5. ✅ Storage/durability checks (blob storage, backups)\n",
+    "6. ✅ Sidecar checks (Istio, if applicable)\n",
+    "\n",
+    "**Estimated time:** 15-20 minutes\n",
+    "\n",
+    "**Important:** \n",
+    "- This notebook is read-only and safe to run. It does not modify any resources.\n",
+    "- All operations are read-only: `kubectl get`, `kubectl logs`, `kubectl top`, `helm get values`\n",
+    "- Artifacts are saved locally only (no cluster modifications)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bootstrap environment\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Add notebooks directory to path so we can import shared as a package\n",
+    "possible_paths = [\n",
+    "    Path.cwd().parent,  # If cwd is module-3, go up one level to notebooks\n",
+    "    Path.cwd(),  # If cwd is already notebooks\n",
+    "    Path.cwd() / \"notebooks\",  # If cwd is workspace root\n",
+    "]\n",
+    "\n",
+    "notebooks_path = None\n",
+    "for path in possible_paths:\n",
+    "    if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        notebooks_path = path\n",
+    "        break\n",
+    "\n",
+    "if not notebooks_path:\n",
+    "    notebooks_path = Path.cwd() / \"notebooks\"\n",
+    "    if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
+    "\n",
+    "# Add notebooks directory to path so 'shared' can be imported as a package\n",
+    "if str(notebooks_path) not in sys.path:\n",
+    "    sys.path.insert(0, str(notebooks_path))\n",
+    "\n",
+    "from shared._bootstrap import bootstrap\n",
+    "\n",
+    "# Run bootstrap\n",
+    "bootstrap_info = bootstrap()\n",
+    "artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
+    "print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Safety Check: Verify Environment\n",
+    "\n",
+    "Before proceeding with validation, confirm you're working with the correct environment. This notebook is read-only and safe for production use.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Safety check: Verify environment and confirm read-only operations\n",
+    "from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
+    "from shared._validation import ok, warn\n",
+    "\n",
+    "provider = get_cloud_provider()\n",
+    "region = get_region()\n",
+    "identity = get_identity()\n",
+    "\n",
+    "print(\"### Environment Safety Check\\n\")\n",
+    "\n",
+    "# Show current environment\n",
+    "provider_display = provider.upper()\n",
+    "print(f\"Cloud Provider: {provider_display}\")\n",
+    "print(f\"Region: {region}\")\n",
+    "\n",
+    "if provider == \"aws\":\n",
+    "    print(f\"Account ID: {identity.get('Account', 'N/A')}\")\n",
+    "    print(f\"User ARN: {identity.get('Arn', 'N/A')}\")\n",
+    "elif provider == \"azure\":\n",
+    "    print(f\"Subscription ID: {identity.get('SubscriptionId', identity.get('Account', 'N/A'))}\")\n",
+    "    print(f\"Subscription Name: {identity.get('SubscriptionName', 'N/A')}\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 60)\n",
+    "print(\"⚠️  IMPORTANT: This notebook is READ-ONLY\")\n",
+    "print(\"=\" * 60)\n",
+    "print(\"\\nThis notebook will:\")\n",
+    "print(\"  ✅ Validate production readiness\")\n",
+    "print(\"  ✅ Check deployment status and health\")\n",
+    "print(\"  ✅ Inspect resource usage and signals\")\n",
+    "print(\"  ✅ Verify storage and backup configuration\")\n",
+    "print(\"  ✅ Collect state snapshots (saved locally)\")\n",
+    "print(\"\\nThis notebook will NOT:\")\n",
+    "print(\"  ❌ Modify Helm values or releases\")\n",
+    "print(\"  ❌ Create or update secrets\")\n",
+    "print(\"  ❌ Restart pods or deployments\")\n",
+    "print(\"  ❌ Change any infrastructure\")\n",
+    "print(\"  ❌ Modify any cluster resources\")\n",
+    "print(\"\\nAll operations are read-only:\")\n",
+    "print(\"  - kubectl get (read resources)\")\n",
+    "print(\"  - kubectl logs (read logs)\")\n",
+    "print(\"  - kubectl top (read metrics)\")\n",
+    "print(\"  - helm get values (read configuration)\")\n",
+    "print(\"  - Write artifacts to local directory only\")\n",
+    "print(\"\\n\" + \"=\" * 60)\n",
+    "\n",
+    "ok(\"Environment safety check complete\")\n",
+    "print(\"\\n✅ Safe to proceed with read-only validation\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Configuration\n",
+    "\n",
+    "Load and validate configuration from environment variables. All secrets are redacted in output.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import json\n",
+    "from shared._validation import require_env, print_config, redact, ok, warn\n",
+    "from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
+    "\n",
+    "# Required configuration variables\n",
+    "required_vars = [\n",
+    "    \"NAMESPACE\",\n",
+    "    \"CLUSTER_NAME\",\n",
+    "]\n",
+    "\n",
+    "# Optional but recommended\n",
+    "optional_vars = [\n",
+    "    \"HELM_RELEASE\",\n",
+    "    \"LANGSMITH_DOMAIN\",\n",
+    "]\n",
+    "\n",
+    "print(\"### Loading Configuration\\n\")\n",
+    "\n",
+    "# Load required variables\n",
+    "config = {}\n",
+    "missing = []\n",
+    "\n",
+    "for var in required_vars:\n",
+    "    value = os.environ.get(var, \"\").strip()\n",
+    "    if not value:\n",
+    "        missing.append(var)\n",
+    "    config[var] = value\n",
+    "\n",
+    "if missing:\n",
+    "    raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
+    "                      f\"💡 Copy env-samples/workshop.env.example to your .env file and fill in values\")\n",
+    "\n",
+    "# Load optional variables\n",
+    "for var in optional_vars:\n",
+    "    config[var] = os.environ.get(var, \"\").strip()\n",
+    "\n",
+    "# Set defaults\n",
+    "if not config.get(\"HELM_RELEASE\"):\n",
+    "    config[\"HELM_RELEASE\"] = \"langsmith\"\n",
+    "\n",
+    "# Print configuration (redacted)\n",
+    "print_config(config, redact_keys=set())\n",
+    "\n",
+    "# Show cloud provider info\n",
+    "provider = get_cloud_provider()\n",
+    "region = get_region()\n",
+    "identity = get_identity()\n",
+    "\n",
+    "provider_display = provider.upper()\n",
+    "print(f\"\\n### Current {provider_display} Session\")\n",
+    "print(f\"Cloud Provider: {provider_display}\")\n",
+    "print(f\"Region: {region}\")\n",
+    "\n",
+    "if provider == \"aws\":\n",
+    "    print(f\"Account ID: {identity.get('Account', 'N/A')}\")\n",
+    "elif provider == \"azure\":\n",
+    "    subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"\")\n",
+    "    print(f\"Subscription ID: {subscription_id}\")\n",
+    "\n",
+    "ok(\"Configuration loaded\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Preflight Checks\n",
+    "\n",
+    "Verify kubectl context, namespace exists, and deployments are ready.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._validation import ok, warn\n",
+    "from shared._k8s_helpers import require_namespace, namespace_exists, wait_for_deployments_ready\n",
+    "from shared._shell import run\n",
+    "\n",
+    "namespace = config[\"NAMESPACE\"]\n",
+    "helm_release = config[\"HELM_RELEASE\"]\n",
+    "\n",
+    "print(\"### Preflight Checks\\n\")\n",
+    "\n",
+    "# Check kubectl is available\n",
+    "print(\"1. Checking kubectl...\")\n",
+    "result = run([\"kubectl\", \"version\", \"--client\", \"--short\"], check=False, stream=False)\n",
+    "if result.returncode == 0:\n",
+    "    ok(\"kubectl is available\")\n",
+    "    print(f\"   {result.stdout.strip()}\")\n",
+    "else:\n",
+    "    raise RuntimeError(\"❌ kubectl is not available or not working\")\n",
+    "\n",
+    "# Check kubectl context\n",
+    "print(\"\\n2. Checking kubectl context...\")\n",
+    "result = run([\"kubectl\", \"config\", \"current-context\"], check=False, stream=False)\n",
+    "if result.returncode == 0:\n",
+    "    context = result.stdout.strip()\n",
+    "    ok(f\"Current context: {context}\")\n",
+    "else:\n",
+    "    warn(\"Could not determine kubectl context\")\n",
+    "    print(\"   💡 Run: kubectl config get-contexts\")\n",
+    "\n",
+    "# Check namespace exists\n",
+    "print(f\"\\n3. Checking namespace '{namespace}'...\")\n",
+    "if namespace_exists(namespace):\n",
+    "    ok(f\"Namespace '{namespace}' exists\")\n",
+    "else:\n",
+    "    raise RuntimeError(f\"❌ Namespace '{namespace}' does not exist. Complete Module 1 first.\")\n",
+    "\n",
+    "# Check deployments are ready\n",
+    "print(f\"\\n4. Checking deployments...\")\n",
+    "require_namespace(namespace)\n",
+    "\n",
+    "try:\n",
+    "    wait_for_deployments_ready(namespace, timeout=\"2m\")\n",
+    "    ok(\"All deployments ready\")\n",
+    "except Exception as e:\n",
+    "    warn(f\"Some deployments may not be ready: {e}\")\n",
+    "    print(\"   💡 Check pod status manually: kubectl get pods -n {namespace}\")\n",
+    "\n",
+    "ok(\"Preflight checks complete\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Snapshot Current State\n",
+    "\n",
+    "Capture current cluster state for baseline reference.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datetime import datetime\n",
+    "from shared._k8s_helpers import get_pods\n",
+    "from shared._shell import run\n",
+    "\n",
+    "print(\"### Snapshotting Current State\\n\")\n",
+    "\n",
+    "timestamp = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
+    "snapshot_dir = artifacts_dir / f\"ops-snapshot-{timestamp}\"\n",
+    "snapshot_dir.mkdir(exist_ok=True)\n",
+    "\n",
+    "print(f\"Saving snapshot to: {snapshot_dir}\\n\")\n",
+    "\n",
+    "# 1. Get all resources\n",
+    "print(\"1. Capturing all resources...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"all\", \"-n\", namespace, \"-o\", \"wide\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(snapshot_dir / \"all-resources.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    ok(\"All resources captured\")\n",
+    "    print(result.stdout)\n",
+    "else:\n",
+    "    warn(\"Could not capture all resources\")\n",
+    "\n",
+    "# 2. Get events (sorted by timestamp)\n",
+    "print(\"\\n2. Capturing recent events...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by=.lastTimestamp\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(snapshot_dir / \"events.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    ok(\"Events captured\")\n",
+    "    \n",
+    "    # Show recent events\n",
+    "    lines = result.stdout.strip().split(\"\\n\")\n",
+    "    if len(lines) > 1:\n",
+    "        print(f\"\\n   Last 10 events:\")\n",
+    "        for line in lines[-10:]:\n",
+    "            print(f\"   {line}\")\n",
+    "else:\n",
+    "    warn(\"Could not capture events\")\n",
+    "\n",
+    "# 3. Get node and pod resource usage (if metrics available)\n",
+    "print(\"\\n3. Checking resource usage...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"top\", \"nodes\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(snapshot_dir / \"node-usage.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    ok(\"Node usage captured\")\n",
+    "    print(result.stdout)\n",
+    "else:\n",
+    "    warn(\"Node metrics not available (metrics-server may not be installed)\")\n",
+    "\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"top\", \"pods\", \"-n\", namespace],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(snapshot_dir / \"pod-usage.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    ok(\"Pod usage captured\")\n",
+    "    print(result.stdout)\n",
+    "else:\n",
+    "    warn(\"Pod metrics not available\")\n",
+    "\n",
+    "# 4. Check for data store services\n",
+    "print(\"\\n4. Checking data store services...\")\n",
+    "data_stores = {\n",
+    "    \"postgres\": [\"postgres\", \"postgresql\", \"database\", \"db\"],\n",
+    "    \"redis\": [\"redis\", \"cache\"],\n",
+    "    \"clickhouse\": [\"clickhouse\", \"ch\"],\n",
+    "}\n",
+    "\n",
+    "found_stores = []\n",
+    "for store_type, keywords in data_stores.items():\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"get\", \"svc\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        services = json.loads(result.stdout)\n",
+    "        for svc in services.get(\"items\", []):\n",
+    "            name = svc.get(\"metadata\", {}).get(\"name\", \"\").lower()\n",
+    "            if any(keyword in name for keyword in keywords):\n",
+    "                found_stores.append((store_type, name))\n",
+    "                print(f\"   ✅ Found {store_type} service: {name}\")\n",
+    "\n",
+    "if found_stores:\n",
+    "    ok(f\"Found {len(found_stores)} data store service(s)\")\n",
+    "else:\n",
+    "    warn(\"No in-cluster data stores found (may be using managed services)\")\n",
+    "\n",
+    "ok(f\"State snapshot saved to: {snapshot_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._validation import ok, warn\n",
+    "from shared._shell import run\n",
+    "\n",
+    "print(\"### Early Warning Signal Checks\\n\")\n",
+    "\n",
+    "# 1. Check pod restarts\n",
+    "print(\"1. Checking pod restart counts...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "critical_restarts = []\n",
+    "warning_restarts = []\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    for pod in pods.get(\"items\", []):\n",
+    "        name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        status = pod.get(\"status\", {})\n",
+    "        phase = status.get(\"phase\", \"\")\n",
+    "        \n",
+    "        # Check restart count\n",
+    "        container_statuses = status.get(\"containerStatuses\", [])\n",
+    "        for cs in container_statuses:\n",
+    "            restart_count = cs.get(\"restartCount\", 0)\n",
+    "            if restart_count > 5:\n",
+    "                critical_restarts.append((name, restart_count))\n",
+    "            elif restart_count > 2:\n",
+    "                warning_restarts.append((name, restart_count))\n",
+    "        \n",
+    "        # Check pod phase\n",
+    "        if phase == \"CrashLoopBackOff\":\n",
+    "            critical_restarts.append((name, \"CrashLoopBackOff\"))\n",
+    "        elif phase == \"Pending\":\n",
+    "            # Check how long it's been pending\n",
+    "            conditions = status.get(\"conditions\", [])\n",
+    "            for cond in conditions:\n",
+    "                if cond.get(\"type\") == \"PodScheduled\" and cond.get(\"status\") != \"True\":\n",
+    "                    # Pod is pending\n",
+    "                    warning_restarts.append((name, \"Pending\"))\n",
+    "\n",
+    "if critical_restarts:\n",
+    "    warn(f\"❌ Critical: Found {len(critical_restarts)} pod(s) with critical issues\")\n",
+    "    for pod_name, issue in critical_restarts:\n",
+    "        print(f\"   - {pod_name}: {issue}\")\n",
+    "    print(\"\\n   💡 Action required: Check pod logs and events\")\n",
+    "elif warning_restarts:\n",
+    "    warn(f\"Found {len(warning_restarts)} pod(s) with warnings\")\n",
+    "    for pod_name, issue in warning_restarts:\n",
+    "        print(f\"   - {pod_name}: {issue}\")\n",
+    "    print(\"\\n   💡 Monitor these pods closely\")\n",
+    "else:\n",
+    "    ok(\"No critical pod restart issues found\")\n",
+    "\n",
+    "# 2. Check for pending pods\n",
+    "print(\"\\n2. Checking for pending pods...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"--field-selector=status.phase=Pending\", \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    pending = pods.get(\"items\", [])\n",
+    "    if pending:\n",
+    "        warn(f\"Found {len(pending)} pending pod(s)\")\n",
+    "        for pod in pending:\n",
+    "            name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "            print(f\"   - {name}\")\n",
+    "        print(\"\\n   💡 Check events: kubectl describe pod <name> -n {namespace}\")\n",
+    "    else:\n",
+    "        ok(\"No pending pods\")\n",
+    "\n",
+    "# 3. Check resource saturation (if metrics available)\n",
+    "print(\"\\n3. Checking resource saturation...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"top\", \"pods\", \"-n\", namespace],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    lines = result.stdout.strip().split(\"\\n\")[1:]  # Skip header\n",
+    "    saturated_pods = []\n",
+    "    \n",
+    "    for line in lines:\n",
+    "        parts = line.split()\n",
+    "        if len(parts) >= 3:\n",
+    "            pod_name = parts[0]\n",
+    "            cpu = parts[1]\n",
+    "            memory = parts[2]\n",
+    "            \n",
+    "            # Parse CPU (handle \"m\" suffix for millicores)\n",
+    "            try:\n",
+    "                if cpu.endswith(\"m\"):\n",
+    "                    cpu_val = int(cpu[:-1])\n",
+    "                else:\n",
+    "                    cpu_val = int(float(cpu.replace(\"Gi\", \"\").replace(\"Mi\", \"\")))\n",
+    "                \n",
+    "                # Parse memory (handle \"Mi\" or \"Gi\" suffix)\n",
+    "                if \"Gi\" in memory:\n",
+    "                    mem_val = float(memory.replace(\"Gi\", \"\")) * 1024\n",
+    "                elif \"Mi\" in memory:\n",
+    "                    mem_val = float(memory.replace(\"Mi\", \"\"))\n",
+    "                else:\n",
+    "                    mem_val = 0\n",
+    "                \n",
+    "                # Check thresholds (simplified - would need requests/limits for accurate %)\n",
+    "                # For now, just flag very high absolute values\n",
+    "                if cpu_val > 2000:  # > 2 cores\n",
+    "                    saturated_pods.append((pod_name, f\"High CPU: {cpu}\"))\n",
+    "                if mem_val > 4096:  # > 4 Gi\n",
+    "                    saturated_pods.append((pod_name, f\"High Memory: {memory}\"))\n",
+    "            except (ValueError, IndexError):\n",
+    "                pass\n",
+    "    \n",
+    "    if saturated_pods:\n",
+    "        warn(f\"Found {len(saturated_pods)} pod(s) with high resource usage\")\n",
+    "        for pod_name, issue in saturated_pods:\n",
+    "            print(f\"   - {pod_name}: {issue}\")\n",
+    "        print(\"\\n   💡 Review resource requests/limits and consider scaling\")\n",
+    "    else:\n",
+    "        ok(\"No obvious resource saturation detected\")\n",
+    "else:\n",
+    "    warn(\"Resource metrics not available (cannot check saturation)\")\n",
+    "\n",
+    "# 4. Check logs for common failure patterns\n",
+    "print(\"\\n4. Checking logs for common failure patterns...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"jsonpath={.items[*].metadata.name}\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "failure_patterns = {\n",
+    "    \"connection refused\": [],\n",
+    "    \"timeout\": [],\n",
+    "    \"out of memory\": [],\n",
+    "    \"database error\": [],\n",
+    "}\n",
+    "\n",
+    "if result.returncode == 0 and result.stdout.strip():\n",
+    "    pod_names = result.stdout.strip().split()\n",
+    "    # Check API and worker pods (most likely to have issues)\n",
+    "    api_pods = [p for p in pod_names if any(keyword in p.lower() for keyword in [\"api\", \"server\", \"backend\"])]\n",
+    "    worker_pods = [p for p in pod_names if any(keyword in p.lower() for keyword in [\"worker\", \"processor\"])]\n",
+    "    \n",
+    "    pods_to_check = (api_pods[:2] if api_pods else []) + (worker_pods[:2] if worker_pods else [])\n",
+    "    \n",
+    "    for pod_name in pods_to_check[:4]:  # Check up to 4 pods\n",
+    "        try:\n",
+    "            log_result = run(\n",
+    "                [\"kubectl\", \"logs\", pod_name, \"-n\", namespace, \"--tail=50\"],\n",
+    "                check=False,\n",
+    "                stream=False\n",
+    "            )\n",
+    "            \n",
+    "            if log_result.returncode == 0:\n",
+    "                logs_lower = log_result.stdout.lower()\n",
+    "                \n",
+    "                for pattern, matches in failure_patterns.items():\n",
+    "                    if pattern in logs_lower:\n",
+    "                        # Check if it's actually an error (not just a log message)\n",
+    "                        lines = log_result.stdout.split(\"\\n\")\n",
+    "                        error_lines = [line for line in lines \n",
+    "                                     if pattern in line.lower() \n",
+    "                                     and any(err in line.lower() for err in [\"error\", \"fail\", \"refused\", \"timeout\"])]\n",
+    "                        \n",
+    "                        if error_lines:\n",
+    "                            matches.append((pod_name, len(error_lines)))\n",
+    "        except Exception:\n",
+    "            pass\n",
+    "    \n",
+    "    found_issues = False\n",
+    "    for pattern, matches in failure_patterns.items():\n",
+    "        if matches:\n",
+    "            found_issues = True\n",
+    "            warn(f\"Found '{pattern}' pattern in {len(matches)} pod(s)\")\n",
+    "            for pod_name, count in matches:\n",
+    "                print(f\"   - {pod_name}: {count} occurrence(s)\")\n",
+    "    \n",
+    "    if not found_issues:\n",
+    "        ok(\"No common failure patterns found in recent logs\")\n",
+    "    else:\n",
+    "        print(\"\\n   💡 Review pod logs for details: kubectl logs <pod> -n {namespace} --tail=100\")\n",
+    "else:\n",
+    "    warn(\"Could not retrieve pod names for log checking\")\n",
+    "\n",
+    "ok(\"Early warning signal checks complete\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._validation import ok, warn\n",
+    "from shared._shell import run\n",
+    "\n",
+    "print(\"### Storage / Durability Checks\\n\")\n",
+    "\n",
+    "# 1. Check blob storage configuration\n",
+    "print(\"1. Checking blob storage configuration...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "blob_storage_configured = False\n",
+    "blob_storage_provider = None\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    deployments = json.loads(result.stdout)\n",
+    "    for deployment in deployments.get(\"items\", []):\n",
+    "        containers = deployment.get(\"spec\", {}).get(\"template\", {}).get(\"spec\", {}).get(\"containers\", [])\n",
+    "        \n",
+    "        for container in containers:\n",
+    "            env_vars = container.get(\"env\", [])\n",
+    "            for env in env_vars:\n",
+    "                env_name = env.get(\"name\", \"\").upper()\n",
+    "                env_value = env.get(\"value\", \"\")\n",
+    "                \n",
+    "                # Check for blob storage configuration\n",
+    "                if \"BLOB\" in env_name or \"S3\" in env_name or \"STORAGE\" in env_name:\n",
+    "                    if \"PROVIDER\" in env_name:\n",
+    "                        blob_storage_provider = env_value\n",
+    "                        blob_storage_configured = True\n",
+    "                    elif env_value and env_value not in [\"local\", \"filesystem\", \"\"]:\n",
+    "                        blob_storage_configured = True\n",
+    "\n",
+    "# Also check Helm values if accessible\n",
+    "helm_release = config.get(\"HELM_RELEASE\", \"langsmith\")\n",
+    "result = run(\n",
+    "    [\"helm\", \"get\", \"values\", helm_release, \"-n\", namespace, \"--output\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    try:\n",
+    "        values = json.loads(result.stdout)\n",
+    "        values_str = str(values).lower()\n",
+    "        \n",
+    "        # Look for blob storage configuration\n",
+    "        if \"blob\" in values_str or \"s3\" in values_str:\n",
+    "            if \"local\" not in values_str and \"filesystem\" not in values_str:\n",
+    "                blob_storage_configured = True\n",
+    "                if \"s3\" in values_str:\n",
+    "                    blob_storage_provider = \"s3\"\n",
+    "                elif \"azure\" in values_str:\n",
+    "                    blob_storage_provider = \"azure\"\n",
+    "    except json.JSONDecodeError:\n",
+    "        pass\n",
+    "\n",
+    "if blob_storage_configured:\n",
+    "    if blob_storage_provider:\n",
+    "        ok(f\"Blob storage configured: {blob_storage_provider}\")\n",
+    "    else:\n",
+    "        ok(\"Blob storage appears configured (provider not detected)\")\n",
+    "    print(\"   💡 Verify blob storage is NOT using local filesystem in production\")\n",
+    "else:\n",
+    "    warn(\"❌ CRITICAL: Blob storage may not be configured\")\n",
+    "    print(\"   💡 Blob storage is REQUIRED for production\")\n",
+    "    print(\"   💡 Without it, ClickHouse will become unusable under load\")\n",
+    "    print(\"   💡 Configure S3 (AWS) or Azure Blob Storage (Azure)\")\n",
+    "    print(\"   💡 Check Helm values: helm get values <release> -n <namespace>\")\n",
+    "\n",
+    "# 2. Check for backup configuration indicators\n",
+    "print(\"\\n2. Checking backup configuration...\")\n",
+    "print(\"   Note: Backup configuration verification depends on deployment type\")\n",
+    "\n",
+    "# For managed services, we can't verify from cluster\n",
+    "# For in-cluster services, we can check for backup jobs\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"cronjobs,jobs\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "backup_jobs_found = False\n",
+    "if result.returncode == 0:\n",
+    "    resources = json.loads(result.stdout)\n",
+    "    for item in resources.get(\"items\", []):\n",
+    "        name = item.get(\"metadata\", {}).get(\"name\", \"\").lower()\n",
+    "        if \"backup\" in name:\n",
+    "            backup_jobs_found = True\n",
+    "            print(f\"   ✅ Found backup job: {name}\")\n",
+    "\n",
+    "if backup_jobs_found:\n",
+    "    ok(\"Backup jobs found in cluster\")\n",
+    "else:\n",
+    "    warn(\"No backup jobs found in cluster\")\n",
+    "    print(\"   💡 For managed services (RDS, Azure Database), backups are automated\")\n",
+    "    print(\"   💡 Verify backups in cloud provider console:\")\n",
+    "    if provider == \"aws\":\n",
+    "        print(\"      - AWS RDS: Check automated backups in RDS console\")\n",
+    "        print(\"      - AWS ElastiCache: Check snapshot configuration\")\n",
+    "    elif provider == \"azure\":\n",
+    "        print(\"      - Azure Database: Check backup configuration in Azure portal\")\n",
+    "        print(\"      - Azure Cache: Check backup configuration\")\n",
+    "    print(\"   💡 For in-cluster ClickHouse, configure backup CronJob\")\n",
+    "\n",
+    "# 3. Check PVCs (for in-cluster storage)\n",
+    "print(\"\\n3. Checking persistent volume claims...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pvc\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pvcs = json.loads(result.stdout)\n",
+    "    pvc_items = pvcs.get(\"items\", [])\n",
+    "    \n",
+    "    if pvc_items:\n",
+    "        ok(f\"Found {len(pvc_items)} PVC(s)\")\n",
+    "        for pvc in pvc_items:\n",
+    "            name = pvc.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "            status = pvc.get(\"status\", {}).get(\"phase\", \"\")\n",
+    "            size = pvc.get(\"spec\", {}).get(\"resources\", {}).get(\"requests\", {}).get(\"storage\", \"N/A\")\n",
+    "            print(f\"   - {name}: {status}, {size}\")\n",
+    "        \n",
+    "        # Check for unbound PVCs\n",
+    "        unbound = [pvc for pvc in pvc_items if pvc.get(\"status\", {}).get(\"phase\") != \"Bound\"]\n",
+    "        if unbound:\n",
+    "            warn(f\"Found {len(unbound)} unbound PVC(s)\")\n",
+    "            print(\"   💡 Check storage class and node capacity\")\n",
+    "    else:\n",
+    "        print(\"   No PVCs found (may be using managed services or ephemeral storage)\")\n",
+    "\n",
+    "ok(\"Storage / durability checks complete\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Sidecar Checks (Istio)\n",
+    "\n",
+    "Detect if Istio sidecars are present and provide guidance on log access.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._validation import ok, warn\n",
+    "from shared._shell import run\n",
+    "\n",
+    "print(\"### Sidecar Checks (Istio)\\n\")\n",
+    "\n",
+    "# Check if Istio is installed (check for istiod)\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"deployment\", \"-A\", \"-o\", \"jsonpath={.items[?(@.metadata.name==\\\"istiod\\\")].metadata.name}\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "istio_installed = False\n",
+    "if result.returncode == 0 and result.stdout.strip():\n",
+    "    istio_installed = True\n",
+    "    ok(\"Istio appears to be installed\")\n",
+    "else:\n",
+    "    print(\"   Istio not detected (or not in default namespace)\")\n",
+    "    print(\"   💡 Sidecar checks will be skipped\")\n",
+    "\n",
+    "if istio_installed:\n",
+    "    # Check for sidecar injection in namespace\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"get\", \"namespace\", namespace, \"-o\", \"json\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    namespace_injection = False\n",
+    "    if result.returncode == 0:\n",
+    "        ns = json.loads(result.stdout)\n",
+    "        labels = ns.get(\"metadata\", {}).get(\"labels\", {})\n",
+    "        if labels.get(\"istio-injection\") == \"enabled\" or labels.get(\"istio-discovery\") == \"enabled\":\n",
+    "            namespace_injection = True\n",
+    "            ok(\"Namespace-level sidecar injection enabled\")\n",
+    "            print(f\"   Labels: {labels}\")\n",
+    "        else:\n",
+    "            print(\"   Namespace-level injection not enabled\")\n",
+    "            print(\"   💡 Sidecars may be injected per-workload\")\n",
+    "    \n",
+    "    # Check for sidecars in pods\n",
+    "    print(\"\\n2. Checking for sidecars in pods...\")\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    pods_with_sidecars = []\n",
+    "    pods_without_sidecars = []\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        pods = json.loads(result.stdout)\n",
+    "        for pod in pods.get(\"items\", []):\n",
+    "            name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "            containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
+    "            container_names = [c.get(\"name\", \"\") for c in containers]\n",
+    "            \n",
+    "            if \"istio-proxy\" in container_names:\n",
+    "                pods_with_sidecars.append((name, container_names))\n",
+    "            else:\n",
+    "                pods_without_sidecars.append((name, container_names))\n",
+    "    \n",
+    "    if pods_with_sidecars:\n",
+    "        ok(f\"Found {len(pods_with_sidecars)} pod(s) with sidecars\")\n",
+    "        print(\"\\n   Pods with sidecars:\")\n",
+    "        for pod_name, containers in pods_with_sidecars[:5]:  # Show first 5\n",
+    "            app_containers = [c for c in containers if c != \"istio-proxy\"]\n",
+    "            print(f\"   - {pod_name}: {', '.join(app_containers)} + istio-proxy\")\n",
+    "        \n",
+    "        print(\"\\n   💡 Important: When fetching logs, specify container name:\")\n",
+    "        print(\"      kubectl logs <pod> -n <namespace> -c <container-name>\")\n",
+    "        print(\"      kubectl logs <pod> -n <namespace> -c istio-proxy  # for proxy logs\")\n",
+    "        print(\"      kubectl logs <pod> -n <namespace> --all-containers=true  # for all logs\")\n",
+    "        print(\"\\n   ⚠️  If logs appear missing, you're likely looking at the wrong container!\")\n",
+    "        \n",
+    "        if pods_without_sidecars:\n",
+    "            warn(f\"Found {len(pods_without_sidecars)} pod(s) without sidecars\")\n",
+    "            print(\"   💡 These pods may need sidecar injection or are opted out\")\n",
+    "    else:\n",
+    "        if namespace_injection:\n",
+    "            warn(\"No pods with sidecars found (may need pod restart)\")\n",
+    "            print(\"   💡 Existing pods require restart to get sidecars\")\n",
+    "        else:\n",
+    "            print(\"   No sidecars detected (Istio may not be used or injection disabled)\")\n",
+    "\n",
+    "ok(\"Sidecar checks complete\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "\n",
+    "### ✅ Sanity Checks Complete\n",
+    "\n",
+    "This notebook has validated:\n",
+    "- ✅ Configuration loaded\n",
+    "- ✅ Preflight checks passed\n",
+    "- ✅ Current state snapshotted\n",
+    "- ✅ Early warning signals checked\n",
+    "- ✅ Storage/durability verified\n",
+    "- ✅ Sidecar status checked (if applicable)\n",
+    "\n",
+    "### 🎯 Next Steps\n",
+    "\n",
+    "1. **Review production readiness checklist:**\n",
+    "   - See `docs/shared/production_readiness_checklist.md`\n",
+    "   - Address any gaps identified\n",
+    "\n",
+    "2. **Review signals and thresholds:**\n",
+    "   - See `docs/shared/ops_signals_and_thresholds.md`\n",
+    "   - Configure alerts based on thresholds\n",
+    "\n",
+    "3. **Review sidecar documentation (if using Istio):**\n",
+    "   - See `docs/shared/sidecars_and_service_mesh.md`\n",
+    "   - Verify ServiceEntry configuration for external databases\n",
+    "\n",
+    "4. **Document your baselines:**\n",
+    "   - Record current resource usage\n",
+    "   - Document scaling thresholds\n",
+    "   - Update runbooks with findings\n",
+    "\n",
+    "### 📋 Common Issues Found\n",
+    "\n",
+    "If checks failed, common issues include:\n",
+    "- Blob storage not configured (CRITICAL for production)\n",
+    "- Pods restarting (check logs and resource limits)\n",
+    "- Pending pods (check node capacity and PVC binding)\n",
+    "- High resource usage (review requests/limits)\n",
+    "- Missing backups (verify in cloud console)\n",
+    "\n",
+    "### 🔍 Evidence for Support\n",
+    "\n",
+    "When contacting support, include:\n",
+    "- State snapshot from this notebook\n",
+    "- Pod logs (from correct container if sidecars enabled)\n",
+    "- Recent events\n",
+    "- Resource usage metrics\n",
+    "- Configuration summary (redacted)\n",
+    "\n",
+    "See `docs/shared/ops_signals_and_thresholds.md` for escalation evidence requirements.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,666 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Module 4: Diagnostics Baseline\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "**This notebook teaches \"baseline first\" discipline.** Before introducing failures or debugging issues, you must capture what \"good\" looks like. This baseline becomes your reference point for all troubleshooting.\n",
+    "\n",
+    "**⚠️ SAFETY NOTICE:** This notebook is **READ-ONLY**. However, Module 4 failure labs will modify your environment. Ensure you've completed the safety check in `../shared/00_setup_or_resume_environment.ipynb` before proceeding.\n",
+    "\n",
+    "**What This Notebook Does:**\n",
+    "1. Captures cluster state snapshot (pods, services, deployments)\n",
+    "2. Collects recent events and resource usage\n",
+    "3. Runs the canonical diagnostics script\n",
+    "4. Performs basic health checks\n",
+    "5. Saves everything to a timestamped directory\n",
+    "\n",
+    "**Why This Matters:**\n",
+    "- You need \"before\" to compare to \"after\"\n",
+    "- Support will ask for baseline diagnostics\n",
+    "- Good debugging starts with understanding normal state\n",
+    "- Evidence collection is time-sensitive\n",
+    "\n",
+    "**Estimated time:** 15-20 minutes\n",
+    "\n",
+    "**Important:** \n",
+    "- Run this notebook BEFORE starting any failure labs. It's your evidence baseline.\n",
+    "- This notebook is read-only and safe to run.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bootstrap environment\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "from datetime import datetime\n",
+    "\n",
+    "# Add notebooks directory to path so we can import shared as a package\n",
+    "possible_paths = [\n",
+    "    Path.cwd().parent,  # If cwd is module-4, go up one level to notebooks\n",
+    "    Path.cwd(),  # If cwd is already notebooks\n",
+    "    Path.cwd() / \"notebooks\",  # If cwd is workspace root\n",
+    "]\n",
+    "\n",
+    "notebooks_path = None\n",
+    "for path in possible_paths:\n",
+    "    if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        notebooks_path = path\n",
+    "        break\n",
+    "\n",
+    "if not notebooks_path:\n",
+    "    notebooks_path = Path.cwd() / \"notebooks\"\n",
+    "    if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
+    "\n",
+    "# Add notebooks directory to path so 'shared' can be imported as a package\n",
+    "if str(notebooks_path) not in sys.path:\n",
+    "    sys.path.insert(0, str(notebooks_path))\n",
+    "\n",
+    "from shared._bootstrap import bootstrap\n",
+    "\n",
+    "# Run bootstrap\n",
+    "bootstrap_info = bootstrap()\n",
+    "artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
+    "\n",
+    "# Create timestamped directory for this baseline\n",
+    "timestamp = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
+    "baseline_dir = artifacts_dir / \"module-4\" / f\"baseline-{timestamp}\"\n",
+    "baseline_dir.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "print(f\"\\nBaseline directory: {baseline_dir}\")\n",
+    "print(f\"All diagnostics will be saved here.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Safety Check: Environment Verification\n",
+    "\n",
+    "Verify you're in a safe environment before collecting baseline. Module 4 failure labs will modify your environment.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Safety check: Verify environment is safe for Module 4\n",
+    "from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
+    "from shared._validation import ok, warn\n",
+    "import os\n",
+    "\n",
+    "provider = get_cloud_provider()\n",
+    "region = get_region()\n",
+    "identity = get_identity()\n",
+    "\n",
+    "print(\"### Environment Safety Check\\n\")\n",
+    "\n",
+    "# Show environment details\n",
+    "provider_display = provider.upper()\n",
+    "print(f\"Cloud Provider: {provider_display}\")\n",
+    "print(f\"Region: {region}\")\n",
+    "\n",
+    "if provider == \"aws\":\n",
+    "    print(f\"Account ID: {identity.get('Account', 'N/A')}\")\n",
+    "    print(f\"User ARN: {identity.get('Arn', 'N/A')}\")\n",
+    "elif provider == \"azure\":\n",
+    "    subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
+    "    print(f\"Subscription ID: {subscription_id}\")\n",
+    "\n",
+    "# Show environment variables\n",
+    "print(f\"\\n### Environment Variables\")\n",
+    "print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
+    "print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
+    "print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
+    "\n",
+    "# Check for Module 4 safety flag\n",
+    "module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
+    "if module4_safe in [\"true\", \"yes\", \"1\"]:\n",
+    "    ok(\"MODULE4_SAFE_ENVIRONMENT flag is set\")\n",
+    "    print(\"   ✅ Environment verified as safe for Module 4 failure labs\")\n",
+    "else:\n",
+    "    warn(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
+    "    print(\"   💡 This notebook is read-only, but failure labs require this flag\")\n",
+    "    print(\"   💡 Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
+    "    print(\"   💡 Complete safety check in ../shared/00_setup_or_resume_environment.ipynb first\")\n",
+    "\n",
+    "print(\"\\n⚠️  REMINDER: This notebook is read-only.\")\n",
+    "print(\"   Failure labs in Module 4 will modify secrets and cause disruptions.\")\n",
+    "print(\"   Only run failure labs in TEST/NON-PRODUCTION environments.\")\n",
+    "\n",
+    "ok(\"Environment check complete\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Configuration\n",
+    "\n",
+    "Load and validate configuration from environment variables.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from shared._validation import ok, warn\n",
+    "from shared._cloud_helpers import get_cloud_provider, get_region\n",
+    "\n",
+    "# Required configuration\n",
+    "required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
+    "\n",
+    "print(\"### Loading Configuration\\n\")\n",
+    "\n",
+    "config = {}\n",
+    "missing = []\n",
+    "\n",
+    "for var in required_vars:\n",
+    "    value = os.environ.get(var, \"\").strip()\n",
+    "    if not value:\n",
+    "        missing.append(var)\n",
+    "    config[var] = value\n",
+    "\n",
+    "if missing:\n",
+    "    raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
+    "                      f\"💡 Copy env-samples/workshop.env.example to your .env file and fill in values\")\n",
+    "\n",
+    "# Optional but recommended\n",
+    "config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
+    "config[\"LANGSMITH_DOMAIN\"] = os.environ.get(\"LANGSMITH_DOMAIN\", \"\")\n",
+    "\n",
+    "namespace = config[\"NAMESPACE\"]\n",
+    "\n",
+    "# Show cloud provider info\n",
+    "provider = get_cloud_provider()\n",
+    "region = get_region()\n",
+    "\n",
+    "print(f\"Cloud Provider: {provider.upper()}\")\n",
+    "print(f\"Region: {region}\")\n",
+    "print(f\"Namespace: {namespace}\")\n",
+    "print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
+    "\n",
+    "if config[\"LANGSMITH_DOMAIN\"]:\n",
+    "    print(f\"LangSmith Domain: {config['LANGSMITH_DOMAIN']}\")\n",
+    "\n",
+    "ok(\"Configuration loaded\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Cluster State Snapshot\n",
+    "\n",
+    "Capture a complete snapshot of all resources in the namespace. This is your \"before\" picture.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._shell import run\n",
+    "import json\n",
+    "\n",
+    "print(\"### Capturing Cluster State Snapshot\\n\")\n",
+    "\n",
+    "# Get all resources\n",
+    "print(\"1. Collecting all resources...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"all\", \"-n\", namespace, \"-o\", \"wide\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    snapshot_file = baseline_dir / \"all-resources.txt\"\n",
+    "    with open(snapshot_file, \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    ok(f\"Saved resource snapshot to {snapshot_file.name}\")\n",
+    "    print(f\"   Resources captured: {len(result.stdout.splitlines())} lines\")\n",
+    "else:\n",
+    "    warn(\"Could not capture resource snapshot\")\n",
+    "\n",
+    "# Get all resources as YAML (more detailed)\n",
+    "print(\"\\n2. Collecting detailed YAML...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"all\", \"-n\", namespace, \"-o\", \"yaml\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    yaml_file = baseline_dir / \"all-resources.yaml\"\n",
+    "    with open(yaml_file, \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    ok(f\"Saved detailed YAML to {yaml_file.name}\")\n",
+    "else:\n",
+    "    warn(\"Could not capture detailed YAML\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Key Deployments Description\n",
+    "\n",
+    "Get detailed information about key deployments.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Describing Key Deployments\\n\")\n",
+    "\n",
+    "# Get list of deployments\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    deployments = json.loads(result.stdout)\n",
+    "    deployment_items = deployments.get(\"items\", [])\n",
+    "    \n",
+    "    if deployment_items:\n",
+    "        ok(f\"Found {len(deployment_items)} deployment(s)\")\n",
+    "        \n",
+    "        # Describe each deployment\n",
+    "        for deployment in deployment_items:\n",
+    "            name = deployment.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "            print(f\"\\n3. Describing deployment: {name}\")\n",
+    "            \n",
+    "            result = run(\n",
+    "                [\"kubectl\", \"describe\", \"deployment\", name, \"-n\", namespace],\n",
+    "                check=False,\n",
+    "                stream=False\n",
+    "            )\n",
+    "            \n",
+    "            if result.returncode == 0:\n",
+    "                desc_file = baseline_dir / f\"deployment-{name}.txt\"\n",
+    "                with open(desc_file, \"w\") as f:\n",
+    "                    f.write(result.stdout)\n",
+    "                print(f\"   ✅ Saved description to {desc_file.name}\")\n",
+    "            else:\n",
+    "                warn(f\"Could not describe deployment {name}\")\n",
+    "    else:\n",
+    "        warn(\"No deployments found\")\n",
+    "else:\n",
+    "    warn(\"Could not list deployments\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Recent Events\n",
+    "\n",
+    "Capture recent events sorted by timestamp. Events often contain the first clues about what's happening.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Collecting Recent Events\\n\")\n",
+    "\n",
+    "# Get events sorted by timestamp\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    events_file = baseline_dir / \"events.txt\"\n",
+    "    with open(events_file, \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    ok(f\"Saved events to {events_file.name}\")\n",
+    "    \n",
+    "    # Count events by type\n",
+    "    lines = result.stdout.strip().split(\"\\n\")\n",
+    "    if len(lines) > 1:  # Header + events\n",
+    "        event_count = len(lines) - 1\n",
+    "        print(f\"   Captured {event_count} event(s)\")\n",
+    "        \n",
+    "        # Show last few events\n",
+    "        if event_count > 0:\n",
+    "            print(\"\\n   Last 5 events:\")\n",
+    "            for line in lines[-5:]:\n",
+    "                if line.strip():\n",
+    "                    print(f\"   {line}\")\n",
+    "    else:\n",
+    "        print(\"   No events found (this is normal for a healthy cluster)\")\n",
+    "else:\n",
+    "    warn(\"Could not collect events\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Resource Usage\n",
+    "\n",
+    "Capture resource usage (CPU, memory) if metrics are available.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Collecting Resource Usage\\n\")\n",
+    "\n",
+    "# Top pods\n",
+    "print(\"1. Checking pod resource usage...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"top\", \"pods\", \"-n\", namespace],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    top_pods_file = baseline_dir / \"top-pods.txt\"\n",
+    "    with open(top_pods_file, \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    ok(f\"Saved pod resource usage to {top_pods_file.name}\")\n",
+    "    print(result.stdout)\n",
+    "else:\n",
+    "    warn(\"Could not get pod resource usage (metrics server may not be available)\")\n",
+    "    print(\"   💡 This is OK - metrics are optional for baseline collection\")\n",
+    "\n",
+    "# Top nodes (if available)\n",
+    "print(\"\\n2. Checking node resource usage...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"top\", \"nodes\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    top_nodes_file = baseline_dir / \"top-nodes.txt\"\n",
+    "    with open(top_nodes_file, \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    ok(f\"Saved node resource usage to {top_nodes_file.name}\")\n",
+    "    print(result.stdout)\n",
+    "else:\n",
+    "    warn(\"Could not get node resource usage (metrics server may not be available)\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Canonical Diagnostics Script\n",
+    "\n",
+    "**This is the most important step.** Run the official LangChain diagnostics script that Support expects.\n",
+    "\n",
+    "The script captures:\n",
+    "- Pod logs (all containers)\n",
+    "- Events (sorted by timestamp)\n",
+    "- Resource usage (CPU, memory)\n",
+    "- Configuration (deployments, services, ingress)\n",
+    "- Storage (PVCs, storage classes)\n",
+    "- Network (services, endpoints)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import urllib.request\n",
+    "import subprocess\n",
+    "\n",
+    "print(\"### Running Canonical Diagnostics Script\\n\")\n",
+    "\n",
+    "# URL to the canonical script\n",
+    "script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
+    "script_path = baseline_dir / \"get_k8s_debugging_info.sh\"\n",
+    "\n",
+    "print(f\"1. Downloading script from: {script_url}\")\n",
+    "try:\n",
+    "    urllib.request.urlretrieve(script_url, script_path)\n",
+    "    ok(f\"Downloaded script to {script_path.name}\")\n",
+    "    \n",
+    "    # Make executable\n",
+    "    script_path.chmod(0o755)\n",
+    "    \n",
+    "    # Run the script\n",
+    "    print(f\"\\n2. Running diagnostics script for namespace: {namespace}\")\n",
+    "    print(\"   (This may take a few minutes...)\")\n",
+    "    \n",
+    "    result = run(\n",
+    "        [str(script_path), namespace],\n",
+    "        check=False,\n",
+    "        stream=True  # Stream output so user can see progress\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        ok(\"Diagnostics script completed successfully\")\n",
+    "        \n",
+    "        # The script creates a tarball - find it\n",
+    "        diagnostics_tarball = None\n",
+    "        for file in baseline_dir.parent.iterdir():\n",
+    "            if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
+    "                diagnostics_tarball = file\n",
+    "                break\n",
+    "        \n",
+    "        if diagnostics_tarball:\n",
+    "            # Move it to our baseline directory\n",
+    "            target_path = baseline_dir / diagnostics_tarball.name\n",
+    "            diagnostics_tarball.rename(target_path)\n",
+    "            ok(f\"Diagnostics bundle saved to: {target_path.name}\")\n",
+    "            print(f\"   Size: {target_path.stat().st_size / 1024 / 1024:.2f} MB\")\n",
+    "        else:\n",
+    "            warn(\"Could not find diagnostics tarball (check script output above)\")\n",
+    "    else:\n",
+    "        warn(f\"Diagnostics script returned non-zero exit code: {result.returncode}\")\n",
+    "        print(\"   Check the output above for errors\")\n",
+    "        print(\"   💡 The script may still have collected useful information\")\n",
+    "        \n",
+    "except urllib.request.URLError as e:\n",
+    "    warn(f\"Could not download diagnostics script: {e}\")\n",
+    "    print(\"   💡 You can download it manually and run it:\")\n",
+    "    print(f\"      curl -O {script_url}\")\n",
+    "    print(f\"      chmod +x get_k8s_debugging_info.sh\")\n",
+    "    print(f\"      ./get_k8s_debugging_info.sh {namespace}\")\n",
+    "except Exception as e:\n",
+    "    warn(f\"Error running diagnostics script: {e}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Basic Health Check\n",
+    "\n",
+    "Perform a basic HTTP check to verify the LangSmith endpoint is reachable.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import urllib3\n",
+    "\n",
+    "# Disable SSL warnings for self-signed certs\n",
+    "urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\n",
+    "\n",
+    "print(\"### Testing Endpoint Reachability\\n\")\n",
+    "\n",
+    "# Determine endpoint URL\n",
+    "if config[\"LANGSMITH_DOMAIN\"]:\n",
+    "    test_url = f\"https://{config['LANGSMITH_DOMAIN']}\"\n",
+    "else:\n",
+    "    # Try to get from ingress\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"get\", \"ingress\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        ingresses = json.loads(result.stdout)\n",
+    "        for ingress in ingresses.get(\"items\", []):\n",
+    "            rules = ingress.get(\"spec\", {}).get(\"rules\", [])\n",
+    "            for rule in rules:\n",
+    "                host = rule.get(\"host\", \"\")\n",
+    "                if host:\n",
+    "                    test_url = f\"https://{host}\"\n",
+    "                    break\n",
+    "    else:\n",
+    "        test_url = None\n",
+    "\n",
+    "if test_url:\n",
+    "    print(f\"Testing: {test_url}\")\n",
+    "    try:\n",
+    "        response = requests.get(test_url, allow_redirects=True, verify=False, timeout=10)\n",
+    "        \n",
+    "        health_file = baseline_dir / \"endpoint-health.txt\"\n",
+    "        with open(health_file, \"w\") as f:\n",
+    "            f.write(f\"URL: {test_url}\\n\")\n",
+    "            f.write(f\"Status Code: {response.status_code}\\n\")\n",
+    "            f.write(f\"Response Headers:\\n{json.dumps(dict(response.headers), indent=2)}\\n\")\n",
+    "        \n",
+    "        if response.status_code in [200, 302, 401, 403]:\n",
+    "            ok(f\"Endpoint is reachable (HTTP {response.status_code})\")\n",
+    "            print(f\"   Response saved to {health_file.name}\")\n",
+    "        else:\n",
+    "            warn(f\"Endpoint returned unexpected status: {response.status_code}\")\n",
+    "    except requests.exceptions.SSLError:\n",
+    "        warn(\"SSL verification failed (may be self-signed certificate)\")\n",
+    "        print(\"   💡 This is OK for testing. In production, use proper TLS certificates.\")\n",
+    "    except requests.exceptions.RequestException as e:\n",
+    "        warn(f\"Could not reach endpoint: {e}\")\n",
+    "        print(\"   💡 Endpoint may still be provisioning or DNS not configured\")\n",
+    "else:\n",
+    "    warn(\"No endpoint URL available for testing\")\n",
+    "    print(\"   💡 Set LANGSMITH_DOMAIN in your .env file to test endpoint reachability\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. What Good Looks Like\n",
+    "\n",
+    "Quick validation checks to confirm the baseline is healthy.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._validation import ok, warn\n",
+    "\n",
+    "print(\"### Quick Health Validation\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "healthy_pods = 0\n",
+    "unhealthy_pods = []\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    for pod in pods.get(\"items\", []):\n",
+    "        name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        phase = pod.get(\"status\", {}).get(\"phase\", \"\")\n",
+    "        container_statuses = pod.get(\"status\", {}).get(\"containerStatuses\", [])\n",
+    "        \n",
+    "        is_ready = True\n",
+    "        for cs in container_statuses:\n",
+    "            if not cs.get(\"ready\", False):\n",
+    "                is_ready = False\n",
+    "                break\n",
+    "        \n",
+    "        if phase == \"Running\" and is_ready:\n",
+    "            healthy_pods += 1\n",
+    "        else:\n",
+    "            unhealthy_pods.append((name, phase, is_ready))\n",
+    "    \n",
+    "    if unhealthy_pods:\n",
+    "        warn(f\"Found {len(unhealthy_pods)} pod(s) that are not healthy:\")\n",
+    "        for name, phase, ready in unhealthy_pods:\n",
+    "            print(f\"   - {name}: phase={phase}, ready={ready}\")\n",
+    "    else:\n",
+    "        ok(f\"All {healthy_pods} pod(s) are healthy and ready\")\n",
+    "else:\n",
+    "    warn(\"Could not check pod status\")\n",
+    "\n",
+    "# Check for CrashLoopBackOff\n",
+    "if unhealthy_pods:\n",
+    "    crash_loops = [name for name, phase, _ in unhealthy_pods if phase == \"CrashLoopBackOff\"]\n",
+    "    if crash_loops:\n",
+    "        warn(f\"Found {len(crash_loops)} pod(s) in CrashLoopBackOff:\")\n",
+    "        for name in crash_loops:\n",
+    "            print(f\"   - {name}\")\n",
+    "        print(\"   💡 Check pod logs to understand why they're crashing\")\n",
+    "\n",
+    "# Check for Pending pods\n",
+    "pending = [name for name, phase, _ in unhealthy_pods if phase == \"Pending\"]\n",
+    "if pending:\n",
+    "    warn(f\"Found {len(pending)} pod(s) in Pending state:\")\n",
+    "    for name in pending:\n",
+    "        print(f\"   - {name}\")\n",
+    "    print(\"   💡 Check events and resource availability\")\n",
+    "\n",
+    "print(\"\\n### Baseline Summary\\n\")\n",
+    "print(f\"✅ Baseline captured at: {timestamp}\")\n",
+    "print(f\"📁 Baseline directory: {baseline_dir}\")\n",
+    "print(f\"📊 Resources captured:\")\n",
+    "print(f\"   - Cluster state snapshot\")\n",
+    "print(f\"   - Deployment descriptions\")\n",
+    "print(f\"   - Recent events\")\n",
+    "print(f\"   - Resource usage (if available)\")\n",
+    "print(f\"   - Canonical diagnostics bundle\")\n",
+    "print(f\"   - Endpoint health check\")\n",
+    "\n",
+    "ok(\"Baseline collection complete!\")\n",
+    "print(\"\\n💡 Use this baseline as your reference point for all failure labs.\")\n",
+    "print(\"   Compare future diagnostics to this baseline to identify what changed.\")\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,944 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Module 4: Failure Lab - PostgreSQL\n",
+    "\n",
+    "## ⚠️ CRITICAL SAFETY WARNING\n",
+    "\n",
+    "**THIS NOTEBOOK WILL MODIFY YOUR ENVIRONMENT:**\n",
+    "- **Modifies Kubernetes secrets** (breaks PostgreSQL password)\n",
+    "- **Causes service disruptions** (API failures, login failures)\n",
+    "- **Requires remediation** to restore functionality\n",
+    "\n",
+    "**REQUIREMENTS:**\n",
+    "- ✅ **TEST/NON-PRODUCTION environment ONLY**\n",
+    "- ✅ **MODULE4_SAFE_ENVIRONMENT=true** must be set in .env\n",
+    "- ✅ **Baseline diagnostics collected** (run `01_diagnostics_baseline.ipynb` first)\n",
+    "- ✅ **Backup/restore plan** available\n",
+    "\n",
+    "**DO NOT RUN THIS LAB AGAINST PRODUCTION SYSTEMS.**\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "**This lab teaches you how to debug PostgreSQL connectivity failures in LangSmith.**\n",
+    "\n",
+    "PostgreSQL is LangSmith's primary metadata store. It holds:\n",
+    "- User accounts and workspaces\n",
+    "- Project definitions\n",
+    "- API keys and permissions\n",
+    "- Trace metadata (not the traces themselves, which go to ClickHouse)\n",
+    "\n",
+    "**When PostgreSQL fails, you'll see:**\n",
+    "- API endpoints return 5xx errors\n",
+    "- Login/authentication may fail\n",
+    "- UI may load but actions fail\n",
+    "- Connection exhaustion patterns in logs\n",
+    "\n",
+    "**Learning Objectives:**\n",
+    "1. Understand how PostgreSQL failures manifest\n",
+    "2. Practice collecting diagnostics for database issues\n",
+    "3. Learn to identify connection vs. credential vs. network issues\n",
+    "4. Practice safe remediation\n",
+    "\n",
+    "**Estimated time:** 30-45 minutes\n",
+    "\n",
+    "**⚠️ Important:** \n",
+    "- Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n",
+    "- Complete safety check in `../shared/00_setup_or_resume_environment.ipynb` first!\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bootstrap environment\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Add notebooks directory to path\n",
+    "possible_paths = [\n",
+    "    Path.cwd().parent,\n",
+    "    Path.cwd(),\n",
+    "    Path.cwd() / \"notebooks\",\n",
+    "]\n",
+    "\n",
+    "notebooks_path = None\n",
+    "for path in possible_paths:\n",
+    "    if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        notebooks_path = path\n",
+    "        break\n",
+    "\n",
+    "if not notebooks_path:\n",
+    "    notebooks_path = Path.cwd() / \"notebooks\"\n",
+    "    if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
+    "\n",
+    "if str(notebooks_path) not in sys.path:\n",
+    "    sys.path.insert(0, str(notebooks_path))\n",
+    "\n",
+    "from shared._bootstrap import bootstrap\n",
+    "from shared._validation import ok, warn\n",
+    "\n",
+    "# Run bootstrap\n",
+    "bootstrap_info = bootstrap()\n",
+    "artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
+    "print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ⚠️ CRITICAL: Environment Safety Verification\n",
+    "\n",
+    "**Before proceeding, verify you're in a TEST/NON-PRODUCTION environment and understand what will be modified.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# CRITICAL SAFETY CHECK: Verify environment is safe for failure injection\n",
+    "from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
+    "from shared._validation import ok, warn, fail\n",
+    "import os\n",
+    "\n",
+    "provider = get_cloud_provider()\n",
+    "region = get_region()\n",
+    "identity = get_identity()\n",
+    "\n",
+    "print(\"=\" * 70)\n",
+    "print(\"⚠️  CRITICAL SAFETY CHECK - POSTGRESQL FAILURE LAB\")\n",
+    "print(\"=\" * 70)\n",
+    "\n",
+    "# Show environment details prominently\n",
+    "provider_display = provider.upper()\n",
+    "print(f\"\\n### Current Environment Configuration\")\n",
+    "print(f\"Cloud Provider: {provider_display}\")\n",
+    "print(f\"Region: {region}\")\n",
+    "\n",
+    "if provider == \"aws\":\n",
+    "    account_id = identity.get('Account', 'N/A')\n",
+    "    user_arn = identity.get('Arn', 'N/A')\n",
+    "    print(f\"Account ID: {account_id}\")\n",
+    "    print(f\"User ARN: {user_arn}\")\n",
+    "elif provider == \"azure\":\n",
+    "    subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
+    "    subscription_name = identity.get(\"SubscriptionName\", \"N/A\")\n",
+    "    print(f\"Subscription ID: {subscription_id}\")\n",
+    "    print(f\"Subscription Name: {subscription_name}\")\n",
+    "\n",
+    "# Show all relevant environment variables\n",
+    "print(f\"\\n### Environment Variables (VERIFY THESE ARE CORRECT)\")\n",
+    "print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
+    "print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
+    "print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
+    "print(f\"LANGSMITH_DOMAIN: {os.environ.get('LANGSMITH_DOMAIN', 'NOT SET')}\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"⚠️  WHAT THIS LAB WILL DO:\")\n",
+    "print(\"=\" * 70)\n",
+    "print(\"\\nThis failure lab will:\")\n",
+    "print(\"  1. Find the PostgreSQL secret in your namespace\")\n",
+    "print(\"  2. BACKUP the original secret (saved to artifacts)\")\n",
+    "print(\"  3. MODIFY the secret to set an INVALID password\")\n",
+    "print(\"  4. Apply the modified secret (breaks database connectivity)\")\n",
+    "print(\"  5. Cause API failures and login failures\")\n",
+    "print(\"  6. Require remediation to restore (restore original secret)\")\n",
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "\n",
+    "# Check for Module 4 safety flag\n",
+    "module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
+    "if module4_safe not in [\"true\", \"yes\", \"1\"]:\n",
+    "    fail(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
+    "    print(\"\\n❌ SAFETY CHECK FAILED - Cannot proceed\")\n",
+    "    print(\"\\nTo run this failure lab, you MUST:\")\n",
+    "    print(\"  1. Verify this is a TEST/NON-PRODUCTION environment\")\n",
+    "    print(\"  2. Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
+    "    print(\"  3. Complete safety check in ../shared/00_setup_or_resume_environment.ipynb\")\n",
+    "    print(\"  4. Re-run this cell to confirm\")\n",
+    "    print(\"\\nThis flag is REQUIRED to prevent accidental execution in production.\")\n",
+    "    raise RuntimeError(\"MODULE4_SAFE_ENVIRONMENT not set. Required for failure labs.\")\n",
+    "\n",
+    "ok(\"MODULE4_SAFE_ENVIRONMENT flag is set\")\n",
+    "print(\"\\n✅ Safety check passed - environment marked as safe for failure injection\")\n",
+    "print(\"\\n⚠️  REMINDER: This lab will break PostgreSQL connectivity.\")\n",
+    "print(\"   Ensure you understand the remediation steps before proceeding.\")\n",
+    "print(\"   Original secret will be backed up automatically.\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"✅ Environment verified - ready for PostgreSQL failure lab\")\n",
+    "print(\"=\" * 70)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Configuration & Prerequisites\n",
+    "\n",
+    "Load configuration and verify prerequisites.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from shared._validation import require_env\n",
+    "\n",
+    "required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
+    "config = require_env(*required_vars)\n",
+    "config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
+    "\n",
+    "namespace = config[\"NAMESPACE\"]\n",
+    "\n",
+    "print(f\"Namespace: {namespace}\")\n",
+    "print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
+    "\n",
+    "ok(\"Configuration loaded\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. What This Service Does for LangSmith\n",
+    "\n",
+    "PostgreSQL is LangSmith's **primary metadata store**. It holds:\n",
+    "\n",
+    "- **User accounts and authentication data**\n",
+    "- **Workspaces and projects** (organizational structure)\n",
+    "- **API keys and permissions** (access control)\n",
+    "- **Trace metadata** (not the trace data itself, which goes to ClickHouse)\n",
+    "- **Evaluation results and feedback**\n",
+    "\n",
+    "**Why it matters:**\n",
+    "- Without PostgreSQL, users can't log in\n",
+    "- API calls fail (no authentication, no project lookups)\n",
+    "- UI loads but can't perform actions\n",
+    "- All LangSmith functionality depends on it\n",
+    "\n",
+    "**How LangSmith connects:**\n",
+    "- Connection string stored in Kubernetes Secrets\n",
+    "- Connection pool managed by application\n",
+    "- Connection limits are critical (PostgreSQL has max connections)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Expected Symptoms When PostgreSQL Fails\n",
+    "\n",
+    "**What you'll see:**\n",
+    "\n",
+    "1. **API 5xx errors:**\n",
+    "   - `/api/v1/...` endpoints return 500 or 503\n",
+    "   - Error messages mention \"database\" or \"connection\"\n",
+    "\n",
+    "2. **Login failures:**\n",
+    "   - Users can't authenticate\n",
+    "   - OIDC/SAML may work (redirects) but session creation fails\n",
+    "\n",
+    "3. **UI loads but actions fail:**\n",
+    "   - Pages render (static content)\n",
+    "   - API calls fail (can't load projects, traces, etc.)\n",
+    "\n",
+    "4. **Log patterns:**\n",
+    "   - Connection timeout errors\n",
+    "   - \"connection refused\" or \"connection reset\"\n",
+    "   - \"too many connections\" (if connection pool exhausted)\n",
+    "   - \"authentication failed\" (if credentials wrong)\n",
+    "\n",
+    "**Timeline:**\n",
+    "- Symptoms appear within seconds of failure\n",
+    "- API calls start failing immediately\n",
+    "- Existing connections may work briefly, then fail\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Failure Injection Options\n",
+    "\n",
+    "**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
+    "\n",
+    "### Level 1: Subtle Failure (Recommended for first run)\n",
+    "\n",
+    "**Option A: Wrong Database Password**\n",
+    "- Modify the PostgreSQL password in the Kubernetes Secret\n",
+    "- Symptoms: Authentication failures, connection refused\n",
+    "\n",
+    "**Option B: Wrong Database Host**\n",
+    "- Point connection string to non-existent host\n",
+    "- Symptoms: Connection timeout, DNS resolution failures\n",
+    "\n",
+    "**Option C: Network Isolation (if NetworkPolicy supported)**\n",
+    "- Apply NetworkPolicy blocking egress to PostgreSQL\n",
+    "- Symptoms: Connection timeout, no route to host\n",
+    "\n",
+    "### Level 2: Obvious Failure\n",
+    "\n",
+    "**Option D: Remove Secret Entirely**\n",
+    "- Delete the PostgreSQL connection secret\n",
+    "- Symptoms: Pods crash on startup, immediate failures\n",
+    "\n",
+    "**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Do the Drill - Step 1: Confirm Baseline\n",
+    "\n",
+    "**Before injecting any failure, verify your baseline is healthy.**\n",
+    "\n",
+    "💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._shell import run\n",
+    "import json\n",
+    "\n",
+    "print(\"### Quick Baseline Check\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    healthy = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    print(f\"Pods: {healthy}/{total} running\")\n",
+    "    \n",
+    "    if healthy == total and total > 0:\n",
+    "        ok(\"Baseline looks healthy\")\n",
+    "    else:\n",
+    "        warn(\"Some pods are not running - check baseline first\")\n",
+    "else:\n",
+    "    warn(\"Could not check pod status\")\n",
+    "\n",
+    "# Check for PostgreSQL secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "postgres_secrets = []\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        if \"postgres\" in name.lower() or \"database\" in name.lower() or \"db\" in name.lower():\n",
+    "            postgres_secrets.append(name)\n",
+    "\n",
+    "if postgres_secrets:\n",
+    "    ok(f\"Found {len(postgres_secrets)} PostgreSQL-related secret(s)\")\n",
+    "    for secret_name in postgres_secrets:\n",
+    "        print(f\"   - {secret_name}\")\n",
+    "else:\n",
+    "    warn(\"No PostgreSQL secrets found\")\n",
+    "    print(\"   💡 PostgreSQL connection may be configured differently\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Do the Drill - Step 2: Apply Failure Injection\n",
+    "\n",
+    "**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
+    "\n",
+    "Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# FAILURE INJECTION: Wrong Database Password\n",
+    "# This cell modifies the PostgreSQL password secret to an invalid value\n",
+    "\n",
+    "import base64\n",
+    "import yaml\n",
+    "from datetime import datetime\n",
+    "\n",
+    "# Find PostgreSQL secret (look for common names)\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "postgres_secret_name = None\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        # Common patterns: postgres, database, db, langsmith-db\n",
+    "        if any(keyword in name.lower() for keyword in [\"postgres\", \"database\", \"db\"]):\n",
+    "            # Check if it has password-related keys\n",
+    "            data = secret.get(\"data\", {})\n",
+    "            if any(key in data for key in [\"password\", \"POSTGRES_PASSWORD\", \"DB_PASSWORD\"]):\n",
+    "                postgres_secret_name = name\n",
+    "                break\n",
+    "\n",
+    "if not postgres_secret_name:\n",
+    "    raise RuntimeError(\"❌ Could not find PostgreSQL secret. Check your deployment configuration.\")\n",
+    "\n",
+    "print(f\"Found PostgreSQL secret: {postgres_secret_name}\")\n",
+    "\n",
+    "# Get current secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", postgres_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
+    "    check=True,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "# Save original secret for restoration\n",
+    "backup_file = artifacts_dir / \"module-4\" / f\"postgres-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
+    "backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
+    "with open(backup_file, \"w\") as f:\n",
+    "    f.write(result.stdout)\n",
+    "\n",
+    "ok(f\"Backed up original secret to: {backup_file.name}\")\n",
+    "\n",
+    "# Parse YAML and modify password\n",
+    "secret_data = yaml.safe_load(result.stdout)\n",
+    "if \"data\" not in secret_data:\n",
+    "    raise RuntimeError(\"Secret has no data section\")\n",
+    "\n",
+    "# Find password key (could be password, POSTGRES_PASSWORD, DB_PASSWORD, etc.)\n",
+    "password_key = None\n",
+    "for key in [\"password\", \"POSTGRES_PASSWORD\", \"DB_PASSWORD\", \"postgres-password\"]:\n",
+    "    if key in secret_data[\"data\"]:\n",
+    "        password_key = key\n",
+    "        break\n",
+    "\n",
+    "if not password_key:\n",
+    "    raise RuntimeError(\"Could not find password key in secret\")\n",
+    "\n",
+    "# Set invalid password\n",
+    "invalid_password = \"INVALID_PASSWORD_12345\"\n",
+    "invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
+    "\n",
+    "# Modify secret\n",
+    "secret_data[\"data\"][password_key] = invalid_password_b64\n",
+    "\n",
+    "# Save modified secret to temp file\n",
+    "temp_secret_file = artifacts_dir / \"module-4\" / \"postgres-secret-modified.yaml\"\n",
+    "with open(temp_secret_file, \"w\") as f:\n",
+    "    yaml.dump(secret_data, f)\n",
+    "\n",
+    "print(\"=\" * 70)\n",
+    "print(\"⚠️  READY TO APPLY FAILURE INJECTION\")\n",
+    "print(\"=\" * 70)\n",
+    "print(f\"\\nThis will modify secret: {postgres_secret_name}\")\n",
+    "print(f\"Modified secret saved to: {temp_secret_file.name}\")\n",
+    "print(f\"Backup saved to: {backup_file.name}\")\n",
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"⚠️  FINAL WARNING BEFORE FAILURE INJECTION\")\n",
+    "print(\"=\" * 70)\n",
+    "print(\"\\nThis will:\")\n",
+    "print(\"  ❌ Break PostgreSQL connectivity\")\n",
+    "print(\"  ❌ Cause API 5xx errors\")\n",
+    "print(\"  ❌ Break login/authentication\")\n",
+    "print(\"  ❌ Disrupt LangSmith functionality\")\n",
+    "print(\"\\nTo apply the failure:\")\n",
+    "print(\"  1. Verify MODULE4_SAFE_ENVIRONMENT=true is set\")\n",
+    "print(\"  2. Verify you're in a TEST environment\")\n",
+    "print(\"  3. Uncomment the code in the next cell\")\n",
+    "print(\"  4. Run the next cell to apply\")\n",
+    "print(\"\\nTo restore after the lab:\")\n",
+    "print(f\"  - Use the backup file: {backup_file.name}\")\n",
+    "print(\"  - See the 'Remediation' section below\")\n",
+    "print(\"\\n\" + \"=\" * 70)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# UNCOMMENT TO APPLY FAILURE INJECTION\n",
+    "# \n",
+    "# result = run(\n",
+    "#     [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
+    "#     check=True,\n",
+    "#     stream=True\n",
+    "# )\n",
+    "# \n",
+    "# ok(\"Failure injection applied - PostgreSQL password is now invalid\")\n",
+    "# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
+    "# print(\"   This may take 1-2 minutes. Watch for pod restarts:\")\n",
+    "# print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "# \n",
+    "# # Wait a moment for changes to propagate\n",
+    "# import time\n",
+    "# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
+    "# time.sleep(30)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Do the Drill - Step 3: Observe Symptoms\n",
+    "\n",
+    "**Now that the failure is injected, observe how it manifests.**\n",
+    "\n",
+    "Check:\n",
+    "1. Pod logs for connection errors\n",
+    "2. API endpoint responses\n",
+    "3. UI behavior\n",
+    "4. Events for pod restarts\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datetime import datetime\n",
+    "\n",
+    "# Create incident directory for diagnostics\n",
+    "incident_dir = artifacts_dir / \"module-4\" / f\"postgres-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
+    "incident_dir.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "print(f\"### Collecting Failure Diagnostics\\n\")\n",
+    "print(f\"Saving to: {incident_dir}\\n\")\n",
+    "\n",
+    "# 1. Check pod status\n",
+    "print(\"1. Checking pod status...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    print(result.stdout)\n",
+    "    \n",
+    "    # Check for restarts\n",
+    "    lines = result.stdout.split(\"\\n\")\n",
+    "    restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
+    "    if restarts:\n",
+    "        print(\"\\n   Pod restart counts:\")\n",
+    "        for line in restarts[1:]:  # Skip header\n",
+    "            if line.strip():\n",
+    "                parts = line.split()\n",
+    "                if len(parts) > 3:\n",
+    "                    print(f\"   {parts[0]}: {parts[3]} restarts\")\n",
+    "\n",
+    "# 2. Check recent events\n",
+    "print(\"\\n2. Checking recent events...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"events.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    if result.stdout.strip():\n",
+    "        print(\"   Recent warning/error events:\")\n",
+    "        for line in result.stdout.split(\"\\n\")[-5:]:\n",
+    "            if line.strip():\n",
+    "                print(f\"   {line}\")\n",
+    "\n",
+    "# 3. Check API pod logs for database errors\n",
+    "print(\"\\n3. Checking API pod logs for database errors...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-api\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "api_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if api_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, api_pod, \"--tail=50\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        logs_file = incident_dir / f\"api-pod-{api_pod}-logs.txt\"\n",
+    "        with open(logs_file, \"w\") as f:\n",
+    "            f.write(result.stdout)\n",
+    "        \n",
+    "        # Look for database-related errors\n",
+    "        error_keywords = [\"database\", \"postgres\", \"connection\", \"timeout\", \"refused\", \"authentication\"]\n",
+    "        error_lines = [l for l in result.stdout.split(\"\\n\") \n",
+    "                      if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if error_lines:\n",
+    "            print(\"   Found database-related errors:\")\n",
+    "            for line in error_lines[-5:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            print(\"   No obvious database errors in recent logs\")\n",
+    "else:\n",
+    "    warn(\"Could not find API pod\")\n",
+    "\n",
+    "ok(f\"Diagnostics saved to: {incident_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
+    "\n",
+    "**This is critical - Support will ask for this bundle.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import urllib.request\n",
+    "\n",
+    "print(\"### Running Canonical Diagnostics Script\\n\")\n",
+    "\n",
+    "script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
+    "script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
+    "\n",
+    "try:\n",
+    "    urllib.request.urlretrieve(script_url, script_path)\n",
+    "    script_path.chmod(0o755)\n",
+    "    \n",
+    "    print(f\"Running diagnostics script for namespace: {namespace}\")\n",
+    "    result = run(\n",
+    "        [str(script_path), namespace],\n",
+    "        check=False,\n",
+    "        stream=True\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        ok(\"Diagnostics script completed\")\n",
+    "        \n",
+    "        # Find and move tarball\n",
+    "        for file in incident_dir.parent.iterdir():\n",
+    "            if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
+    "                target_path = incident_dir / file.name\n",
+    "                file.rename(target_path)\n",
+    "                ok(f\"Diagnostics bundle: {target_path.name}\")\n",
+    "                break\n",
+    "    else:\n",
+    "        warn(\"Diagnostics script had errors (check output above)\")\n",
+    "        \n",
+    "except Exception as e:\n",
+    "    warn(f\"Could not run diagnostics script: {e}\")\n",
+    "    print(\"   💡 You can run it manually:\")\n",
+    "    print(f\"      curl -O {script_url}\")\n",
+    "    print(f\"      chmod +x get_k8s_debugging_info.sh\")\n",
+    "    print(f\"      ./get_k8s_debugging_info.sh {namespace}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 10. Do the Drill - Step 5: Guided Triage\n",
+    "\n",
+    "**Where to look first for PostgreSQL issues:**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Guided Triage Steps\\n\")\n",
+    "\n",
+    "print(\"1. Check pod logs for connection errors:\")\n",
+    "print(f\"   kubectl logs -n {namespace} <pod-name> | grep -i 'database\\\\|postgres\\\\|connection'\")\n",
+    "print()\n",
+    "\n",
+    "print(\"2. Verify secret exists and has correct keys:\")\n",
+    "print(f\"   kubectl get secret {postgres_secret_name} -n {namespace} -o yaml\")\n",
+    "print(\"   (Don't print the actual values - they're base64 encoded)\")\n",
+    "print()\n",
+    "\n",
+    "print(\"3. Check for pod restarts (indicates startup failures):\")\n",
+    "print(f\"   kubectl get pods -n {namespace}\")\n",
+    "print()\n",
+    "\n",
+    "print(\"4. Test database connectivity from a pod (if possible):\")\n",
+    "print(\"   kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \\\\\")\n",
+    "print(\"     psql -h <db-host> -U <user> -d <database>\")\n",
+    "print()\n",
+    "\n",
+    "print(\"5. Check events for authentication/connection errors:\")\n",
+    "print(f\"   kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
+    "print()\n",
+    "\n",
+    "# Check what we can automatically\n",
+    "print(\"\\n### Automatic Checks\\n\")\n",
+    "\n",
+    "# Check secret still exists\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", postgres_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    ok(f\"Secret '{postgres_secret_name}' still exists\")\n",
+    "    secret_data = json.loads(result.stdout)\n",
+    "    keys = list(secret_data.get(\"data\", {}).keys())\n",
+    "    print(f\"   Secret keys: {', '.join(keys)}\")\n",
+    "else:\n",
+    "    warn(f\"Secret '{postgres_secret_name}' not found!\")\n",
+    "\n",
+    "# Check for pods with database connection env vars\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    db_related_pods = []\n",
+    "    for pod in pods.get(\"items\", []):\n",
+    "        name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
+    "        for container in containers:\n",
+    "            env = container.get(\"env\", [])\n",
+    "            db_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
+    "                                           for kw in [\"DB\", \"POSTGRES\", \"DATABASE\"])]\n",
+    "            if db_env:\n",
+    "                db_related_pods.append(name)\n",
+    "                break\n",
+    "    \n",
+    "    if db_related_pods:\n",
+    "        print(f\"\\n   Pods with database environment variables:\")\n",
+    "        for pod_name in set(db_related_pods):\n",
+    "            print(f\"   - {pod_name}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 11. Do the Drill - Step 6: Remediation\n",
+    "\n",
+    "**Restore the original secret to fix the issue.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# REMEDIATION: Restore original secret\n",
+    "# UNCOMMENT TO RESTORE\n",
+    "\n",
+    "# if backup_file.exists():\n",
+    "#     print(f\"Restoring original secret from: {backup_file.name}\")\n",
+    "#     result = run(\n",
+    "#         [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
+    "#         check=True,\n",
+    "#         stream=True\n",
+    "#     )\n",
+    "#     \n",
+    "#     ok(\"Original secret restored\")\n",
+    "#     print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
+    "#     print(\"   This may take 1-2 minutes. Monitor pod status:\")\n",
+    "#     print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "#     \n",
+    "#     import time\n",
+    "#     print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
+    "#     time.sleep(60)\n",
+    "# else:\n",
+    "#     warn(f\"Backup file not found: {backup_file}\")\n",
+    "#     print(\"   💡 You may need to manually restore the secret\")\n",
+    "\n",
+    "print(\"⚠️  To restore, uncomment the code above and run this cell.\")\n",
+    "print(f\"   Backup file: {backup_file.name}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 12. Do the Drill - Step 7: Confirm Recovery\n",
+    "\n",
+    "**Verify that everything is working again.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Verifying Recovery\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    running = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    \n",
+    "    if running == total and total > 0:\n",
+    "        ok(f\"All {total} pod(s) are running\")\n",
+    "    else:\n",
+    "        warn(f\"Only {running}/{total} pod(s) running\")\n",
+    "        print(\"   💡 Wait a bit longer for pods to fully recover\")\n",
+    "\n",
+    "# Check for recent errors in logs\n",
+    "print(\"\\nChecking for recent errors...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-api\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "api_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if api_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, api_pod, \"--tail=20\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        error_keywords = [\"error\", \"fail\", \"database\", \"postgres\", \"connection\"]\n",
+    "        recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
+    "                        if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if recent_errors:\n",
+    "            warn(\"Still seeing some errors in logs:\")\n",
+    "            for line in recent_errors[-3:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            ok(\"No recent errors in API logs\")\n",
+    "\n",
+    "ok(\"Recovery verification complete\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 13. What Support Will Ask For\n",
+    "\n",
+    "**When escalating a PostgreSQL issue, Support will need:**\n",
+    "\n",
+    "1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
+    "2. **PostgreSQL connection details:**\n",
+    "   - Host/endpoint (redacted)\n",
+    "   - Database name\n",
+    "   - Username (redacted)\n",
+    "   - Whether using SSL/TLS\n",
+    "3. **Error messages from logs:**\n",
+    "   - Full error text (not just \"connection failed\")\n",
+    "   - Timestamps of first occurrence\n",
+    "4. **Recent changes:**\n",
+    "   - Secret rotations\n",
+    "   - Database migrations\n",
+    "   - Network policy changes\n",
+    "5. **Connection pool status:**\n",
+    "   - Current connections vs. max connections\n",
+    "   - Connection pool exhaustion patterns\n",
+    "6. **Database health (if accessible):**\n",
+    "   - PostgreSQL version\n",
+    "   - Active connections\n",
+    "   - Lock contention\n",
+    "\n",
+    "**Evidence collected in this lab:**\n",
+    "- ✅ Diagnostics bundle\n",
+    "- ✅ Pod logs with database errors\n",
+    "- ✅ Events showing failures\n",
+    "- ✅ Secret configuration (structure, not values)\n",
+    "\n",
+    "**Additional evidence to gather (if escalating):**\n",
+    "- Database endpoint connectivity test\n",
+    "- Connection pool metrics (if available)\n",
+    "- PostgreSQL logs (if accessible via cloud provider)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 14. Lessons Learned\n",
+    "\n",
+    "**Key takeaways from this lab:**\n",
+    "\n",
+    "1. **PostgreSQL failures manifest quickly** - API calls fail within seconds\n",
+    "2. **Logs are your friend** - Connection errors appear in pod logs immediately\n",
+    "3. **Secrets matter** - Wrong credentials cause authentication failures\n",
+    "4. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
+    "5. **Diagnostics bundle is essential** - Support needs it for root cause analysis\n",
+    "\n",
+    "**Common mistakes to avoid:**\n",
+    "- ❌ Changing multiple things at once (hard to identify root cause)\n",
+    "- ❌ Not collecting diagnostics before remediation\n",
+    "- ❌ Ignoring connection pool limits\n",
+    "- ❌ Not testing database connectivity independently\n",
+    "\n",
+    "**Next steps:**\n",
+    "- Practice with other failure injection methods (Level 2)\n",
+    "- Try the Redis, ClickHouse, or Blob Storage failure labs\n",
+    "- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,941 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Module 4: Failure Lab - Redis\n",
+    "\n",
+    "## ⚠️ CRITICAL SAFETY WARNING\n",
+    "\n",
+    "**THIS NOTEBOOK WILL MODIFY YOUR ENVIRONMENT:**\n",
+    "- **Modifies Kubernetes secrets** (breaks Redis password)\n",
+    "- **Causes service disruptions** (intermittent ingestion, worker backlog)\n",
+    "- **Requires remediation** to restore functionality\n",
+    "\n",
+    "**REQUIREMENTS:**\n",
+    "- ✅ **TEST/NON-PRODUCTION environment ONLY**\n",
+    "- ✅ **MODULE4_SAFE_ENVIRONMENT=true** must be set in .env\n",
+    "- ✅ **Baseline diagnostics collected** (run `01_diagnostics_baseline.ipynb` first)\n",
+    "- ✅ **Backup/restore plan** available\n",
+    "\n",
+    "**DO NOT RUN THIS LAB AGAINST PRODUCTION SYSTEMS.**\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "**This lab teaches you how to debug Redis connectivity failures in LangSmith.**\n",
+    "\n",
+    "Redis is LangSmith's **cache and job queue**. It handles:\n",
+    "- Job queue for asynchronous trace processing\n",
+    "- Caching for frequently accessed data\n",
+    "- Rate limiting and session management\n",
+    "- Worker coordination\n",
+    "\n",
+    "**When Redis fails, you'll see:**\n",
+    "- Intermittent ingestion issues\n",
+    "- Latency spikes and retries\n",
+    "- Worker backlog (jobs piling up)\n",
+    "- Traces may be delayed or missing\n",
+    "\n",
+    "**Learning Objectives:**\n",
+    "1. Understand how Redis failures manifest\n",
+    "2. Practice collecting diagnostics for cache/queue issues\n",
+    "3. Learn to identify connection vs. credential vs. network issues\n",
+    "4. Practice safe remediation\n",
+    "\n",
+    "**Estimated time:** 30-45 minutes\n",
+    "\n",
+    "**⚠️ Important:** \n",
+    "- Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n",
+    "- Complete safety check in `../shared/00_setup_or_resume_environment.ipynb` first!\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bootstrap environment\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Add notebooks directory to path\n",
+    "possible_paths = [\n",
+    "    Path.cwd().parent,\n",
+    "    Path.cwd(),\n",
+    "    Path.cwd() / \"notebooks\",\n",
+    "]\n",
+    "\n",
+    "notebooks_path = None\n",
+    "for path in possible_paths:\n",
+    "    if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        notebooks_path = path\n",
+    "        break\n",
+    "\n",
+    "if not notebooks_path:\n",
+    "    notebooks_path = Path.cwd() / \"notebooks\"\n",
+    "    if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
+    "\n",
+    "if str(notebooks_path) not in sys.path:\n",
+    "    sys.path.insert(0, str(notebooks_path))\n",
+    "\n",
+    "from shared._bootstrap import bootstrap\n",
+    "from shared._validation import ok, warn\n",
+    "\n",
+    "# Run bootstrap\n",
+    "bootstrap_info = bootstrap()\n",
+    "artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
+    "print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ⚠️ CRITICAL: Environment Safety Verification\n",
+    "\n",
+    "**Before proceeding, verify you're in a TEST/NON-PRODUCTION environment and understand what will be modified.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# CRITICAL SAFETY CHECK: Verify environment is safe for failure injection\n",
+    "from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
+    "from shared._validation import ok, warn, fail\n",
+    "import os\n",
+    "\n",
+    "provider = get_cloud_provider()\n",
+    "region = get_region()\n",
+    "identity = get_identity()\n",
+    "\n",
+    "print(\"=\" * 70)\n",
+    "print(\"⚠️  CRITICAL SAFETY CHECK - REDIS FAILURE LAB\")\n",
+    "print(\"=\" * 70)\n",
+    "\n",
+    "# Show environment details prominently\n",
+    "provider_display = provider.upper()\n",
+    "print(f\"\\n### Current Environment Configuration\")\n",
+    "print(f\"Cloud Provider: {provider_display}\")\n",
+    "print(f\"Region: {region}\")\n",
+    "\n",
+    "if provider == \"aws\":\n",
+    "    account_id = identity.get('Account', 'N/A')\n",
+    "    user_arn = identity.get('Arn', 'N/A')\n",
+    "    print(f\"Account ID: {account_id}\")\n",
+    "    print(f\"User ARN: {user_arn}\")\n",
+    "elif provider == \"azure\":\n",
+    "    subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
+    "    subscription_name = identity.get(\"SubscriptionName\", \"N/A\")\n",
+    "    print(f\"Subscription ID: {subscription_id}\")\n",
+    "    print(f\"Subscription Name: {subscription_name}\")\n",
+    "\n",
+    "# Show all relevant environment variables\n",
+    "print(f\"\\n### Environment Variables (VERIFY THESE ARE CORRECT)\")\n",
+    "print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
+    "print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
+    "print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
+    "print(f\"LANGSMITH_DOMAIN: {os.environ.get('LANGSMITH_DOMAIN', 'NOT SET')}\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"⚠️  WHAT THIS LAB WILL DO:\")\n",
+    "print(\"=\" * 70)\n",
+    "print(\"\\nThis failure lab will:\")\n",
+    "print(\"  1. Find the Redis secret in your namespace\")\n",
+    "print(\"  2. BACKUP the original secret (saved to artifacts)\")\n",
+    "print(\"  3. MODIFY the secret to set an INVALID password\")\n",
+    "print(\"  4. Apply the modified secret (breaks Redis connectivity)\")\n",
+    "print(\"  5. Cause intermittent ingestion issues and worker backlog\")\n",
+    "print(\"  6. Require remediation to restore (restore original secret)\")\n",
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "\n",
+    "# Check for Module 4 safety flag\n",
+    "module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
+    "if module4_safe not in [\"true\", \"yes\", \"1\"]:\n",
+    "    fail(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
+    "    print(\"\\n❌ SAFETY CHECK FAILED - Cannot proceed\")\n",
+    "    print(\"\\nTo run this failure lab, you MUST:\")\n",
+    "    print(\"  1. Verify this is a TEST/NON-PRODUCTION environment\")\n",
+    "    print(\"  2. Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
+    "    print(\"  3. Complete safety check in ../shared/00_setup_or_resume_environment.ipynb\")\n",
+    "    print(\"  4. Re-run this cell to confirm\")\n",
+    "    print(\"\\nThis flag is REQUIRED to prevent accidental execution in production.\")\n",
+    "    raise RuntimeError(\"MODULE4_SAFE_ENVIRONMENT not set. Required for failure labs.\")\n",
+    "\n",
+    "ok(\"MODULE4_SAFE_ENVIRONMENT flag is set\")\n",
+    "print(\"\\n✅ Safety check passed - environment marked as safe for failure injection\")\n",
+    "print(\"\\n⚠️  REMINDER: This lab will break Redis connectivity.\")\n",
+    "print(\"   Ensure you understand the remediation steps before proceeding.\")\n",
+    "print(\"   Original secret will be backed up automatically.\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"✅ Environment verified - ready for Redis failure lab\")\n",
+    "print(\"=\" * 70)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Configuration & Prerequisites\n",
+    "\n",
+    "Load configuration and verify prerequisites.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from shared._validation import require_env\n",
+    "\n",
+    "required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
+    "config = require_env(*required_vars)\n",
+    "config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
+    "\n",
+    "namespace = config[\"NAMESPACE\"]\n",
+    "\n",
+    "print(f\"Namespace: {namespace}\")\n",
+    "print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
+    "\n",
+    "ok(\"Configuration loaded\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. What This Service Does for LangSmith\n",
+    "\n",
+    "Redis is LangSmith's **cache and job queue**. It handles:\n",
+    "\n",
+    "- **Job queue for asynchronous processing:**\n",
+    "  - Workers pull trace processing jobs from Redis\n",
+    "  - Jobs are queued when traces arrive via API\n",
+    "  - Queue backlog indicates processing delays\n",
+    "\n",
+    "- **Caching:**\n",
+    "  - Frequently accessed data (project metadata, user info)\n",
+    "  - Reduces load on PostgreSQL\n",
+    "  - Improves response times\n",
+    "\n",
+    "- **Rate limiting and session management:**\n",
+    "  - API rate limiting\n",
+    "  - Session storage (if configured)\n",
+    "\n",
+    "- **Worker coordination:**\n",
+    "  - Distributed locking\n",
+    "  - Task distribution\n",
+    "\n",
+    "**Why it matters:**\n",
+    "- Without Redis, workers can't process traces\n",
+    "- Job queue fills up, causing delays\n",
+    "- Cache misses increase load on PostgreSQL\n",
+    "- Ingestion becomes unreliable\n",
+    "\n",
+    "**How LangSmith connects:**\n",
+    "- Connection string stored in Kubernetes Secrets\n",
+    "- Workers connect to Redis to pull jobs\n",
+    "- API servers use Redis for caching\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Expected Symptoms When Redis Fails\n",
+    "\n",
+    "**What you'll see:**\n",
+    "\n",
+    "1. **Intermittent ingestion issues:**\n",
+    "   - Some traces process, others don't\n",
+    "   - Inconsistent behavior (works sometimes, fails other times)\n",
+    "   - Retries visible in logs\n",
+    "\n",
+    "2. **Latency spikes:**\n",
+    "   - API responses slow down\n",
+    "   - Worker processing delays\n",
+    "   - Timeout errors\n",
+    "\n",
+    "3. **Worker backlog:**\n",
+    "   - Jobs piling up in queue\n",
+    "   - Workers unable to pull new jobs\n",
+    "   - Queue length increasing\n",
+    "\n",
+    "4. **Log patterns:**\n",
+    "   - Connection timeout errors\n",
+    "   - \"connection refused\" or \"connection reset\"\n",
+    "   - \"NOAUTH Authentication required\" (if password wrong)\n",
+    "   - Retry attempts in worker logs\n",
+    "   - Cache miss patterns\n",
+    "\n",
+    "**Timeline:**\n",
+    "- Symptoms may be intermittent (connection pool retries)\n",
+    "- Worker backlog builds over time\n",
+    "- Cache misses cause cascading delays\n",
+    "- Full failure if connection pool exhausted\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Failure Injection Options\n",
+    "\n",
+    "**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
+    "\n",
+    "### Level 1: Subtle Failure (Recommended for first run)\n",
+    "\n",
+    "**Option A: Wrong Redis Password**\n",
+    "- Modify the Redis password in the Kubernetes Secret\n",
+    "- Symptoms: Authentication failures, connection refused, intermittent failures\n",
+    "\n",
+    "**Option B: Block Egress to Redis Endpoint**\n",
+    "- Apply NetworkPolicy blocking egress to Redis (if NetworkPolicy supported)\n",
+    "- Symptoms: Connection timeout, no route to host, intermittent failures\n",
+    "\n",
+    "### Level 2: Obvious Failure\n",
+    "\n",
+    "**Option C: Wrong Redis Host/Endpoint**\n",
+    "- Point connection string to non-existent host\n",
+    "- Symptoms: Connection timeout, DNS resolution failures, immediate failures\n",
+    "\n",
+    "**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Do the Drill - Step 1: Confirm Baseline\n",
+    "\n",
+    "**Before injecting any failure, verify your baseline is healthy.**\n",
+    "\n",
+    "💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._shell import run\n",
+    "import json\n",
+    "\n",
+    "print(\"### Quick Baseline Check\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    healthy = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    print(f\"Pods: {healthy}/{total} running\")\n",
+    "    \n",
+    "    if healthy == total and total > 0:\n",
+    "        ok(\"Baseline looks healthy\")\n",
+    "    else:\n",
+    "        warn(\"Some pods are not running - check baseline first\")\n",
+    "else:\n",
+    "    warn(\"Could not check pod status\")\n",
+    "\n",
+    "# Check for Redis secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "redis_secrets = []\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        if \"redis\" in name.lower() or \"cache\" in name.lower():\n",
+    "            redis_secrets.append(name)\n",
+    "\n",
+    "if redis_secrets:\n",
+    "    ok(f\"Found {len(redis_secrets)} Redis-related secret(s)\")\n",
+    "    for secret_name in redis_secrets:\n",
+    "        print(f\"   - {secret_name}\")\n",
+    "else:\n",
+    "    warn(\"No Redis secrets found\")\n",
+    "    print(\"   💡 Redis connection may be configured differently\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Do the Drill - Step 2: Apply Failure Injection\n",
+    "\n",
+    "**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
+    "\n",
+    "Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# FAILURE INJECTION: Wrong Redis Password\n",
+    "# This cell modifies the Redis password secret to an invalid value\n",
+    "\n",
+    "import base64\n",
+    "import yaml\n",
+    "from datetime import datetime\n",
+    "\n",
+    "# Find Redis secret (look for common names)\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "redis_secret_name = None\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        # Common patterns: redis, cache\n",
+    "        if any(keyword in name.lower() for keyword in [\"redis\", \"cache\"]):\n",
+    "            # Check if it has password-related keys\n",
+    "            data = secret.get(\"data\", {})\n",
+    "            if any(key in data for key in [\"password\", \"REDIS_PASSWORD\", \"CACHE_PASSWORD\"]):\n",
+    "                redis_secret_name = name\n",
+    "                break\n",
+    "\n",
+    "if not redis_secret_name:\n",
+    "    raise RuntimeError(\"❌ Could not find Redis secret. Check your deployment configuration.\")\n",
+    "\n",
+    "print(f\"Found Redis secret: {redis_secret_name}\")\n",
+    "\n",
+    "# Get current secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", redis_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
+    "    check=True,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "# Save original secret for restoration\n",
+    "backup_file = artifacts_dir / \"module-4\" / f\"redis-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
+    "backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
+    "with open(backup_file, \"w\") as f:\n",
+    "    f.write(result.stdout)\n",
+    "\n",
+    "ok(f\"Backed up original secret to: {backup_file.name}\")\n",
+    "\n",
+    "# Parse YAML and modify password\n",
+    "secret_data = yaml.safe_load(result.stdout)\n",
+    "if \"data\" not in secret_data:\n",
+    "    raise RuntimeError(\"Secret has no data section\")\n",
+    "\n",
+    "# Find password key (could be password, REDIS_PASSWORD, CACHE_PASSWORD, etc.)\n",
+    "password_key = None\n",
+    "for key in [\"password\", \"REDIS_PASSWORD\", \"CACHE_PASSWORD\", \"redis-password\"]:\n",
+    "    if key in secret_data[\"data\"]:\n",
+    "        password_key = key\n",
+    "        break\n",
+    "\n",
+    "if not password_key:\n",
+    "    raise RuntimeError(\"Could not find password key in secret\")\n",
+    "\n",
+    "# Set invalid password\n",
+    "invalid_password = \"INVALID_PASSWORD_12345\"\n",
+    "invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
+    "\n",
+    "# Modify secret\n",
+    "secret_data[\"data\"][password_key] = invalid_password_b64\n",
+    "\n",
+    "# Save modified secret to temp file\n",
+    "temp_secret_file = artifacts_dir / \"module-4\" / \"redis-secret-modified.yaml\"\n",
+    "with open(temp_secret_file, \"w\") as f:\n",
+    "    yaml.dump(secret_data, f)\n",
+    "\n",
+    "print(f\"\\n⚠️  READY TO APPLY FAILURE INJECTION\")\n",
+    "print(f\"   This will set an invalid password in secret: {redis_secret_name}\")\n",
+    "print(f\"   Modified secret saved to: {temp_secret_file.name}\")\n",
+    "print(f\"\\n   To apply, uncomment and run the next cell.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# UNCOMMENT TO APPLY FAILURE INJECTION\n",
+    "# \n",
+    "# result = run(\n",
+    "#     [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
+    "#     check=True,\n",
+    "#     stream=True\n",
+    "# )\n",
+    "# \n",
+    "# ok(\"Failure injection applied - Redis password is now invalid\")\n",
+    "# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
+    "# print(\"   This may take 1-2 minutes. Watch for pod restarts:\")\n",
+    "# print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "# \n",
+    "# # Wait a moment for changes to propagate\n",
+    "# import time\n",
+    "# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
+    "# time.sleep(30)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Do the Drill - Step 3: Observe Symptoms\n",
+    "\n",
+    "**Now that the failure is injected, observe how it manifests.**\n",
+    "\n",
+    "Check:\n",
+    "1. Worker pod logs for Redis connection errors\n",
+    "2. Queue backlog (if visible)\n",
+    "3. Worker retry patterns\n",
+    "4. Latency in API responses\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datetime import datetime\n",
+    "\n",
+    "# Create incident directory for diagnostics\n",
+    "incident_dir = artifacts_dir / \"module-4\" / f\"redis-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
+    "incident_dir.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "print(f\"### Collecting Failure Diagnostics\\n\")\n",
+    "print(f\"Saving to: {incident_dir}\\n\")\n",
+    "\n",
+    "# 1. Check pod status\n",
+    "print(\"1. Checking pod status...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    print(result.stdout)\n",
+    "    \n",
+    "    # Check for restarts\n",
+    "    lines = result.stdout.split(\"\\n\")\n",
+    "    restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
+    "    if restarts:\n",
+    "        print(\"\\n   Pod restart counts:\")\n",
+    "        for line in restarts[1:]:  # Skip header\n",
+    "            if line.strip():\n",
+    "                parts = line.split()\n",
+    "                if len(parts) > 3:\n",
+    "                    print(f\"   {parts[0]}: {parts[3]} restarts\")\n",
+    "\n",
+    "# 2. Check recent events\n",
+    "print(\"\\n2. Checking recent events...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"events.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    if result.stdout.strip():\n",
+    "        print(\"   Recent warning/error events:\")\n",
+    "        for line in result.stdout.split(\"\\n\")[-5:]:\n",
+    "            if line.strip():\n",
+    "                print(f\"   {line}\")\n",
+    "\n",
+    "# 3. Check worker pod logs for Redis errors\n",
+    "print(\"\\n3. Checking worker pod logs for Redis errors...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if worker_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=50\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        logs_file = incident_dir / f\"worker-pod-{worker_pod}-logs.txt\"\n",
+    "        with open(logs_file, \"w\") as f:\n",
+    "            f.write(result.stdout)\n",
+    "        \n",
+    "        # Look for Redis-related errors\n",
+    "        error_keywords = [\"redis\", \"cache\", \"connection\", \"timeout\", \"refused\", \"authentication\", \"NOAUTH\", \"retry\"]\n",
+    "        error_lines = [l for l in result.stdout.split(\"\\n\") \n",
+    "                      if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if error_lines:\n",
+    "            print(\"   Found Redis-related errors:\")\n",
+    "            for line in error_lines[-5:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            print(\"   No obvious Redis errors in recent logs\")\n",
+    "else:\n",
+    "    warn(\"Could not find worker pod\")\n",
+    "\n",
+    "ok(f\"Diagnostics saved to: {incident_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
+    "\n",
+    "**This is critical - Support will ask for this bundle.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import urllib.request\n",
+    "\n",
+    "print(\"### Running Canonical Diagnostics Script\\n\")\n",
+    "\n",
+    "script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
+    "script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
+    "\n",
+    "try:\n",
+    "    urllib.request.urlretrieve(script_url, script_path)\n",
+    "    script_path.chmod(0o755)\n",
+    "    \n",
+    "    print(f\"Running diagnostics script for namespace: {namespace}\")\n",
+    "    result = run(\n",
+    "        [str(script_path), namespace],\n",
+    "        check=False,\n",
+    "        stream=True\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        ok(\"Diagnostics script completed\")\n",
+    "        \n",
+    "        # Find and move tarball\n",
+    "        for file in incident_dir.parent.iterdir():\n",
+    "            if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
+    "                target_path = incident_dir / file.name\n",
+    "                file.rename(target_path)\n",
+    "                ok(f\"Diagnostics bundle: {target_path.name}\")\n",
+    "                break\n",
+    "    else:\n",
+    "        warn(\"Diagnostics script had errors (check output above)\")\n",
+    "        \n",
+    "except Exception as e:\n",
+    "    warn(f\"Could not run diagnostics script: {e}\")\n",
+    "    print(\"   💡 You can run it manually:\")\n",
+    "    print(f\"      curl -O {script_url}\")\n",
+    "    print(f\"      chmod +x get_k8s_debugging_info.sh\")\n",
+    "    print(f\"      ./get_k8s_debugging_info.sh {namespace}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 10. Do the Drill - Step 5: Guided Triage\n",
+    "\n",
+    "**Where to look first for Redis issues:**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Guided Triage Steps\\n\")\n",
+    "\n",
+    "print(\"1. Check worker pod logs for Redis connection errors:\")\n",
+    "print(f\"   kubectl logs -n {namespace} <worker-pod-name> | grep -i 'redis\\\\|cache\\\\|connection'\")\n",
+    "print()\n",
+    "\n",
+    "print(\"2. Verify secret exists and has correct keys:\")\n",
+    "print(f\"   kubectl get secret {redis_secret_name} -n {namespace} -o yaml\")\n",
+    "print(\"   (Don't print the actual values - they're base64 encoded)\")\n",
+    "print()\n",
+    "\n",
+    "print(\"3. Check for worker pod restarts (indicates connection failures):\")\n",
+    "print(f\"   kubectl get pods -n {namespace} -l app=langsmith-worker\")\n",
+    "print()\n",
+    "\n",
+    "print(\"4. Test Redis connectivity from a pod (if possible):\")\n",
+    "print(\"   kubectl run -it --rm debug --image=redis:7 --restart=Never -- \\\\\")\n",
+    "print(\"     redis-cli -h <redis-host> -p <port> -a <password> ping\")\n",
+    "print()\n",
+    "\n",
+    "print(\"5. Check events for connection/authentication errors:\")\n",
+    "print(f\"   kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
+    "print()\n",
+    "\n",
+    "# Check what we can automatically\n",
+    "print(\"\\n### Automatic Checks\\n\")\n",
+    "\n",
+    "# Check secret still exists\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", redis_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    ok(f\"Secret '{redis_secret_name}' still exists\")\n",
+    "    secret_data = json.loads(result.stdout)\n",
+    "    keys = list(secret_data.get(\"data\", {}).keys())\n",
+    "    print(f\"   Secret keys: {', '.join(keys)}\")\n",
+    "else:\n",
+    "    warn(f\"Secret '{redis_secret_name}' not found!\")\n",
+    "\n",
+    "# Check for pods with Redis connection env vars\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    redis_related_pods = []\n",
+    "    for pod in pods.get(\"items\", []):\n",
+    "        name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
+    "        for container in containers:\n",
+    "            env = container.get(\"env\", [])\n",
+    "            redis_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
+    "                                              for kw in [\"REDIS\", \"CACHE\"])]\n",
+    "            if redis_env:\n",
+    "                redis_related_pods.append(name)\n",
+    "                break\n",
+    "    \n",
+    "    if redis_related_pods:\n",
+    "        print(f\"\\n   Pods with Redis environment variables:\")\n",
+    "        for pod_name in set(redis_related_pods):\n",
+    "            print(f\"   - {pod_name}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 11. Do the Drill - Step 6: Remediation\n",
+    "\n",
+    "**Restore the original secret to fix the issue.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# REMEDIATION: Restore original secret\n",
+    "# UNCOMMENT TO RESTORE\n",
+    "\n",
+    "# if backup_file.exists():\n",
+    "#     print(f\"Restoring original secret from: {backup_file.name}\")\n",
+    "#     result = run(\n",
+    "#         [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
+    "#         check=True,\n",
+    "#         stream=True\n",
+    "#     )\n",
+    "#     \n",
+    "#     ok(\"Original secret restored\")\n",
+    "#     print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
+    "#     print(\"   This may take 1-2 minutes. Monitor pod status:\")\n",
+    "#     print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "#     \n",
+    "#     import time\n",
+    "#     print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
+    "#     time.sleep(60)\n",
+    "# else:\n",
+    "#     warn(f\"Backup file not found: {backup_file}\")\n",
+    "#     print(\"   💡 You may need to manually restore the secret\")\n",
+    "\n",
+    "print(\"⚠️  To restore, uncomment the code above and run this cell.\")\n",
+    "print(f\"   Backup file: {backup_file.name}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 12. Do the Drill - Step 7: Confirm Recovery\n",
+    "\n",
+    "**Verify that everything is working again.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Verifying Recovery\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    running = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    \n",
+    "    if running == total and total > 0:\n",
+    "        ok(f\"All {total} pod(s) are running\")\n",
+    "    else:\n",
+    "        warn(f\"Only {running}/{total} pod(s) running\")\n",
+    "        print(\"   💡 Wait a bit longer for pods to fully recover\")\n",
+    "\n",
+    "# Check for recent errors in worker logs\n",
+    "print(\"\\nChecking for recent errors in worker logs...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if worker_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=20\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        error_keywords = [\"error\", \"fail\", \"redis\", \"cache\", \"connection\"]\n",
+    "        recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
+    "                        if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if recent_errors:\n",
+    "            warn(\"Still seeing some errors in logs:\")\n",
+    "            for line in recent_errors[-3:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            ok(\"No recent errors in worker logs\")\n",
+    "\n",
+    "ok(\"Recovery verification complete\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 13. What Support Will Ask For\n",
+    "\n",
+    "**When escalating a Redis issue, Support will need:**\n",
+    "\n",
+    "1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
+    "2. **Redis connection details:**\n",
+    "   - Host/endpoint (redacted)\n",
+    "   - Port\n",
+    "   - Password (redacted)\n",
+    "   - Whether using SSL/TLS\n",
+    "3. **Error messages from logs:**\n",
+    "   - Full error text (not just \"connection failed\")\n",
+    "   - Timestamps of first occurrence\n",
+    "   - Retry patterns\n",
+    "4. **Recent changes:**\n",
+    "   - Secret rotations\n",
+    "   - Network policy changes\n",
+    "   - Redis configuration changes\n",
+    "5. **Queue status (if accessible):**\n",
+    "   - Queue length\n",
+    "   - Worker processing rate\n",
+    "   - Backlog growth rate\n",
+    "6. **Redis health (if accessible):**\n",
+    "   - Redis version\n",
+    "   - Memory usage\n",
+    "   - Connection count\n",
+    "   - Slow queries\n",
+    "\n",
+    "**Evidence collected in this lab:**\n",
+    "- ✅ Diagnostics bundle\n",
+    "- ✅ Worker pod logs with Redis errors\n",
+    "- ✅ Events showing failures\n",
+    "- ✅ Secret configuration (structure, not values)\n",
+    "\n",
+    "**Additional evidence to gather (if escalating):**\n",
+    "- Redis endpoint connectivity test\n",
+    "- Queue metrics (if available)\n",
+    "- Redis logs (if accessible via cloud provider)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 14. Lessons Learned\n",
+    "\n",
+    "**Key takeaways from this lab:**\n",
+    "\n",
+    "1. **Redis failures can be intermittent** - Connection pool retries may mask issues\n",
+    "2. **Worker logs are critical** - Redis errors appear in worker pod logs\n",
+    "3. **Queue backlog is a symptom** - Jobs pile up when workers can't connect\n",
+    "4. **Secrets matter** - Wrong credentials cause authentication failures\n",
+    "5. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
+    "\n",
+    "**Common mistakes to avoid:**\n",
+    "- ❌ Ignoring intermittent failures (they indicate connection issues)\n",
+    "- ❌ Not checking worker logs (API logs may not show Redis errors)\n",
+    "- ❌ Not monitoring queue length (backlog indicates processing delays)\n",
+    "- ❌ Not testing Redis connectivity independently\n",
+    "\n",
+    "**Next steps:**\n",
+    "- Practice with other failure injection methods (Level 2)\n",
+    "- Try the ClickHouse or Blob Storage failure labs\n",
+    "- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,942 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Module 4: Failure Lab - ClickHouse\n",
+    "\n",
+    "## ⚠️ CRITICAL SAFETY WARNING\n",
+    "\n",
+    "**THIS NOTEBOOK WILL MODIFY YOUR ENVIRONMENT:**\n",
+    "- **Modifies Kubernetes secrets** (breaks ClickHouse password)\n",
+    "- **Causes service disruptions** (traces delayed/missing, insert errors)\n",
+    "- **Requires remediation** to restore functionality\n",
+    "\n",
+    "**REQUIREMENTS:**\n",
+    "- ✅ **TEST/NON-PRODUCTION environment ONLY**\n",
+    "- ✅ **MODULE4_SAFE_ENVIRONMENT=true** must be set in .env\n",
+    "- ✅ **Baseline diagnostics collected** (run `01_diagnostics_baseline.ipynb` first)\n",
+    "- ✅ **Backup/restore plan** available\n",
+    "\n",
+    "**DO NOT RUN THIS LAB AGAINST PRODUCTION SYSTEMS.**\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "**This lab teaches you how to debug ClickHouse connectivity failures in LangSmith.**\n",
+    "\n",
+    "ClickHouse is LangSmith's **trace storage**. It handles:\n",
+    "- Storing trace data (spans, events, metadata)\n",
+    "- Time-series queries for trace search and filtering\n",
+    "- High-volume writes from workers\n",
+    "- Efficient querying for UI display\n",
+    "\n",
+    "**When ClickHouse fails, you'll see:**\n",
+    "- Traces delayed or missing\n",
+    "- Insert errors and merge/backlog hints\n",
+    "- UI loads but traces don't appear\n",
+    "- Query timeouts\n",
+    "\n",
+    "**Learning Objectives:**\n",
+    "1. Understand how ClickHouse failures manifest\n",
+    "2. Practice collecting diagnostics for trace storage issues\n",
+    "3. Learn to identify connection vs. credential vs. network issues\n",
+    "4. Practice safe remediation\n",
+    "\n",
+    "**Estimated time:** 30-45 minutes\n",
+    "\n",
+    "**⚠️ Important:** \n",
+    "- Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n",
+    "- Complete safety check in `../shared/00_setup_or_resume_environment.ipynb` first!\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bootstrap environment\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Add notebooks directory to path\n",
+    "possible_paths = [\n",
+    "    Path.cwd().parent,\n",
+    "    Path.cwd(),\n",
+    "    Path.cwd() / \"notebooks\",\n",
+    "]\n",
+    "\n",
+    "notebooks_path = None\n",
+    "for path in possible_paths:\n",
+    "    if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        notebooks_path = path\n",
+    "        break\n",
+    "\n",
+    "if not notebooks_path:\n",
+    "    notebooks_path = Path.cwd() / \"notebooks\"\n",
+    "    if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
+    "\n",
+    "if str(notebooks_path) not in sys.path:\n",
+    "    sys.path.insert(0, str(notebooks_path))\n",
+    "\n",
+    "from shared._bootstrap import bootstrap\n",
+    "from shared._validation import ok, warn\n",
+    "\n",
+    "# Run bootstrap\n",
+    "bootstrap_info = bootstrap()\n",
+    "artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
+    "print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ⚠️ CRITICAL: Environment Safety Verification\n",
+    "\n",
+    "**Before proceeding, verify you're in a TEST/NON-PRODUCTION environment and understand what will be modified.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# CRITICAL SAFETY CHECK: Verify environment is safe for failure injection\n",
+    "from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
+    "from shared._validation import ok, warn, fail\n",
+    "import os\n",
+    "\n",
+    "provider = get_cloud_provider()\n",
+    "region = get_region()\n",
+    "identity = get_identity()\n",
+    "\n",
+    "print(\"=\" * 70)\n",
+    "print(\"⚠️  CRITICAL SAFETY CHECK - CLICKHOUSE FAILURE LAB\")\n",
+    "print(\"=\" * 70)\n",
+    "\n",
+    "# Show environment details prominently\n",
+    "provider_display = provider.upper()\n",
+    "print(f\"\\n### Current Environment Configuration\")\n",
+    "print(f\"Cloud Provider: {provider_display}\")\n",
+    "print(f\"Region: {region}\")\n",
+    "\n",
+    "if provider == \"aws\":\n",
+    "    account_id = identity.get('Account', 'N/A')\n",
+    "    user_arn = identity.get('Arn', 'N/A')\n",
+    "    print(f\"Account ID: {account_id}\")\n",
+    "    print(f\"User ARN: {user_arn}\")\n",
+    "elif provider == \"azure\":\n",
+    "    subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
+    "    subscription_name = identity.get(\"SubscriptionName\", \"N/A\")\n",
+    "    print(f\"Subscription ID: {subscription_id}\")\n",
+    "    print(f\"Subscription Name: {subscription_name}\")\n",
+    "\n",
+    "# Show all relevant environment variables\n",
+    "print(f\"\\n### Environment Variables (VERIFY THESE ARE CORRECT)\")\n",
+    "print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
+    "print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
+    "print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
+    "print(f\"LANGSMITH_DOMAIN: {os.environ.get('LANGSMITH_DOMAIN', 'NOT SET')}\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"⚠️  WHAT THIS LAB WILL DO:\")\n",
+    "print(\"=\" * 70)\n",
+    "print(\"\\nThis failure lab will:\")\n",
+    "print(\"  1. Find the ClickHouse secret in your namespace\")\n",
+    "print(\"  2. BACKUP the original secret (saved to artifacts)\")\n",
+    "print(\"  3. MODIFY the secret to set an INVALID password\")\n",
+    "print(\"  4. Apply the modified secret (breaks ClickHouse connectivity)\")\n",
+    "print(\"  5. Cause trace ingestion failures and query timeouts\")\n",
+    "print(\"  6. Require remediation to restore (restore original secret)\")\n",
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "\n",
+    "# Check for Module 4 safety flag\n",
+    "module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
+    "if module4_safe not in [\"true\", \"yes\", \"1\"]:\n",
+    "    fail(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
+    "    print(\"\\n❌ SAFETY CHECK FAILED - Cannot proceed\")\n",
+    "    print(\"\\nTo run this failure lab, you MUST:\")\n",
+    "    print(\"  1. Verify this is a TEST/NON-PRODUCTION environment\")\n",
+    "    print(\"  2. Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
+    "    print(\"  3. Complete safety check in ../shared/00_setup_or_resume_environment.ipynb\")\n",
+    "    print(\"  4. Re-run this cell to confirm\")\n",
+    "    print(\"\\nThis flag is REQUIRED to prevent accidental execution in production.\")\n",
+    "    raise RuntimeError(\"MODULE4_SAFE_ENVIRONMENT not set. Required for failure labs.\")\n",
+    "\n",
+    "ok(\"MODULE4_SAFE_ENVIRONMENT flag is set\")\n",
+    "print(\"\\n✅ Safety check passed - environment marked as safe for failure injection\")\n",
+    "print(\"\\n⚠️  REMINDER: This lab will break ClickHouse connectivity.\")\n",
+    "print(\"   Ensure you understand the remediation steps before proceeding.\")\n",
+    "print(\"   Original secret will be backed up automatically.\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"✅ Environment verified - ready for ClickHouse failure lab\")\n",
+    "print(\"=\" * 70)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Configuration & Prerequisites\n",
+    "\n",
+    "Load configuration and verify prerequisites.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from shared._validation import require_env\n",
+    "\n",
+    "required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
+    "config = require_env(*required_vars)\n",
+    "config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
+    "\n",
+    "namespace = config[\"NAMESPACE\"]\n",
+    "\n",
+    "print(f\"Namespace: {namespace}\")\n",
+    "print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
+    "\n",
+    "ok(\"Configuration loaded\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. What This Service Does for LangSmith\n",
+    "\n",
+    "ClickHouse is LangSmith's **clickhouse and job queue**. It handles:\n",
+    "\n",
+    "- **Job queue for asynchronous processing:**\n",
+    "  - Workers pull trace processing jobs from ClickHouse\n",
+    "  - Jobs are queued when traces arrive via API\n",
+    "  - Queue backlog indicates processing delays\n",
+    "\n",
+    "- **Caching:**\n",
+    "  - Frequently accessed data (project metadata, user info)\n",
+    "  - Reduces load on PostgreSQL\n",
+    "  - Improves response times\n",
+    "\n",
+    "- **Rate limiting and session management:**\n",
+    "  - API rate limiting\n",
+    "  - Session storage (if configured)\n",
+    "\n",
+    "- **Worker coordination:**\n",
+    "  - Distributed locking\n",
+    "  - Task distribution\n",
+    "\n",
+    "**Why it matters:**\n",
+    "- Without ClickHouse, workers can't process traces\n",
+    "- Job queue fills up, causing delays\n",
+    "- Cache misses increase load on PostgreSQL\n",
+    "- Ingestion becomes unreliable\n",
+    "\n",
+    "**How LangSmith connects:**\n",
+    "- Connection string stored in Kubernetes Secrets\n",
+    "- Workers connect to ClickHouse to pull jobs\n",
+    "- API servers use ClickHouse for caching\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Expected Symptoms When ClickHouse Fails\n",
+    "\n",
+    "**What you'll see:**\n",
+    "\n",
+    "1. **Traces delayed or missing:**\n",
+    "   - Some traces process, others don't\n",
+    "   - Inconsistent behavior (works sometimes, fails other times)\n",
+    "   - Retries visible in logs\n",
+    "\n",
+    "2. **Latency spikes:**\n",
+    "   - API responses slow down\n",
+    "   - Worker processing delays\n",
+    "   - Timeout errors\n",
+    "\n",
+    "3. **Worker backlog:**\n",
+    "   - Jobs piling up in queue\n",
+    "   - Workers unable to pull new jobs\n",
+    "   - Queue length increasing\n",
+    "\n",
+    "4. **Log patterns:**\n",
+    "   - Connection timeout errors\n",
+    "   - \"connection refused\" or \"connection reset\"\n",
+    "   - \"NOAUTH Authentication required\" (if password wrong)\n",
+    "   - Retry attempts in worker logs\n",
+    "   - Cache miss patterns\n",
+    "\n",
+    "**Timeline:**\n",
+    "- Symptoms may be intermittent (connection pool retries)\n",
+    "- Worker backlog builds over time\n",
+    "- Cache misses cause cascading delays\n",
+    "- Full failure if connection pool exhausted\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Failure Injection Options\n",
+    "\n",
+    "**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
+    "\n",
+    "### Level 1: Subtle Failure (Recommended for first run)\n",
+    "\n",
+    "**Option A: Wrong ClickHouse Password**\n",
+    "- Modify the ClickHouse password in the Kubernetes Secret\n",
+    "- Symptoms: Authentication failures, connection refused, intermittent failures\n",
+    "\n",
+    "**Option B: Block Egress to ClickHouse Endpoint**\n",
+    "- Apply NetworkPolicy blocking egress to ClickHouse (if NetworkPolicy supported)\n",
+    "- Symptoms: Connection timeout, no route to host, intermittent failures\n",
+    "\n",
+    "### Level 2: Obvious Failure\n",
+    "\n",
+    "**Option C: Wrong ClickHouse Host/Endpoint**\n",
+    "- Point connection string to non-existent host\n",
+    "- Symptoms: Connection timeout, DNS resolution failures, immediate failures\n",
+    "\n",
+    "**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Do the Drill - Step 1: Confirm Baseline\n",
+    "\n",
+    "**Before injecting any failure, verify your baseline is healthy.**\n",
+    "\n",
+    "💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._shell import run\n",
+    "import json\n",
+    "\n",
+    "print(\"### Quick Baseline Check\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    healthy = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    print(f\"Pods: {healthy}/{total} running\")\n",
+    "    \n",
+    "    if healthy == total and total > 0:\n",
+    "        ok(\"Baseline looks healthy\")\n",
+    "    else:\n",
+    "        warn(\"Some pods are not running - check baseline first\")\n",
+    "else:\n",
+    "    warn(\"Could not check pod status\")\n",
+    "\n",
+    "# Check for ClickHouse secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "clickhouse_secrets = []\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        if \"clickhouse\" in name.lower() or \"clickhouse\" in name.lower():\n",
+    "            clickhouse_secrets.append(name)\n",
+    "\n",
+    "if clickhouse_secrets:\n",
+    "    ok(f\"Found {len(clickhouse_secrets)} ClickHouse-related secret(s)\")\n",
+    "    for secret_name in clickhouse_secrets:\n",
+    "        print(f\"   - {secret_name}\")\n",
+    "else:\n",
+    "    warn(\"No ClickHouse secrets found\")\n",
+    "    print(\"   💡 ClickHouse connection may be configured differently\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Do the Drill - Step 2: Apply Failure Injection\n",
+    "\n",
+    "**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
+    "\n",
+    "Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# FAILURE INJECTION: Wrong ClickHouse Password\n",
+    "# This cell modifies the ClickHouse password secret to an invalid value\n",
+    "\n",
+    "import base64\n",
+    "import yaml\n",
+    "from datetime import datetime\n",
+    "\n",
+    "# Find ClickHouse secret (look for common names)\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "clickhouse_secret_name = None\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        # Common patterns: clickhouse, clickhouse\n",
+    "        if any(keyword in name.lower() for keyword in [\"clickhouse\", \"clickhouse\"]):\n",
+    "            # Check if it has password-related keys\n",
+    "            data = secret.get(\"data\", {})\n",
+    "            if any(key in data for key in [\"password\", \"REDIS_PASSWORD\", \"CLICKHOUSE_PASSWORD\"]):\n",
+    "                clickhouse_secret_name = name\n",
+    "                break\n",
+    "\n",
+    "if not clickhouse_secret_name:\n",
+    "    raise RuntimeError(\"❌ Could not find ClickHouse secret. Check your deployment configuration.\")\n",
+    "\n",
+    "print(f\"Found ClickHouse secret: {clickhouse_secret_name}\")\n",
+    "\n",
+    "# Get current secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", clickhouse_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
+    "    check=True,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "# Save original secret for restoration\n",
+    "backup_file = artifacts_dir / \"module-4\" / f\"clickhouse-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
+    "backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
+    "with open(backup_file, \"w\") as f:\n",
+    "    f.write(result.stdout)\n",
+    "\n",
+    "ok(f\"Backed up original secret to: {backup_file.name}\")\n",
+    "\n",
+    "# Parse YAML and modify password\n",
+    "secret_data = yaml.safe_load(result.stdout)\n",
+    "if \"data\" not in secret_data:\n",
+    "    raise RuntimeError(\"Secret has no data section\")\n",
+    "\n",
+    "# Find password key (could be password, REDIS_PASSWORD, CLICKHOUSE_PASSWORD, etc.)\n",
+    "password_key = None\n",
+    "for key in [\"password\", \"REDIS_PASSWORD\", \"CLICKHOUSE_PASSWORD\", \"clickhouse-password\"]:\n",
+    "    if key in secret_data[\"data\"]:\n",
+    "        password_key = key\n",
+    "        break\n",
+    "\n",
+    "if not password_key:\n",
+    "    raise RuntimeError(\"Could not find password key in secret\")\n",
+    "\n",
+    "# Set invalid password\n",
+    "invalid_password = \"INVALID_PASSWORD_12345\"\n",
+    "invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
+    "\n",
+    "# Modify secret\n",
+    "secret_data[\"data\"][password_key] = invalid_password_b64\n",
+    "\n",
+    "# Save modified secret to temp file\n",
+    "temp_secret_file = artifacts_dir / \"module-4\" / \"clickhouse-secret-modified.yaml\"\n",
+    "with open(temp_secret_file, \"w\") as f:\n",
+    "    yaml.dump(secret_data, f)\n",
+    "\n",
+    "print(f\"\\n⚠️  READY TO APPLY FAILURE INJECTION\")\n",
+    "print(f\"   This will set an invalid password in secret: {clickhouse_secret_name}\")\n",
+    "print(f\"   Modified secret saved to: {temp_secret_file.name}\")\n",
+    "print(f\"\\n   To apply, uncomment and run the next cell.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# UNCOMMENT TO APPLY FAILURE INJECTION\n",
+    "# \n",
+    "# result = run(\n",
+    "#     [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
+    "#     check=True,\n",
+    "#     stream=True\n",
+    "# )\n",
+    "# \n",
+    "# ok(\"Failure injection applied - ClickHouse password is now invalid\")\n",
+    "# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
+    "# print(\"   This may take 1-2 minutes. Watch for pod restarts:\")\n",
+    "# print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "# \n",
+    "# # Wait a moment for changes to propagate\n",
+    "# import time\n",
+    "# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
+    "# time.sleep(30)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Do the Drill - Step 3: Observe Symptoms\n",
+    "\n",
+    "**Now that the failure is injected, observe how it manifests.**\n",
+    "\n",
+    "Check:\n",
+    "1. Worker pod logs for ClickHouse connection errors\n",
+    "2. Queue backlog (if visible)\n",
+    "3. Worker retry patterns\n",
+    "4. Latency in API responses\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datetime import datetime\n",
+    "\n",
+    "# Create incident directory for diagnostics\n",
+    "incident_dir = artifacts_dir / \"module-4\" / f\"clickhouse-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
+    "incident_dir.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "print(f\"### Collecting Failure Diagnostics\\n\")\n",
+    "print(f\"Saving to: {incident_dir}\\n\")\n",
+    "\n",
+    "# 1. Check pod status\n",
+    "print(\"1. Checking pod status...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    print(result.stdout)\n",
+    "    \n",
+    "    # Check for restarts\n",
+    "    lines = result.stdout.split(\"\\n\")\n",
+    "    restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
+    "    if restarts:\n",
+    "        print(\"\\n   Pod restart counts:\")\n",
+    "        for line in restarts[1:]:  # Skip header\n",
+    "            if line.strip():\n",
+    "                parts = line.split()\n",
+    "                if len(parts) > 3:\n",
+    "                    print(f\"   {parts[0]}: {parts[3]} restarts\")\n",
+    "\n",
+    "# 2. Check recent events\n",
+    "print(\"\\n2. Checking recent events...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"events.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    if result.stdout.strip():\n",
+    "        print(\"   Recent warning/error events:\")\n",
+    "        for line in result.stdout.split(\"\\n\")[-5:]:\n",
+    "            if line.strip():\n",
+    "                print(f\"   {line}\")\n",
+    "\n",
+    "# 3. Check worker pod logs for ClickHouse errors\n",
+    "print(\"\\n3. Checking worker pod logs for ClickHouse errors...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if worker_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=50\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        logs_file = incident_dir / f\"worker-pod-{worker_pod}-logs.txt\"\n",
+    "        with open(logs_file, \"w\") as f:\n",
+    "            f.write(result.stdout)\n",
+    "        \n",
+    "        # Look for ClickHouse-related errors\n",
+    "        error_keywords = [\"clickhouse\", \"clickhouse\", \"connection\", \"timeout\", \"refused\", \"authentication\", \"NOAUTH\", \"retry\"]\n",
+    "        error_lines = [l for l in result.stdout.split(\"\\n\") \n",
+    "                      if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if error_lines:\n",
+    "            print(\"   Found ClickHouse-related errors:\")\n",
+    "            for line in error_lines[-5:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            print(\"   No obvious ClickHouse errors in recent logs\")\n",
+    "else:\n",
+    "    warn(\"Could not find worker pod\")\n",
+    "\n",
+    "ok(f\"Diagnostics saved to: {incident_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
+    "\n",
+    "**This is critical - Support will ask for this bundle.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import urllib.request\n",
+    "\n",
+    "print(\"### Running Canonical Diagnostics Script\\n\")\n",
+    "\n",
+    "script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
+    "script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
+    "\n",
+    "try:\n",
+    "    urllib.request.urlretrieve(script_url, script_path)\n",
+    "    script_path.chmod(0o755)\n",
+    "    \n",
+    "    print(f\"Running diagnostics script for namespace: {namespace}\")\n",
+    "    result = run(\n",
+    "        [str(script_path), namespace],\n",
+    "        check=False,\n",
+    "        stream=True\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        ok(\"Diagnostics script completed\")\n",
+    "        \n",
+    "        # Find and move tarball\n",
+    "        for file in incident_dir.parent.iterdir():\n",
+    "            if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
+    "                target_path = incident_dir / file.name\n",
+    "                file.rename(target_path)\n",
+    "                ok(f\"Diagnostics bundle: {target_path.name}\")\n",
+    "                break\n",
+    "    else:\n",
+    "        warn(\"Diagnostics script had errors (check output above)\")\n",
+    "        \n",
+    "except Exception as e:\n",
+    "    warn(f\"Could not run diagnostics script: {e}\")\n",
+    "    print(\"   💡 You can run it manually:\")\n",
+    "    print(f\"      curl -O {script_url}\")\n",
+    "    print(f\"      chmod +x get_k8s_debugging_info.sh\")\n",
+    "    print(f\"      ./get_k8s_debugging_info.sh {namespace}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 10. Do the Drill - Step 5: Guided Triage\n",
+    "\n",
+    "**Where to look first for ClickHouse issues:**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Guided Triage Steps\\n\")\n",
+    "\n",
+    "print(\"1. Check worker pod logs for ClickHouse connection errors:\")\n",
+    "print(f\"   kubectl logs -n {namespace} <worker-pod-name> | grep -i 'clickhouse\\\\|clickhouse\\\\|connection'\")\n",
+    "print()\n",
+    "\n",
+    "print(\"2. Verify secret exists and has correct keys:\")\n",
+    "print(f\"   kubectl get secret {clickhouse_secret_name} -n {namespace} -o yaml\")\n",
+    "print(\"   (Don't print the actual values - they're base64 encoded)\")\n",
+    "print()\n",
+    "\n",
+    "print(\"3. Check for worker pod restarts (indicates connection failures):\")\n",
+    "print(f\"   kubectl get pods -n {namespace} -l app=langsmith-worker\")\n",
+    "print()\n",
+    "\n",
+    "print(\"4. Test ClickHouse connectivity from a pod (if possible):\")\n",
+    "print(\"   kubectl run -it --rm debug --image=clickhouse:7 --restart=Never -- \\\\\")\n",
+    "print(\"     clickhouse-cli -h <clickhouse-host> -p <port> -a <password> ping\")\n",
+    "print()\n",
+    "\n",
+    "print(\"5. Check events for connection/authentication errors:\")\n",
+    "print(f\"   kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
+    "print()\n",
+    "\n",
+    "# Check what we can automatically\n",
+    "print(\"\\n### Automatic Checks\\n\")\n",
+    "\n",
+    "# Check secret still exists\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", clickhouse_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    ok(f\"Secret '{clickhouse_secret_name}' still exists\")\n",
+    "    secret_data = json.loads(result.stdout)\n",
+    "    keys = list(secret_data.get(\"data\", {}).keys())\n",
+    "    print(f\"   Secret keys: {', '.join(keys)}\")\n",
+    "else:\n",
+    "    warn(f\"Secret '{clickhouse_secret_name}' not found!\")\n",
+    "\n",
+    "# Check for pods with ClickHouse connection env vars\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    clickhouse_related_pods = []\n",
+    "    for pod in pods.get(\"items\", []):\n",
+    "        name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
+    "        for container in containers:\n",
+    "            env = container.get(\"env\", [])\n",
+    "            clickhouse_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
+    "                                              for kw in [\"REDIS\", \"CLICKHOUSE\"])]\n",
+    "            if clickhouse_env:\n",
+    "                clickhouse_related_pods.append(name)\n",
+    "                break\n",
+    "    \n",
+    "    if clickhouse_related_pods:\n",
+    "        print(f\"\\n   Pods with ClickHouse environment variables:\")\n",
+    "        for pod_name in set(clickhouse_related_pods):\n",
+    "            print(f\"   - {pod_name}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 11. Do the Drill - Step 6: Remediation\n",
+    "\n",
+    "**Restore the original secret to fix the issue.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# REMEDIATION: Restore original secret\n",
+    "# UNCOMMENT TO RESTORE\n",
+    "\n",
+    "# if backup_file.exists():\n",
+    "#     print(f\"Restoring original secret from: {backup_file.name}\")\n",
+    "#     result = run(\n",
+    "#         [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
+    "#         check=True,\n",
+    "#         stream=True\n",
+    "#     )\n",
+    "#     \n",
+    "#     ok(\"Original secret restored\")\n",
+    "#     print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
+    "#     print(\"   This may take 1-2 minutes. Monitor pod status:\")\n",
+    "#     print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "#     \n",
+    "#     import time\n",
+    "#     print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
+    "#     time.sleep(60)\n",
+    "# else:\n",
+    "#     warn(f\"Backup file not found: {backup_file}\")\n",
+    "#     print(\"   💡 You may need to manually restore the secret\")\n",
+    "\n",
+    "print(\"⚠️  To restore, uncomment the code above and run this cell.\")\n",
+    "if 'backup_file' in locals():\n",
+    "    print(f\"   Backup file: {backup_file.name}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 12. Do the Drill - Step 7: Confirm Recovery\n",
+    "\n",
+    "**Verify that everything is working again.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Verifying Recovery\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    running = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    \n",
+    "    if running == total and total > 0:\n",
+    "        ok(f\"All {total} pod(s) are running\")\n",
+    "    else:\n",
+    "        warn(f\"Only {running}/{total} pod(s) running\")\n",
+    "        print(\"   💡 Wait a bit longer for pods to fully recover\")\n",
+    "\n",
+    "# Check for recent errors in worker logs\n",
+    "print(\"\\nChecking for recent errors in worker logs...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if worker_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=20\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        error_keywords = [\"error\", \"fail\", \"clickhouse\", \"clickhouse\", \"connection\"]\n",
+    "        recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
+    "                        if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if recent_errors:\n",
+    "            warn(\"Still seeing some errors in logs:\")\n",
+    "            for line in recent_errors[-3:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            ok(\"No recent errors in worker logs\")\n",
+    "\n",
+    "ok(\"Recovery verification complete\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 13. What Support Will Ask For\n",
+    "\n",
+    "**When escalating a ClickHouse issue, Support will need:**\n",
+    "\n",
+    "1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
+    "2. **ClickHouse connection details:**\n",
+    "   - Host/endpoint (redacted)\n",
+    "   - Port\n",
+    "   - Password (redacted)\n",
+    "   - Whether using SSL/TLS\n",
+    "3. **Error messages from logs:**\n",
+    "   - Full error text (not just \"connection failed\")\n",
+    "   - Timestamps of first occurrence\n",
+    "   - Retry patterns\n",
+    "4. **Recent changes:**\n",
+    "   - Secret rotations\n",
+    "   - Network policy changes\n",
+    "   - ClickHouse configuration changes\n",
+    "5. **Queue status (if accessible):**\n",
+    "   - Queue length\n",
+    "   - Worker processing rate\n",
+    "   - Backlog growth rate\n",
+    "6. **ClickHouse health (if accessible):**\n",
+    "   - ClickHouse version\n",
+    "   - Memory usage\n",
+    "   - Connection count\n",
+    "   - Slow queries\n",
+    "\n",
+    "**Evidence collected in this lab:**\n",
+    "- ✅ Diagnostics bundle\n",
+    "- ✅ Worker pod logs with ClickHouse errors\n",
+    "- ✅ Events showing failures\n",
+    "- ✅ Secret configuration (structure, not values)\n",
+    "\n",
+    "**Additional evidence to gather (if escalating):**\n",
+    "- ClickHouse endpoint connectivity test\n",
+    "- Queue metrics (if available)\n",
+    "- ClickHouse logs (if accessible via cloud provider)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 14. Lessons Learned\n",
+    "\n",
+    "**Key takeaways from this lab:**\n",
+    "\n",
+    "1. **ClickHouse failures can be intermittent** - Connection pool retries may mask issues\n",
+    "2. **Worker logs are critical** - ClickHouse errors appear in worker pod logs\n",
+    "3. **Queue backlog is a symptom** - Jobs pile up when workers can't connect\n",
+    "4. **Secrets matter** - Wrong credentials cause authentication failures\n",
+    "5. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
+    "\n",
+    "**Common mistakes to avoid:**\n",
+    "- ❌ Ignoring intermittent failures (they indicate connection issues)\n",
+    "- ❌ Not checking worker logs (API logs may not show ClickHouse errors)\n",
+    "- ❌ Not monitoring queue length (backlog indicates processing delays)\n",
+    "- ❌ Not testing ClickHouse connectivity independently\n",
+    "\n",
+    "**Next steps:**\n",
+    "- Practice with other failure injection methods (Level 2)\n",
+    "- Try the ClickHouse or Blob Storage failure labs\n",
+    "- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,946 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Module 4: Failure Lab - Blob Storage\n",
+    "\n",
+    "## ⚠️ CRITICAL SAFETY WARNING\n",
+    "\n",
+    "**THIS NOTEBOOK WILL MODIFY YOUR ENVIRONMENT:**\n",
+    "- **Modifies Kubernetes secrets** (breaks blob storage credentials)\n",
+    "- **Causes service disruptions** (large payload failures, ClickHouse degradation)\n",
+    "- **Requires remediation** to restore functionality\n",
+    "\n",
+    "**REQUIREMENTS:**\n",
+    "- ✅ **TEST/NON-PRODUCTION environment ONLY**\n",
+    "- ✅ **MODULE4_SAFE_ENVIRONMENT=true** must be set in .env\n",
+    "- ✅ **Baseline diagnostics collected** (run `01_diagnostics_baseline.ipynb` first)\n",
+    "- ✅ **Backup/restore plan** available\n",
+    "\n",
+    "**DO NOT RUN THIS LAB AGAINST PRODUCTION SYSTEMS.**\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "**This lab teaches you how to debug Blob Storage configuration failures in LangSmith.**\n",
+    "\n",
+    "Blob Storage is LangSmith's **large payload storage**. It handles:\n",
+    "- Storing large trace payloads and artifacts\n",
+    "- Offloading large data from ClickHouse\n",
+    "- Providing durable storage for trace data\n",
+    "\n",
+    "**When Blob Storage fails, you'll see:**\n",
+    "- Large payload traces degrade ClickHouse performance\n",
+    "- Warnings/errors in logs about artifact storage\n",
+    "- Increased ClickHouse pressure and latency under load\n",
+    "- Traces with large payloads fail to store properly\n",
+    "\n",
+    "**Learning Objectives:**\n",
+    "1. Understand how Blob Storage failures manifest\n",
+    "2. Practice collecting diagnostics for blob storage issues\n",
+    "3. Learn to identify configuration vs. credential vs. network issues\n",
+    "4. Practice safe remediation\n",
+    "\n",
+    "**Estimated time:** 30-45 minutes\n",
+    "\n",
+    "**⚠️ Important:** \n",
+    "- Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n",
+    "- Complete safety check in `../shared/00_setup_or_resume_environment.ipynb` first!\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bootstrap environment\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Add notebooks directory to path\n",
+    "possible_paths = [\n",
+    "    Path.cwd().parent,\n",
+    "    Path.cwd(),\n",
+    "    Path.cwd() / \"notebooks\",\n",
+    "]\n",
+    "\n",
+    "notebooks_path = None\n",
+    "for path in possible_paths:\n",
+    "    if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        notebooks_path = path\n",
+    "        break\n",
+    "\n",
+    "if not notebooks_path:\n",
+    "    notebooks_path = Path.cwd() / \"notebooks\"\n",
+    "    if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
+    "\n",
+    "if str(notebooks_path) not in sys.path:\n",
+    "    sys.path.insert(0, str(notebooks_path))\n",
+    "\n",
+    "from shared._bootstrap import bootstrap\n",
+    "from shared._validation import ok, warn\n",
+    "\n",
+    "# Run bootstrap\n",
+    "bootstrap_info = bootstrap()\n",
+    "artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
+    "print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ⚠️ CRITICAL: Environment Safety Verification\n",
+    "\n",
+    "**Before proceeding, verify you're in a TEST/NON-PRODUCTION environment and understand what will be modified.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# CRITICAL SAFETY CHECK: Verify environment is safe for failure injection\n",
+    "from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
+    "from shared._validation import ok, warn, fail\n",
+    "import os\n",
+    "\n",
+    "provider = get_cloud_provider()\n",
+    "region = get_region()\n",
+    "identity = get_identity()\n",
+    "\n",
+    "print(\"=\" * 70)\n",
+    "print(\"⚠️  CRITICAL SAFETY CHECK - BLOB STORAGE FAILURE LAB\")\n",
+    "print(\"=\" * 70)\n",
+    "\n",
+    "# Show environment details prominently\n",
+    "provider_display = provider.upper()\n",
+    "print(f\"\\n### Current Environment Configuration\")\n",
+    "print(f\"Cloud Provider: {provider_display}\")\n",
+    "print(f\"Region: {region}\")\n",
+    "\n",
+    "if provider == \"aws\":\n",
+    "    account_id = identity.get('Account', 'N/A')\n",
+    "    user_arn = identity.get('Arn', 'N/A')\n",
+    "    print(f\"Account ID: {account_id}\")\n",
+    "    print(f\"User ARN: {user_arn}\")\n",
+    "elif provider == \"azure\":\n",
+    "    subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
+    "    subscription_name = identity.get(\"SubscriptionName\", \"N/A\")\n",
+    "    print(f\"Subscription ID: {subscription_id}\")\n",
+    "    print(f\"Subscription Name: {subscription_name}\")\n",
+    "\n",
+    "# Show all relevant environment variables\n",
+    "print(f\"\\n### Environment Variables (VERIFY THESE ARE CORRECT)\")\n",
+    "print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
+    "print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
+    "print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
+    "print(f\"LANGSMITH_DOMAIN: {os.environ.get('LANGSMITH_DOMAIN', 'NOT SET')}\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"⚠️  WHAT THIS LAB WILL DO:\")\n",
+    "print(\"=\" * 70)\n",
+    "print(\"\\nThis failure lab will:\")\n",
+    "print(\"  1. Find the Blob Storage secret in your namespace\")\n",
+    "print(\"  2. BACKUP the original secret (saved to artifacts)\")\n",
+    "print(\"  3. MODIFY the secret to set INVALID credentials\")\n",
+    "print(\"  4. Apply the modified secret (breaks blob storage connectivity)\")\n",
+    "print(\"  5. Cause large payload failures and ClickHouse degradation\")\n",
+    "print(\"  6. Require remediation to restore (restore original secret)\")\n",
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "\n",
+    "# Check for Module 4 safety flag\n",
+    "module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
+    "if module4_safe not in [\"true\", \"yes\", \"1\"]:\n",
+    "    fail(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
+    "    print(\"\\n❌ SAFETY CHECK FAILED - Cannot proceed\")\n",
+    "    print(\"\\nTo run this failure lab, you MUST:\")\n",
+    "    print(\"  1. Verify this is a TEST/NON-PRODUCTION environment\")\n",
+    "    print(\"  2. Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
+    "    print(\"  3. Complete safety check in ../shared/00_setup_or_resume_environment.ipynb\")\n",
+    "    print(\"  4. Re-run this cell to confirm\")\n",
+    "    print(\"\\nThis flag is REQUIRED to prevent accidental execution in production.\")\n",
+    "    raise RuntimeError(\"MODULE4_SAFE_ENVIRONMENT not set. Required for failure labs.\")\n",
+    "\n",
+    "ok(\"MODULE4_SAFE_ENVIRONMENT flag is set\")\n",
+    "print(\"\\n✅ Safety check passed - environment marked as safe for failure injection\")\n",
+    "print(\"\\n⚠️  REMINDER: This lab will break blob storage connectivity.\")\n",
+    "print(\"   Ensure you understand the remediation steps before proceeding.\")\n",
+    "print(\"   Original secret will be backed up automatically.\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"✅ Environment verified - ready for Blob Storage failure lab\")\n",
+    "print(\"=\" * 70)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Configuration & Prerequisites\n",
+    "\n",
+    "Load configuration and verify prerequisites.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from shared._validation import require_env\n",
+    "\n",
+    "required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
+    "config = require_env(*required_vars)\n",
+    "config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
+    "\n",
+    "namespace = config[\"NAMESPACE\"]\n",
+    "\n",
+    "print(f\"Namespace: {namespace}\")\n",
+    "print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
+    "\n",
+    "ok(\"Configuration loaded\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. What This Service Does for LangSmith\n",
+    "\n",
+    "Blob Storage is LangSmith's **blob and job queue**. It handles:\n",
+    "\n",
+    "- **Job queue for asynchronous processing:**\n",
+    "  - Workers pull trace processing jobs from Blob Storage\n",
+    "  - Jobs are queued when traces arrive via API\n",
+    "  - Queue backlog indicates processing delays\n",
+    "\n",
+    "- **Caching:**\n",
+    "  - Frequently accessed data (project metadata, user info)\n",
+    "  - Reduces load on PostgreSQL\n",
+    "  - Improves response times\n",
+    "\n",
+    "- **Rate limiting and session management:**\n",
+    "  - API rate limiting\n",
+    "  - Session storage (if configured)\n",
+    "\n",
+    "- **Worker coordination:**\n",
+    "  - Distributed locking\n",
+    "  - Task distribution\n",
+    "\n",
+    "**Why it matters:**\n",
+    "- Without Blob Storage, workers can't process traces\n",
+    "- Job queue fills up, causing delays\n",
+    "- Cache misses increase load on PostgreSQL\n",
+    "- Ingestion becomes unreliable\n",
+    "\n",
+    "**How LangSmith connects:**\n",
+    "- Connection string stored in Kubernetes Secrets\n",
+    "- Workers connect to Blob Storage to pull jobs\n",
+    "- API servers use Blob Storage for caching\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Expected Symptoms When Blob Storage Fails\n",
+    "\n",
+    "**What you'll see:**\n",
+    "\n",
+    "1. **Large payload traces degrade ClickHouse:**\n",
+    "   - ClickHouse performance degrades under load\n",
+    "   - Insert operations slow down\n",
+    "   - Query performance suffers\n",
+    "   - Storage pressure increases\n",
+    "\n",
+    "2. **Warnings/errors in logs about artifact storage:**\n",
+    "   - Worker logs show artifact upload failures\n",
+    "   - Bucket access errors\n",
+    "   - Credential errors\n",
+    "   - \"No such bucket\" or \"Access Denied\" errors\n",
+    "\n",
+    "3. **Increased ClickHouse pressure:**\n",
+    "   - ClickHouse latency increases\n",
+    "   - Merge operations backlog\n",
+    "   - Storage usage spikes\n",
+    "   - Query timeouts\n",
+    "\n",
+    "4. **Log patterns:**\n",
+    "   - Artifact storage errors in worker logs\n",
+    "   - S3/blob storage connection errors\n",
+    "   - Bucket access denied errors\n",
+    "   - Credential errors\n",
+    "   - Configuration errors\n",
+    "\n",
+    "**Timeline:**\n",
+    "- Symptoms appear gradually (under load)\n",
+    "- ClickHouse performance degrades over time\n",
+    "- Large traces fail or are rejected\n",
+    "- Full failure if blob storage completely unavailable\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Failure Injection Options\n",
+    "\n",
+    "**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
+    "\n",
+    "### Level 1: Subtle Failure (Recommended for first run)\n",
+    "\n",
+    "**Option A: Wrong Blob Storage Password**\n",
+    "- Modify the Blob Storage password in the Kubernetes Secret\n",
+    "- Symptoms: Authentication failures, connection refused, intermittent failures\n",
+    "\n",
+    "**Option B: Block Egress to Blob Storage Endpoint**\n",
+    "- Apply NetworkPolicy blocking egress to Blob Storage (if NetworkPolicy supported)\n",
+    "- Symptoms: Connection timeout, no route to host, intermittent failures\n",
+    "\n",
+    "### Level 2: Obvious Failure\n",
+    "\n",
+    "**Option C: Wrong Blob Storage Host/Endpoint**\n",
+    "- Point connection string to non-existent host\n",
+    "- Symptoms: Connection timeout, DNS resolution failures, immediate failures\n",
+    "\n",
+    "**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Do the Drill - Step 1: Confirm Baseline\n",
+    "\n",
+    "**Before injecting any failure, verify your baseline is healthy.**\n",
+    "\n",
+    "💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._shell import run\n",
+    "import json\n",
+    "\n",
+    "print(\"### Quick Baseline Check\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    healthy = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    print(f\"Pods: {healthy}/{total} running\")\n",
+    "    \n",
+    "    if healthy == total and total > 0:\n",
+    "        ok(\"Baseline looks healthy\")\n",
+    "    else:\n",
+    "        warn(\"Some pods are not running - check baseline first\")\n",
+    "else:\n",
+    "    warn(\"Could not check pod status\")\n",
+    "\n",
+    "# Check for Blob Storage secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "blob_secrets = []\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        if \"blob\" in name.lower() or \"blob\" in name.lower():\n",
+    "            blob_secrets.append(name)\n",
+    "\n",
+    "if blob_secrets:\n",
+    "    ok(f\"Found {len(blob_secrets)} Blob Storage-related secret(s)\")\n",
+    "    for secret_name in blob_secrets:\n",
+    "        print(f\"   - {secret_name}\")\n",
+    "else:\n",
+    "    warn(\"No Blob Storage secrets found\")\n",
+    "    print(\"   💡 Blob Storage connection may be configured differently\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Do the Drill - Step 2: Apply Failure Injection\n",
+    "\n",
+    "**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
+    "\n",
+    "Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# FAILURE INJECTION: Wrong Blob Storage Password\n",
+    "# This cell modifies the Blob Storage password secret to an invalid value\n",
+    "\n",
+    "import base64\n",
+    "import yaml\n",
+    "from datetime import datetime\n",
+    "\n",
+    "# Find Blob Storage secret (look for common names)\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "blob_secret_name = None\n",
+    "if result.returncode == 0:\n",
+    "    secrets = json.loads(result.stdout)\n",
+    "    for secret in secrets.get(\"items\", []):\n",
+    "        name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        # Common patterns: blob, blob\n",
+    "        if any(keyword in name.lower() for keyword in [\"blob\", \"blob\"]):\n",
+    "            # Check if it has password-related keys\n",
+    "            data = secret.get(\"data\", {})\n",
+    "            if any(key in data for key in [\"password\", \"REDIS_PASSWORD\", \"BLOB_PASSWORD\"]):\n",
+    "                blob_secret_name = name\n",
+    "                break\n",
+    "\n",
+    "if not blob_secret_name:\n",
+    "    raise RuntimeError(\"❌ Could not find Blob Storage secret. Check your deployment configuration.\")\n",
+    "\n",
+    "print(f\"Found Blob Storage secret: {blob_secret_name}\")\n",
+    "\n",
+    "# Get current secret\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", blob_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
+    "    check=True,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "# Save original secret for restoration\n",
+    "backup_file = artifacts_dir / \"module-4\" / f\"blob-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
+    "backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
+    "with open(backup_file, \"w\") as f:\n",
+    "    f.write(result.stdout)\n",
+    "\n",
+    "ok(f\"Backed up original secret to: {backup_file.name}\")\n",
+    "\n",
+    "# Parse YAML and modify password\n",
+    "secret_data = yaml.safe_load(result.stdout)\n",
+    "if \"data\" not in secret_data:\n",
+    "    raise RuntimeError(\"Secret has no data section\")\n",
+    "\n",
+    "# Find password key (could be password, REDIS_PASSWORD, BLOB_PASSWORD, etc.)\n",
+    "password_key = None\n",
+    "for key in [\"password\", \"REDIS_PASSWORD\", \"BLOB_PASSWORD\", \"blob-password\"]:\n",
+    "    if key in secret_data[\"data\"]:\n",
+    "        password_key = key\n",
+    "        break\n",
+    "\n",
+    "if not password_key:\n",
+    "    raise RuntimeError(\"Could not find password key in secret\")\n",
+    "\n",
+    "# Set invalid password\n",
+    "invalid_password = \"INVALID_PASSWORD_12345\"\n",
+    "invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
+    "\n",
+    "# Modify secret\n",
+    "secret_data[\"data\"][password_key] = invalid_password_b64\n",
+    "\n",
+    "# Save modified secret to temp file\n",
+    "temp_secret_file = artifacts_dir / \"module-4\" / \"blob-secret-modified.yaml\"\n",
+    "with open(temp_secret_file, \"w\") as f:\n",
+    "    yaml.dump(secret_data, f)\n",
+    "\n",
+    "print(f\"\\n⚠️  READY TO APPLY FAILURE INJECTION\")\n",
+    "print(f\"   This will set an invalid password in secret: {blob_secret_name}\")\n",
+    "print(f\"   Modified secret saved to: {temp_secret_file.name}\")\n",
+    "print(f\"\\n   To apply, uncomment and run the next cell.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# UNCOMMENT TO APPLY FAILURE INJECTION\n",
+    "# \n",
+    "# result = run(\n",
+    "#     [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
+    "#     check=True,\n",
+    "#     stream=True\n",
+    "# )\n",
+    "# \n",
+    "# ok(\"Failure injection applied - Blob Storage password is now invalid\")\n",
+    "# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
+    "# print(\"   This may take 1-2 minutes. Watch for pod restarts:\")\n",
+    "# print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "# \n",
+    "# # Wait a moment for changes to propagate\n",
+    "# import time\n",
+    "# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
+    "# time.sleep(30)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Do the Drill - Step 3: Observe Symptoms\n",
+    "\n",
+    "**Now that the failure is injected, observe how it manifests.**\n",
+    "\n",
+    "Check:\n",
+    "1. Worker pod logs for Blob Storage connection errors\n",
+    "2. Queue backlog (if visible)\n",
+    "3. Worker retry patterns\n",
+    "4. Latency in API responses\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datetime import datetime\n",
+    "\n",
+    "# Create incident directory for diagnostics\n",
+    "incident_dir = artifacts_dir / \"module-4\" / f\"blob-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
+    "incident_dir.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "print(f\"### Collecting Failure Diagnostics\\n\")\n",
+    "print(f\"Saving to: {incident_dir}\\n\")\n",
+    "\n",
+    "# 1. Check pod status\n",
+    "print(\"1. Checking pod status...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    print(result.stdout)\n",
+    "    \n",
+    "    # Check for restarts\n",
+    "    lines = result.stdout.split(\"\\n\")\n",
+    "    restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
+    "    if restarts:\n",
+    "        print(\"\\n   Pod restart counts:\")\n",
+    "        for line in restarts[1:]:  # Skip header\n",
+    "            if line.strip():\n",
+    "                parts = line.split()\n",
+    "                if len(parts) > 3:\n",
+    "                    print(f\"   {parts[0]}: {parts[3]} restarts\")\n",
+    "\n",
+    "# 2. Check recent events\n",
+    "print(\"\\n2. Checking recent events...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    with open(incident_dir / \"events.txt\", \"w\") as f:\n",
+    "        f.write(result.stdout)\n",
+    "    if result.stdout.strip():\n",
+    "        print(\"   Recent warning/error events:\")\n",
+    "        for line in result.stdout.split(\"\\n\")[-5:]:\n",
+    "            if line.strip():\n",
+    "                print(f\"   {line}\")\n",
+    "\n",
+    "# 3. Check worker pod logs for Blob Storage errors\n",
+    "print(\"\\n3. Checking worker pod logs for Blob Storage errors...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if worker_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=50\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        logs_file = incident_dir / f\"worker-pod-{worker_pod}-logs.txt\"\n",
+    "        with open(logs_file, \"w\") as f:\n",
+    "            f.write(result.stdout)\n",
+    "        \n",
+    "        # Look for Blob Storage-related errors\n",
+    "        error_keywords = [\"blob\", \"blob\", \"connection\", \"timeout\", \"refused\", \"authentication\", \"NOAUTH\", \"retry\"]\n",
+    "        error_lines = [l for l in result.stdout.split(\"\\n\") \n",
+    "                      if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if error_lines:\n",
+    "            print(\"   Found Blob Storage-related errors:\")\n",
+    "            for line in error_lines[-5:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            print(\"   No obvious Blob Storage errors in recent logs\")\n",
+    "else:\n",
+    "    warn(\"Could not find worker pod\")\n",
+    "\n",
+    "ok(f\"Diagnostics saved to: {incident_dir}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
+    "\n",
+    "**This is critical - Support will ask for this bundle.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import urllib.request\n",
+    "\n",
+    "print(\"### Running Canonical Diagnostics Script\\n\")\n",
+    "\n",
+    "script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
+    "script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
+    "\n",
+    "try:\n",
+    "    urllib.request.urlretrieve(script_url, script_path)\n",
+    "    script_path.chmod(0o755)\n",
+    "    \n",
+    "    print(f\"Running diagnostics script for namespace: {namespace}\")\n",
+    "    result = run(\n",
+    "        [str(script_path), namespace],\n",
+    "        check=False,\n",
+    "        stream=True\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        ok(\"Diagnostics script completed\")\n",
+    "        \n",
+    "        # Find and move tarball\n",
+    "        for file in incident_dir.parent.iterdir():\n",
+    "            if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
+    "                target_path = incident_dir / file.name\n",
+    "                file.rename(target_path)\n",
+    "                ok(f\"Diagnostics bundle: {target_path.name}\")\n",
+    "                break\n",
+    "    else:\n",
+    "        warn(\"Diagnostics script had errors (check output above)\")\n",
+    "        \n",
+    "except Exception as e:\n",
+    "    warn(f\"Could not run diagnostics script: {e}\")\n",
+    "    print(\"   💡 You can run it manually:\")\n",
+    "    print(f\"      curl -O {script_url}\")\n",
+    "    print(f\"      chmod +x get_k8s_debugging_info.sh\")\n",
+    "    print(f\"      ./get_k8s_debugging_info.sh {namespace}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 10. Do the Drill - Step 5: Guided Triage\n",
+    "\n",
+    "**Where to look first for Blob Storage issues:**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Guided Triage Steps\\n\")\n",
+    "\n",
+    "print(\"1. Check worker pod logs for Blob Storage connection errors:\")\n",
+    "print(f\"   kubectl logs -n {namespace} <worker-pod-name> | grep -i 'blob\\\\|blob\\\\|connection'\")\n",
+    "print()\n",
+    "\n",
+    "print(\"2. Verify secret exists and has correct keys:\")\n",
+    "print(f\"   kubectl get secret {blob_secret_name} -n {namespace} -o yaml\")\n",
+    "print(\"   (Don't print the actual values - they're base64 encoded)\")\n",
+    "print()\n",
+    "\n",
+    "print(\"3. Check for worker pod restarts (indicates connection failures):\")\n",
+    "print(f\"   kubectl get pods -n {namespace} -l app=langsmith-worker\")\n",
+    "print()\n",
+    "\n",
+    "print(\"4. Test Blob Storage connectivity from a pod (if possible):\")\n",
+    "print(\"   kubectl run -it --rm debug --image=blob:7 --restart=Never -- \\\\\")\n",
+    "print(\"     blob-cli -h <blob-host> -p <port> -a <password> ping\")\n",
+    "print()\n",
+    "\n",
+    "print(\"5. Check events for connection/authentication errors:\")\n",
+    "print(f\"   kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
+    "print()\n",
+    "\n",
+    "# Check what we can automatically\n",
+    "print(\"\\n### Automatic Checks\\n\")\n",
+    "\n",
+    "# Check secret still exists\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"secret\", blob_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    ok(f\"Secret '{blob_secret_name}' still exists\")\n",
+    "    secret_data = json.loads(result.stdout)\n",
+    "    keys = list(secret_data.get(\"data\", {}).keys())\n",
+    "    print(f\"   Secret keys: {', '.join(keys)}\")\n",
+    "else:\n",
+    "    warn(f\"Secret '{blob_secret_name}' not found!\")\n",
+    "\n",
+    "# Check for pods with Blob Storage connection env vars\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    blob_related_pods = []\n",
+    "    for pod in pods.get(\"items\", []):\n",
+    "        name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "        containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
+    "        for container in containers:\n",
+    "            env = container.get(\"env\", [])\n",
+    "            blob_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
+    "                                              for kw in [\"REDIS\", \"BLOB\"])]\n",
+    "            if blob_env:\n",
+    "                blob_related_pods.append(name)\n",
+    "                break\n",
+    "    \n",
+    "    if blob_related_pods:\n",
+    "        print(f\"\\n   Pods with Blob Storage environment variables:\")\n",
+    "        for pod_name in set(blob_related_pods):\n",
+    "            print(f\"   - {pod_name}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 11. Do the Drill - Step 6: Remediation\n",
+    "\n",
+    "**Restore the original secret to fix the issue.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# REMEDIATION: Restore original secret\n",
+    "# UNCOMMENT TO RESTORE\n",
+    "\n",
+    "# if backup_file.exists():\n",
+    "#     print(f\"Restoring original secret from: {backup_file.name}\")\n",
+    "#     result = run(\n",
+    "#         [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
+    "#         check=True,\n",
+    "#         stream=True\n",
+    "#     )\n",
+    "#     \n",
+    "#     ok(\"Original secret restored\")\n",
+    "#     print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
+    "#     print(\"   This may take 1-2 minutes. Monitor pod status:\")\n",
+    "#     print(f\"   kubectl get pods -n {namespace} -w\")\n",
+    "#     \n",
+    "#     import time\n",
+    "#     print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
+    "#     time.sleep(60)\n",
+    "# else:\n",
+    "#     warn(f\"Backup file not found: {backup_file}\")\n",
+    "#     print(\"   💡 You may need to manually restore the secret\")\n",
+    "\n",
+    "print(\"⚠️  To restore, uncomment the code above and run this cell.\")\n",
+    "if 'backup_file' in locals() and backup_file:\n",
+    "    print(f\"   Backup file: {backup_file.name}\")\n",
+    "else:\n",
+    "    print(\"   💡 If you modified Helm values, restore them manually\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 12. Do the Drill - Step 7: Confirm Recovery\n",
+    "\n",
+    "**Verify that everything is working again.**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Verifying Recovery\\n\")\n",
+    "\n",
+    "# Check pod status\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    pods = json.loads(result.stdout)\n",
+    "    running = sum(1 for p in pods.get(\"items\", [])\n",
+    "                  if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
+    "    total = len(pods.get(\"items\", []))\n",
+    "    \n",
+    "    if running == total and total > 0:\n",
+    "        ok(f\"All {total} pod(s) are running\")\n",
+    "    else:\n",
+    "        warn(f\"Only {running}/{total} pod(s) running\")\n",
+    "        print(\"   💡 Wait a bit longer for pods to fully recover\")\n",
+    "\n",
+    "# Check for recent errors in worker logs\n",
+    "print(\"\\nChecking for recent errors in worker logs...\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
+    "if worker_pod:\n",
+    "    result = run(\n",
+    "        [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=20\"],\n",
+    "        check=False,\n",
+    "        stream=False\n",
+    "    )\n",
+    "    \n",
+    "    if result.returncode == 0:\n",
+    "        error_keywords = [\"error\", \"fail\", \"blob\", \"blob\", \"connection\"]\n",
+    "        recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
+    "                        if any(kw in l.lower() for kw in error_keywords)]\n",
+    "        \n",
+    "        if recent_errors:\n",
+    "            warn(\"Still seeing some errors in logs:\")\n",
+    "            for line in recent_errors[-3:]:\n",
+    "                print(f\"   {line}\")\n",
+    "        else:\n",
+    "            ok(\"No recent errors in worker logs\")\n",
+    "\n",
+    "ok(\"Recovery verification complete\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 13. What Support Will Ask For\n",
+    "\n",
+    "**When escalating a Blob Storage issue, Support will need:**\n",
+    "\n",
+    "1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
+    "2. **Blob Storage connection details:**\n",
+    "   - Host/endpoint (redacted)\n",
+    "   - Port\n",
+    "   - Password (redacted)\n",
+    "   - Whether using SSL/TLS\n",
+    "3. **Error messages from logs:**\n",
+    "   - Full error text (not just \"connection failed\")\n",
+    "   - Timestamps of first occurrence\n",
+    "   - Retry patterns\n",
+    "4. **Recent changes:**\n",
+    "   - Secret rotations\n",
+    "   - Network policy changes\n",
+    "   - Blob Storage configuration changes\n",
+    "5. **Queue status (if accessible):**\n",
+    "   - Queue length\n",
+    "   - Worker processing rate\n",
+    "   - Backlog growth rate\n",
+    "6. **Blob Storage health (if accessible):**\n",
+    "   - Blob Storage version\n",
+    "   - Memory usage\n",
+    "   - Connection count\n",
+    "   - Slow queries\n",
+    "\n",
+    "**Evidence collected in this lab:**\n",
+    "- ✅ Diagnostics bundle\n",
+    "- ✅ Worker pod logs with Blob Storage errors\n",
+    "- ✅ Events showing failures\n",
+    "- ✅ Secret configuration (structure, not values)\n",
+    "\n",
+    "**Additional evidence to gather (if escalating):**\n",
+    "- Blob Storage endpoint connectivity test\n",
+    "- Queue metrics (if available)\n",
+    "- Blob Storage logs (if accessible via cloud provider)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 14. Lessons Learned\n",
+    "\n",
+    "**Key takeaways from this lab:**\n",
+    "\n",
+    "1. **Blob Storage failures can be intermittent** - Connection pool retries may mask issues\n",
+    "2. **Worker logs are critical** - Blob Storage errors appear in worker pod logs\n",
+    "3. **Queue backlog is a symptom** - Jobs pile up when workers can't connect\n",
+    "4. **Secrets matter** - Wrong credentials cause authentication failures\n",
+    "5. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
+    "\n",
+    "**Common mistakes to avoid:**\n",
+    "- ❌ Ignoring intermittent failures (they indicate connection issues)\n",
+    "- ❌ Not checking worker logs (API logs may not show Blob Storage errors)\n",
+    "- ❌ Not monitoring queue length (backlog indicates processing delays)\n",
+    "- ❌ Not testing Blob Storage connectivity independently\n",
+    "\n",
+    "**Next steps:**\n",
+    "- Practice with other failure injection methods (Level 2)\n",
+    "- Try the ClickHouse or Blob Storage failure labs\n",
+    "- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,37 @@
+# Module 4: Troubleshooting & Incident Response
+
+This directory contains notebooks for Module 4 of the LangSmith Self-Hosted Operator workshop.
+
+## Notebooks
+
+### Setup & Baseline
+- **`../shared/00_setup_or_resume_environment.ipynb`** - Validates environment is ready (shared across modules 2, 3, 4)
+- **`01_diagnostics_baseline.ipynb`** - Captures baseline diagnostics (run this first!)
+
+### Failure Labs
+- **`10_failure_lab_postgres.ipynb`** - PostgreSQL connectivity failure debugging
+- **`20_failure_lab_redis.ipynb`** - Redis connectivity failure debugging
+- **`30_failure_lab_clickhouse.ipynb`** - ClickHouse connectivity failure debugging
+- **`40_failure_lab_blob_storage.ipynb`** - Blob storage configuration failure debugging
+
+### Advanced
+- **`90_full_incident_drill.ipynb`** - Complete incident simulation (optional)
+
+## Workflow
+
+1. Run `../shared/00_setup_or_resume_environment.ipynb` to verify your environment
+2. Run `01_diagnostics_baseline.ipynb` to capture baseline
+3. Run failure labs in order (10, 20, 30, 40) or pick specific ones
+4. Optionally run `90_full_incident_drill.ipynb` for complete practice
+
+## Important Notes
+
+- **Always run baseline first** - You need "before" to compare to "after"
+- **Failure injections are reversible** - All labs include remediation steps
+- **Don't skip diagnostics collection** - Support will ask for the canonical bundle
+- **Practice in test environments only** - These labs modify your deployment
+
+## Documentation
+
+See `docs/modules/module-4.md` for complete module documentation.
+
@@ -0,0 +1,614 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Setup or Resume Environment\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "This notebook helps you prepare for workshop modules 2 through 4 (Identity & Auth, Production Operations, or Troubleshooting). It validates that your LangSmith environment is running and accessible, or guides you to deploy it using Module 1.\n",
+    "\n",
+    "### About This Notebook\n",
+    "This notebook is **READ-ONLY** and safe to run. It performs validation checks only and does not modify any resources. \n",
+    "\n",
+    "### Module-Specific Notes\n",
+    "- **Modules 2 and 3** are read-only validation notebooks, perfect for understanding your current configuration\n",
+    "- **Module 4** includes hands-on failure labs that intentionally modify secrets to teach troubleshooting—these require a test environment\n",
+    "- Module-specific guidance is provided below to help you understand what to expect\n",
+    "\n",
+    "### Prerequisites\n",
+    "- Module 1 notebooks available (for deployment if needed)\n",
+    "- `kubectl` configured (if environment exists)\n",
+    "\n",
+    "### What This Notebook Does\n",
+    "1. Checks if LangSmith is already deployed\n",
+    "2. If not, provides links to Module 1 deployment notebooks\n",
+    "3. If yes, validates the environment is healthy and reachable\n",
+    "4. **Verifies you're in the correct environment** (shows account/region)\n",
+    "5. Shows module-specific safety warnings\n",
+    "\n",
+    "**Estimated time:** 10-15 minutes\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bootstrap environment\n",
+    "import sys\n",
+    "import os\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Add notebooks directory to path so we can import shared as a package\n",
+    "possible_paths = [\n",
+    "    Path.cwd().parent,  # If cwd is a module directory, go up one level to notebooks\n",
+    "    Path.cwd(),  # If cwd is already notebooks\n",
+    "    Path.cwd() / \"notebooks\",  # If cwd is workspace root\n",
+    "]\n",
+    "\n",
+    "notebooks_path = None\n",
+    "for path in possible_paths:\n",
+    "    if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        notebooks_path = path\n",
+    "        break\n",
+    "\n",
+    "if not notebooks_path:\n",
+    "    notebooks_path = Path.cwd() / \"notebooks\"\n",
+    "    if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
+    "        raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
+    "\n",
+    "# Add notebooks directory to path so 'shared' can be imported as a package\n",
+    "if str(notebooks_path) not in sys.path:\n",
+    "    sys.path.insert(0, str(notebooks_path))\n",
+    "\n",
+    "from shared._bootstrap import bootstrap\n",
+    "\n",
+    "# Run bootstrap\n",
+    "bootstrap_info = bootstrap()\n",
+    "artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
+    "print(f\"\\nArtifacts directory: {artifacts_dir}\")\n",
+    "\n",
+    "# Detect which module is using this notebook\n",
+    "# Check current working directory or environment variable\n",
+    "current_module = None\n",
+    "cwd_str = str(Path.cwd())\n",
+    "if \"module-2\" in cwd_str:\n",
+    "    current_module = \"2\"\n",
+    "elif \"module-3\" in cwd_str:\n",
+    "    current_module = \"3\"\n",
+    "elif \"module-4\" in cwd_str:\n",
+    "    current_module = \"4\"\n",
+    "else:\n",
+    "    # Try environment variable\n",
+    "    current_module = os.environ.get(\"CURRENT_MODULE\", \"\")\n",
+    "    if not current_module:\n",
+    "        # Default: assume generic use\n",
+    "        current_module = None\n",
+    "\n",
+    "print(f\"\\nDetected module context: {current_module if current_module else 'Generic'}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Environment Safety Verification\n",
+    "\n",
+    "**Before proceeding, verify you're working with the correct environment.**\n",
+    "\n",
+    "**Module-specific notes:**\n",
+    "- **Module 2 (Identity & Auth):** Read-only validation - safe for production\n",
+    "- **Module 3 (Production Operations):** Read-only validation - safe for production\n",
+    "- **Module 4 (Troubleshooting):** Includes failure labs that modify secrets - **TEST environment ONLY**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Environment Safety Check: Verify environment and show module-specific warnings\n",
+    "from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
+    "from shared._validation import ok, warn, fail\n",
+    "\n",
+    "provider = get_cloud_provider()\n",
+    "region = get_region()\n",
+    "identity = get_identity()\n",
+    "\n",
+    "print(\"=\" * 70)\n",
+    "print(\"⚠️  ENVIRONMENT SAFETY CHECK\")\n",
+    "print(\"=\" * 70)\n",
+    "\n",
+    "# Show environment details prominently\n",
+    "provider_display = provider.upper()\n",
+    "print(f\"\\n### Current Environment Configuration\")\n",
+    "print(f\"Cloud Provider: {provider_display}\")\n",
+    "print(f\"Region: {region}\")\n",
+    "\n",
+    "if provider == \"aws\":\n",
+    "    account_id = identity.get('Account', 'N/A')\n",
+    "    user_arn = identity.get('Arn', 'N/A')\n",
+    "    print(f\"Account ID: {account_id}\")\n",
+    "    print(f\"User ARN: {user_arn}\")\n",
+    "elif provider == \"azure\":\n",
+    "    subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
+    "    subscription_name = identity.get(\"SubscriptionName\", \"N/A\")\n",
+    "    print(f\"Subscription ID: {subscription_id}\")\n",
+    "    print(f\"Subscription Name: {subscription_name}\")\n",
+    "\n",
+    "# Show all relevant environment variables\n",
+    "print(f\"\\n### Environment Variables (for verification)\")\n",
+    "print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
+    "print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
+    "print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
+    "print(f\"LANGSMITH_DOMAIN: {os.environ.get('LANGSMITH_DOMAIN', 'NOT SET')}\")\n",
+    "\n",
+    "# Module-specific safety checks\n",
+    "if current_module == \"4\":\n",
+    "    # Module 4: Failure labs require TEST environment\n",
+    "    print(\"\\n\" + \"=\" * 70)\n",
+    "    print(\"⚠️  CRITICAL: Module 4 Failure Labs Will Modify Your Environment\")\n",
+    "    print(\"=\" * 70)\n",
+    "    print(\"\\nThe failure labs in Module 4 will:\")\n",
+    "    print(\"  ❌ Modify Kubernetes secrets (break passwords/credentials)\")\n",
+    "    print(\"  ❌ Cause service disruptions (API failures, login failures)\")\n",
+    "    print(\"  ❌ Require remediation to restore functionality\")\n",
+    "    print(\"\\nThis is INTENTIONAL for learning troubleshooting, but:\")\n",
+    "    print(\"  ⚠️  ONLY run in TEST/NON-PRODUCTION environments\")\n",
+    "    print(\"  ⚠️  DO NOT run against production systems\")\n",
+    "    print(\"  ⚠️  Ensure you can restore the environment after labs\")\n",
+    "    print(\"\\n\" + \"=\" * 70)\n",
+    "    \n",
+    "    # Require explicit confirmation for Module 4\n",
+    "    print(\"\\n### Environment Verification Required for Module 4\")\n",
+    "    print(\"\\nPlease confirm:\")\n",
+    "    print(\"  1. ✅ This is a TEST/NON-PRODUCTION environment\")\n",
+    "    print(\"  2. ✅ You understand failure labs will modify secrets\")\n",
+    "    print(\"  3. ✅ You have a way to restore the environment (backup/teardown)\")\n",
+    "    print(\"  4. ✅ You will NOT run these labs against production\")\n",
+    "    \n",
+    "    # Check if user has explicitly acknowledged\n",
+    "    module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
+    "    if module4_safe in [\"true\", \"yes\", \"1\"]:\n",
+    "        ok(\"MODULE4_SAFE_ENVIRONMENT flag is set - proceeding\")\n",
+    "        print(\"\\n✅ Safety check passed - environment marked as safe for Module 4\")\n",
+    "    else:\n",
+    "        fail(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
+    "        print(\"\\n❌ SAFETY CHECK FAILED\")\n",
+    "        print(\"\\nTo proceed with Module 4 failure labs, you MUST:\")\n",
+    "        print(\"  1. Verify this is a TEST/NON-PRODUCTION environment\")\n",
+    "        print(\"  2. Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
+    "        print(\"  3. Re-run this cell to confirm\")\n",
+    "        print(\"\\nThis flag is REQUIRED to prevent accidental execution in production.\")\n",
+    "        raise RuntimeError(\"MODULE4_SAFE_ENVIRONMENT not set. This is required for Module 4 failure labs.\")\n",
+    "    \n",
+    "    print(\"\\n\" + \"=\" * 70)\n",
+    "    print(\"✅ Environment verified as safe for Module 4 failure labs\")\n",
+    "    print(\"=\" * 70)\n",
+    "elif current_module in [\"2\", \"3\"]:\n",
+    "    # Modules 2 and 3: Read-only, safe for production\n",
+    "    print(\"\\n\" + \"=\" * 70)\n",
+    "    print(f\"✅ Module {current_module} is READ-ONLY\")\n",
+    "    print(\"=\" * 70)\n",
+    "    if current_module == \"2\":\n",
+    "        print(\"\\nModule 2 (Identity & Auth) notebooks:\")\n",
+    "        print(\"  ✅ Perform read-only validation checks\")\n",
+    "        print(\"  ✅ Do NOT modify any infrastructure or secrets\")\n",
+    "        print(\"  ✅ Safe to run against production environments\")\n",
+    "    elif current_module == \"3\":\n",
+    "        print(\"\\nModule 3 (Production Operations) notebooks:\")\n",
+    "        print(\"  ✅ Perform read-only validation and signal checks\")\n",
+    "        print(\"  ✅ Do NOT modify any infrastructure or resources\")\n",
+    "        print(\"  ✅ Safe to run against production environments\")\n",
+    "    print(\"\\n\" + \"=\" * 70)\n",
+    "    ok(\"Environment check complete - safe to proceed with read-only validation\")\n",
+    "else:\n",
+    "    # Generic use - show general warning\n",
+    "    print(\"\\n\" + \"=\" * 70)\n",
+    "    print(\"⚠️  MODULE CONTEXT NOT DETECTED\")\n",
+    "    print(\"=\" * 70)\n",
+    "    print(\"\\nThis notebook is used by multiple modules:\")\n",
+    "    print(\"  - Module 2: Read-only validation (safe for production)\")\n",
+    "    print(\"  - Module 3: Read-only validation (safe for production)\")\n",
+    "    print(\"  - Module 4: Failure labs (TEST environment ONLY)\")\n",
+    "    print(\"\\n💡 If using Module 4, ensure MODULE4_SAFE_ENVIRONMENT=true is set\")\n",
+    "    print(\"\\n\" + \"=\" * 70)\n",
+    "    ok(\"Environment check complete\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Configuration\n",
+    "\n",
+    "Load and validate configuration from environment variables.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from shared._validation import require_env, ok, warn\n",
+    "from shared._cloud_helpers import get_cloud_provider, get_region\n",
+    "\n",
+    "# Required configuration\n",
+    "required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
+    "\n",
+    "print(\"### Loading Configuration\\n\")\n",
+    "\n",
+    "config = {}\n",
+    "missing = []\n",
+    "\n",
+    "for var in required_vars:\n",
+    "    value = os.environ.get(var, \"\").strip()\n",
+    "    if not value:\n",
+    "        missing.append(var)\n",
+    "    config[var] = value\n",
+    "\n",
+    "if missing:\n",
+    "    raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
+    "                      f\"💡 Copy env-samples/workshop.env.example to your .env file and fill in values\")\n",
+    "\n",
+    "# Optional but recommended\n",
+    "config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
+    "config[\"LANGSMITH_DOMAIN\"] = os.environ.get(\"LANGSMITH_DOMAIN\", \"\")\n",
+    "\n",
+    "# Show cloud provider info\n",
+    "provider = get_cloud_provider()\n",
+    "region = get_region()\n",
+    "\n",
+    "print(f\"Cloud Provider: {provider.upper()}\")\n",
+    "print(f\"Region: {region}\")\n",
+    "print(f\"Namespace: {config['NAMESPACE']}\")\n",
+    "print(f\"Cluster: {config['CLUSTER_NAME']}\")\n",
+    "print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
+    "\n",
+    "if config[\"LANGSMITH_DOMAIN\"]:\n",
+    "    print(f\"LangSmith Domain: {config['LANGSMITH_DOMAIN']}\")\n",
+    "\n",
+    "ok(\"Configuration loaded\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Check if Environment Exists\n",
+    "\n",
+    "We'll check if LangSmith is already deployed. If not, we'll provide instructions to deploy using Module 1.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shared._shell import run\n",
+    "from shared._cloud_helpers import cluster_exists, configure_kubectl, get_kubernetes_service_name\n",
+    "\n",
+    "namespace = config[\"NAMESPACE\"]\n",
+    "cluster_name = config[\"CLUSTER_NAME\"]\n",
+    "k8s_service = get_kubernetes_service_name()\n",
+    "\n",
+    "print(f\"### Checking {k8s_service} Cluster\\n\")\n",
+    "\n",
+    "# Check if cluster exists\n",
+    "if cluster_exists(cluster_name):\n",
+    "    ok(f\"Cluster '{cluster_name}' exists\")\n",
+    "    \n",
+    "    # Configure kubectl\n",
+    "    print(f\"\\n### Configuring kubectl\\n\")\n",
+    "    try:\n",
+    "        configure_kubectl(cluster_name, region)\n",
+    "        ok(\"kubectl configured\")\n",
+    "    except Exception as e:\n",
+    "        warn(f\"Could not configure kubectl: {e}\")\n",
+    "        print(\"💡 Make sure you have proper cloud provider credentials\")\n",
+    "        raise\n",
+    "else:\n",
+    "    warn(f\"Cluster '{cluster_name}' not found\")\n",
+    "    print(\"\\n💡 You need to deploy LangSmith first using Module 1 notebooks.\")\n",
+    "    print(\"   See the 'Deploy Environment' section below.\")\n",
+    "    raise RuntimeError(\"Cluster not found. Deploy using Module 1 first.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Verify Namespace and Helm Release\n",
+    "\n",
+    "Check that the LangSmith namespace exists and Helm release is installed.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "\n",
+    "helm_release = config[\"HELM_RELEASE\"]\n",
+    "\n",
+    "print(\"### Checking Namespace\\n\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"namespace\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    ok(f\"Namespace '{namespace}' exists\")\n",
+    "else:\n",
+    "    warn(f\"Namespace '{namespace}' not found\")\n",
+    "    print(\"\\n💡 You need to deploy LangSmith first using Module 1 notebooks.\")\n",
+    "    print(\"   See the 'Deploy Environment' section below.\")\n",
+    "    raise RuntimeError(\"Namespace not found. Deploy using Module 1 first.\")\n",
+    "\n",
+    "print(\"\\n### Checking Helm Release\\n\")\n",
+    "result = run(\n",
+    "    [\"helm\", \"list\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    releases = json.loads(result.stdout)\n",
+    "    release_found = any(r.get(\"name\") == helm_release for r in releases)\n",
+    "    \n",
+    "    if release_found:\n",
+    "        ok(f\"Helm release '{helm_release}' found\")\n",
+    "        # Get release info\n",
+    "        result = run(\n",
+    "            [\"helm\", \"status\", helm_release, \"-n\", namespace, \"-o\", \"json\"],\n",
+    "            check=False,\n",
+    "            stream=False\n",
+    "        )\n",
+    "        if result.returncode == 0:\n",
+    "            release_info = json.loads(result.stdout)\n",
+    "            print(f\"   Status: {release_info.get('info', {}).get('status', 'unknown')}\")\n",
+    "            print(f\"   Chart: {release_info.get('chart', {}).get('metadata', {}).get('name', 'unknown')}\")\n",
+    "            print(f\"   Version: {release_info.get('chart', {}).get('metadata', {}).get('version', 'unknown')}\")\n",
+    "    else:\n",
+    "        warn(f\"Helm release '{helm_release}' not found in namespace '{namespace}'\")\n",
+    "        print(\"\\n💡 You need to deploy LangSmith first using Module 1 notebooks.\")\n",
+    "        print(\"   See the 'Deploy Environment' section below.\")\n",
+    "        raise RuntimeError(\"Helm release not found. Deploy using Module 1 first.\")\n",
+    "else:\n",
+    "    warn(\"Could not list Helm releases\")\n",
+    "    print(\"💡 Make sure Helm is installed and kubectl is configured correctly\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Verify Ingress Endpoint\n",
+    "\n",
+    "Check that the LangSmith ingress is configured and reachable.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "from urllib.parse import urlparse\n",
+    "\n",
+    "print(\"### Checking Ingress\\n\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"ingress\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "ingress_found = False\n",
+    "ingress_host = None\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    ingresses = json.loads(result.stdout)\n",
+    "    for ingress in ingresses.get(\"items\", []):\n",
+    "        rules = ingress.get(\"spec\", {}).get(\"rules\", [])\n",
+    "        for rule in rules:\n",
+    "            host = rule.get(\"host\", \"\")\n",
+    "            if host:\n",
+    "                ingress_found = True\n",
+    "                ingress_host = host\n",
+    "                print(f\"   Found ingress with host: {host}\")\n",
+    "                break\n",
+    "\n",
+    "if not ingress_found:\n",
+    "    warn(\"No ingress found\")\n",
+    "    print(\"💡 Ingress may still be provisioning. Check Module 1 validation notebook.\")\n",
+    "else:\n",
+    "    ok(f\"Ingress configured with host: {ingress_host}\")\n",
+    "    \n",
+    "    # Try to reach the endpoint\n",
+    "    if config[\"LANGSMITH_DOMAIN\"]:\n",
+    "        test_url = f\"https://{config['LANGSMITH_DOMAIN']}\"\n",
+    "    elif ingress_host:\n",
+    "        test_url = f\"https://{ingress_host}\"\n",
+    "    else:\n",
+    "        test_url = None\n",
+    "    \n",
+    "    if test_url:\n",
+    "        print(f\"\\n### Testing Endpoint Reachability\\n\")\n",
+    "        print(f\"Testing: {test_url}\")\n",
+    "        try:\n",
+    "            # Allow redirects, don't verify SSL (may be self-signed)\n",
+    "            response = requests.get(test_url, allow_redirects=True, verify=False, timeout=10)\n",
+    "            if response.status_code in [200, 302, 401, 403]:\n",
+    "                ok(f\"Endpoint is reachable (HTTP {response.status_code})\")\n",
+    "            else:\n",
+    "                warn(f\"Endpoint returned unexpected status: {response.status_code}\")\n",
+    "        except requests.exceptions.SSLError:\n",
+    "            # SSL error is OK if using self-signed certs\n",
+    "            warn(\"SSL verification failed (may be self-signed certificate)\")\n",
+    "            print(\"💡 This is OK for testing. In production, use proper TLS certificates.\")\n",
+    "        except requests.exceptions.RequestException as e:\n",
+    "            warn(f\"Could not reach endpoint: {e}\")\n",
+    "            print(\"💡 Ingress may still be provisioning. Wait a few minutes and try again.\")\n",
+    "    else:\n",
+    "        warn(\"No domain configured for testing\")\n",
+    "        print(\"💡 Set LANGSMITH_DOMAIN in your .env file to test endpoint reachability\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Quick Health Check\n",
+    "\n",
+    "Verify that key deployments are running.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"### Checking Key Deployments\\n\")\n",
+    "result = run(\n",
+    "    [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
+    "    check=False,\n",
+    "    stream=False\n",
+    ")\n",
+    "\n",
+    "if result.returncode == 0:\n",
+    "    deployments = json.loads(result.stdout)\n",
+    "    deployment_items = deployments.get(\"items\", [])\n",
+    "    \n",
+    "    if deployment_items:\n",
+    "        ok(f\"Found {len(deployment_items)} deployment(s)\")\n",
+    "        print(\"\\nDeployment Status:\")\n",
+    "        for deployment in deployment_items:\n",
+    "            name = deployment.get(\"metadata\", {}).get(\"name\", \"\")\n",
+    "            spec_replicas = deployment.get(\"spec\", {}).get(\"replicas\", 0)\n",
+    "            status_replicas = deployment.get(\"status\", {}).get(\"replicas\", 0)\n",
+    "            ready_replicas = deployment.get(\"status\", {}).get(\"readyReplicas\", 0)\n",
+    "            available_replicas = deployment.get(\"status\", {}).get(\"availableReplicas\", 0)\n",
+    "            \n",
+    "            status_icon = \"✅\" if ready_replicas == spec_replicas and available_replicas == spec_replicas else \"⚠️\"\n",
+    "            print(f\"   {status_icon} {name}: {ready_replicas}/{spec_replicas} ready, {available_replicas}/{spec_replicas} available\")\n",
+    "    else:\n",
+    "        warn(\"No deployments found\")\n",
+    "        print(\"💡 LangSmith may not be fully deployed. Check Module 1 validation notebook.\")\n",
+    "else:\n",
+    "    warn(\"Could not list deployments\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ✅ Environment Ready\n",
+    "\n",
+    "Your LangSmith environment is running and accessible. You're ready to proceed with your module.\n",
+    "\n",
+    "**Next Steps (Module-Specific):**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Show module-specific next steps\n",
+    "print(\"### Module-Specific Next Steps\\n\")\n",
+    "\n",
+    "if current_module == \"2\":\n",
+    "    print(\"**Module 2: Identity & Auth**\\n\")\n",
+    "    print(\"1. Run `01_sso_oidc_validation.ipynb` to validate OIDC SSO configuration\")\n",
+    "    print(\"2. (Optional) Run `02_sso_saml_validation.ipynb` if using SAML\")\n",
+    "    print(\"\\n💡 These notebooks are read-only and safe for production use.\")\n",
+    "elif current_module == \"3\":\n",
+    "    print(\"**Module 3: Production Operations & Scaling**\\n\")\n",
+    "    print(\"1. Run `01_ops_sanity_checks.ipynb` to validate production readiness\")\n",
+    "    print(\"2. Review production readiness checklist: `docs/shared/production_readiness_checklist.md`\")\n",
+    "    print(\"3. Review signals and thresholds: `docs/shared/ops_signals_and_thresholds.md`\")\n",
+    "    print(\"\\n💡 This notebook is read-only and safe for production use.\")\n",
+    "elif current_module == \"4\":\n",
+    "    print(\"**Module 4: Troubleshooting & Incident Response**\\n\")\n",
+    "    print(\"1. Run `01_diagnostics_baseline.ipynb` to capture a baseline snapshot\")\n",
+    "    print(\"2. Proceed with failure labs (10, 20, 30, 40)\")\n",
+    "    print(\"3. Optionally run `90_full_incident_drill.ipynb` for a complete incident simulation\")\n",
+    "    print(\"\\n⚠️  REMINDER: Module 4 failure labs modify secrets and cause disruptions.\")\n",
+    "    print(\"   Only run in TEST/NON-PRODUCTION environments.\")\n",
+    "else:\n",
+    "    print(\"**Generic Use**\\n\")\n",
+    "    print(\"This notebook can be used by:\")\n",
+    "    print(\"  - Module 2: Identity & Auth validation (read-only)\")\n",
+    "    print(\"  - Module 3: Production Operations checks (read-only)\")\n",
+    "    print(\"  - Module 4: Troubleshooting failure labs (modifies environment)\")\n",
+    "    print(\"\\n💡 Navigate to the appropriate module directory and run this notebook from there.\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"📝 Important Reminder\")\n",
+    "print(\"=\" * 70)\n",
+    "print(\"\\n**When finished with workshop modules, run Module 1's `99_teardown.ipynb`\")\n",
+    "print(\"to delete the environment and avoid ongoing cloud costs.**\")\n",
+    "print(\"\\nThe teardown notebook will:\")\n",
+    "print(\"  - Remove Helm release\")\n",
+    "print(\"  - Destroy Terraform-managed infrastructure (Kubernetes cluster, database, cache, blob storage, etc.)\")\n",
+    "print(\"  - Clean up any remaining resources\")\n",
+    "print(\"\\n**Location:** `../module-1/99_teardown.ipynb`\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 🚀 Deploy Environment (If Not Already Deployed)\n",
+    "\n",
+    "If your environment is not running, follow these steps to deploy LangSmith using Module 1:\n",
+    "\n",
+    "### Step 1: Preflight Checks\n",
+    "Run `../module-1/01_preflight.ipynb` to validate your environment.\n",
+    "\n",
+    "### Step 2: Provision Infrastructure\n",
+    "Run `../module-1/02_terraform_apply.ipynb` to deploy cloud infrastructure (Kubernetes cluster, database, cache, blob storage).\n",
+    "\n",
+    "### Step 3: Install LangSmith\n",
+    "Run `../module-1/03_helm_install_langsmith.ipynb` to install LangSmith using Helm.\n",
+    "\n",
+    "### Step 4: Validate Deployment\n",
+    "Run `../module-1/04_validate_ingress_and_ui.ipynb` to verify everything is working.\n",
+    "\n",
+    "### Step 5: Return Here\n",
+    "Once deployment is complete, return to this notebook and re-run the cells above to verify your environment is ready.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "**Note:** If you encounter errors during deployment, refer to Module 1 documentation and troubleshooting guides.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.9.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -1,5 +1,6 @@
 from __future__ import annotations
 import os
+from datetime import date

 def ok(msg: str) -> None:
    print(f"✅ {msg}")
@@ -18,6 +19,9 @@ def require_env(*keys: str) -> dict:
        if not v:
            missing.append(k)
        cfg[k] = v
+        if k == 'CLUSTER_NAME':
+            # Add a hardcoded prefix to the cluster name
+            cfg[k] = f"langsmith-workshop-{date.today().strftime('%Y%m%d')}-{v}"
    if missing:
        fail(f"Missing required environment variables: {', '.join(missing)}")
    return cfg
@@ -0,0 +1,11 @@
+# Test artifacts
+artifacts/
+*.pyc
+__pycache__/
+.pytest_cache/
+.coverage
+htmlcov/
+
+# Notebook execution outputs
+*.ipynb_checkpoints
+
@@ -0,0 +1,127 @@
+# Tests for LangSmith Self-Hosted Workshops
+
+This directory contains tests for validating notebook execution and syntax.
+
+## Test Structure
+
+- `conftest.py`: Pytest configuration and fixtures
+- `test_notebook_execution.py`: Notebook execution tests
+- `requirements.txt`: Test dependencies
+- `artifacts/`: Directory for test artifacts (created automatically)
+
+## Running Tests Locally
+
+### Prerequisites
+
+```bash
+# Install test dependencies
+pip install -r tests/requirements.txt
+
+# Install system dependencies (if needed)
+# macOS: brew install jq
+# Ubuntu: sudo apt-get install jq
+```
+
+### Run All Tests
+
+```bash
+# Run syntax tests only (fast, no infrastructure required)
+CI_SKIP_EXECUTION=true pytest tests/ -v
+
+# Run full execution tests (requires infrastructure)
+pytest tests/ -v
+```
+
+### Run Specific Test Suites
+
+```bash
+# Test Module 1 notebooks
+pytest tests/test_notebook_execution.py::TestModule1Notebooks -v
+
+# Test Module 2 notebooks
+pytest tests/test_notebook_execution.py::TestModule2Notebooks -v
+```
+
+### Run Individual Notebook Tests
+
+```bash
+# Test specific notebook syntax
+pytest tests/test_notebook_execution.py::TestModule1Notebooks::test_module1_notebook_syntax -v
+```
+
+## CI/CD Integration
+
+Tests run automatically on:
+- Pull requests to `main`/`master`
+- Pushes to `main`/`master`
+- Manual workflow dispatch
+
+### GitHub Actions Workflow
+
+The `.github/workflows/test-notebooks.yml` workflow:
+
+1. **Test Notebook Syntax**: Validates JSON structure and code cells
+2. **Test Module 1 Preflight**: Validates preflight notebook structure
+3. **Test Module 2 Syntax**: Validates auth validation notebooks
+4. **Lint Python Code**: Runs flake8 and black checks
+
+### Environment Variables
+
+The workflow uses test environment variables. For full execution tests, set:
+
+```yaml
+# In GitHub Actions secrets/variables
+AWS_ACCESS_KEY_ID
+AWS_SECRET_ACCESS_KEY
+AWS_REGION
+CLUSTER_NAME
+NAMESPACE
+# ... etc
+```
+
+## Test Strategy
+
+### Syntax Tests (Always Run)
+
+- Validate notebook JSON structure
+- Check for code cells
+- Verify imports can be resolved
+- No infrastructure required
+
+### Execution Tests (Conditional)
+
+- Full notebook execution
+- Requires actual infrastructure (cluster, IdP, etc.)
+- Skipped in CI by default (`CI_SKIP_EXECUTION=true`)
+- Can be enabled for integration testing environments
+
+## Adding New Tests
+
+1. Add notebook to appropriate test class in `test_notebook_execution.py`
+2. Update `pytest.parametrize` decorator with notebook name
+3. Add any required environment variables to `conftest.py`
+4. Update GitHub Actions workflow if needed
+
+## Troubleshooting
+
+### Import Errors
+
+If tests fail with import errors:
+- Ensure `notebooks/shared/` is in Python path
+- Check that `conftest.py` is setting up paths correctly
+- Verify all required packages are in `requirements.txt`
+
+### Timeout Errors
+
+If notebook execution times out:
+- Increase timeout in `execute_notebook()` function
+- Check for infinite loops or long-running operations
+- Consider mocking external API calls
+
+### Environment Variable Issues
+
+If tests fail due to missing env vars:
+- Check `conftest.py` for default values
+- Verify GitHub Actions workflow sets required variables
+- Add variables to test fixtures if needed
+
@@ -0,0 +1,2 @@
+# Tests for LangSmith Self-Hosted Workshops notebooks
+
@@ -0,0 +1,28 @@
+"""
+Pytest configuration and fixtures for notebook testing.
+"""
+import os
+import sys
+from pathlib import Path
+
+# Add notebooks directory to path
+repo_root = Path(__file__).parent.parent
+notebooks_dir = repo_root / "notebooks"
+if str(notebooks_dir) not in sys.path:
+    sys.path.insert(0, str(notebooks_dir))
+
+# Set test environment variables
+os.environ.setdefault("NAMESPACE", "langsmith-test")
+os.environ.setdefault("CLUSTER_NAME", "test-cluster")
+os.environ.setdefault("HELM_RELEASE", "langsmith")
+os.environ.setdefault("ARTIFACTS_DIR", str(repo_root / "tests" / "artifacts"))
+
+# Cloud provider defaults (can be overridden by GitHub Actions)
+os.environ.setdefault("CLOUD_PROVIDER", "aws")
+os.environ.setdefault("AWS_REGION", "us-west-2")
+os.environ.setdefault("AZURE_LOCATION", "eastus")
+
+# Create artifacts directory
+artifacts_dir = Path(os.environ["ARTIFACTS_DIR"])
+artifacts_dir.mkdir(parents=True, exist_ok=True)
+
@@ -0,0 +1,11 @@
+# Test dependencies for notebook execution
+pytest>=7.0.0
+jupyter>=1.0.0
+nbconvert>=6.0.0
+ipykernel>=6.0.0
+
+# Notebook dependencies (should match what notebooks need)
+python-dotenv>=1.0.0
+pyyaml>=6.0
+requests>=2.28.0
+
@@ -0,0 +1,283 @@
+"""
+Test notebook execution using nbconvert.
+
+This module executes notebooks and validates they complete without errors.
+"""
+import json
+import os
+import subprocess
+import sys
+from pathlib import Path
+import pytest
+
+# Repository root
+REPO_ROOT = Path(__file__).parent.parent
+NOTEBOOKS_DIR = REPO_ROOT / "notebooks"
+
+
+def execute_notebook(notebook_path: Path, timeout: int = 600) -> tuple[bool, str]:
+    """
+    Execute a Jupyter notebook using nbconvert.
+    
+    Args:
+        notebook_path: Path to the notebook file
+        timeout: Maximum execution time in seconds
+        
+    Returns:
+        Tuple of (success: bool, output: str)
+    """
+    try:
+        # Use nbconvert to execute the notebook
+        result = subprocess.run(
+            [
+                sys.executable,
+                "-m",
+                "jupyter",
+                "nbconvert",
+                "--to",
+                "notebook",
+                "--execute",
+                "--inplace",
+                "--ExecutePreprocessor.timeout=600",
+                "--ExecutePreprocessor.kernel_name=python3",
+                str(notebook_path),
+            ],
+            capture_output=True,
+            text=True,
+            timeout=timeout,
+            cwd=str(notebook_path.parent),
+        )
+        
+        if result.returncode == 0:
+            return True, result.stdout
+        else:
+            error_msg = f"STDOUT:\n{result.stdout}\n\nSTDERR:\n{result.stderr}"
+            return False, error_msg
+            
+    except subprocess.TimeoutExpired:
+        return False, f"Notebook execution timed out after {timeout} seconds"
+    except Exception as e:
+        return False, f"Error executing notebook: {str(e)}"
+
+
+def get_notebook_cells(notebook_path: Path) -> list:
+    """Get all code cells from a notebook."""
+    with open(notebook_path, "r") as f:
+        nb = json.load(f)
+    return [cell for cell in nb.get("cells", []) if cell.get("cell_type") == "code"]
+
+
+class TestNotebookExecution:
+    """Base class for notebook execution tests."""
+    
+    @pytest.fixture(autouse=True)
+    def setup_test_env(self, monkeypatch):
+        """Set up test environment variables."""
+        # Set minimal required env vars for testing
+        test_env = {
+            "NAMESPACE": "langsmith-test",
+            "CLUSTER_NAME": "test-cluster",
+            "HELM_RELEASE": "langsmith",
+            "ARTIFACTS_DIR": str(REPO_ROOT / "tests" / "artifacts"),
+            "CLOUD_PROVIDER": os.environ.get("CLOUD_PROVIDER", "aws"),
+            "AWS_REGION": os.environ.get("AWS_REGION", "us-west-2"),
+            "AZURE_LOCATION": os.environ.get("AZURE_LOCATION", "eastus"),
+            # Mock values for testing (will fail actual operations but allow syntax checks)
+            "LANGSMITH_DOMAIN": "test.langsmith.example.com",
+            "OIDC_ISSUER": "https://test-idp.example.com/oauth2/default",
+            "OIDC_CLIENT_ID": "test-client-id",
+            "OIDC_CLIENT_SECRET": "test-client-secret",
+            "OIDC_REDIRECT_URI": "https://test.langsmith.example.com/auth/callback",
+        }
+        
+        for key, value in test_env.items():
+            monkeypatch.setenv(key, value)
+    
+    def _validate_notebook_syntax(self, notebook_path: Path):
+        """Helper method to validate notebook has valid JSON structure and code cells."""
+        assert notebook_path.exists(), f"Notebook not found: {notebook_path}"
+        
+        with open(notebook_path, "r") as f:
+            nb = json.load(f)
+        
+        assert "cells" in nb, "Notebook missing cells"
+        assert len(nb["cells"]) > 0, "Notebook has no cells"
+        
+        code_cells = [c for c in nb["cells"] if c.get("cell_type") == "code"]
+        assert len(code_cells) > 0, "Notebook has no code cells"
+
+
+# Module 1 tests
+class TestModule1Notebooks(TestNotebookExecution):
+    """Test Module 1 notebooks."""
+    
+    @pytest.mark.parametrize("notebook", [
+        "01_preflight.ipynb",
+        "99_teardown.ipynb",  # Always test syntax, even if execution is skipped
+        # Note: Skip terraform/helm/validation notebooks in CI as they require actual infrastructure
+        # "02_terraform_apply.ipynb",
+        # "03_helm_install_langsmith.ipynb",
+        # "04_validate_ingress_and_ui.ipynb",
+    ])
+    def test_module1_notebook_syntax(self, notebook):
+        """Test Module 1 notebook syntax."""
+        notebook_path = NOTEBOOKS_DIR / "module-1" / notebook
+        self._validate_notebook_syntax(notebook_path)
+    
+    @pytest.mark.skipif(
+        os.environ.get("CI_SKIP_EXECUTION") == "true",
+        reason="Skipping execution in CI (requires infrastructure)"
+    )
+    @pytest.mark.parametrize("notebook", [
+        "01_preflight.ipynb",
+    ])
+    def test_module1_notebook_execution(self, notebook):
+        """Test Module 1 notebook execution (only if infrastructure available)."""
+        notebook_path = NOTEBOOKS_DIR / "module-1" / notebook
+        success, output = execute_notebook(notebook_path, timeout=300)
+        assert success, f"Notebook execution failed:\n{output}"
+    
+    @pytest.mark.skipif(
+        os.environ.get("CI_SKIP_EXECUTION") == "true",
+        reason="Skipping execution in CI (requires infrastructure)"
+    )
+    def test_module1_teardown_execution(self):
+        """
+        Test Module 1 teardown notebook execution.
+        
+        This test runs when CI_SKIP_EXECUTION is not true, ensuring that
+        resources created during execution tests are properly cleaned up.
+        
+        IMPORTANT: This test should run AFTER other execution tests to ensure
+        proper cleanup. It will destroy all infrastructure created during testing.
+        
+        Note: The teardown notebook has commented-out code sections that must be
+        uncommented to actually destroy resources. This test validates the notebook
+        structure and execution flow, but actual resource destruction requires
+        manual uncommenting in the notebook itself.
+        """
+        notebook_path = NOTEBOOKS_DIR / "module-1" / "99_teardown.ipynb"
+        # Teardown may take longer, especially for Terraform destroy
+        # Using 30 minutes timeout to allow for full infrastructure teardown
+        success, output = execute_notebook(notebook_path, timeout=1800)  # 30 minutes
+        assert success, f"Teardown notebook execution failed:\n{output}"
+
+
+# Module 2 tests
+class TestModule2Notebooks(TestNotebookExecution):
+    """Test Module 2 notebooks."""
+    
+    @pytest.mark.parametrize("notebook", [
+        "01_sso_oidc_validation.ipynb",
+        "02_sso_saml_validation.ipynb",
+    ])
+    def test_module2_notebook_syntax(self, notebook):
+        """Test Module 2 notebook syntax."""
+        notebook_path = NOTEBOOKS_DIR / "module-2" / notebook
+        self._validate_notebook_syntax(notebook_path)
+    
+    @pytest.mark.skipif(
+        os.environ.get("CI_SKIP_EXECUTION") == "true",
+        reason="Skipping execution in CI (requires infrastructure)"
+    )
+    @pytest.mark.parametrize("notebook", [
+        "01_sso_oidc_validation.ipynb",
+        "02_sso_saml_validation.ipynb",
+    ])
+    def test_module2_notebook_execution(self, notebook):
+        """Test Module 2 notebook execution (only if infrastructure available)."""
+        notebook_path = NOTEBOOKS_DIR / "module-2" / notebook
+        success, output = execute_notebook(notebook_path, timeout=300)
+        assert success, f"Notebook execution failed:\n{output}"
+
+
+# Module 3 tests
+class TestModule3Notebooks(TestNotebookExecution):
+    """Test Module 3 notebooks."""
+    
+    @pytest.mark.parametrize("notebook", [
+        "01_ops_sanity_checks.ipynb",
+    ])
+    def test_module3_notebook_syntax(self, notebook):
+        """Test Module 3 notebook syntax."""
+        notebook_path = NOTEBOOKS_DIR / "module-3" / notebook
+        self._validate_notebook_syntax(notebook_path)
+    
+    @pytest.mark.skipif(
+        os.environ.get("CI_SKIP_EXECUTION") == "true",
+        reason="Skipping execution in CI (requires infrastructure)"
+    )
+    @pytest.mark.parametrize("notebook", [
+        "01_ops_sanity_checks.ipynb",
+    ])
+    def test_module3_notebook_execution(self, notebook):
+        """Test Module 3 notebook execution (only if infrastructure available)."""
+        notebook_path = NOTEBOOKS_DIR / "module-3" / notebook
+        # Ops sanity checks may take longer due to resource usage checks
+        success, output = execute_notebook(notebook_path, timeout=600)
+        assert success, f"Notebook execution failed:\n{output}"
+
+
+# Module 4 tests
+class TestModule4Notebooks(TestNotebookExecution):
+    """Test Module 4 notebooks."""
+    
+    @pytest.mark.parametrize("notebook", [
+        "00_setup_or_resume_environment.ipynb",
+        "01_diagnostics_baseline.ipynb",
+        "10_failure_lab_postgres.ipynb",
+        "20_failure_lab_redis.ipynb",
+        "30_failure_lab_clickhouse.ipynb",
+        "40_failure_lab_blob_storage.ipynb",
+    ])
+    def test_module4_notebook_syntax(self, notebook):
+        """Test Module 4 notebook syntax."""
+        notebook_path = NOTEBOOKS_DIR / "module-4" / notebook
+        self._validate_notebook_syntax(notebook_path)
+    
+    @pytest.mark.skipif(
+        os.environ.get("CI_SKIP_EXECUTION") == "true",
+        reason="Skipping execution in CI (requires infrastructure)"
+    )
+    @pytest.mark.parametrize("notebook", [
+        "00_setup_or_resume_environment.ipynb",
+        "01_diagnostics_baseline.ipynb",
+    ])
+    def test_module4_notebook_execution(self, notebook):
+        """
+        Test Module 4 notebook execution (only if infrastructure available).
+        
+        Tests setup and baseline notebooks which are read-only validation.
+        Failure labs are syntax-tested only to avoid modifying production environments.
+        """
+        notebook_path = NOTEBOOKS_DIR / "module-4" / notebook
+        # Setup and baseline checks may take longer due to diagnostics collection
+        success, output = execute_notebook(notebook_path, timeout=600)
+        assert success, f"Notebook execution failed:\n{output}"
+    
+    @pytest.mark.skipif(
+        os.environ.get("CI_SKIP_EXECUTION") == "true",
+        reason="Skipping execution in CI (requires infrastructure and failure injection)"
+    )
+    @pytest.mark.parametrize("notebook", [
+        "10_failure_lab_postgres.ipynb",
+        "20_failure_lab_redis.ipynb",
+        "30_failure_lab_clickhouse.ipynb",
+        "40_failure_lab_blob_storage.ipynb",
+    ])
+    def test_module4_failure_lab_execution(self, notebook):
+        """
+        Test Module 4 failure lab notebook execution (only if infrastructure available).
+        
+        WARNING: These notebooks inject failures by modifying secrets and configurations.
+        They should only be run in test environments, not production.
+        
+        These tests validate that failure injection and remediation workflows function
+        correctly. The notebooks include safety mechanisms (commented-out injection code)
+        but should still be used with caution.
+        """
+        notebook_path = NOTEBOOKS_DIR / "module-4" / notebook
+        # Failure labs may take longer due to failure injection, observation, and remediation
+        success, output = execute_notebook(notebook_path, timeout=900)  # 15 minutes
+        assert success, f"Notebook execution failed:\n{output}"
+
				`@@ -0,0 +1,2 @@`
				`# Tests for LangSmith Self-Hosted Workshops notebooks`