Merge pull request #1 from langchain-ai/cwaddingham/refinements

Refinements to the initial creation
This commit is contained in:
Cory Waddingham
2026-01-06 16:46:31 -08:00
committed by GitHub
40 changed files with 14317 additions and 679 deletions
+3
View File
@@ -0,0 +1,3 @@
# Automatically strip output cells from Jupyter notebooks before committing
*.ipynb filter=nbstripout
+120
View File
@@ -0,0 +1,120 @@
# GitHub Actions Workflows
This directory contains CI/CD workflows for the LangSmith Self-Hosted Workshops repository.
## Workflows
### `test-notebooks.yml`
Main workflow for testing notebook syntax and execution.
**Triggers:**
- Pull requests to `main`/`master`
- Pushes to `main`/`master`
- Manual workflow dispatch
**Jobs:**
1. **test-notebook-syntax**: Validates notebook JSON structure
2. **test-module-1-preflight**: Tests Module 1 preflight notebook
3. **test-module-2-syntax**: Tests Module 2 auth validation notebooks
4. **lint-python**: Lints Python code in shared modules
## Environment Variables
### Required for Syntax Tests (Always Available)
These are set in the workflow and don't require secrets:
```yaml
NAMESPACE: "langsmith-test"
CLUSTER_NAME: "test-cluster"
HELM_RELEASE: "langsmith"
CLOUD_PROVIDER: "aws"
AWS_REGION: "us-west-2"
LANGSMITH_DOMAIN: "test.langsmith.example.com"
```
### Required for Full Execution Tests (Optional)
For full notebook execution, set these in GitHub Secrets:
**AWS:**
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`
- `AWS_REGION`
- `AWS_ACCOUNT_ID` (optional, for validation)
**Azure:**
- `AZURE_CLIENT_ID`
- `AZURE_CLIENT_SECRET`
- `AZURE_TENANT_ID`
- `AZURE_SUBSCRIPTION_ID`
- `AZURE_LOCATION`
**Infrastructure:**
- `CLUSTER_NAME`
- `NAMESPACE`
- `TERRAFORM_REPO_DIR`
- `HELM_REPO_DIR`
**OIDC/SAML (Module 2):**
- `OIDC_ISSUER`
- `OIDC_CLIENT_ID`
- `OIDC_CLIENT_SECRET`
- `OIDC_REDIRECT_URI`
- `SAML_METADATA_URL` (if using SAML)
## Customizing Workflows
### Adding New Test Jobs
1. Add job to `test-notebooks.yml`
2. Set appropriate `needs:` dependencies
3. Configure environment variables
4. Add artifact uploads if needed
### Enabling Full Execution Tests
To enable full notebook execution in CI:
1. Set required secrets in GitHub repository settings
2. Remove or modify `CI_SKIP_EXECUTION` environment variable
3. Update test conditions in `test_notebook_execution.py`
### Adding New Modules
When adding Module 3, 4, etc.:
1. Create new test class in `test_notebook_execution.py`
2. Add parametrized tests for new notebooks
3. Add new job in GitHub Actions workflow
4. Update this README
## Workflow Status
Workflow status badges can be added to README:
```markdown
![Test Notebooks](https://github.com/your-org/langsmith-self-hosted-workshops/workflows/Test%20Notebooks/badge.svg)
```
## Troubleshooting
### Workflow Fails on Syntax Tests
- Check notebook JSON is valid
- Verify all imports are available
- Check Python version compatibility
### Workflow Times Out
- Increase `timeout-minutes` in job definition
- Check for long-running operations
- Consider splitting into smaller jobs
### Environment Variable Issues
- Verify secrets are set in repository settings
- Check variable names match exactly
- Ensure secrets are accessible to workflow
+256
View File
@@ -0,0 +1,256 @@
name: Test Notebooks
on:
pull_request:
branches:
- main
- master
push:
branches:
- main
- master
workflow_dispatch: # Allow manual triggering
jobs:
test-notebook-syntax:
name: Test Notebook Syntax
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
cache: 'pip'
- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y jq
- name: Install test dependencies
run: |
pip install -r tests/requirements.txt
- name: Run notebook syntax tests
env:
CI_SKIP_EXECUTION: "true" # Skip full execution, only test syntax
NAMESPACE: "langsmith-test"
CLUSTER_NAME: "test-cluster"
HELM_RELEASE: "langsmith"
CLOUD_PROVIDER: "aws"
AWS_REGION: "us-west-2"
AZURE_LOCATION: "eastus"
LANGSMITH_DOMAIN: "test.langsmith.example.com"
OIDC_ISSUER: "https://test-idp.example.com/oauth2/default"
OIDC_CLIENT_ID: "test-client-id"
OIDC_CLIENT_SECRET: "test-client-secret"
OIDC_REDIRECT_URI: "https://test.langsmith.example.com/auth/callback"
run: |
pytest tests/test_notebook_execution.py::TestModule1Notebooks::test_module1_notebook_syntax -v
pytest tests/test_notebook_execution.py::TestModule2Notebooks::test_module2_notebook_syntax -v
pytest tests/test_notebook_execution.py::TestModule3Notebooks::test_module3_notebook_syntax -v
pytest tests/test_notebook_execution.py::TestModule4Notebooks::test_module4_notebook_syntax -v
- name: Upload test artifacts
if: always()
uses: actions/upload-artifact@v3
with:
name: test-artifacts
path: tests/artifacts/
retention-days: 1
test-module-1-preflight:
name: Test Module 1 Preflight (Dry Run)
runs-on: ubuntu-latest
timeout-minutes: 15
needs: test-notebook-syntax
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
cache: 'pip'
- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y jq
# Install mock tools (these won't actually work, but allow import checks)
sudo ln -sf /bin/true /usr/local/bin/aws || true
sudo ln -sf /bin/true /usr/local/bin/terraform || true
sudo ln -sf /bin/true /usr/local/bin/helm || true
sudo ln -sf /bin/true /usr/local/bin/kubectl || true
- name: Install test dependencies
run: |
pip install -r tests/requirements.txt
- name: Create test environment file
run: |
mkdir -p notebooks
cat > notebooks/workshop.env <<EOF
WORKSHOP_NAME="langsmith-test"
NAMESPACE="langsmith-test"
CLUSTER_NAME="test-cluster"
AWS_REGION="us-west-2"
HELM_RELEASE="langsmith"
ARTIFACTS_DIR="./tests/artifacts"
DRY_RUN="true"
EOF
- name: Run Module 1 preflight notebook (syntax only)
env:
CI_SKIP_EXECUTION: "true"
NAMESPACE: "langsmith-test"
CLUSTER_NAME: "test-cluster"
HELM_RELEASE: "langsmith"
CLOUD_PROVIDER: "aws"
AWS_REGION: "us-west-2"
run: |
# Test that notebook can be loaded and parsed
python -c "
import json
import sys
from pathlib import Path
nb_path = Path('notebooks/module-1/01_preflight.ipynb')
with open(nb_path) as f:
nb = json.load(f)
# Validate structure
assert 'cells' in nb
assert len(nb['cells']) > 0
print(f'✅ Notebook structure valid: {len(nb[\"cells\"])} cells')
sys.exit(0)
"
- name: Upload test artifacts
if: always()
uses: actions/upload-artifact@v3
with:
name: module-1-artifacts
path: tests/artifacts/
retention-days: 1
test-module-2-syntax:
name: Test Module 2 Notebooks (Syntax)
runs-on: ubuntu-latest
timeout-minutes: 15
needs: test-notebook-syntax
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
cache: 'pip'
- name: Install test dependencies
run: |
pip install -r tests/requirements.txt
- name: Create test environment file
run: |
mkdir -p notebooks
cat > notebooks/workshop.env <<EOF
WORKSHOP_NAME="langsmith-test"
NAMESPACE="langsmith-test"
CLUSTER_NAME="test-cluster"
AWS_REGION="us-west-2"
HELM_RELEASE="langsmith"
ARTIFACTS_DIR="./tests/artifacts"
LANGSMITH_DOMAIN="test.langsmith.example.com"
OIDC_ISSUER="https://test-idp.example.com/oauth2/default"
OIDC_CLIENT_ID="test-client-id"
OIDC_CLIENT_SECRET="test-client-secret"
OIDC_REDIRECT_URI="https://test.langsmith.example.com/auth/callback"
EOF
- name: Validate Module 2 notebooks
env:
CI_SKIP_EXECUTION: "true"
NAMESPACE: "langsmith-test"
CLUSTER_NAME: "test-cluster"
HELM_RELEASE: "langsmith"
LANGSMITH_DOMAIN: "test.langsmith.example.com"
OIDC_ISSUER: "https://test-idp.example.com/oauth2/default"
OIDC_CLIENT_ID: "test-client-id"
OIDC_CLIENT_SECRET: "test-client-secret"
OIDC_REDIRECT_URI: "https://test.langsmith.example.com/auth/callback"
run: |
python -c "
import json
import sys
from pathlib import Path
notebooks = [
'notebooks/module-2/01_sso_oidc_validation.ipynb',
'notebooks/module-2/02_sso_saml_validation.ipynb',
]
for nb_path_str in notebooks:
nb_path = Path(nb_path_str)
if not nb_path.exists():
print(f'❌ Notebook not found: {nb_path}')
sys.exit(1)
with open(nb_path) as f:
nb = json.load(f)
assert 'cells' in nb, f'Missing cells in {nb_path}'
assert len(nb['cells']) > 0, f'No cells in {nb_path}'
code_cells = [c for c in nb['cells'] if c.get('cell_type') == 'code']
print(f'✅ {nb_path.name}: {len(code_cells)} code cells, {len(nb[\"cells\"])} total cells')
print('✅ All Module 2 notebooks validated')
sys.exit(0)
"
- name: Upload test artifacts
if: always()
uses: actions/upload-artifact@v3
with:
name: module-2-artifacts
path: tests/artifacts/
retention-days: 1
lint-python:
name: Lint Python Code
runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
cache: 'pip'
- name: Install linting tools
run: |
pip install flake8 black isort
- name: Run flake8
run: |
flake8 notebooks/shared/ tests/ --max-line-length=120 --ignore=E501,W503 || true
- name: Check code formatting with black
run: |
black --check notebooks/shared/ tests/ || true
+3 -1
View File
@@ -6,6 +6,8 @@ The workshop is designed for **platform, infrastructure, and MLOps engineers** r
> **Note:** This workshop assumes deployment using *NIX-based servers, preferably Linux. If you must use Windows please raise an issue in the [Github](https://github.com/langchain-ai/langsmith-self-hosted-workshops) repo and LangChain engineers will address it.
> **Note:** This workshop uses Jupyter notebooks for its demonstrations. You have the option of running them locally via your own [Jupyter server](https://jupyter.org/) or use Google's [Github-to-Colab tool](https://githubtocolab.com) with your existing Google Suite account.
This repo complements (but does not replace) the high-level deployment instructions [the LangSmith documentation](https://docs.langchain.com). Where the docs explain *what* to do, this workshop focuses on *how to do it safely and repeatedly*.
---
@@ -178,7 +180,7 @@ git clone https://github.com/langchain-ai/helm.git <your-helm-path>
### 3. Start the Workshop
1. Read `docs/modules/module-1.md` for module overview and context
2. Open `notebooks/module-1/01_aws_preflight.ipynb` in Jupyter
2. Open `notebooks/module-1/01_preflight.ipynb` in Jupyter
3. Run the bootstrap cell (first cell) to validate your environment
4. Follow the notebook cells sequentially
+606
View File
@@ -0,0 +1,606 @@
# Module 1: Deployment & Baseline Validation
**Goal:** Deploy LangSmith self-hosted using the official Terraform and Helm repositories, establishing a supported baseline configuration.
**Duration:** ~2 hours
**Audience:** Platform engineers, infrastructure teams, and operators deploying LangSmith for the first time
**Prerequisites:**
- Cloud provider account with appropriate permissions
- Local tooling installed (`aws`/`az`, `terraform`, `kubectl`, `helm`, `jq`)
- LangSmith self-hosted license key
- Basic familiarity with Kubernetes (pods, services, ingress)
---
## Motivation
Most self-hosted LangSmith failures occur **before** users ever touch the product:
- Mis-sized clusters that "work" until users arrive
- Unsupported ingress setups causing connectivity issues
- In-cluster databases used past their limits
- Missing storage primitives (blob storage, persistent volumes)
- Incorrect infrastructure configuration leading to data loss
Module 1 exists to ensure every deployment starts from a **supported baseline** using the **official Terraform and Helm repositories**. This baseline becomes the foundation for production operations (Module 3) and authentication (Module 2).
---
## Outcomes
By the end of this module, participants will:
- Deploy cloud infrastructure using the official `langchain-ai/terraform` repository
- Install LangSmith using the official `langchain-ai/helm` chart
- Validate cluster readiness, storage, and ingress
- Understand *why* specific architectural choices are required
- Establish a baseline configuration for future modules
- Be ready to layer in authentication (Module 2) and production operations (Module 3)
---
## What This Module Avoids
- **SSO / OIDC / SAML:** Covered in Module 2
- **HA tuning beyond defaults:** Covered in Module 3
- **Advanced autoscaling (KEDA):** Covered in Module 3
- **Performance benchmarking:** Out of scope
- **Custom infrastructure:** We use official Terraform modules only
- **Forked repositories:** We reference official repos directly
This keeps the baseline clean, repeatable, and supportable.
---
## Architecture Baseline (What We Support)
This workshop uses a **single, opinionated baseline**:
### Compute
- **AWS:** Amazon EKS (Elastic Kubernetes Service)
- **Azure:** Azure Kubernetes Service (AKS)
- **GCP:** Google Kubernetes Engine (GKE) - coming soon
### Ingress
- **AWS:** AWS Application Load Balancer (ALB) - cloud-native load balancer only
- **Azure:** Azure Application Gateway - cloud-native load balancer only
- **Why:** Cloud-native load balancers provide automatic scaling, health checks, and integration with cloud provider services
### Datastores
- **PostgreSQL:** Managed service (RDS for AWS, Azure Database for PostgreSQL)
- **Redis:** Managed service (ElastiCache for AWS, Azure Cache for Redis)
- **ClickHouse:** Managed service (ClickHouse Cloud) OR in-cluster with EBS CSI/Azure Disk CSI
- **Why:** Managed services reduce operational overhead and provide automated backups
### Blob Storage
- **AWS:** S3 (Simple Storage Service) - **required for production**
- **Azure:** Azure Blob Storage - **required for production**
- **Why:** Without blob storage, ClickHouse table size explodes under load, making the system unusable
### Provisioning
- **Infrastructure:** Terraform (official `langchain-ai/terraform` repository)
- **Application:** Helm (official `langchain-ai/helm` chart)
### Deviations
Deviations from this baseline are discussed in advanced modules but not used here. This ensures:
- Support can help troubleshoot standard configurations
- Updates and security patches are straightforward
- Documentation and runbooks apply directly
---
## Workshop Flow
### 1️⃣ Environment Readiness & Preflight (2030 min)
**Notebook:** `01_preflight.ipynb`
**What we validate:**
- Tooling validation (cloud CLI, terraform, kubectl, helm, jq)
- Cloud provider credentials & region sanity check
- Cluster capacity expectations
- Storage prerequisites (CSI drivers, StorageClasses)
- Blob storage requirement (cloud object storage)
**Key emphasis:**
- Verify you're using the correct cloud account/subscription
- Ensure all required tools are available and in PATH
- Validate storage CSI drivers are installed
- Confirm blob storage is accessible
**Output:**
- Environment validated and ready
- Artifacts directory created
- Cloud provider identity confirmed
---
### 2️⃣ Terraform: Provisioning the Platform Substrate (4560 min)
**Notebook:** `02_terraform_apply.ipynb`
**What we deploy:**
- Managed Kubernetes cluster (EKS/AKS)
- Managed PostgreSQL database (RDS/Azure Database)
- Managed Redis cache (ElastiCache/Azure Cache)
- Object storage for blob storage (S3/Azure Blob Storage)
- IAM/RBAC roles and policies
- Storage CSI driver addon
**Key principles:**
- Use the **official** Terraform repo (do not fork)
- Pin module versions for reproducibility
- Use remote state & locking
- Plan before applying
- Capture outputs needed for Helm
**Workflow:**
1. Clone and navigate to official Terraform repository
2. Identify correct module path for your cloud provider
3. Pin module versions in `versions.tf`
4. Configure Terraform variables (region, cluster name, database credentials)
5. Initialize Terraform (`terraform init`)
6. Create Terraform plan (`terraform plan`)
7. Review plan carefully
8. Apply infrastructure (`terraform apply`)
9. Capture outputs for Helm configuration
**Key emphasis:**
- Why we do *not* fork upstream
- Why remote state & locking matter
- What support will expect to see later
- How to interpret Terraform outputs
**Output:**
- Infrastructure deployed and healthy
- Terraform outputs captured
- Cluster accessible via kubectl
---
### 3️⃣ Helm: Installing LangSmith (4560 min)
**Notebook:** `03_helm_install_langsmith.ipynb`
**What we install:**
- LangSmith application components
- External service connections (PostgreSQL, Redis, blob storage)
- Resource requests & limits
- Ingress configuration
**Key principles:**
- Use the **official** Helm chart (do not fork)
- Pin chart versions for reproducibility
- Create minimal, sane values file
- Inject required secrets properly
- Render templates before install
- Understand that "helm install succeeded" ≠ "system is healthy"
**Workflow:**
1. Clone and navigate to official Helm repository
2. Identify correct chart path
3. Pin chart version
4. Create minimal values file:
- External service connections (database, cache, blob storage)
- Resource requests & limits
- Ingress configuration
- Required secrets
5. Create Kubernetes secrets for sensitive values
6. Render templates (`helm template`) to validate
7. Install chart (`helm install`)
8. Verify installation (`helm status`)
**Key emphasis:**
- External services wiring (why managed services matter)
- Resource requests & limits (why they're required)
- Why "helm install succeeded" ≠ "system is healthy"
- Start with minimal values file and only configure what you need
**Output:**
- LangSmith application deployed
- Pods starting (may not be ready yet)
- Helm release created
---
### 4️⃣ Validation & Go/No-Go Checklist (2030 min)
**Notebook:** `04_validate_ingress_and_ui.ipynb`
**What we validate:**
1. Pod readiness (all pods running)
2. License key validation (properly configured)
3. PVC binding (storage provisioned)
4. External services connectivity (PostgreSQL, Redis, blob storage)
5. Ingress provisioning (load balancer created)
6. Endpoint reachability (services accessible)
7. Basic UI availability (web interface works)
8. Basic functional test (optional trace submission)
**Key emphasis:**
- This checklist becomes your **baseline reference** for future troubleshooting
- Most issues are caught here, before real users onboard
- Validation ensures you're on a **supported path**
**Workflow:**
1. Verify all pods are running and ready
2. Validate license key is configured correctly
3. Check PVCs are bound (storage provisioned)
4. Test connectivity to external services
5. Verify ingress is provisioned and accessible
6. Test endpoint reachability (HTTPS)
7. Verify UI is accessible
8. Optional: Submit test trace to validate functionality
**Output:**
- Deployment validated and healthy
- Baseline reference established
- Ready for Module 2 (authentication)
---
### 5️⃣ Teardown & Cleanup (Optional, 3045 min)
**Notebook:** `99_teardown.ipynb`
**What we clean up:**
- Helm release (LangSmith application)
- Kubernetes resources (secrets, PVCs)
- Terraform-managed infrastructure (cluster, database, cache, blob storage)
**Key emphasis:**
- Avoid ongoing cloud costs
- Practice proper resource lifecycle management
- Verify cleanup completed successfully
**Workflow:**
1. Uninstall Helm release
2. Clean up remaining Kubernetes resources
3. Destroy Terraform infrastructure
4. Verify all resources removed
**Output:**
- All resources destroyed
- No ongoing costs
- Clean slate for re-deployment
---
## Common Pitfalls Addressed in Module 1
### ClickHouse PVCs Stuck in `Pending`
**Symptom:** ClickHouse pods cannot start, PVCs remain in `Pending` state.
**Cause:** Missing EBS CSI driver (AWS) or Azure Disk CSI driver (Azure).
**Fix:** Install CSI driver addon before deploying LangSmith.
**Prevention:** Preflight checks validate CSI driver installation.
### Load Balancer Never Appears
**Symptom:** Ingress created but no load balancer provisioned.
**Cause:** Wrong ingress class or missing ingress controller.
**Fix:** Use cloud-native ingress class (AWS: `alb`, Azure: `azure/application-gateway`).
**Prevention:** Preflight checks validate ingress controller installation.
### Inline Trace Payloads Exploding ClickHouse
**Symptom:** ClickHouse table size grows rapidly, queries slow down.
**Cause:** Blob storage not configured, large payloads stored inline in ClickHouse.
**Fix:** Configure S3 (AWS) or Azure Blob Storage (Azure) before deployment.
**Prevention:** Preflight checks validate blob storage accessibility.
### Under-Sized Clusters That "Work" Until Users Arrive
**Symptom:** Deployment works initially but fails under load.
**Cause:** Cluster nodes too small, insufficient resources.
**Fix:** Use recommended node sizes (see service sizing baselines in Module 3).
**Prevention:** Preflight checks validate cluster capacity expectations.
### Terraform State Lock Issues
**Symptom:** `terraform apply` fails with state lock error.
**Cause:** Another process holds the lock, or previous operation didn't release lock.
**Fix:** Use remote state backend with locking (S3 + DynamoDB for AWS, Azure Storage for Azure).
**Prevention:** Terraform configuration uses remote state by default.
---
## Service Sizing Baselines
### Kubernetes Cluster
**Production baseline:**
- **Node instance type:** m5.xlarge (4 vCPU, 16 GB RAM) minimum
- **Node count:** 3+ nodes (for HA)
- **Storage:** EBS gp3 (AWS) or Premium SSD (Azure) with 100+ GB per node
**Non-production guidance:**
- m5.large (2 vCPU, 8 GB RAM) acceptable for development
- 2 nodes sufficient for non-production
### PostgreSQL
**Production baseline:**
- **Instance size:** db.r5.xlarge (4 vCPU, 32 GB RAM) minimum
- **Storage:** 500 GB+ with autoscaling enabled
- **High availability:** Multi-AZ deployment (RDS) or read replicas (Azure)
**Non-production guidance:**
- db.t3.medium (2 vCPU, 4 GB RAM) acceptable for development
- Single-AZ acceptable for non-production
### Redis
**Production baseline:**
- **Instance type:** cache.r6g.xlarge (6 vCPU, 13.07 GB RAM) minimum
- **High availability:** Redis Cluster mode enabled (3+ nodes)
**Non-production guidance:**
- cache.t3.micro acceptable for development
- Single node acceptable for non-production
### ClickHouse
**Production baseline:**
- **Deployment:** Managed ClickHouse (ClickHouse Cloud) OR in-cluster with EBS CSI
- **In-cluster sizing:** 3-node cluster minimum (for HA)
- **Resources per node:** 8 CPU, 32 GB RAM, 1 TB storage
**Non-production guidance:**
- Single node acceptable for development
- 4 CPU, 16 GB RAM per node sufficient
---
## Blob Storage Requirement
### Why Blob Storage is Required
**Problem without blob storage:**
- Large trace payloads stored inline in ClickHouse
- ClickHouse table size explodes
- Query performance degrades
- Storage costs increase dramatically
- System becomes unusable under load
**Solution with blob storage:**
- Large payloads stored in S3/Azure Blob Storage
- ClickHouse stores only references (small strings)
- Query performance remains stable
- Storage costs scale linearly
- System handles production load
### Requirements
**Production:**
- **Service:** S3 (AWS) or Azure Blob Storage (Azure)
- **Bucket/Container:** Dedicated bucket for LangSmith
- **Access:** IAM roles (AWS) or Managed Identity (Azure) - no access keys
- **Versioning:** Enabled for data protection
- **Encryption:** Server-side encryption enabled
**Non-production:**
- Local MinIO or in-cluster object storage acceptable
- Access keys acceptable (not for production)
- No versioning required
---
## Terraform Best Practices
### Use Official Repository
**Why:**
- Support expects standard configurations
- Updates and security patches are provided
- Documentation and examples are maintained
- Compatibility with Helm chart is guaranteed
**How:**
- Clone `langchain-ai/terraform` repository
- Reference modules directly (do not fork)
- Pin module versions in `versions.tf`
### Remote State & Locking
**Why:**
- Prevents concurrent modifications
- Enables team collaboration
- Provides state history
- Prevents state corruption
**Configuration:**
- **AWS:** S3 backend with DynamoDB table for locking
- **Azure:** Azure Storage backend with blob container
### Plan Before Apply
**Why:**
- Review changes before applying
- Catch configuration errors early
- Understand resource impact
- Validate variable values
**Workflow:**
1. `terraform init` - Initialize backend and modules
2. `terraform plan` - Generate execution plan
3. Review plan carefully
4. `terraform apply` - Apply changes
---
## Helm Best Practices
### Use Official Chart
**Why:**
- Support expects standard configurations
- Updates and security patches are provided
- Documentation and examples are maintained
- Compatibility with Terraform outputs is guaranteed
**How:**
- Clone `langchain-ai/helm` repository
- Reference chart directly (do not fork)
- Pin chart version
### Minimal Values File
**Principle:** Start with minimal configuration and only add what you need.
**Why:**
- Reduces complexity
- Fewer points of failure
- Easier to troubleshoot
- Clearer configuration intent
**What to include:**
- External service connections (database, cache, blob storage)
- Resource requests & limits
- Ingress configuration
- Required secrets
**What to avoid:**
- Configuration for services you're not using
- Over-optimization before baseline works
- Custom modifications without justification
### Render Before Install
**Why:**
- Validate template syntax
- Review generated manifests
- Catch configuration errors early
- Understand what will be deployed
**Command:**
```bash
helm template <release-name> <chart-path> -f <values-file> -n <namespace>
```
---
## Validation Checklist
See `notebooks/module-1/04_validate_ingress_and_ui.ipynb` for complete validation.
**Quick checklist:**
- [ ] All pods running and ready
- [ ] License key configured correctly
- [ ] PVCs bound (storage provisioned)
- [ ] External services accessible (PostgreSQL, Redis, blob storage)
- [ ] Ingress provisioned and accessible
- [ ] Endpoint reachable via HTTPS
- [ ] UI accessible in browser
- [ ] Basic functional test passes (optional)
---
## Artifacts Participants Leave With
1. **Working baseline deployment**
- LangSmith accessible via HTTPS
- All services healthy and connected
- Ingress configured correctly
2. **Pinned Terraform + Helm configuration**
- Terraform module versions documented
- Helm chart version documented
- Values file saved and version controlled
3. **Validated ingress endpoint**
- HTTPS URL accessible
- TLS certificate valid
- DNS configured correctly
4. **Readiness checklist**
- Validation results documented
- Baseline reference established
- Troubleshooting evidence collected
5. **Confidence they're on a supported path**
- Official repositories used
- Standard configuration applied
- Support can help troubleshoot
---
## Next Steps
1. **Run the validation notebook:**
- `notebooks/module-1/04_validate_ingress_and_ui.ipynb`
- Address any failures before proceeding
2. **Proceed to Module 2:**
- Configure authentication (OIDC/SAML)
- Set up role mapping
- Validate SSO flows
3. **Proceed to Module 3:**
- Configure production operations
- Set up autoscaling
- Establish observability
---
## References
- [Official Terraform Repository](https://github.com/langchain-ai/terraform)
- [Official Helm Repository](https://github.com/langchain-ai/helm)
- LangSmith Self-Hosted Documentation
- Cloud Provider Documentation (AWS EKS, Azure AKS)
---
## Troubleshooting
### Common Issues
**Terraform apply fails:**
- Check cloud provider credentials
- Verify IAM permissions
- Review Terraform plan for errors
- Check remote state backend configuration
**Helm install fails:**
- Verify chart path is correct
- Check values file syntax
- Validate secrets exist
- Review Helm template output
**Pods not starting:**
- Check pod logs: `kubectl logs <pod> -n <namespace>`
- Check events: `kubectl get events -n <namespace>`
- Verify resource requests/limits
- Check PVC binding status
**Ingress not accessible:**
- Verify ingress controller installed
- Check ingress class matches controller
- Verify DNS configuration
- Check TLS certificate validity
**External services not accessible:**
- Verify network connectivity (VPC/VNet)
- Check security group/NSG rules
- Validate connection strings
- Test connectivity from pod
For detailed troubleshooting, see the validation notebook and Module 3 operations guide.
+541
View File
@@ -0,0 +1,541 @@
# Module 2: Identity & Authentication
**Duration:** ~2 hours
**Audience:** Operators deploying and managing LangSmith self-hosted
**Prerequisite:** Module 1 complete (working deployment with DNS/TLS/Ingress configured)
---
## Motivation
Most production LangSmith deployments require centralized identity management. Configuring SSO **before** onboarding users prevents:
- Manual user provisioning overhead
- Security gaps from shared credentials
- Compliance violations from unmanaged access
- Operational toil from authentication failures
This module ensures your authentication setup is **correct from day one**, not retrofitted after users are already in the system.
---
## Outcomes
By the end of this module, participants will:
- Understand LangSmith's authentication and authorization model
- Configure OIDC or SAML SSO with their identity provider
- Validate authentication flows end-to-end
- Map identity provider groups to LangSmith roles
- Troubleshoot common authentication failures
- Maintain authentication configuration as code
---
## What This Module Avoids
- **IdP admin tutorials:** We assume your IdP team provides required configuration values
- **SCIM deep-dive:** User provisioning via SCIM is out of scope
- **Multi-IdP scenarios:** We focus on single IdP configuration
- **Local auth production use:** Local authentication is discouraged for production deployments
---
## Supported Identity Models
### OIDC (Preferred)
- **When to use:** Modern IdPs (Okta, Azure AD, Google Workspace, Auth0)
- **Advantages:** Standard protocol, easier debugging, better error messages
- **Requirements:** OIDC-compliant IdP with client credentials
### SAML (Fallback)
- **When to use:** Legacy IdPs or enterprise requirements
- **Advantages:** Widely supported, enterprise-standard
- **Requirements:** SAML 2.0 IdP with metadata endpoint or XML file
### Local Authentication (Discouraged)
- **When to use:** Development/testing only
- **Limitations:** No centralized management, manual user creation, security risk
- **Note:** This module does not cover local auth configuration
---
## Authentication Request Flow
```
┌─────────┐ ┌──────────────┐ ┌─────────────┐
│ Browser │ │ LangSmith │ │ Identity │
│ │ │ (Ingress) │ │ Provider │
└────┬────┘ └──────┬───────┘ └──────┬──────┘
│ │ │
│ 1. GET /login │ │
├────────────────────>│ │
│ │ │
│ 2. Redirect to IdP │ │
│ (with state) │ │
│<────────────────────┤ │
│ │ │
│ 3. GET /authorize │ │
├───────────────────────────────────────────────>│
│ │ │
│ 4. User authenticates │
│ (IdP UI) │
│ │ │
│ 5. Callback with code/token │
│<───────────────────────────────────────────────┤
│ │ │
│ 6. POST /callback │ │
├────────────────────>│ │
│ │ │
│ 7. Exchange code for token │
│ ├─────────────────────────>│
│ │<─────────────────────────┤
│ │ │
│ 8. Validate token & extract claims │
│ │ │
│ 9. Create/update user session │
│ │ │
│ 10. Redirect to dashboard │
│<────────────────────┤ │
│ │ │
```
**Key Points:**
- Redirect URI must match **exactly** (protocol, domain, path, trailing slashes)
- State parameter prevents CSRF attacks
- Token validation includes signature, expiration, and issuer verification
- Claims mapping determines user roles and workspace access
---
## Workshop Flow
### 1. LangSmith Authentication Model
**Authentication vs Authorization:**
- **Authentication (AuthN):** "Who are you?" - Verified by IdP
- **Authorization (AuthZ):** "What can you do?" - Determined by role mapping
**Roles:**
- **Admin:** Full system access, workspace management, user management
- **Member:** Workspace access, project creation, trace viewing
- **Viewer:** Read-only access to assigned workspaces
**Workspaces & Organizations:**
- Users belong to **organizations** (top-level container)
- Users access **workspaces** within organizations
- Role mapping determines which workspaces a user can access
- **No shared admin accounts** - each user authenticates individually
**Key Principle:** Authentication is centralized (IdP), authorization is application-level (LangSmith role mapping).
---
### 2. Choosing OIDC vs SAML
**Decision Rule:**
```
IF IdP supports OIDC AND you can configure OIDC client
→ Use OIDC (preferred)
ELSE IF IdP only supports SAML OR enterprise requires SAML
→ Use SAML (fallback)
ELSE
→ Re-evaluate IdP choice
```
**OIDC Advantages:**
- Better error messages
- Easier debugging (standard endpoints)
- Modern protocol with better security defaults
- Simpler configuration
**SAML Advantages:**
- Enterprise-standard
- Widely supported
- Mature protocol
**Recommendation:** Start with OIDC unless blocked by IdP limitations or policy.
---
### 3. Configuring OIDC
#### Required IdP Inputs
Your IdP team must provide:
1. **Issuer URL** (e.g., `https://your-org.okta.com/oauth2/default`)
- Must be HTTPS
- Must be reachable from LangSmith pods
- Used for discovery and token validation
2. **Client ID**
- OAuth2 client identifier
- Public value (safe to log)
3. **Client Secret**
- OAuth2 client secret
- **Never log or print**
- Store in Kubernetes secret
4. **Redirect URI**
- **Exact format:** `https://your-langsmith-domain.com/auth/callback`
- Must match **exactly** (case-sensitive, no trailing slash unless specified)
- IdP team must whitelist this URI
5. **Required Claims**
- `email` (required): User email address
- `name` (optional): Display name
- `groups` (optional): Group membership for role mapping
6. **Scopes**
- `openid` (required)
- `email` (required)
- `profile` (optional)
- `groups` (optional, if using group-based role mapping)
#### Helm/Environment Configuration
**Helm Values (recommended):**
```yaml
auth:
provider: oidc
oidc:
issuer: "https://your-org.okta.com/oauth2/default"
clientId: "your-client-id"
clientSecret:
secretName: langsmith-oidc-secret
secretKey: client-secret
redirectURI: "https://your-langsmith-domain.com/auth/callback"
scopes:
- openid
- email
- profile
- groups
claimMapping:
email: email
name: name
groups: groups
```
**Environment Variables (alternative):**
```bash
AUTH_PROVIDER=oidc
OIDC_ISSUER=https://your-org.okta.com/oauth2/default
OIDC_CLIENT_ID=your-client-id
OIDC_CLIENT_SECRET=<from-secret>
OIDC_REDIRECT_URI=https://your-langsmith-domain.com/auth/callback
OIDC_SCOPES=openid,email,profile,groups
```
#### Redirect URI Exactness
**Critical:** The redirect URI must match **exactly** between:
- LangSmith configuration
- IdP whitelist
- Actual callback URL
**Common Mistakes:**
- Trailing slash mismatch: `/auth/callback` vs `/auth/callback/`
- Protocol mismatch: `http://` vs `https://`
- Domain mismatch: `langsmith.example.com` vs `www.langsmith.example.com`
- Port mismatch: `:443` vs no port
**Validation:** Use the validation notebook to verify exact match.
#### TLS Requirements
- IdP issuer URL must be HTTPS
- LangSmith domain must have valid TLS certificate
- Certificate must be trusted by browser (not self-signed for production)
- Certificate must match domain exactly (no wildcard issues)
#### Clock Skew
- LangSmith and IdP clocks must be synchronized
- Maximum allowed skew: typically 5 minutes
- Use NTP on Kubernetes nodes
- Validate with: `kubectl exec <pod> -- date` vs IdP server time
---
### 4. Role Mapping
**Principle:** Map IdP groups to LangSmith roles, not individual users.
#### Group-Based Mapping (Recommended)
```yaml
auth:
roleMapping:
groups:
- group: "langsmith-admins"
role: "admin"
- group: "langsmith-members"
role: "member"
- group: "langsmith-viewers"
role: "viewer"
```
**Benefits:**
- Centralized management in IdP
- Easier audit trail
- Scales to large organizations
#### User-Based Mapping (Fallback)
```yaml
auth:
roleMapping:
users:
- email: "admin@example.com"
role: "admin"
```
**Use only when:**
- Group claims unavailable
- Temporary workaround
- Small team (< 10 users)
#### Minimal Admins Principle
- **Start with 1-2 admins**
- Add admins only when necessary
- Use group-based mapping for admins
- Document admin assignments
#### Mapping Claims to Roles
**Claim Structure:**
```json
{
"email": "user@example.com",
"name": "John Doe",
"groups": ["langsmith-members", "engineering"]
}
```
**Mapping Logic:**
1. Extract `groups` claim
2. Match against role mapping configuration
3. Assign highest privilege role found
4. Default to "member" if no match
**Validation:** Test with users in different groups to verify mapping.
---
### 5. SAML Configuration
#### Required Metadata
Your IdP team must provide:
1. **SAML Metadata URL** (preferred)
- HTTPS endpoint serving XML metadata
- Must be reachable from LangSmith pods
- Auto-refreshes configuration
2. **SAML Metadata XML** (fallback)
- Static XML file
- Must be updated manually when IdP changes
- Store in Kubernetes secret or ConfigMap
#### Expected Attributes
**Required:**
- `email` or `http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress`
- `name` or `http://schemas.xmlsoap.org/ws/2005/05/identity/claims/name`
**Optional (for role mapping):**
- `groups` or `http://schemas.microsoft.com/ws/2008/06/identity/claims/groups`
- Custom attribute names (must match exactly)
#### Common Failures
1. **Missing Attributes**
- Symptom: User authenticates but has no email/name
- Cause: IdP not sending required attributes
- Fix: Configure IdP to send required attributes
2. **Attribute Name Mismatch**
- Symptom: Claims not mapped correctly
- Cause: LangSmith expects different attribute name
- Fix: Update attribute mapping in Helm values
3. **Signature Validation Failure**
- Symptom: Authentication fails with "invalid signature"
- Cause: Certificate mismatch or expired certificate
- Fix: Update IdP certificate in metadata
4. **Assertion Expired**
- Symptom: Authentication times out
- Cause: Clock skew or assertion validity window too short
- Fix: Synchronize clocks, adjust validity window
---
### 6. Validation & Failure Drills
#### Validation Checklist
See `docs/shared/auth_validation_checklist.md` for complete checklist.
**Quick Validation:**
1. ✅ Ingress/TLS configured correctly
2. ✅ Redirect URI matches exactly
3. ✅ IdP issuer reachable
4. ✅ Client credentials valid
5. ✅ Role mapping configured
6. ✅ Login flow works end-to-end
7. ✅ Logout works
8. ✅ Session invalidation works
#### Failure Drills
**Purpose:** Understand failure modes and recovery procedures.
**Drill 1: Redirect URI Mismatch**
- **Change:** Modify redirect URI in Helm values (add trailing slash)
- **Observe:** Login redirect fails
- **Recover:** Revert change, restart pods
- **Validate:** Login works again
**Drill 2: Missing Claim**
- **Change:** Remove `groups` claim from IdP configuration
- **Observe:** Users authenticate but have no role
- **Recover:** Restore `groups` claim
- **Validate:** Role mapping works again
**Drill 3: Secret Rotation Wrong**
- **Change:** Update client secret in IdP but not in LangSmith
- **Observe:** Authentication fails with "invalid client"
- **Recover:** Update Kubernetes secret, restart pods
- **Validate:** Authentication works again
**Note:** These drills are **optional** and should only be run in non-production environments.
---
## Common Pitfalls
### Login Loop
**Symptom:** User redirected to IdP, then back to LangSmith, then to IdP again (infinite loop)
**Causes:**
- Redirect URI mismatch
- Session cookie not set (TLS/cookie issues)
- Token validation failure
**Fix:** Check redirect URI exactness, verify TLS certificate, check token validation logs
### No Data After Login
**Symptom:** User authenticates successfully but sees empty workspace
**Causes:**
- Role mapping not configured
- User not in any mapped groups
- Workspace not assigned to user's organization
**Fix:** Verify role mapping configuration, check user's group membership, verify workspace assignment
### TLS Callback Issues
**Symptom:** IdP callback fails with TLS errors
**Causes:**
- Self-signed certificate on LangSmith domain
- Certificate chain incomplete
- Certificate expired
**Fix:** Use valid TLS certificate from trusted CA, ensure full chain is present
### Multiple IdPs
**Symptom:** Confusion about which IdP to use
**Causes:**
- Multiple IdP configurations present
- Configuration precedence unclear
**Fix:** Use single IdP configuration, remove unused configurations
---
## Security & Compliance Callouts
### Least Privilege
- Start with minimal admin access
- Use group-based role mapping
- Regular access reviews
- Document all admin assignments
### Auditability
- All authentication events logged
- Role changes tracked
- Session creation/destruction logged
- Export logs to SIEM for compliance
### Centralized Identity Governance
- Manage users in IdP, not LangSmith
- Use IdP groups for access control
- Regular access reviews in IdP
- Deprovision users in IdP when they leave
---
## Artifacts Participants Leave With
1. **SSO Configuration**
- Helm values file with auth configuration
- Kubernetes secrets for client credentials
- Documentation of IdP settings
2. **IdP Settings Document**
- Redirect URI whitelisted
- Required claims configured
- Scopes configured
- Group structure documented
3. **Mapping Reference**
- Group-to-role mapping table
- Admin assignments documented
- Workspace access rules
4. **Validation Checklist**
- Completed validation checklist
- Test results for admin and standard user
- Logout/session invalidation verified
5. **Debugging Playbook**
- Troubleshooting guide reference
- Log locations documented
- Support bundle procedure
---
## Next Steps
1. **Run the validation notebook:**
- `notebooks/module-2/01_sso_oidc_validation.ipynb` (OIDC)
- `notebooks/module-2/02_sso_saml_validation.ipynb` (SAML)
2. **Complete the validation checklist:**
- `docs/shared/auth_validation_checklist.md`
3. **Review troubleshooting guide:**
- `docs/shared/auth_troubleshooting.md`
4. **Proceed to Module 3** (if applicable)
---
## References
- [OIDC Specification](https://openid.net/specs/openid-connect-core-1_0.html)
- [SAML 2.0 Specification](http://docs.oasis-open.org/security/saml/v2.0/)
- LangSmith Helm Chart Documentation
- Your IdP's OIDC/SAML documentation
+679
View File
@@ -0,0 +1,679 @@
# Module 3: Production Operations & Scaling
**Goal:** Enable operators to run LangSmith reliably under real production load, understand scaling domains, and respond effectively when things go wrong (day-2 operations).
**Duration:** ~2 hours
**Audience:** Platform engineers, infrastructure teams, SREs, and on-call operators
**Prerequisites:**
- Module 1 complete: LangSmith deployed and reachable (AWS/EKS or Azure/AKS baseline)
- Module 2 complete: Authentication and authorization configured (OIDC/SAML)
---
## Overview
Module 3 transitions from "it works" to "it works reliably under load." This module covers production operations, scaling strategies, observability, and the mental models needed for day-2 operations.
**What you'll accomplish:**
- Understand LangSmith's distributed architecture and scaling domains
- Configure production-grade service sizing and HA
- Implement autoscaling strategies (HPA and KEDA)
- Set up observability and early warning signals
- Validate production readiness
- Prepare for incident response
**What this module avoids:**
- Deep dives into specific monitoring tools (Prometheus/Grafana setup)
- Custom alerting rule creation (covered in incident response)
- Performance tuning and optimization (out of scope)
- Multi-region deployments (advanced topic)
---
## Section 1: Production Mental Model
### Distributed System Reality
LangSmith is a **distributed system** with multiple services that must coordinate:
- **API Server:** Handles HTTP requests, authentication, routing
- **Workers:** Process traces, spans, and evaluations asynchronously
- **ClickHouse:** Time-series data storage and queries
- **PostgreSQL:** Metadata, users, workspaces, projects
- **Redis:** Caching, rate limiting, job queues
- **Blob Storage:** Large payload storage (traces, artifacts)
**Key insight:** These services have different scaling characteristics and failure modes. Understanding these differences is critical for production operations.
### Scaling Domains
**Scaling domains** are groups of resources that scale together or have shared bottlenecks:
1. **Ingestion Domain:**
- API server pods (stateless, horizontal scaling)
- Ingress/Load Balancer (cloud-managed, scales automatically)
- **Bottleneck:** API server CPU/memory under high request volume
2. **Processing Domain:**
- Worker pods (stateless, horizontal scaling)
- Redis (single instance or cluster, vertical scaling)
- **Bottleneck:** Worker capacity and Redis throughput
3. **Storage Domain:**
- ClickHouse (stateful, complex scaling)
- PostgreSQL (stateful, vertical scaling + read replicas)
- Blob Storage (cloud-managed, effectively unlimited)
- **Bottleneck:** ClickHouse query performance, PostgreSQL connection limits
4. **Control Plane:**
- Kubernetes cluster (managed service)
- Helm releases, ConfigMaps, Secrets
- **Bottleneck:** Cluster capacity and node resources
**Critical understanding:** Scaling one domain without addressing downstream bottlenecks creates cascading failures.
---
## Section 2: Scaling Model
### What Scales Well
**Horizontal scaling (add more pods):**
- API server pods (stateless HTTP handlers)
- Worker pods (stateless job processors)
- Ingress controllers (cloud-managed load balancers)
**Why:** These services are stateless and can be scaled independently based on load.
### What Does NOT Autoscale
**Vertical scaling only (increase resources per instance):**
- PostgreSQL (managed RDS/Azure Database)
- Redis (managed ElastiCache/Azure Cache)
- ClickHouse (in-cluster or managed, complex scaling)
**Why:** These are stateful services with data locality requirements. Scaling requires careful planning and may involve downtime.
**Manual scaling required:**
- Kubernetes node capacity (cluster autoscaling helps, but has limits)
- Blob storage buckets (unlimited capacity, but requires configuration)
- Network bandwidth (cloud-managed, but has limits)
### Failure Pattern: HPA Increases Ingestion → Downstream Saturation
**Common anti-pattern:**
1. High request volume triggers HPA to scale API server pods
2. API servers successfully handle more requests
3. Workers cannot keep up with increased trace volume
4. Redis queue fills up
5. ClickHouse ingestion rate saturates
6. PostgreSQL connection pool exhausts
7. System degrades despite "scaled" API servers
**Solution:** Scale all domains together, or implement backpressure and rate limiting.
**Key principle:** Monitor downstream services, not just upstream services.
---
## Section 3: Service Sizing Baselines
### PostgreSQL (Database)
**Production baseline:**
- **Instance size:** db.r5.xlarge (4 vCPU, 32 GB RAM) minimum
- **Storage:** 500 GB+ with autoscaling enabled
- **High availability:** Multi-AZ deployment (RDS) or read replicas (Azure)
- **Connection pool:** 100+ connections configured in LangSmith
- **Backups:** Automated daily backups with 7-day retention minimum
**Non-production guidance:**
- db.t3.medium (2 vCPU, 4 GB RAM) acceptable for development
- Single-AZ acceptable for non-production
- 30-day backup retention sufficient
**Verification:**
```bash
# AWS RDS
aws rds describe-db-instances --db-instance-identifier <instance-id>
# Azure Database
az postgres flexible-server show --name <server-name> --resource-group <rg>
```
**What to check:**
- Instance class/size
- Multi-AZ status
- Storage autoscaling enabled
- Backup retention period
- Private networking (VPC/subnet configuration)
### Redis (Cache)
**Production baseline:**
- **Instance type:** cache.r6g.xlarge (6 vCPU, 13.07 GB RAM) minimum
- **High availability:** Redis Cluster mode enabled (3+ nodes)
- **Memory:** 50% headroom for growth
- **Persistence:** AOF (Append Only File) enabled for durability
**Non-production guidance:**
- cache.t3.micro acceptable for development
- Single node acceptable for non-production
- RDB snapshots sufficient (no AOF required)
**Verification:**
```bash
# AWS ElastiCache
aws elasticache describe-cache-clusters --cache-cluster-id <cluster-id>
# Azure Cache
az redis show --name <cache-name> --resource-group <rg>
```
**What to check:**
- Node type and memory size
- Cluster mode enabled (production)
- AOF persistence enabled
- Private networking
### ClickHouse
**Production baseline:**
- **Deployment:** Managed ClickHouse (ClickHouse Cloud) OR in-cluster with EBS CSI
- **In-cluster sizing:** 3-node cluster minimum (for HA)
- **Resources per node:** 8 CPU, 32 GB RAM, 1 TB storage
- **Storage:** EBS gp3 volumes with 3000 IOPS
- **Replication:** 2x replication factor (6 total pods for 3-node cluster)
**Non-production guidance:**
- Single node acceptable for development
- 4 CPU, 16 GB RAM per node sufficient
- 100 GB storage per node
**Verification:**
```bash
# In-cluster ClickHouse
kubectl get statefulset -n <namespace> | grep clickhouse
kubectl get pvc -n <namespace> | grep clickhouse
# Check ClickHouse cluster status
kubectl exec -it <clickhouse-pod> -n <namespace> -- clickhouse-client --query "SELECT * FROM system.clusters"
```
**What to check:**
- StatefulSet replica count
- PVC size and storage class
- Resource requests/limits
- Replication factor
### Managed vs In-Cluster
**Managed services (recommended for production):**
- PostgreSQL: RDS (AWS) or Azure Database for PostgreSQL
- Redis: ElastiCache (AWS) or Azure Cache for Redis
- ClickHouse: ClickHouse Cloud (managed service)
**Benefits:**
- Automated backups and maintenance
- High availability built-in
- Security patches applied automatically
- Monitoring and alerting included
**In-cluster services (acceptable for non-production):**
- PostgreSQL: Postgres operator (Crunchy Data, Zalando)
- Redis: Redis operator or Helm chart
- ClickHouse: ClickHouse operator
**Trade-offs:**
- More operational overhead
- Requires backup strategy
- Manual HA configuration
- Lower cost for development
### Private Networking
**Production requirement:** All data stores must be in private subnets with no public internet access.
**Why:**
- Security: Reduces attack surface
- Compliance: Required for many compliance frameworks
- Performance: Lower latency within VPC/VNet
**Verification:**
- RDS/Azure Database: Check subnet group (private subnets only)
- ElastiCache/Azure Cache: Check subnet group (private subnets only)
- ClickHouse: Check pod network policies and service mesh egress rules
---
## Section 4: Blob Storage REQUIRED for Production
### Why Blob Storage is Required
**Problem without blob storage:**
- Large trace payloads stored inline in ClickHouse
- ClickHouse table size explodes
- Query performance degrades
- Storage costs increase dramatically
- System becomes unusable under load
**Solution with blob storage:**
- Large payloads stored in S3/Azure Blob Storage
- ClickHouse stores only references (small strings)
- Query performance remains stable
- Storage costs scale linearly
- System handles production load
### Requirements
**Production:**
- **Service:** S3 (AWS) or Azure Blob Storage (Azure)
- **Bucket/Container:** Dedicated bucket for LangSmith
- **Access:** IAM roles (AWS) or Managed Identity (Azure) - no access keys
- **Lifecycle policies:** Configured for cost optimization (move to Glacier/Cool tier after 90 days)
- **Versioning:** Enabled for data protection
- **Encryption:** Server-side encryption enabled
**Non-production:**
- Local MinIO or in-cluster object storage acceptable
- Access keys acceptable (not for production)
- No lifecycle policies required
### Verification
**Check Helm values:**
```yaml
blobStorage:
provider: s3 # or azure
bucket: langsmith-traces
region: us-west-2
# IAM role ARN (not access keys)
iamRoleArn: arn:aws:iam::<account>:role/langsmith-blob-storage
```
**Check environment variables:**
```bash
kubectl exec <api-pod> -n <namespace> -- env | grep -i blob
kubectl exec <api-pod> -n <namespace> -- env | grep -i s3
```
**What to verify:**
- Blob storage provider configured (not "local" or "filesystem")
- Bucket/container name present
- IAM role or managed identity configured (no access keys)
- Blob storage health check passes (see ops sanity checks notebook)
---
## Section 5: Autoscaling Strategy
### HPA (Horizontal Pod Autoscaler) for API Servers
**Use case:** Scale API server pods based on CPU/memory utilization.
**Configuration:**
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: langsmith-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: langsmith-api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
```
**Baseline:**
- **Min replicas:** 2 (for HA)
- **Max replicas:** 10 (adjust based on cluster capacity)
- **CPU target:** 70% average utilization
- **Memory target:** 80% average utilization
**Verification:**
```bash
kubectl get hpa -n <namespace>
kubectl describe hpa langsmith-api -n <namespace>
```
### KEDA for Bursty Worker Scaling
**Why KEDA instead of HPA:**
- Workers process jobs from Redis queues
- Queue depth is a better scaling signal than CPU/memory
- Bursty workloads need rapid scaling (seconds, not minutes)
- KEDA supports Redis queue depth metrics
**Configuration:**
```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: langsmith-workers
spec:
scaleTargetRef:
name: langsmith-worker
minReplicaCount: 1
maxReplicaCount: 20
triggers:
- type: redis
metadata:
address: <redis-host>:6379
listName: langsmith:jobs:traces
listLength: "10" # Scale up when queue depth > 10
```
**Baseline:**
- **Min replicas:** 1
- **Max replicas:** 20 (adjust based on workload)
- **Queue depth threshold:** 10 jobs (adjust based on processing time)
- **Cooldown period:** 30 seconds
**Verification:**
```bash
kubectl get scaledobject -n <namespace>
kubectl describe scaledobject langsmith-workers -n <namespace>
```
### What Does NOT Autoscale
**Manual scaling required:**
- PostgreSQL instance size (vertical scaling only)
- Redis cluster size (add nodes manually)
- ClickHouse nodes (StatefulSet scaling requires data rebalancing)
- Kubernetes nodes (cluster autoscaler helps, but has limits)
**Key principle:** Monitor these services and scale proactively based on capacity planning, not reactively based on alerts.
---
## Section 6: Observability & Early Warning Signals
### Three Layers of Observability
**1. Kubernetes Layer:**
- Pod status, restarts, resource usage
- Node capacity and utilization
- Events and warnings
- **Tools:** `kubectl`, `kubectl top`, cluster monitoring
**2. LangSmith Application Layer:**
- Request rates, latencies, error rates
- Trace ingestion rates
- Worker queue depths
- **Tools:** Application metrics, logs, dashboards
**3. Data Store Layer:**
- PostgreSQL connection counts, query performance
- Redis memory usage, hit rates
- ClickHouse query performance, table sizes
- **Tools:** Cloud provider monitoring, database metrics
### Early Warning Signals
See `docs/shared/ops_signals_and_thresholds.md` for complete signal catalog.
**Critical signals (red flags):**
- Pod restart count > 5 in 1 hour
- Pending pods > 0 for > 5 minutes
- API server CPU > 80% for > 10 minutes
- Worker queue depth > 100
- PostgreSQL connections > 80% of max
- Redis memory > 90%
- ClickHouse query latency > 5 seconds (p95)
**Warning signals (yellow flags):**
- Pod restart count > 2 in 1 hour
- API server CPU > 70% for > 10 minutes
- Worker queue depth > 50
- PostgreSQL connections > 60% of max
- Redis memory > 75%
### Red Flag Thresholds
**Immediate action required:**
- Any pod in `CrashLoopBackOff` state
- Any pod `Pending` for > 10 minutes
- API server error rate > 5%
- Worker queue depth > 200
- PostgreSQL connection pool exhausted
- Redis out of memory
- ClickHouse query timeout > 10 seconds
**Escalation evidence:**
- Pod logs (last 100 lines)
- Recent events (`kubectl get events --sort-by=.lastTimestamp`)
- Resource usage (`kubectl top pods`)
- Application metrics snapshot
- Database connection counts
---
## Section 7: Backups, DR, and Failure Domains
### What Backups Cover
**PostgreSQL backups (managed services):**
- Automated daily backups (RDS/Azure Database)
- Point-in-time recovery (PITR) for last 7 days
- Cross-region backup replication (if configured)
- **Covers:** Database schema, user data, workspace/project metadata
**ClickHouse backups:**
- Manual backups via `clickhouse-backup` tool
- Cloud storage snapshots (if using managed ClickHouse)
- **Covers:** Trace data, span data, evaluation results
**Blob storage:**
- Versioning enabled (S3/Azure Blob)
- Lifecycle policies for cost optimization
- Cross-region replication (if configured)
- **Covers:** Large trace payloads, artifacts, files
### What Backups Do NOT Cover
**Not backed up automatically:**
- Kubernetes secrets (stored in cluster, not in backups)
- Helm values (stored in Git, not in backups)
- In-cluster ClickHouse data (unless backup job configured)
- Redis data (ephemeral cache, not backed up)
- Application configuration (ConfigMaps, stored in cluster)
**Manual backup required:**
- Kubernetes secrets (export to encrypted storage)
- Helm values files (store in Git)
- In-cluster ClickHouse (configure backup job)
- Application logs (export to log aggregation service)
### Failure Domains
**Availability Zone (AZ) failures:**
- **Impact:** Pods in one AZ unavailable
- **Mitigation:** Multi-AZ deployment (pods spread across AZs)
- **Recovery:** Kubernetes reschedules pods to healthy AZs
**Node failures:**
- **Impact:** All pods on failed node unavailable
- **Mitigation:** Multiple nodes, pod anti-affinity rules
- **Recovery:** Kubernetes reschedules pods to healthy nodes
**Database failures:**
- **Impact:** Application cannot read/write data
- **Mitigation:** Multi-AZ RDS, automated failover
- **Recovery:** RDS promotes standby to primary (5-10 minutes)
**Region failures:**
- **Impact:** Entire deployment unavailable
- **Mitigation:** Multi-region deployment (advanced, out of scope)
- **Recovery:** Manual failover to secondary region
**Reality check:** Most failures are AZ or node-level. Region failures are rare but catastrophic. Plan accordingly.
---
## Section 8: Production Readiness Checklist
See `docs/shared/production_readiness_checklist.md` for complete checklist.
**Each checklist item maps to real incidents:**
1. **Blob storage configured** → Prevents ClickHouse table explosion
2. **PostgreSQL HA enabled** → Prevents database downtime
3. **Redis cluster mode** → Prevents cache failures
4. **ClickHouse replication** → Prevents data loss
5. **HPA configured** → Prevents API server overload
6. **KEDA configured** → Prevents worker queue saturation
7. **Monitoring enabled** → Enables early detection
8. **Backups configured** → Enables data recovery
9. **Private networking** → Meets security requirements
10. **Resource limits set** → Prevents resource exhaustion
**Validation:**
- Run `notebooks/module-3/01_ops_sanity_checks.ipynb` to validate each item
- Review cloud provider console for managed service configuration
- Check Helm values for application configuration
- Verify monitoring dashboards show expected metrics
---
## Section 9: Sidecars & Service Mesh (Istio)
### When Sidecars Are Needed
**Use cases:**
- **Egress control:** Restrict outbound traffic to approved destinations
- **mTLS:** Encrypt traffic between services
- **Policy enforcement:** Rate limiting, circuit breakers
- **Observability:** Distributed tracing, metrics collection
**When NOT needed:**
- Simple deployments without egress requirements
- Development environments
- Proof-of-concept deployments
### How to Enable Injection Safely
**Namespace-level injection (recommended for LangSmith):**
```yaml
apiVersion: v1
kind: Namespace
metadata:
name: langsmith
labels:
istio-injection: enabled
istio-discovery: enabled
```
**Per-workload annotation (for selective injection):**
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: langsmith-api
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: "true"
```
**Revision-based injection (for canary/blue-green):**
```yaml
labels:
istio-injection: enabled
istio.io/rev: default
```
### Operational Implications
**Logging and kubectl logs:**
- Multi-container pods require container selection
- **App logs:** `kubectl logs <pod> -c <container-name> -n <namespace>`
- **Proxy logs:** `kubectl logs <pod> -c istio-proxy -n <namespace>`
- **All logs:** `kubectl logs <pod> --all-containers=true -n <namespace>`
**Common issue:** "If logs appear missing after injection, you're likely looking at the wrong container."
**Health probes and timeouts:**
- Sidecar adds latency to health checks
- Increase probe timeouts if sidecars are enabled
- Verify readiness probes account for sidecar startup
**Egress to external databases:**
- Configure `ServiceEntry` for external PostgreSQL/Redis endpoints
- Configure `DestinationRule` for traffic policies
- Verify egress rules allow database connections
See `docs/shared/sidecars_and_service_mesh.md` for detailed guidance.
---
## Section 10: Transition to Incident Response
Module 3 establishes the baseline for production operations. The next step is **incident response**:
**What you'll learn:**
- How to diagnose common failure modes
- How to gather evidence for support
- How to implement runbooks
- How to perform post-incident reviews
**Prerequisites:**
- Module 3 complete (production readiness validated)
- Monitoring and alerting configured
- On-call rotation established
---
## Artifacts Participants Leave With
1. **Production readiness checklist** (completed)
2. **Service sizing documentation** (baselines documented)
3. **Autoscaling configuration** (HPA and KEDA configured)
4. **Observability setup** (signals and thresholds documented)
5. **Backup strategy** (backups configured and tested)
6. **Ops sanity checks notebook** (validation results)
---
## Next Steps
1. **Run the ops sanity checks notebook:**
- `notebooks/module-3/01_ops_sanity_checks.ipynb`
2. **Review production readiness checklist:**
- `docs/shared/production_readiness_checklist.md`
3. **Document your thresholds:**
- `docs/shared/ops_signals_and_thresholds.md`
4. **Configure monitoring and alerting** (next module)
5. **Proceed to incident response training** (next module)
---
## References
- [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
- [KEDA Documentation](https://keda.sh/docs/)
- [Istio Service Mesh](https://istio.io/latest/docs/)
- LangSmith Helm Chart Documentation
- Cloud Provider Documentation (AWS RDS, Azure Database, etc.)
+426
View File
@@ -0,0 +1,426 @@
# Module 4: Troubleshooting & Incident Response
**Goal:** Teach operators how to diagnose LangSmith self-hosted issues under pressure, collect the right evidence, and resolve incidents efficiently—either independently or with LangChain Support.
**Duration:** ~3-4 hours (with optional full incident drill)
**Audience:** On-call engineers, platform owners, SREs, and anyone responsible for keeping LangSmith healthy
**Prerequisites:**
- Module 1 complete: LangSmith deployed and reachable
- Module 2 complete: Authentication configured
- Module 3 complete: Production operations concepts understood
- Participants own day-2 operations
---
## Overview
Module 4 is hands-on: learners will introduce subtle but noticeable failures and debug them using standard tools and the canonical diagnostics bundle. This module builds the muscle memory needed for real incidents.
**What you'll accomplish:**
- Understand common failure modes and their symptoms
- Master the "first 10 minutes" incident response checklist
- Learn to collect canonical diagnostics bundles
- Practice debugging with guided failure labs
- Know when and how to escalate to Support
**What this module avoids:**
- Deep dives into specific monitoring tools (assumes basic kubectl/helm)
- Performance optimization (covered in Module 3)
- Infrastructure provisioning (covered in Module 1)
- Authentication configuration (covered in Module 2)
---
## Section 1: Incident Reality Check
### The Mindset
**Incidents happen.** Even with perfect configuration, production systems fail. The difference between a 30-minute incident and a 4-hour outage is often preparation and process.
**Key principles:**
1. **Collect evidence first.** Don't redeploy, restart, or reconfigure until you understand what's wrong.
2. **Time is evidence.** Every minute that passes without collecting diagnostics is lost information.
3. **Symptoms are clues.** The same root cause can manifest differently depending on load, timing, and configuration.
4. **Support needs context.** A good diagnostics bundle is worth more than a perfect description.
### What Makes Incidents Hard
**Pressure:**
- Users are impacted
- Management is asking for updates
- You're on-call and tired
- Multiple systems are involved
**Complexity:**
- Distributed systems have many moving parts
- Failures cascade (one service fails, others follow)
- Symptoms don't always point to root cause
- Configuration drift accumulates over time
**Tooling:**
- Too many tools (which one shows the truth?)
- Too few tools (missing critical information)
- Tools that hide the problem (aggregation, sampling)
**This module prepares you for all of these.**
---
## Section 2: Common Failure Modes
### Ingestion & Tracing Failures
**Symptoms:**
- Traces appear delayed or missing
- Worker pods show errors in logs
- ClickHouse insert errors
- Queue backlogs
**Common causes:**
- ClickHouse connectivity issues (network, credentials, resource limits)
- Blob storage misconfiguration (large payloads fail)
- Worker resource exhaustion (CPU/memory limits)
- Redis connectivity (job queue backing up)
**What to check first:**
- Worker pod logs
- ClickHouse pod status and logs
- Redis connectivity and latency
- Blob storage configuration
### UI & API Failures
**Symptoms:**
- UI returns 5xx errors
- API endpoints timeout
- Login fails or redirects loop
- Specific features don't work
**Common causes:**
- Database connectivity (PostgreSQL unreachable)
- Authentication misconfiguration (OIDC/SAML)
- Ingress/load balancer issues
- API pod crashes or resource limits
**What to check first:**
- API pod logs
- Database connectivity
- Ingress status and configuration
- Authentication configuration (Module 2 validation)
### Authentication Failures
**Symptoms:**
- Users can't log in
- Redirect loops
- 403 errors after successful login
- Session timeouts
**Common causes:**
- IdP connectivity issues
- OIDC/SAML configuration drift
- Secret rotation without updating LangSmith
- Network policies blocking egress
**What to check first:**
- Auth pod logs
- IdP connectivity (curl to issuer URL)
- OIDC/SAML configuration (Module 2 validation)
- Network policies
---
## Section 3: First 10 Minutes Checklist
**The first 10 minutes of an incident are critical.** This is when you collect the most valuable evidence and make decisions that determine how long the incident lasts.
### What NOT to Do
**Resist the urge to:**
- Run `helm upgrade` or `kubectl rollout restart`
- Delete pods "to see if they come back"
- Scale resources up/down
- Change configuration
**Why:** Redeploying destroys evidence and may mask the root cause. Collect diagnostics first.
### The Checklist
See [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md) for the complete reference.
**Quick summary:**
1. **Minute 0-2:** Triage & scope (what's broken, who's impacted)
2. **Minute 2-5:** Quick health check (pods, events, ingress)
3. **Minute 5-8:** Collect diagnostics bundle (canonical script + snapshots)
4. **Minute 8-10:** Identify likely root cause (symptoms → checks)
**Key insight:** This checklist is not about fixing the issue—it's about collecting evidence and making informed decisions.
---
## Section 4: Standard Diagnostics Collection
### The Canonical Script
LangChain provides an official diagnostics script that captures everything Support needs:
**Location:**
```
https://github.com/langchain-ai/helm/blob/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
```
**What it captures:**
- Pod logs (all containers)
- Events (sorted by timestamp)
- Resource usage (CPU, memory)
- Configuration (deployments, services, ingress)
- Storage (PVCs, storage classes)
- Network (services, endpoints)
**How to use it:**
```bash
curl -O https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
chmod +x get_k8s_debugging_info.sh
./get_k8s_debugging_info.sh <namespace>
```
**Important:** Always run this script before making changes. The bundle it creates is your evidence.
### What Good Debugging Looks Like
**Good debugging:**
- Starts with a baseline (what was working before)
- Collects evidence systematically (checklist-driven)
- Documents hypotheses and tests them
- Preserves evidence (saves diagnostics bundles)
- Escalates with context (diagnostics + timeline)
**Bad debugging:**
- Changes things without understanding
- Doesn't collect evidence
- Jumps to conclusions
- Destroys evidence (redeploys, deletes)
- Escalates without context ("it's broken, fix it")
**The difference:** Good debugging produces a clear root cause and fix. Bad debugging produces more incidents.
---
## Section 5: Working with Support
### What Speeds Up Support
**Good escalation includes:**
- Diagnostics bundle (canonical script output)
- Timeline (when did it start, what changed)
- Symptoms (what's broken, who's impacted)
- What you've tried (investigation steps, results)
- Environment details (versions, configuration)
**Use the [Support Escalation Template](../shared/support_escalation_template.md).**
### What Slows Down Support
**Poor escalation includes:**
- No diagnostics bundle ("just look at it")
- Vague symptoms ("it's slow")
- No timeline ("it broke")
- No environment details ("it's on Kubernetes")
- Secrets in logs (security risk)
**Result:** Support has to ask for information you could have provided, delaying resolution.
### Required Metadata
**Support will always ask for:**
1. Diagnostics bundle (canonical script)
2. Helm chart version
3. Image tags (if known)
4. Recent changes (deployments, config, infrastructure)
5. Cloud provider and region
6. Kubernetes version
7. What you've tried and results
**Provide this upfront to speed resolution.**
---
## Section 6: Preventing Repeat Incidents
### Post-Incident Review
**After an incident is resolved:**
1. **Document the root cause** (what actually broke)
2. **Identify contributing factors** (what made it worse)
3. **List what worked** (what helped you debug)
4. **List what didn't work** (what slowed you down)
5. **Create action items** (what to change to prevent recurrence)
**Key questions:**
- Could we have detected this earlier? (monitoring, alerts)
- Could we have prevented this? (configuration, testing)
- Could we have fixed it faster? (runbooks, tooling)
- What did we learn? (new failure mode, new tool)
### Common Patterns
**Configuration drift:**
- Secrets rotate, but LangSmith config isn't updated
- Infrastructure changes, but Helm values aren't updated
- IdP settings change, but OIDC/SAML config isn't updated
**Prevention:** Automated validation (Module 2, Module 3 notebooks), configuration as code, regular audits.
**Resource exhaustion:**
- ClickHouse runs out of disk
- PostgreSQL hits connection limits
- Workers hit CPU/memory limits
**Prevention:** Monitoring (Module 3), autoscaling (Module 3), capacity planning.
**Network issues:**
- Egress blocked by NetworkPolicy
- Load balancer misconfiguration
- DNS resolution failures
**Prevention:** Network policy testing, ingress validation (Module 1), DNS checks.
---
## Section 7: Hands-on Failure Labs
**This is where you practice.** Each lab follows the same pattern:
1. **Baseline snapshot:** Capture what "good" looks like
2. **Introduce failure:** Apply a subtle but noticeable fault
3. **Observe symptoms:** See how the failure manifests
4. **Collect diagnostics:** Run the canonical script and gather evidence
5. **Hypothesize root cause:** Based on symptoms, identify likely cause
6. **Verify with targeted checks:** Confirm your hypothesis
7. **Remediate:** Revert the failure
8. **Confirm recovery:** Verify everything is working again
9. **Capture lessons learned:** Document what you discovered
### Lab Structure
**Each failure lab includes:**
- **What this service does for LangSmith:** Context on the service's role
- **Expected symptoms when it fails:** What you'll see when it breaks
- **Failure injection options:** Two levels (subtle vs. obvious)
- **Do the drill:** Step-by-step debugging process
- **What Support will ask for:** Service-specific evidence
### Available Labs
1. **PostgreSQL Failure Lab** (`10_failure_lab_postgres.ipynb`)
- Connection failures, wrong credentials, network isolation
- Symptoms: API 5xx, login failures, connection exhaustion
2. **Redis Failure Lab** (`20_failure_lab_redis.ipynb`)
- Connectivity issues, wrong credentials
- Symptoms: Intermittent ingestion, latency spikes, worker backlog
3. **ClickHouse Failure Lab** (`30_failure_lab_clickhouse.ipynb`)
- Endpoint misconfiguration, network isolation, resource limits
- Symptoms: Traces delayed/missing, insert errors, UI loads but traces don't appear
4. **Blob Storage Failure Lab** (`40_failure_lab_blob_storage.ipynb`)
- Credential misconfiguration, bucket name errors
- Symptoms: Large payload traces degrade ClickHouse, warnings in logs
5. **Full Incident Drill** (`90_full_incident_drill.ipynb`) (optional)
- Combined failure + timeline pressure
- Practice "first 10 minutes" checklist
- Produce incident summary using escalation template
---
## Section 8: Workshop Wrap-up
### What You've Learned
- How to respond to incidents systematically
- How to collect canonical diagnostics bundles
- How to debug common failure modes
- How to escalate effectively to Support
- How to prevent repeat incidents
### Next Steps
**Immediate:**
- Run through failure labs to build muscle memory
- Customize the "first 10 minutes" checklist for your environment
- Set up monitoring and alerts (Module 3)
**Ongoing:**
- Practice incident response regularly (drills)
- Keep diagnostics script updated
- Document your own failure modes and fixes
- Share learnings with your team
### Resources
- [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)
- [Support Escalation Template](../shared/support_escalation_template.md)
- [Canonical Diagnostics Script](https://github.com/langchain-ai/helm/blob/main/charts/langsmith/scripts/get_k8s_debugging_info.sh)
- Module 1: Deployment & Baseline Validation
- Module 2: Identity & Authentication
- Module 3: Production Operations & Scaling
---
## Artifacts
**Participants leave with:**
- A working incident response process
- Experience debugging real failure modes
- A diagnostics bundle collection workflow
- An escalation template customized for their environment
- Confidence to handle incidents independently
---
## Common Pitfalls
**Don't:**
- Skip the baseline snapshot (you need "before" to compare to "after")
- Redeploy before collecting evidence (destroys diagnostics)
- Ignore error messages (they're clues)
- Escalate without diagnostics bundle (slows Support)
- Delete evidence (you'll need it for post-incident review)
**Do:**
- Follow the checklist (it's battle-tested)
- Collect diagnostics early (time is evidence)
- Document your investigation (helps you and Support)
- Test your process (run drills)
- Learn from each incident (prevent repeats)
---
## Troubleshooting
**"The diagnostics script fails":**
- Check kubectl access and namespace
- Verify script is up-to-date (check GitHub)
- Run with verbose output to see what's failing
**"I can't reproduce the failure":**
- Check that failure injection was applied correctly
- Verify symptoms match expected behavior
- Try a different failure injection method (Level 2 if Level 1 didn't work)
**"The remediation doesn't work":**
- Verify you reverted the exact change you made
- Check for cascading failures (one failure caused another)
- Collect post-remediation diagnostics to compare
**"I don't understand the symptoms":**
- Review the service's role in LangSmith (lab introduction)
- Check logs for error patterns
- Compare to baseline snapshot (what changed?)
---
**Remember:** Incident response is a skill. Practice makes perfect. The more you drill, the better you'll be when real incidents happen.
+366
View File
@@ -0,0 +1,366 @@
# Authentication Troubleshooting Playbook
**Purpose:** Triage tree for common authentication failures
**Audience:** Operators troubleshooting SSO issues
---
## Triage Tree
### 1. Login Loop
**Symptoms:**
- User redirected to IdP
- User authenticates successfully
- Redirected back to LangSmith
- Immediately redirected to IdP again (infinite loop)
**Likely Causes:**
1. Redirect URI mismatch (most common)
2. Session cookie not being set (TLS/cookie issues)
3. Token validation failure
4. State parameter mismatch
**Evidence Gathering:**
```bash
# Check pod logs for redirect errors
kubectl logs <api-pod> -n <namespace> --tail=100 | grep -i "redirect\|callback\|auth"
# Check ingress configuration
kubectl get ingress -n <namespace> -o yaml
# Test redirect URI exactness
curl -I https://<domain>/auth/callback
# Check browser console for cookie errors
# (Manual check in browser developer tools)
```
**Commands:**
```bash
# Verify redirect URI in Helm values
helm get values <release> -n <namespace> | grep -i redirect
# Check environment variables
kubectl exec <pod> -n <namespace> -- env | grep -i redirect
# Verify IdP whitelist (manual check in IdP admin console)
```
**Fix:**
1. Verify redirect URI matches **exactly** (case, trailing slashes, protocol)
2. Check IdP whitelist includes exact redirect URI
3. Verify TLS certificate is valid (browser must accept cookies)
4. Check session cookie settings (SameSite, Secure flags)
---
### 2. 403/Unauthorized After Login
**Symptoms:**
- User authenticates successfully at IdP
- Redirected back to LangSmith
- Receives 403 Forbidden or "Unauthorized" error
- Cannot access any resources
**Likely Causes:**
1. Role mapping not configured
2. User not in any mapped groups
3. Workspace not assigned to user's organization
4. Claims/attributes not being sent by IdP
**Evidence Gathering:**
```bash
# Check pod logs for authorization errors
kubectl logs <api-pod> -n <namespace> --tail=100 | grep -i "403\|unauthorized\|forbidden\|role"
# Check role mapping configuration
helm get values <release> -n <namespace> | grep -i "role\|mapping\|group"
# Check user's group membership (from IdP)
# (Manual check - verify user is in expected groups)
```
**Commands:**
```bash
# Verify role mapping in Helm values
helm get values <release> -n <namespace> | grep -A 10 "roleMapping"
# Check environment variables for claim mappings
kubectl exec <pod> -n <namespace> -- env | grep -i "claim\|attribute\|group"
# Test with different user (in mapped group)
```
**Fix:**
1. Verify user is in a group that's mapped to a role
2. Check role mapping configuration in Helm values
3. Verify IdP is sending group claims/attributes
4. Assign user to appropriate group in IdP
5. Verify workspace assignment in LangSmith
---
### 3. SAML Assertion Missing Attributes
**Symptoms:**
- User authenticates successfully
- Login completes but user has no email/name
- Role mapping doesn't work
- User cannot access resources
**Likely Causes:**
1. IdP not configured to send required attributes
2. Attribute names don't match configuration
3. Attribute mapping incorrect in LangSmith
**Evidence Gathering:**
```bash
# Check logs for missing attribute errors
kubectl logs <api-pod> -n <namespace> --tail=100 | grep -i "attribute\|missing\|email\|name"
# Check SAML attribute mapping
helm get values <release> -n <namespace> | grep -i "saml.*attribute"
# Verify SAML metadata includes attribute definitions
# (Check IdP metadata XML)
```
**Commands:**
```bash
# Verify attribute mapping configuration
kubectl exec <pod> -n <namespace> -- env | grep -i "SAML.*ATTRIBUTE"
# Check SAML metadata for attribute definitions
curl <SAML_METADATA_URL> | grep -i "Attribute"
# Test with SAML tracer (browser extension) to see actual assertion
```
**Fix:**
1. Configure IdP to send required attributes (email, name, groups)
2. Verify attribute names match LangSmith configuration exactly
3. Update attribute mapping in Helm values if names differ
4. Test with SAML tracer to verify attributes in assertion
---
### 4. Redirect Mismatch
**Symptoms:**
- Login attempt fails immediately
- Error: "redirect_uri_mismatch" or similar
- User never reaches IdP login page
**Likely Causes:**
1. Redirect URI in LangSmith doesn't match IdP whitelist
2. Trailing slash mismatch
3. Protocol mismatch (http vs https)
4. Domain mismatch
**Evidence Gathering:**
```bash
# Check configured redirect URI
helm get values <release> -n <namespace> | grep -i redirect
# Verify exact redirect URI format
kubectl exec <pod> -n <namespace> -- env | grep -i REDIRECT
# Test redirect URI endpoint
curl -I https://<domain>/auth/callback
```
**Commands:**
```bash
# Compare redirect URIs
echo "LangSmith config:"
kubectl exec <pod> -n <namespace> -- env | grep OIDC_REDIRECT_URI
echo "IdP whitelist:"
# (Manual check in IdP admin console)
# Verify exact match (including trailing slashes, case)
```
**Fix:**
1. Get exact redirect URI from LangSmith configuration
2. Verify it matches IdP whitelist **exactly** (character-by-character)
3. Update IdP whitelist if needed
4. Restart LangSmith pods after configuration change
---
### 5. TLS/Callback Issues
**Symptoms:**
- IdP callback fails with TLS errors
- Browser shows "Not Secure" warning
- Certificate errors in browser console
- Callback never completes
**Likely Causes:**
1. Self-signed certificate (browser rejects)
2. Certificate chain incomplete
3. Certificate expired
4. Certificate doesn't match domain
**Evidence Gathering:**
```bash
# Check TLS certificate
openssl s_client -connect <domain>:443 -servername <domain> < /dev/null
# Check certificate expiration
echo | openssl s_client -connect <domain>:443 -servername <domain> 2>/dev/null | \
openssl x509 -noout -dates
# Check ingress TLS configuration
kubectl get ingress -n <namespace> -o yaml | grep -A 5 tls
```
**Commands:**
```bash
# Verify certificate validity
kubectl get ingress -n <namespace> -o jsonpath='{.items[0].spec.tls[0].secretName}'
kubectl get secret <tls-secret> -n <namespace> -o yaml
# Test certificate from pod
kubectl exec <pod> -n <namespace> -- openssl s_client -connect <domain>:443 -servername <domain>
```
**Fix:**
1. Use valid TLS certificate from trusted CA (not self-signed)
2. Ensure full certificate chain is present
3. Renew certificate if expired
4. Verify certificate matches domain exactly
5. Update ingress TLS secret if needed
---
## What Support Will Ask For
When contacting LangSmith support for authentication issues, provide:
### Minimal Evidence Bundle
1. **Configuration Summary (redacted)**
- Auth provider type (OIDC/SAML)
- Issuer/metadata URL (no secrets)
- Domain
- Claim/attribute mappings
- Role mapping configuration
2. **Pod Logs**
- Last 200 lines from API/server pods
- Filtered for auth-related errors
- Timestamp of failure
3. **Recent Events**
```bash
kubectl get events -n <namespace> --sort-by=.lastTimestamp > events.txt
```
4. **Ingress Configuration**
```bash
kubectl get ingress -n <namespace> -o yaml > ingress.yaml
```
5. **Helm Values (redacted)**
```bash
helm get values <release> -n <namespace> > helm-values.txt
# Manually redact secrets before sending
```
### Do NOT Include
- Client secrets
- Tokens
- Passwords
- Private keys
- Full certificate chains (public certs OK)
### Support Bundle Script
```bash
#!/bin/bash
# Collect minimal auth troubleshooting bundle
NAMESPACE="${NAMESPACE:-langsmith}"
RELEASE="${HELM_RELEASE:-langsmith}"
OUTPUT_DIR="auth-support-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$OUTPUT_DIR"
# Pod logs (last 200 lines, auth-related)
kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].metadata.name}' | \
tr ' ' '\n' | grep -E "(api|server|backend)" | head -3 | while read pod; do
kubectl logs "$pod" -n "$NAMESPACE" --tail=200 | \
grep -i -E "(auth|oidc|saml|sso|login|redirect)" > "$OUTPUT_DIR/${pod}-auth-logs.txt" || true
done
# Events
kubectl get events -n "$NAMESPACE" --sort-by=.lastTimestamp > "$OUTPUT_DIR/events.txt"
# Ingress
kubectl get ingress -n "$NAMESPACE" -o yaml > "$OUTPUT_DIR/ingress.yaml"
# Helm values (redact secrets manually)
helm get values "$RELEASE" -n "$NAMESPACE" > "$OUTPUT_DIR/helm-values.txt"
echo "⚠️ REDACT SECRETS FROM helm-values.txt BEFORE SENDING"
# Configuration summary
cat > "$OUTPUT_DIR/config-summary.txt" <<EOF
Auth Configuration Summary
Generated: $(date -Iseconds)
Namespace: $NAMESPACE
Release: $RELEASE
Domain: ${LANGSMITH_DOMAIN:-N/A}
Provider: ${AUTH_PROVIDER:-N/A}
Note: Secrets not included for security.
EOF
echo "Support bundle saved to: $OUTPUT_DIR"
echo "⚠️ Review and redact secrets before sending to support"
```
---
## Quick Reference
### OIDC Issues
- **Redirect mismatch:** Check exact URI match
- **Token validation:** Check issuer URL, clock skew
- **Missing claims:** Verify scopes and IdP configuration
### SAML Issues
- **Missing attributes:** Check IdP attribute configuration
- **Signature failure:** Verify certificate in metadata
- **Entity ID mismatch:** Check entity ID configuration
### Common Commands
```bash
# Check auth configuration
kubectl exec <pod> -n <namespace> -- env | grep -i -E "(auth|oidc|saml)"
# Check logs
kubectl logs <pod> -n <namespace> --tail=100 | grep -i auth
# Check Helm values
helm get values <release> -n <namespace>
# Restart pods (after config change)
kubectl rollout restart deployment -n <namespace>
```
---
## Escalation
If issues persist after following this playbook:
1. Collect minimal evidence bundle (see above)
2. Document exact steps to reproduce
3. Note any recent configuration changes
4. Contact LangSmith support with evidence bundle
+110
View File
@@ -0,0 +1,110 @@
# Authentication Validation Checklist
**Purpose:** Operator-friendly checklist for validating SSO configuration
**Use:** Complete this checklist after running the validation notebook(s)
---
## Preconditions
- [ ] DNS configured and resolving correctly
- [ ] TLS certificate valid and trusted (not self-signed in production)
- [ ] Ingress configured and accessible
- [ ] LangSmith deployment healthy (all pods running, PVCs bound)
---
## Configuration Inputs
### OIDC Configuration
- [ ] `OIDC_ISSUER` set and accessible
- [ ] `OIDC_CLIENT_ID` set
- [ ] `OIDC_CLIENT_SECRET` set (stored in Kubernetes secret)
- [ ] `OIDC_REDIRECT_URI` matches exactly between LangSmith and IdP
- [ ] `OIDC_SCOPES` includes `openid` and `email`
- [ ] `OIDC_SCOPES` includes `groups` (if using group-based role mapping)
### SAML Configuration
- [ ] `SAML_METADATA_URL` accessible OR `SAML_METADATA_FILE` exists
- [ ] SAML metadata XML is valid
- [ ] Entity ID matches between LangSmith and IdP
- [ ] Signing certificate present in metadata
- [ ] SSO endpoints found in metadata
### Common to Both
- [ ] `LANGSMITH_DOMAIN` matches actual domain
- [ ] Claim/attribute mappings configured
- [ ] Role mapping configured (groups or users)
---
## Role Mapping
- [ ] Group-to-role mapping configured (preferred)
- [ ] Admin groups identified and mapped
- [ ] Member groups identified and mapped
- [ ] Viewer groups identified and mapped (if applicable)
- [ ] Minimal admin principle followed (1-2 admins to start)
---
## Login Validation
### Admin User
- [ ] Admin user can log in via SSO
- [ ] Admin user sees correct role (admin)
- [ ] Admin user can access organization settings
- [ ] Admin user can manage workspaces
- [ ] Admin user can manage users (if applicable)
### Standard User
- [ ] Standard user can log in via SSO
- [ ] Standard user sees correct role (member/viewer)
- [ ] Standard user can access assigned workspaces
- [ ] Standard user cannot access organization settings
- [ ] Standard user cannot manage users
---
## Session Management
- [ ] Logout works correctly
- [ ] Session invalidation works (logout from IdP invalidates LangSmith session)
- [ ] Session timeout configured appropriately
- [ ] Multiple browser sessions work independently
---
## Audit Evidence
- [ ] Authentication events logged
- [ ] Role assignments logged
- [ ] Session creation/destruction logged
- [ ] Failed authentication attempts logged
- [ ] Logs exportable to SIEM (if required)
---
## Documentation
- [ ] Helm values file saved (with secrets redacted)
- [ ] IdP settings documented
- [ ] Group-to-role mapping table created
- [ ] Admin assignments documented
- [ ] Troubleshooting playbook bookmarked
---
## Sign-Off
**Validated by:** _________________
**Date:** _________________
**Notes:** _________________
---
**Next Steps:**
- Proceed to Module 3 (if applicable)
- Schedule regular access reviews
- Document any deviations from standard configuration
+163
View File
@@ -0,0 +1,163 @@
# First 10 Minutes: Incident Response Checklist
**When:** You detect or are alerted to a LangSmith self-hosted issue.
**Goal:** Collect evidence, stabilize if possible, and prepare for escalation—without making things worse.
---
## ⚠️ Critical: Do NOT Redeploy
**Resist the urge to:**
- Run `helm upgrade` or `kubectl rollout restart`
- Delete pods "to see if they come back"
- Scale resources up/down
- Change configuration
**Why:** Redeploying destroys evidence and may mask the root cause. Collect diagnostics first.
---
## Minute 0-2: Triage & Scope
- [ ] **Confirm the issue:** What's broken? (UI down, API 5xx, traces missing, auth failing)
- [ ] **Check who's impacted:** All users, specific endpoints, specific features?
- [ ] **Note the time:** Record detection time and any recent changes (deployments, config changes, infrastructure changes)
- [ ] **Check basic connectivity:**
```bash
kubectl cluster-info
kubectl get nodes
kubectl get pods -n <namespace>
```
---
## Minute 2-5: Quick Health Check
- [ ] **Pod status:**
```bash
kubectl get pods -n <namespace> -o wide
```
Look for: CrashLoopBackOff, Pending, Error states
- [ ] **Recent events:**
```bash
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
```
Look for: Failed scheduling, image pull errors, resource limits
- [ ] **Ingress/Load Balancer:**
```bash
kubectl get ingress -n <namespace>
```
Check if endpoint is reachable (curl or browser)
- [ ] **Key deployments:**
```bash
kubectl get deployments -n <namespace>
kubectl describe deployment <deployment-name> -n <namespace>
```
---
## Minute 5-8: Collect Diagnostics Bundle
- [ ] **Run canonical diagnostics script:**
```bash
# Download and run the official script
curl -O https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
chmod +x get_k8s_debugging_info.sh
./get_k8s_debugging_info.sh <namespace>
```
This captures: pod logs, events, resource usage, configuration
- [ ] **Save timestamped snapshot:**
```bash
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
mkdir -p artifacts/incident-$TIMESTAMP
kubectl get all -n <namespace> -o yaml > artifacts/incident-$TIMESTAMP/all-resources.yaml
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > artifacts/incident-$TIMESTAMP/events.txt
```
- [ ] **Check logs for obvious errors:**
```bash
# Check API server logs
kubectl logs -n <namespace> -l app=langsmith-api --tail=100
# Check worker logs
kubectl logs -n <namespace> -l app=langsmith-worker --tail=100
```
Look for: connection errors, timeouts, authentication failures, resource exhaustion
---
## Minute 8-10: Identify Likely Root Cause
Based on symptoms, check the most likely culprits:
### If UI/API is down:
- [ ] Check ingress/load balancer status (via cloud helper or kubectl)
- [ ] Check API pod logs for startup errors
- [ ] Verify external services (PostgreSQL, Redis) are reachable
### If traces are missing/delayed:
- [ ] Check ClickHouse connectivity and logs
- [ ] Check worker pod logs for insert errors
- [ ] Verify blob storage configuration (if large payloads)
### If authentication fails:
- [ ] Check OIDC/SAML configuration (Module 2 validation)
- [ ] Check IdP connectivity
- [ ] Review auth-related pod logs
### If ingestion is slow:
- [ ] Check Redis connectivity and latency
- [ ] Check worker pod resource usage
- [ ] Look for queue backlogs
---
## After 10 Minutes: Decision Point
**If you've identified and can safely fix the issue:**
- Document what you changed
- Verify recovery
- Collect post-recovery diagnostics
**If you need help:**
- Use the [Support Escalation Template](../shared/support_escalation_template.md)
- Include the diagnostics bundle
- Note what you've tried and the results
**If the issue is critical and escalating:**
- Continue collecting evidence every 5-10 minutes
- Document timeline of symptoms
- Prepare escalation with all evidence
---
## What NOT to Do
- ❌ Don't delete namespaces or persistent volumes
- ❌ Don't change database passwords or connection strings
- ❌ Don't scale resources without understanding the bottleneck
- ❌ Don't ignore error messages—they're evidence
- ❌ Don't skip the diagnostics bundle—Support will ask for it
---
## Quick Reference: Common Failure Patterns
| Symptom | Likely Cause | First Check |
|---------|--------------|-------------|
| All pods CrashLoopBackOff | Config error, missing secret | `kubectl describe pod` |
| API 5xx errors | Database/Redis connection | Pod logs, service endpoints |
| Traces not appearing | ClickHouse connectivity | ClickHouse pod logs |
| Slow ingestion | Redis latency, worker backlog | Worker logs, Redis metrics |
| Auth redirect loop | OIDC/SAML misconfiguration | Auth pod logs, IdP connectivity |
---
**Remember:** The goal is evidence collection and safe triage, not immediate resolution. A good diagnostics bundle is worth more than a hasty fix.
+312
View File
@@ -0,0 +1,312 @@
# Operations Signals and Thresholds
**Purpose:** Define early warning signals and red flag thresholds for LangSmith operations
**Use:** Configure monitoring and alerting based on these thresholds
**Frequency:** Review quarterly and adjust based on historical data
---
## Signal Categories
### Critical Signals (Red Flags - Immediate Action)
**Pod Health:**
- Pod in `CrashLoopBackOff` state → **IMMEDIATE**
- Pod `Pending` for > 10 minutes → **IMMEDIATE**
- Pod restart count > 5 in 1 hour → **IMMEDIATE**
- Pod `ImagePullBackOff`**IMMEDIATE**
**Resource Saturation:**
- Node CPU > 90% for > 5 minutes → **IMMEDIATE**
- Node memory > 95% for > 5 minutes → **IMMEDIATE**
- Pod CPU > 90% for > 10 minutes → **IMMEDIATE**
- Pod memory > 95% for > 10 minutes → **IMMEDIATE**
**Application Health:**
- API server error rate > 5% → **IMMEDIATE**
- API server latency p95 > 5 seconds → **IMMEDIATE**
- Worker queue depth > 200 → **IMMEDIATE**
- Worker processing rate < 10 jobs/minute → **IMMEDIATE**
**Data Store Health:**
- PostgreSQL connection pool exhausted → **IMMEDIATE**
- PostgreSQL query timeout > 10 seconds → **IMMEDIATE**
- Redis out of memory → **IMMEDIATE**
- Redis connection refused → **IMMEDIATE**
- ClickHouse query timeout > 10 seconds → **IMMEDIATE**
- ClickHouse table size > 1 TB (single table) → **IMMEDIATE**
### Warning Signals (Yellow Flags - Monitor Closely)
**Pod Health:**
- Pod restart count > 2 in 1 hour → **WARNING**
- Pod `Pending` for > 5 minutes → **WARNING**
- Pod CPU > 70% for > 10 minutes → **WARNING**
- Pod memory > 80% for > 10 minutes → **WARNING**
**Application Health:**
- API server error rate > 1% → **WARNING**
- API server latency p95 > 2 seconds → **WARNING**
- Worker queue depth > 50 → **WARNING**
- Worker processing rate < 50 jobs/minute → **WARNING**
**Data Store Health:**
- PostgreSQL connections > 80% of max → **WARNING**
- PostgreSQL query latency p95 > 2 seconds → **WARNING**
- Redis memory > 90% → **WARNING**
- Redis hit rate < 80% → **WARNING**
- ClickHouse query latency p95 > 3 seconds → **WARNING**
- ClickHouse disk usage > 80% → **WARNING**
---
## Threshold Definitions
### Pod Restart Count
**Measurement:** `kubectl get pods -n <namespace> --field-selector=status.phase=Running` → count restarts
**Calculation:**
```bash
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'
```
**Thresholds:**
- **Critical:** > 5 restarts in 1 hour
- **Warning:** > 2 restarts in 1 hour
**Action:**
- Check pod logs: `kubectl logs <pod> -n <namespace> --tail=100`
- Check events: `kubectl get events -n <namespace> --sort-by=.lastTimestamp`
- Check resource limits: `kubectl describe pod <pod> -n <namespace>`
### Pending Pods
**Measurement:** Pods in `Pending` state
**Calculation:**
```bash
kubectl get pods -n <namespace> --field-selector=status.phase=Pending
```
**Thresholds:**
- **Critical:** Pending for > 10 minutes
- **Warning:** Pending for > 5 minutes
**Action:**
- Check events: `kubectl describe pod <pod> -n <namespace>`
- Check node capacity: `kubectl top nodes`
- Check PVC binding: `kubectl get pvc -n <namespace>`
### API Server Error Rate
**Measurement:** HTTP 5xx responses / total requests
**Calculation:**
- Application metrics: `rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])`
- Or: Check application logs for error patterns
**Thresholds:**
- **Critical:** > 5% error rate
- **Warning:** > 1% error rate
**Action:**
- Check pod logs: `kubectl logs <api-pod> -n <namespace> --tail=100 | grep -i error`
- Check downstream services (PostgreSQL, Redis, ClickHouse)
- Check resource usage: `kubectl top pod <api-pod> -n <namespace>`
### Worker Queue Depth
**Measurement:** Number of jobs in Redis queue
**Calculation:**
```bash
# Redis CLI
redis-cli LLEN langsmith:jobs:traces
```
**Or via application metrics:**
- KEDA metrics: `redis_queue_length`
**Thresholds:**
- **Critical:** > 200 jobs
- **Warning:** > 50 jobs
**Action:**
- Scale workers: Check KEDA ScaledObject
- Check worker processing rate
- Check for stuck jobs
### PostgreSQL Connection Count
**Measurement:** Active connections / max connections
**Calculation:**
```sql
SELECT count(*) FROM pg_stat_activity;
SELECT setting FROM pg_settings WHERE name = 'max_connections';
```
**Or via cloud provider metrics:**
- AWS RDS: `DatabaseConnections` metric
- Azure Database: `active_connections` metric
**Thresholds:**
- **Critical:** > 90% of max connections
- **Warning:** > 80% of max connections
**Action:**
- Check for connection leaks
- Review connection pool configuration
- Consider increasing max connections (if justified)
### Redis Memory Usage
**Measurement:** Used memory / max memory
**Calculation:**
```bash
redis-cli INFO memory
# used_memory / maxmemory
```
**Or via cloud provider metrics:**
- AWS ElastiCache: `DatabaseMemoryUsagePercentage`
- Azure Cache: `usedmemorypercentage`
**Thresholds:**
- **Critical:** > 95% memory usage
- **Warning:** > 90% memory usage
**Action:**
- Check for memory leaks
- Review key expiration policies
- Consider scaling up instance size
### ClickHouse Query Latency
**Measurement:** p95 query latency
**Calculation:**
- ClickHouse system tables: `SELECT quantile(0.95)(query_duration_ms) FROM system.query_log WHERE event_time > now() - INTERVAL 1 HOUR`
**Thresholds:**
- **Critical:** p95 > 10 seconds
- **Warning:** p95 > 3 seconds
**Action:**
- Check table sizes (may need partitioning)
- Check for slow queries: `SELECT * FROM system.query_log WHERE query_duration_ms > 5000 ORDER BY query_duration_ms DESC LIMIT 10`
- Check disk I/O: `kubectl top pod <clickhouse-pod> -n <namespace>`
---
## Log Patterns to Monitor
### Common Failure Patterns
**Connection Refused:**
```bash
kubectl logs <pod> -n <namespace> --tail=100 | grep -i "connection refused"
```
**Timeouts:**
```bash
kubectl logs <pod> -n <namespace> --tail=100 | grep -i "timeout"
```
**Out of Memory:**
```bash
kubectl logs <pod> -n <namespace> --tail=100 | grep -i "out of memory\|OOM"
```
**Database Errors:**
```bash
kubectl logs <pod> -n <namespace> --tail=100 | grep -i "database\|postgres\|redis\|clickhouse" | grep -i "error\|fail"
```
**Authentication Errors:**
```bash
kubectl logs <pod> -n <namespace> --tail=100 | grep -i "unauthorized\|forbidden\|auth"
```
---
## Escalation Evidence
When escalating to support, gather:
1. **Pod Status:**
```bash
kubectl get pods -n <namespace> -o wide
kubectl describe pod <problem-pod> -n <namespace>
```
2. **Recent Events:**
```bash
kubectl get events -n <namespace> --sort-by=.lastTimestamp | tail -50
```
3. **Resource Usage:**
```bash
kubectl top pods -n <namespace>
kubectl top nodes
```
4. **Pod Logs:**
```bash
kubectl logs <pod> -n <namespace> --tail=200
```
5. **Application Metrics:**
- Error rates, latencies, queue depths
- Database connection counts
- Cache hit rates
6. **Configuration:**
- Helm values (redacted)
- Environment variables (redacted)
- Resource requests/limits
---
## Threshold Tuning
**Initial thresholds:** Use the values above as starting points.
**Tuning process:**
1. Monitor for 1-2 weeks
2. Identify false positives (alerts that don't require action)
3. Identify missed incidents (issues that should have alerted)
4. Adjust thresholds based on historical data
5. Document threshold changes and rationale
**Factors to consider:**
- Workload patterns (peak hours, batch jobs)
- Growth trajectory (user growth, data growth)
- Resource capacity (cluster size, database size)
- Business requirements (SLA, RTO, RPO)
---
## Quick Reference
| Signal | Critical | Warning | Measurement |
|--------|----------|---------|-------------|
| Pod restarts | > 5/hour | > 2/hour | `kubectl get pods` |
| Pending pods | > 10 min | > 5 min | `kubectl get pods` |
| API error rate | > 5% | > 1% | Application metrics |
| Worker queue | > 200 | > 50 | Redis queue length |
| PostgreSQL connections | > 90% max | > 80% max | Database metrics |
| Redis memory | > 95% | > 90% | Redis INFO memory |
| ClickHouse latency | > 10s p95 | > 3s p95 | Query log |
---
## Next Steps
1. **Configure alerts** based on these thresholds
2. **Test alerts** to ensure they fire correctly
3. **Document runbooks** for each alert type
4. **Review quarterly** and adjust based on experience
@@ -0,0 +1,238 @@
# Production Readiness Checklist
**Purpose:** Validate that LangSmith deployment meets production requirements
**Use:** Complete this checklist before declaring production-ready
**Frequency:** Review quarterly or after significant changes
---
## Infrastructure & Networking
### Cloud Provider Configuration
- [ ] Correct cloud account/subscription (verified)
- [ ] Correct region selected (verified)
- [ ] Private networking configured (all data stores in private subnets)
- [ ] VPC/VNet peering configured (if multi-VPC deployment)
- [ ] Security groups/NSGs configured correctly
- [ ] IAM roles/Managed Identities configured (no access keys)
### Kubernetes Cluster
- [ ] Cluster version supported (check compatibility matrix)
- [ ] Node capacity sufficient (headroom for scaling)
- [ ] Cluster autoscaling enabled (if applicable)
- [ ] CSI storage drivers installed (EBS CSI for AWS, Azure Disk CSI for Azure)
- [ ] Network policies configured (if required)
- [ ] Resource quotas set (if multi-tenant)
---
## Data Stores
### PostgreSQL
- [ ] Instance size meets baseline (db.r5.xlarge minimum for production)
- [ ] Multi-AZ enabled (RDS) or read replicas configured (Azure)
- [ ] Storage autoscaling enabled
- [ ] Automated backups configured (7-day retention minimum)
- [ ] Connection pool configured (100+ connections)
- [ ] Private networking (no public access)
- [ ] Encryption at rest enabled
- [ ] Performance insights/monitoring enabled
### Redis
- [ ] Instance type meets baseline (cache.r6g.xlarge minimum for production)
- [ ] Cluster mode enabled (3+ nodes for production)
- [ ] AOF persistence enabled (production)
- [ ] Memory headroom sufficient (50% free)
- [ ] Private networking (no public access)
- [ ] Encryption at rest enabled
### ClickHouse
- [ ] Deployment type: Managed (ClickHouse Cloud) OR in-cluster with proper sizing
- [ ] In-cluster: 3-node cluster minimum (for HA)
- [ ] Resources per node: 8 CPU, 32 GB RAM, 1 TB storage (production)
- [ ] Replication factor: 2x (6 total pods for 3-node cluster)
- [ ] Storage class: EBS gp3 with 3000 IOPS (AWS) or Premium SSD (Azure)
- [ ] Backups configured (if in-cluster)
- [ ] Private networking (no public access)
### Blob Storage (REQUIRED)
- [ ] Blob storage provider configured (S3 or Azure Blob Storage)
- [ ] NOT using local filesystem or in-cluster storage
- [ ] Bucket/container created and accessible
- [ ] IAM role/Managed Identity configured (no access keys)
- [ ] Versioning enabled
- [ ] Encryption at rest enabled
- [ ] Lifecycle policies configured (cost optimization)
- [ ] Health check passes (see ops sanity checks notebook)
**Critical:** Blob storage is REQUIRED for production. Without it, ClickHouse will become unusable under load.
---
## Application Configuration
### Helm Configuration
- [ ] Helm values file reviewed and documented
- [ ] Resource requests/limits set for all containers
- [ ] Replica counts set appropriately (min 2 for HA)
- [ ] Environment variables documented
- [ ] Secrets stored in Kubernetes (not in values file)
- [ ] Values file version controlled (Git)
### High Availability
- [ ] API server replicas: 2+ (for HA)
- [ ] Worker replicas: 1+ (scaled via KEDA)
- [ ] Pod anti-affinity rules configured (spread across nodes/AZs)
- [ ] Readiness probes configured correctly
- [ ] Liveness probes configured correctly
### Autoscaling
- [ ] HPA configured for API servers (CPU/memory targets)
- [ ] HPA min replicas: 2
- [ ] HPA max replicas: 10+ (adjust based on capacity)
- [ ] KEDA ScaledObject configured for workers (queue depth)
- [ ] KEDA min replicas: 1
- [ ] KEDA max replicas: 20+ (adjust based on workload)
---
## Observability
### Monitoring
- [ ] Kubernetes metrics available (pod CPU/memory)
- [ ] Application metrics exposed (request rates, latencies)
- [ ] Database metrics available (connection counts, query performance)
- [ ] Redis metrics available (memory usage, hit rates)
- [ ] ClickHouse metrics available (query latency, table sizes)
- [ ] Log aggregation configured (CloudWatch, Azure Monitor, etc.)
### Alerting
- [ ] Critical alerts configured (pod crashes, high error rates)
- [ ] Warning alerts configured (resource saturation, queue depth)
- [ ] Alert thresholds documented (see ops_signals_and_thresholds.md)
- [ ] On-call rotation configured
- [ ] Escalation paths defined
### Dashboards
- [ ] Kubernetes dashboard (pod status, resource usage)
- [ ] Application dashboard (request rates, error rates)
- [ ] Database dashboard (connection counts, query performance)
- [ ] Queue depth dashboard (worker queue metrics)
---
## Security
### Authentication & Authorization
- [ ] SSO configured (OIDC or SAML)
- [ ] Local auth disabled (production)
- [ ] Role mapping configured correctly
- [ ] Admin access restricted (minimal admins)
### Network Security
- [ ] Ingress TLS configured (valid certificate)
- [ ] mTLS enabled (if service mesh used)
- [ ] Egress policies configured (if service mesh used)
- [ ] Network policies configured (if required)
### Secrets Management
- [ ] Secrets stored in Kubernetes (not in code)
- [ ] Secrets rotation process documented
- [ ] Access to secrets restricted (RBAC)
---
## Backup & Disaster Recovery
### Backups
- [ ] PostgreSQL backups automated (daily, 7-day retention)
- [ ] ClickHouse backups configured (if in-cluster)
- [ ] Blob storage versioning enabled
- [ ] Backup restoration tested (last 6 months)
### Disaster Recovery
- [ ] DR plan documented
- [ ] RTO/RPO defined
- [ ] Failover procedure tested
- [ ] Cross-region replication configured (if required)
---
## Operational Readiness
### Documentation
- [ ] Runbooks documented (common operations)
- [ ] Incident response procedures documented
- [ ] Escalation paths documented
- [ ] Service sizing baselines documented
### Testing
- [ ] Load testing performed (validates scaling)
- [ ] Failover testing performed (validates HA)
- [ ] Backup restoration tested
- [ ] Ops sanity checks notebook run (all checks pass)
### Team Readiness
- [ ] On-call rotation established
- [ ] Team trained on operations
- [ ] Access to cloud console (for managed services)
- [ ] Access to monitoring/alerting tools
---
## Service Mesh (If Applicable)
### Istio Configuration
- [ ] Istio installed and configured
- [ ] Sidecar injection enabled (namespace or per-workload)
- [ ] ServiceEntry configured (for external databases)
- [ ] DestinationRule configured (traffic policies)
- [ ] Egress policies configured (if required)
- [ ] mTLS enabled (if required)
### Operational Considerations
- [ ] Log selection documented (app vs proxy logs)
- [ ] Health probe timeouts adjusted (account for sidecar)
- [ ] Multi-container pod logging understood
---
## Sign-Off
**Validated by:** _________________
**Date:** _________________
**Next Review Date:** _________________
**Notes:** _________________
---
## Post-Checklist Actions
1. **Run ops sanity checks notebook:**
- `notebooks/module-3/01_ops_sanity_checks.ipynb`
- Address any failures before production
2. **Document thresholds:**
- Update `docs/shared/ops_signals_and_thresholds.md`
- Configure alerts based on thresholds
3. **Schedule quarterly reviews:**
- Review checklist quarterly
- Update baselines as workload grows
- Adjust thresholds based on historical data
---
## Common Gaps
**Most common production readiness gaps:**
1. Blob storage not configured (CRITICAL)
2. PostgreSQL single-AZ (no HA)
3. Redis single node (no cluster mode)
4. No autoscaling configured
5. No monitoring/alerting
6. Backups not tested
7. Resource limits not set
**Address these before declaring production-ready.**
+468
View File
@@ -0,0 +1,468 @@
# Sidecars & Service Mesh (Istio)
**Purpose:** Guide for enabling and operating Istio sidecars in LangSmith deployments
**Audience:** Platform engineers and operators managing service mesh configurations
**Prerequisites:** Istio installed in cluster (out of scope for this guide)
---
## When Sidecars Are Needed
### Use Cases
**Egress Control:**
- Restrict outbound traffic to approved destinations only
- Prevent pods from accessing unauthorized external services
- Enforce network policies at the service mesh level
**mTLS (Mutual TLS):**
- Encrypt traffic between services within the cluster
- Provide service-to-service authentication
- Meet compliance requirements for encrypted communication
**Policy Enforcement:**
- Rate limiting between services
- Circuit breakers for fault tolerance
- Traffic splitting for canary deployments
**Observability:**
- Distributed tracing across services
- Service-level metrics collection
- Request/response logging
### When NOT Needed
**Simple deployments:**
- Development environments
- Proof-of-concept deployments
- Single-service deployments
**No egress requirements:**
- All traffic stays within cluster
- No external database connections
- No outbound API calls
**Alternative solutions:**
- Network policies (Kubernetes native)
- Ingress controllers (for north-south traffic)
- Application-level rate limiting
---
## How to Enable Injection Safely
### Namespace-Level Injection (Recommended)
**Best for:** LangSmith namespace (all workloads need sidecars)
**Configuration:**
```yaml
apiVersion: v1
kind: Namespace
metadata:
name: langsmith
labels:
istio-injection: enabled
istio-discovery: enabled
```
**Apply:**
```bash
kubectl label namespace langsmith istio-injection=enabled istio-discovery=enabled
```
**Verification:**
```bash
kubectl get namespace langsmith --show-labels
```
**Behavior:**
- All new pods in namespace get sidecars automatically
- Existing pods require restart to get sidecars
- Pods can opt out with annotation: `sidecar.istio.io/inject: "false"`
### Per-Workload Annotation (Selective Injection)
**Best for:** Specific workloads that need sidecars
**Configuration:**
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: langsmith-api
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: "true"
spec:
containers:
- name: api
# ... container spec
```
**Apply:**
```bash
kubectl patch deployment langsmith-api -n langsmith -p '{"spec":{"template":{"metadata":{"annotations":{"sidecar.istio.io/inject":"true"}}}}}'
```
**Behavior:**
- Only annotated workloads get sidecars
- Works even if namespace injection is disabled
- More granular control
### Revision-Based Injection (Canary/Blue-Green)
**Best for:** Gradual rollout or canary deployments
**Configuration:**
```yaml
apiVersion: v1
kind: Namespace
metadata:
name: langsmith
labels:
istio-injection: enabled
istio.io/rev: default # or specific revision
```
**Behavior:**
- Allows multiple Istio control planes
- Enables gradual migration
- Supports canary deployments
---
## Operational Implications
### Logging and kubectl logs Behavior
**Multi-container pods:** After sidecar injection, pods have multiple containers:
- Application container (e.g., `langsmith-api`)
- Sidecar container (`istio-proxy`)
**Default behavior:**
```bash
# This shows logs from the FIRST container (usually application)
kubectl logs <pod> -n <namespace>
# This may show proxy logs if proxy is first container
kubectl logs <pod> -n <namespace> -c istio-proxy
# Show logs from specific container
kubectl logs <pod> -n <namespace> -c <container-name>
# Show logs from all containers
kubectl logs <pod> -n <namespace> --all-containers=true
```
**Common issue:** "If logs appear missing after injection, you're likely looking at the wrong container."
**Solution:**
```bash
# List containers in pod
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.containers[*].name}'
# Get logs from application container
kubectl logs <pod> -n <namespace> -c langsmith-api
# Get logs from proxy container
kubectl logs <pod> -n <namespace> -c istio-proxy
```
### Health Probes and Timeouts
**Sidecar adds latency:**
- Sidecar intercepts health check requests
- Adds ~10-50ms latency per request
- May cause probe timeouts if thresholds are too low
**Adjust probe timeouts:**
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: langsmith-api
spec:
template:
spec:
containers:
- name: api
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5 # Increase if sidecars enabled
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 5 # Increase if sidecars enabled
failureThreshold: 3
```
**Verification:**
```bash
# Check probe success rate
kubectl get pods -n <namespace> -o wide
# Look for pods in Ready state
# Check probe failures
kubectl describe pod <pod> -n <namespace> | grep -A 5 "Liveness\|Readiness"
```
### Egress to External Databases
**Problem:** Sidecars block outbound traffic by default.
**Solution:** Configure `ServiceEntry` for external endpoints.
**Example ServiceEntry for PostgreSQL:**
```yaml
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
name: postgres-external
namespace: langsmith
spec:
hosts:
- <postgres-hostname>.rds.amazonaws.com
ports:
- number: 5432
name: postgres
protocol: TCP
location: MESH_EXTERNAL
resolution: DNS
```
**Example ServiceEntry for Redis:**
```yaml
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
name: redis-external
namespace: langsmith
spec:
hosts:
- <redis-endpoint>.cache.amazonaws.com
ports:
- number: 6379
name: redis
protocol: TCP
location: MESH_EXTERNAL
resolution: DNS
```
**Apply:**
```bash
kubectl apply -f serviceentry-postgres.yaml -n langsmith
kubectl apply -f serviceentry-redis.yaml -n langsmith
```
**Verification:**
```bash
# Check ServiceEntry
kubectl get serviceentry -n langsmith
# Test connectivity from pod
kubectl exec -it <pod> -n langsmith -c <app-container> -- nc -zv <db-host> <port>
```
### DestinationRule for Traffic Policies
**Example DestinationRule:**
```yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: postgres-dr
namespace: langsmith
spec:
host: <postgres-hostname>.rds.amazonaws.com
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 10
http2MaxRequests: 100
tls:
mode: SIMPLE
```
---
## Sample Labels and Annotations
### Namespace Labels
```yaml
labels:
istio-injection: enabled
istio-discovery: enabled
```
### Pod Annotations
```yaml
annotations:
sidecar.istio.io/inject: "true" # Enable injection
# or
sidecar.istio.io/inject: "false" # Disable injection
```
### Complete Example
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: langsmith-api
namespace: langsmith
spec:
replicas: 2
template:
metadata:
annotations:
sidecar.istio.io/inject: "true"
spec:
containers:
- name: api
image: langsmith/api:latest
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
timeoutSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
timeoutSeconds: 5
```
---
## Verification Commands
### Check Sidecar Injection
```bash
# List pods and containers
kubectl get pods -n langsmith -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].name}{"\n"}{end}'
# Check for istio-proxy container
kubectl get pod <pod> -n langsmith -o jsonpath='{.spec.containers[*].name}' | grep istio-proxy
# Describe pod to see all containers
kubectl describe pod <pod> -n langsmith | grep -A 10 "Containers:"
```
### Check ServiceEntry
```bash
# List ServiceEntries
kubectl get serviceentry -n langsmith
# Describe ServiceEntry
kubectl describe serviceentry <name> -n langsmith
```
### Check DestinationRule
```bash
# List DestinationRules
kubectl get destinationrule -n langsmith
# Describe DestinationRule
kubectl describe destinationrule <name> -n langsmith
```
### Test Connectivity
```bash
# Test from application container
kubectl exec -it <pod> -n langsmith -c <app-container> -- curl -v <external-url>
# Test from proxy container (if needed)
kubectl exec -it <pod> -n langsmith -c istio-proxy -- curl -v <external-url>
```
---
## Troubleshooting
### Logs Appear Missing
**Symptom:** `kubectl logs <pod>` shows no output or wrong logs.
**Cause:** Looking at wrong container (proxy instead of app).
**Solution:**
```bash
# List containers
kubectl get pod <pod> -n langsmith -o jsonpath='{.spec.containers[*].name}'
# Get logs from correct container
kubectl logs <pod> -n langsmith -c <app-container-name>
```
### Health Probes Failing
**Symptom:** Pods not becoming Ready after sidecar injection.
**Cause:** Probe timeouts too low for sidecar latency.
**Solution:** Increase `timeoutSeconds` in probe configuration.
### External Database Connection Refused
**Symptom:** Cannot connect to external PostgreSQL/Redis.
**Cause:** ServiceEntry not configured or incorrect.
**Solution:**
1. Check ServiceEntry exists: `kubectl get serviceentry -n langsmith`
2. Verify hostname matches: `kubectl describe serviceentry <name> -n langsmith`
3. Check egress policies: `kubectl get authorizationpolicy -n langsmith`
### High Latency After Injection
**Symptom:** Request latency increased after sidecar injection.
**Cause:** Normal sidecar overhead (10-50ms per request).
**Solution:** This is expected. If latency is excessive (>100ms), check:
- Proxy resource limits
- Network policies
- mTLS overhead
---
## Best Practices
1. **Start with namespace-level injection** for simplicity
2. **Adjust health probe timeouts** after injection
3. **Configure ServiceEntry** for all external dependencies
4. **Monitor proxy resource usage** (CPU/memory)
5. **Document container names** for log access
6. **Test connectivity** after configuration changes
7. **Use per-workload annotation** for selective injection
---
## References
- [Istio Documentation](https://istio.io/latest/docs/)
- [Istio Sidecar Injection](https://istio.io/latest/docs/setup/additional-setup/sidecar-injection/)
- [Istio ServiceEntry](https://istio.io/latest/docs/reference/config/networking/service-entry/)
- [Istio DestinationRule](https://istio.io/latest/docs/reference/config/networking/destination-rule/)
+185
View File
@@ -0,0 +1,185 @@
# Support Escalation Template
**Use this template when escalating an incident to LangChain Support.**
Copy and fill in each section. Include the diagnostics bundle and any relevant evidence.
---
## Incident Summary
**Start Time:** `YYYY-MM-DD HH:MM:SS UTC`
**Detection Time:** `YYYY-MM-DD HH:MM:SS UTC`
**Current Status:** `[Investigating / Escalating / Resolved]`
**Brief Description:**
```
[One-sentence summary of the issue]
```
---
## Symptoms
**Who is impacted:**
- [ ] All users
- [ ] Specific user(s) or workspace(s)
- [ ] Specific endpoints or features
- [ ] Internal operations only
**What's broken:**
- [ ] UI is unreachable or returns errors
- [ ] API endpoints return 5xx errors
- [ ] Traces are missing or delayed
- [ ] Authentication/authorization failures
- [ ] Ingestion is slow or failing
- [ ] Other: `[describe]`
**Error messages observed:**
```
[Paste relevant error messages, redacting any secrets]
```
**User-facing impact:**
```
[Describe what users experience]
```
---
## Recent Changes
**Deployments/Releases:**
- [ ] Helm upgrade/chart change: `[version/date]`
- [ ] Configuration change: `[what changed]`
- [ ] Infrastructure change: `[what changed]`
- [ ] No recent changes
**Timeline:**
```
[Chronological list of changes leading up to the incident]
```
---
## Environment Details
**Cloud Provider:** `[AWS / Azure / GCP / Other]`
**Region/Location:** `[region]`
**Kubernetes Service:** `[EKS / AKS / GKE / Other]`
**Cluster Name:** `[cluster-name]`
**Namespace:** `[namespace]`
**LangSmith Version:**
- Helm Chart Version: `[version]`
- Image Tags: `[if known]`
- Deployment Method: `[Helm / kubectl / Other]`
**Infrastructure:**
- PostgreSQL: `[RDS / Azure Database / In-cluster / Other]`
- Redis: `[ElastiCache / Azure Cache / In-cluster / Other]`
- ClickHouse: `[Managed / In-cluster]`
- Blob Storage: `[S3 / Azure Blob / GCS / Other]`
---
## Diagnostics Bundle
**Bundle Location:** `[path or URL to diagnostics bundle]`
**Bundle Contents:**
- [ ] Canonical diagnostics script output (`get_k8s_debugging_info.sh`)
- [ ] `kubectl get all -o yaml` snapshot
- [ ] Recent events (`kubectl get events`)
- [ ] Pod logs (API, workers, ClickHouse)
- [ ] Resource usage snapshot (`kubectl top pods/nodes`)
- [ ] Ingress/load balancer configuration
- [ ] Helm values (redacted)
**Bundle Timestamp:** `YYYY-MM-DD HH:MM:SS UTC`
---
## What We've Tried
**Investigation Steps:**
1. `[What you checked and what you found]`
2. `[Next step and result]`
3. `[Continue as needed]`
**Remediation Attempts:**
- [ ] Restarted pods: `[which pods, result]`
- [ ] Checked external service connectivity: `[result]`
- [ ] Verified configuration: `[result]`
- [ ] Other: `[describe]`
**Current Hypothesis:**
```
[Your best guess at the root cause, with evidence]
```
---
## Evidence & Logs
**Key Log Excerpts (redact secrets):**
```
[Paste relevant log lines with timestamps]
```
**Error Patterns:**
```
[Describe patterns you've observed]
```
**Metrics/Signals:**
```
[Any metrics or signals that indicate the issue]
```
---
## Questions for Support
1. `[Your question]`
2. `[Another question]`
3. `[Continue as needed]`
---
## Additional Context
**Related Issues:**
- Previous similar incidents: `[reference]`
- Known limitations: `[describe]`
- Custom configurations: `[describe, redact secrets]`
**Priority:**
- [ ] Critical (service down, all users impacted)
- [ ] High (major feature broken, many users impacted)
- [ ] Medium (degraded performance, some users impacted)
- [ ] Low (minor issue, workaround available)
---
## Next Steps
**What we need from Support:**
- [ ] Root cause analysis
- [ ] Remediation steps
- [ ] Configuration guidance
- [ ] Performance optimization
- [ ] Other: `[describe]`
**Our availability:**
- Timezone: `[timezone]`
- Best time to contact: `[time range]`
- Escalation contact: `[name/email]`
---
**Template Version:** 1.0
**Last Updated:** `[date]`
**Note:** Always redact secrets, API keys, passwords, and connection strings before sharing. Use `[REDACTED]` or similar markers.
+37
View File
@@ -35,3 +35,40 @@ VALUES_FILE="./helm/langsmith-values/values.aws-demo.yaml"
ARTIFACTS_DIR="./artifacts"
LOG_LEVEL="info" # info|debug
DRY_RUN="true" # true by default; notebooks should flip this explicitly when applying
# ===== OIDC SSO Configuration (Module 2) =====
# Required: Get these values from your IdP team
# LangSmith domain (must match your ingress domain)
LANGSMITH_DOMAIN="langsmith.example.com"
# OIDC Configuration (required)
OIDC_ISSUER="https://your-org.okta.com/oauth2/default" # IdP issuer URL
OIDC_CLIENT_ID="your-client-id" # OAuth2 client ID (public)
OIDC_CLIENT_SECRET="your-client-secret" # OAuth2 client secret (store in K8s secret, never commit)
OIDC_REDIRECT_URI="https://langsmith.example.com/auth/callback" # Must match EXACTLY in IdP whitelist
# OIDC Scopes (optional, defaults shown)
OIDC_SCOPES="openid,email,profile,groups" # Include 'groups' for group-based role mapping
# Claim Mappings (optional, defaults shown)
OIDC_EMAIL_CLAIM="email" # Claim name for user email (required)
OIDC_NAME_CLAIM="name" # Claim name for user display name (optional)
OIDC_GROUPS_CLAIM="groups" # Claim name for group membership (optional, for role mapping)
# ===== SAML SSO Configuration (Module 2 - Alternative) =====
# Use SAML if your IdP doesn't support OIDC or enterprise policy requires SAML
# SAML_METADATA_URL="https://your-idp.com/saml/metadata" # Preferred: metadata URL
# SAML_METADATA_FILE="/path/to/metadata.xml" # Alternative: metadata file path
# SAML_ENTITY_ID="https://langsmith.example.com" # Optional: entity ID
# SAML_EMAIL_ATTRIBUTE="email" # Optional: email attribute name
# SAML_NAME_ATTRIBUTE="name" # Optional: name attribute name
# SAML_GROUPS_ATTRIBUTE="groups" # Optional: groups attribute name
# ===== Notes =====
# 1. OIDC_CLIENT_SECRET should be stored in Kubernetes secret, not in this file
# 2. Redirect URI must match EXACTLY (case, trailing slashes, protocol)
# 3. IdP team must whitelist the redirect URI
# 4. For production, use HTTPS for all URLs
# 5. See docs/modules/module-2.md for complete configuration guide
+1 -1
View File
@@ -15,7 +15,7 @@ AWS_REGION="us-east-1"
AWS_ACCOUNT_ID=""
# Naming (used by notebooks for display + validation)
CLUSTER_NAME="langsmith-workshop"
#CLUSTER_NAME=""
# Local repo paths (absolute is safest)
TERRAFORM_REPO_DIR="$HOME/src/langchain-ai/terraform"
-589
View File
@@ -1,589 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 1: Preflight Checks\n",
"\n",
"## Overview\n",
"\n",
"This notebook validates your environment before deploying LangSmith. Most self-hosted failures occur **before** users ever touch the product due to:\n",
"\n",
"- Mis-sized clusters\n",
"- Unsupported ingress setups\n",
"- In-cluster databases used past their limits\n",
"- Missing storage primitives (blob, PVs)\n",
"\n",
"This preflight ensures you start from a **supported baseline**.\n",
"\n",
"## What We'll Check\n",
"\n",
"1. ✅ Tooling validation (cloud CLI, terraform, kubectl, helm, jq)\n",
"2. ✅ Cloud provider credentials & region sanity check\n",
"3. ✅ Cluster capacity expectations\n",
"4. ✅ Storage prerequisites (CSI drivers, StorageClasses)\n",
"5. ✅ Blob storage requirement (cloud object storage)\n",
"\n",
"**Estimated time:** 20-30 minutes\n",
"\n",
"**Supported Cloud Providers:** AWS, Azure (GCP coming soon)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Bootstrap environment\n",
"import sys\n",
"from pathlib import Path\n",
"\n",
"# Add notebooks directory to path so we can import shared as a package\n",
"# Find the notebooks directory by looking for the shared folder\n",
"possible_paths = [\n",
" Path.cwd().parent, # If cwd is module-1, go up one level to notebooks\n",
" Path.cwd(), # If cwd is already notebooks\n",
" Path.cwd() / \"notebooks\", # If cwd is workspace root\n",
"]\n",
"\n",
"notebooks_path = None\n",
"for path in possible_paths:\n",
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
" notebooks_path = path\n",
" break\n",
"\n",
"if not notebooks_path:\n",
" # Fallback: try workspace root\n",
" notebooks_path = Path.cwd() / \"notebooks\"\n",
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
"\n",
"# Add notebooks directory to path so 'shared' can be imported as a package\n",
"if str(notebooks_path) not in sys.path:\n",
" sys.path.insert(0, str(notebooks_path))\n",
"\n",
"from shared._bootstrap import bootstrap\n",
"\n",
"# Run bootstrap: loads env, checks tools, validates AWS, creates artifacts dir\n",
"bootstrap_info = bootstrap()\n",
"print(f\"\\nBootstrap complete! Artifacts directory: {bootstrap_info['artifacts_dir']}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cloud Provider Account & Region Validation\n",
"\n",
"Verify you're using the correct cloud provider account/subscription and region. This is critical for avoiding accidental deployments to production or wrong regions.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import json\n",
"from shared._cloud_helpers import (\n",
" get_cloud_provider,\n",
" get_region,\n",
" get_identity,\n",
" assert_account,\n",
")\n",
"from shared._validation import require_env, print_config, ok, warn\n",
"\n",
"# Get cloud configuration\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"identity = get_identity()\n",
"\n",
"provider_display = provider.upper()\n",
"print(f\"### Current {provider_display} Session\")\n",
"print(f\"Cloud Provider: {provider_display}\")\n",
"print(f\"Region: {region}\")\n",
"\n",
"if provider == \"aws\":\n",
" print(f\"Account ID: {identity['Account']}\")\n",
" print(f\"User ARN: {identity['Arn']}\")\n",
" account_var = \"AWS_ACCOUNT_ID\"\n",
"elif provider == \"azure\":\n",
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"\")\n",
" subscription_name = identity.get(\"SubscriptionName\", \"\")\n",
" print(f\"Subscription ID: {subscription_id}\")\n",
" print(f\"Subscription Name: {subscription_name}\")\n",
" account_var = \"AZURE_SUBSCRIPTION_ID\"\n",
"else:\n",
" account_var = None\n",
"\n",
"# Optional: Validate against expected account/subscription\n",
"if account_var:\n",
" expected_account = os.environ.get(account_var, \"\").strip()\n",
" if expected_account:\n",
" assert_account(expected_account)\n",
" else:\n",
" warn(f\"{account_var} not set in environment - skipping account validation\")\n",
" print(f\"💡 Tip: Set {account_var} in your .env file to add a guardrail against wrong account deployments\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Required Environment Variables\n",
"\n",
"Verify that all required configuration is present. These values will be used throughout the deployment.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check required environment variables\n",
"from shared._cloud_helpers import get_cloud_provider\n",
"\n",
"provider = get_cloud_provider()\n",
"\n",
"# Base required vars (cloud-agnostic)\n",
"required_vars = [\n",
" \"WORKSHOP_NAME\",\n",
" \"NAMESPACE\",\n",
" \"CLUSTER_NAME\",\n",
" \"TERRAFORM_DIR\",\n",
" \"HELM_RELEASE\",\n",
" \"HELM_NAMESPACE\",\n",
" \"HELM_CHART_REF\",\n",
"]\n",
"\n",
"# Add cloud-specific required vars\n",
"if provider == \"aws\":\n",
" required_vars.append(\"AWS_REGION\")\n",
"elif provider == \"azure\":\n",
" required_vars.append(\"AZURE_LOCATION\")\n",
"\n",
"config = require_env(*required_vars)\n",
"\n",
"# Optional but recommended (cloud-specific)\n",
"optional_vars = {}\n",
"if provider == \"aws\":\n",
" optional_vars = {\n",
" \"AWS_PROFILE\": os.environ.get(\"AWS_PROFILE\", \"\"),\n",
" \"AWS_ACCOUNT_ID\": os.environ.get(\"AWS_ACCOUNT_ID\", \"\"),\n",
" \"VALUES_FILE\": os.environ.get(\"VALUES_FILE\", \"\"),\n",
" }\n",
"elif provider == \"azure\":\n",
" optional_vars = {\n",
" \"AZURE_SUBSCRIPTION_ID\": os.environ.get(\"AZURE_SUBSCRIPTION_ID\", \"\"),\n",
" \"AZURE_RESOURCE_GROUP\": os.environ.get(\"AZURE_RESOURCE_GROUP\", \"\"),\n",
" \"VALUES_FILE\": os.environ.get(\"VALUES_FILE\", \"\"),\n",
" }\n",
"\n",
"print(\"\\n### Configuration Summary\")\n",
"print(f\"Cloud Provider: {provider.upper()}\")\n",
"print_config(config, redact_keys={\"AWS_PROFILE\"})\n",
"print(\"\\n### Optional Configuration\")\n",
"for k, v in optional_vars.items():\n",
" if v:\n",
" print(f\"- {k}: {v}\")\n",
" else:\n",
" print(f\"- {k}: (not set)\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cluster Capacity Expectations\n",
"\n",
"LangSmith requires adequate cluster resources. Before deploying, understand what you'll need:\n",
"\n",
"- **Minimum:** 3 nodes, 4 vCPU, 16GB RAM each (for development/testing)\n",
"- **Recommended:** 3 nodes, 8 vCPU, 32GB RAM each (for production workloads)\n",
"- **Storage:** EBS CSI driver required for ClickHouse PVCs\n",
"\n",
"Let's check if a cluster already exists and validate its configuration.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._aws_helpers import eks_cluster_exists\n",
"from shared._shell import run\n",
"\n",
"cluster_name = os.environ[\"CLUSTER_NAME\"]\n",
"region = aws_region()\n",
"\n",
"print(f\"### Checking EKS Cluster: {cluster_name}\")\n",
"print(f\"Region: {region}\\n\")\n",
"\n",
"if eks_cluster_exists(cluster_name):\n",
" ok(f\"Cluster '{cluster_name}' exists\")\n",
" \n",
" # Get cluster details\n",
" result = run(\n",
" [\"aws\", \"eks\", \"describe-cluster\", \"--name\", cluster_name, \"--region\", region, \"--output\", \"json\"],\n",
" check=True,\n",
" stream=False\n",
" )\n",
" cluster_info = json.loads(result.stdout)[\"cluster\"]\n",
" \n",
" print(f\"\\nCluster Status: {cluster_info['status']}\")\n",
" print(f\"Kubernetes Version: {cluster_info['version']}\")\n",
" print(f\"Platform Version: {cluster_info.get('platformVersion', 'N/A')}\")\n",
" \n",
" # Check node groups\n",
" print(\"\\n### Node Groups\")\n",
" ng_result = run(\n",
" [\"aws\", \"eks\", \"list-nodegroups\", \"--cluster-name\", cluster_name, \"--region\", region, \"--output\", \"json\"],\n",
" check=True,\n",
" stream=False\n",
" )\n",
" nodegroups = json.loads(ng_result.stdout).get(\"nodegroups\", [])\n",
" \n",
" if nodegroups:\n",
" for ng in nodegroups:\n",
" ng_detail = run(\n",
" [\"aws\", \"eks\", \"describe-nodegroup\", \"--cluster-name\", cluster_name, \n",
" \"--nodegroup-name\", ng, \"--region\", region, \"--output\", \"json\"],\n",
" check=True,\n",
" stream=False\n",
" )\n",
" ng_info = json.loads(ng_detail.stdout)[\"nodegroup\"]\n",
" scaling = ng_info.get(\"scalingConfig\", {})\n",
" print(f\"\\n Node Group: {ng}\")\n",
" print(f\" Status: {ng_info['status']}\")\n",
" print(f\" Desired: {scaling.get('desiredSize', 'N/A')}\")\n",
" print(f\" Min: {scaling.get('minSize', 'N/A')}\")\n",
" print(f\" Max: {scaling.get('maxSize', 'N/A')}\")\n",
" print(f\" Instance Types: {', '.join(ng_info.get('instanceTypes', []))}\")\n",
" else:\n",
" warn(\"No node groups found\")\n",
" print(\"💡 You'll need to create node groups when deploying with Terraform\")\n",
"else:\n",
" warn(f\"Cluster '{cluster_name}' does not exist yet\")\n",
" print(\"💡 This is expected if you haven't run Terraform yet. Proceed to notebook 02_terraform_apply.ipynb\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Storage Prerequisites\n",
"\n",
"LangSmith requires persistent storage for ClickHouse. The cloud storage CSI driver must be installed and StorageClasses must be configured.\n",
"\n",
"**Why this matters:** Without the appropriate CSI driver, ClickHouse PVCs will remain in `Pending` state forever.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check if kubectl is configured for the cluster\n",
"from shared._cloud_helpers import (\n",
" get_cloud_provider,\n",
" get_region,\n",
" configure_kubectl,\n",
" get_storage_driver_name,\n",
")\n",
"\n",
"provider = get_cloud_provider()\n",
"cluster_name = os.environ[\"CLUSTER_NAME\"]\n",
"region = get_region()\n",
"storage_driver = get_storage_driver_name()\n",
"\n",
"k8s_service = \"EKS\" if provider == \"aws\" else \"AKS\" if provider == \"azure\" else \"Kubernetes\"\n",
"print(f\"### Configuring kubectl for {k8s_service} cluster\")\n",
"try:\n",
" # Configure kubectl (cloud-agnostic)\n",
" configure_kubectl(cluster_name, region)\n",
" ok(\"kubectl configured for cluster\")\n",
" \n",
" # Check CSI driver (cloud-specific labels)\n",
" print(f\"\\n### Checking {storage_driver} Driver\")\n",
" \n",
" if provider == \"aws\":\n",
" driver_label = \"app=ebs-csi-controller\"\n",
" driver_name = \"EBS CSI\"\n",
" elif provider == \"azure\":\n",
" driver_label = \"app=csi-azuredisk-controller\"\n",
" driver_name = \"Azure Disk CSI\"\n",
" else:\n",
" driver_label = None\n",
" driver_name = \"Storage CSI\"\n",
" \n",
" if driver_label:\n",
" result = run(\n",
" [\"kubectl\", \"get\", \"daemonset\", \"-n\", \"kube-system\", \"-l\", driver_label, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0 and result.stdout.strip():\n",
" ds_info = json.loads(result.stdout)\n",
" if ds_info.get(\"items\"):\n",
" ok(f\"{driver_name} driver is installed\")\n",
" print(f\" DaemonSet: {ds_info['items'][0]['metadata']['name']}\")\n",
" else:\n",
" warn(f\"{driver_name} driver not found\")\n",
" print(f\"💡 {driver_name} driver must be installed before deploying LangSmith\")\n",
" print(\" The Terraform module should handle this, but verify after deployment\")\n",
" else:\n",
" warn(f\"{driver_name} driver not found\")\n",
" print(f\"💡 {driver_name} driver must be installed before deploying LangSmith\")\n",
" \n",
" # Check StorageClasses\n",
" print(\"\\n### Checking StorageClasses\")\n",
" result = run(\n",
" [\"kubectl\", \"get\", \"storageclass\", \"-o\", \"json\"],\n",
" check=True,\n",
" stream=False\n",
" )\n",
" sc_list = json.loads(result.stdout)\n",
" \n",
" # Find cloud-specific storage classes\n",
" if provider == \"aws\":\n",
" storage_scs = [sc for sc in sc_list.get(\"items\", []) if \"ebs\" in sc[\"metadata\"][\"name\"].lower() or \n",
" sc.get(\"provisioner\", \"\").endswith(\"ebs.csi.aws.com\")]\n",
" elif provider == \"azure\":\n",
" storage_scs = [sc for sc in sc_list.get(\"items\", []) if \"disk\" in sc[\"metadata\"][\"name\"].lower() or \n",
" sc.get(\"provisioner\", \"\").endswith(\"disk.csi.azure.com\")]\n",
" else:\n",
" storage_scs = []\n",
" \n",
" if storage_scs:\n",
" ok(f\"Found {len(storage_scs)} {storage_driver} StorageClass(es):\")\n",
" for sc in storage_scs:\n",
" name = sc[\"metadata\"][\"name\"]\n",
" default = sc.get(\"metadata\", {}).get(\"annotations\", {}).get(\"storageclass.kubernetes.io/is-default-class\", \"false\")\n",
" print(f\" - {name} (default: {default})\")\n",
" else:\n",
" warn(f\"No {storage_driver} StorageClasses found\")\n",
" print(f\"💡 At least one {storage_driver} StorageClass is required for ClickHouse PVCs\")\n",
" \n",
"except Exception as e:\n",
" warn(f\"Could not check storage prerequisites: {e}\")\n",
" print(\"💡 This is expected if the cluster doesn't exist yet\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Blob Storage Requirement\n",
"\n",
"**Critical:** LangSmith requires cloud object storage (S3, Blob Storage, etc.) for blob storage in production. Inline trace payloads will explode ClickHouse if blob storage is not configured.\n",
"\n",
"Let's verify access to your cloud provider's object storage service and check if a storage account/bucket exists or needs to be created.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._cloud_helpers import (\n",
" get_cloud_provider,\n",
" get_region,\n",
" get_blob_storage_service_name,\n",
" verify_blob_storage_access,\n",
")\n",
"from shared._shell import run\n",
"import json\n",
"\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"blob_service = get_blob_storage_service_name()\n",
"\n",
"print(f\"### {blob_service} Access Check\")\n",
"print(f\"Cloud Provider: {provider.upper()}\")\n",
"print(f\"Region: {region}\\n\")\n",
"\n",
"# Test blob storage access\n",
"try:\n",
" if provider == \"aws\":\n",
" result = run(\n",
" [\"aws\", \"s3\", \"ls\", \"--region\", region],\n",
" check=True,\n",
" stream=False\n",
" )\n",
" ok(f\"{blob_service} access verified\")\n",
" \n",
" # List buckets\n",
" buckets_result = run(\n",
" [\"aws\", \"s3api\", \"list-buckets\", \"--output\", \"json\"],\n",
" check=True,\n",
" stream=False\n",
" )\n",
" buckets = json.loads(buckets_result.stdout).get(\"Buckets\", [])\n",
" \n",
" print(f\"\\nFound {len(buckets)} S3 bucket(s):\")\n",
" for bucket in buckets[:10]: # Show first 10\n",
" print(f\" - {bucket['Name']} (created: {bucket['CreationDate']})\")\n",
" \n",
" if len(buckets) > 10:\n",
" print(f\" ... and {len(buckets) - 10} more\")\n",
" \n",
" elif provider == \"azure\":\n",
" result = run(\n",
" [\"az\", \"storage\", \"account\", \"list\", \"--output\", \"json\"],\n",
" check=True,\n",
" stream=False\n",
" )\n",
" ok(f\"{blob_service} access verified\")\n",
" \n",
" # List storage accounts\n",
" accounts = json.loads(result.stdout)\n",
" \n",
" print(f\"\\nFound {len(accounts)} Storage Account(s):\")\n",
" for account in accounts[:10]: # Show first 10\n",
" name = account.get(\"name\", \"N/A\")\n",
" location = account.get(\"location\", \"N/A\")\n",
" print(f\" - {name} (location: {location})\")\n",
" \n",
" if len(accounts) > 10:\n",
" print(f\" ... and {len(accounts) - 10} more\")\n",
" \n",
" print(f\"\\n💡 Note: The Terraform module should create a {blob_service} resource for LangSmith blob storage\")\n",
" print(\" Verify the resource exists after Terraform deployment\")\n",
" \n",
"except Exception as e:\n",
" warn(f\"{blob_service} access check failed: {e}\")\n",
" if provider == \"aws\":\n",
" print(\"💡 Ensure your AWS credentials have S3 permissions\")\n",
" elif provider == \"azure\":\n",
" print(\"💡 Ensure your Azure credentials have Storage Account permissions\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Terraform & Helm Repository Paths\n",
"\n",
"Verify that the Terraform and Helm repository paths are correctly configured and accessible.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"from pathlib import Path\n",
"from shared._validation import ok, warn\n",
"\n",
"def expand_env_vars(path_str: str) -> str:\n",
" \"\"\"Expand environment variable references in a path string.\"\"\"\n",
" # Expand $VAR and ${VAR} references\n",
" def replace_var(match):\n",
" var_name = match.group(1) or match.group(2)\n",
" return os.environ.get(var_name, match.group(0))\n",
" \n",
" # Replace $VAR and ${VAR} patterns\n",
" path_str = re.sub(r'\\$\\{([^}]+)\\}|\\$([a-zA-Z_][a-zA-Z0-9_]*)', replace_var, path_str)\n",
" return path_str\n",
"\n",
"# Expand environment variables in paths (e.g., $TERRAFORM_REPO_DIR, $HELM_REPO_DIR, $HOME)\n",
"terraform_dir_str = expand_env_vars(os.environ[\"TERRAFORM_DIR\"])\n",
"terraform_dir = Path(terraform_dir_str).expanduser().resolve()\n",
"\n",
"helm_chart_ref_str = expand_env_vars(os.environ[\"HELM_CHART_REF\"])\n",
"helm_chart_ref = Path(helm_chart_ref_str).expanduser().resolve()\n",
"\n",
"print(\"### Repository Paths Check\\n\")\n",
"\n",
"# Check Terraform directory\n",
"print(f\"Terraform Directory: {terraform_dir}\")\n",
"if terraform_dir.exists():\n",
" ok(f\"Terraform directory exists\")\n",
" \n",
" # Check for main.tf or similar\n",
" tf_files = list(terraform_dir.glob(\"*.tf\"))\n",
" if tf_files:\n",
" print(f\" Found {len(tf_files)} Terraform file(s)\")\n",
" else:\n",
" warn(\"No .tf files found in Terraform directory\")\n",
" print(\"💡 Ensure you're pointing to the correct Terraform module path\")\n",
"else:\n",
" warn(f\"Terraform directory does not exist: {terraform_dir}\")\n",
" print(\"💡 Update TERRAFORM_DIR in your .env file to point to the langchain-ai/terraform repo\")\n",
"\n",
"# Check Helm chart\n",
"print(f\"\\nHelm Chart Reference: {helm_chart_ref}\")\n",
"if helm_chart_ref.exists():\n",
" ok(f\"Helm chart path exists\")\n",
" \n",
" # Check for Chart.yaml\n",
" chart_yaml = helm_chart_ref / \"Chart.yaml\"\n",
" if chart_yaml.exists():\n",
" print(f\" Found Chart.yaml\")\n",
" else:\n",
" warn(\"Chart.yaml not found\")\n",
" print(\"💡 Ensure you're pointing to the correct Helm chart path\")\n",
"else:\n",
" warn(f\"Helm chart path does not exist: {helm_chart_ref}\")\n",
" print(\"💡 Update HELM_CHART_REF in your .env file to point to the langchain-ai/helm chart\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preflight Summary\n",
"\n",
"Review the checklist below. All items should be ✅ before proceeding to Terraform deployment.\n",
"\n",
"### ✅ Checklist\n",
"\n",
"- [ ] All required tools installed (cloud CLI, terraform, kubectl, helm, jq)\n",
"- [ ] Cloud provider credentials valid and correct account/subscription/region\n",
"- [ ] Required environment variables set\n",
"- [ ] Terraform directory path correct\n",
"- [ ] Helm chart path correct\n",
"- [ ] Blob storage access verified (S3/Blob Storage)\n",
"- [ ] (If cluster exists) Storage CSI driver installed\n",
"- [ ] (If cluster exists) StorageClasses configured\n",
"\n",
"### Next Steps\n",
"\n",
"If all checks pass, proceed to **02_terraform_apply.ipynb** to deploy the infrastructure.\n",
"\n",
"If any checks failed, review the warnings above and fix the issues before continuing.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.14.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
+90 -2
View File
@@ -129,6 +129,77 @@
" print(f\"💡 Tip: Set {account_var} in your .env file to add a guardrail against wrong account deployments\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Workshop Identifier Setup\n",
"\n",
"To ensure unique resource names and enable idempotent deployments, we need a unique identifier for your workshop deployment. This identifier will be used for all Terraform resources.\n",
"\n",
"**We'll use your email address** (hashed for privacy) to create a deterministic identifier that:\n",
"- ✅ Stays the same across notebook runs (idempotent)\n",
"- ✅ Is unique per student\n",
"- ✅ Works with the date-based prefix for resource naming\n",
"\n",
"Enter your email address below. It will be hashed and used to generate your unique workshop identifier.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Generate deterministic workshop identifier from email\n",
"import hashlib\n",
"import json\n",
"from datetime import date\n",
"from pathlib import Path\n",
"\n",
"print(\"### Workshop Identifier Setup\\n\")\n",
"print(\"Enter your email address to generate a unique, deterministic identifier for your deployment.\\n\")\n",
"print(\"This identifier will be used for all Terraform resources and ensures:\")\n",
"print(\" - Same email = same identifier (idempotent)\")\n",
"print(\" - Different emails = different identifiers (unique)\")\n",
"print(\" - No additional environment variables needed\\n\")\n",
"\n",
"# Prompt for email (using input() - works in Jupyter)\n",
"email = input(\"Enter your email address: \").strip().lower()\n",
"\n",
"if not email or \"@\" not in email:\n",
" raise ValueError(\"Invalid email address. Please enter a valid email.\")\n",
"\n",
"# Hash email for privacy and determinism\n",
"email_hash = hashlib.md5(email.encode()).hexdigest()[:6]\n",
"\n",
"# Build identifier: -workshop-YYYYMMDD-<hash>\n",
"today = date.today()\n",
"date_str = today.strftime('%Y%m%d')\n",
"workshop_identifier = f\"-workshop-{date_str}-{email_hash}\"\n",
"\n",
"# Save to artifacts directory for use in Terraform notebook\n",
"identifier_file = Path(bootstrap_info['artifacts_dir']) / \"workshop_identifier.json\"\n",
"identifier_data = {\n",
" \"email_hash\": email_hash,\n",
" \"identifier\": workshop_identifier,\n",
" \"date\": date_str,\n",
" \"created_at\": date.today().isoformat()\n",
"}\n",
"\n",
"with open(identifier_file, 'w') as f:\n",
" json.dump(identifier_data, f, indent=2)\n",
"\n",
"print(f\"\\n✅ Workshop identifier generated:\")\n",
"print(f\" Identifier: {workshop_identifier}\")\n",
"print(f\" Date component: {date_str}\")\n",
"print(f\" Hash (from email): {email_hash}\")\n",
"print(f\"\\n💡 This identifier will be used for all Terraform resources\")\n",
"print(f\" Saved to: {identifier_file}\")\n",
"print(f\"\\n⚠️ IMPORTANT: Use the same email address if you re-run this notebook\")\n",
"print(f\" to ensure Terraform can manage existing resources.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -153,7 +224,6 @@
"required_vars = [\n",
" \"WORKSHOP_NAME\",\n",
" \"NAMESPACE\",\n",
" \"CLUSTER_NAME\",\n",
" \"TERRAFORM_DIR\",\n",
" \"HELM_RELEASE\",\n",
" \"HELM_NAMESPACE\",\n",
@@ -222,9 +292,27 @@
" get_kubernetes_service_name,\n",
")\n",
"from shared._shell import run\n",
"from shared._validation import ok, warn, require_env\n",
"from pathlib import Path\n",
"import json\n",
"\n",
"# Load workshop identifier if it exists (from identifier setup cell)\n",
"identifier_file = Path(bootstrap_info['artifacts_dir']) / \"workshop_identifier.json\"\n",
"if identifier_file.exists():\n",
" with open(identifier_file) as f:\n",
" identifier_data = json.load(f)\n",
" workshop_identifier = identifier_data[\"identifier\"]\n",
" # Compute expected cluster name: langsmith-eks${identifier}\n",
" cluster_name = f\"langsmith-eks{workshop_identifier}\"\n",
" print(f\"💡 Using cluster name from workshop identifier: {cluster_name}\\n\")\n",
"else:\n",
" # Fallback to CLUSTER_NAME env var if identifier not set yet\n",
" config = require_env(\"CLUSTER_NAME\")\n",
" cluster_name = config[\"CLUSTER_NAME\"]\n",
" warn(\"Workshop identifier not found - using CLUSTER_NAME from environment\")\n",
" print(\"💡 Run the 'Workshop Identifier Setup' cell above to generate a unique identifier\\n\")\n",
"\n",
"provider = get_cloud_provider()\n",
"cluster_name = os.environ[\"CLUSTER_NAME\"]\n",
"region = get_region()\n",
"k8s_service = get_kubernetes_service_name()\n",
"\n",
+274 -4
View File
@@ -97,6 +97,7 @@
"source": [
"import os\n",
"import re\n",
"import json\n",
"from pathlib import Path\n",
"from shared._validation import require_env, ok, warn, fail\n",
"from shared._shell import run\n",
@@ -131,15 +132,41 @@
"terraform_dir_str = expand_env_vars(config[\"TERRAFORM_DIR\"])\n",
"terraform_dir = Path(terraform_dir_str).expanduser().resolve()\n",
"\n",
"cluster_name = config[\"CLUSTER_NAME\"]\n",
"# Load workshop identifier from preflight notebook\n",
"identifier_file = artifacts_dir / \"workshop_identifier.json\"\n",
"if not identifier_file.exists():\n",
" fail(f\"Workshop identifier not found: {identifier_file}\")\n",
" print(\"\\n💡 To fix this:\")\n",
" print(\" 1. Run the preflight notebook (01_preflight.ipynb) first\")\n",
" print(\" 2. Complete the 'Workshop Identifier Setup' cell\")\n",
" print(\" 3. Then return to this notebook\")\n",
" raise RuntimeError(f\"Workshop identifier not found. Please run 01_preflight.ipynb first.\")\n",
"\n",
"with open(identifier_file) as f:\n",
" identifier_data = json.load(f)\n",
"\n",
"workshop_identifier = identifier_data[\"identifier\"]\n",
"print(f\"✅ Loaded workshop identifier: {workshop_identifier}\")\n",
"\n",
"# Compute expected cluster name for validation/display\n",
"# Terraform computes: cluster_name = \"langsmith-eks${local.identifier}\"\n",
"cluster_name = f\"langsmith-eks{workshop_identifier}\"\n",
"\n",
"region = config[region_var]\n",
"workshop_name = config[\"WORKSHOP_NAME\"]\n",
"\n",
"print(\"### Terraform Configuration\")\n",
"print(\"\\n### Terraform Configuration\")\n",
"print(f\"Terraform Directory: {terraform_dir}\")\n",
"print(f\"Cluster Name: {cluster_name}\")\n",
"print(f\"Workshop Identifier: {workshop_identifier}\")\n",
"print(f\"Expected Cluster Name: {cluster_name}\")\n",
"print(f\"Region: {region}\")\n",
"print(f\"Workshop Name: {workshop_name}\\n\")\n",
"print(f\"Workshop Name: {workshop_name}\")\n",
"print(f\"\\n💡 Terraform will use this identifier for all resource names:\")\n",
"print(f\" Cluster: langsmith-eks{workshop_identifier}\")\n",
"print(f\" Redis: langsmith-redis{workshop_identifier}\")\n",
"print(f\" S3: langsmith-s3{workshop_identifier}\")\n",
"print(f\" Postgres: langsmith-postgres{workshop_identifier}\")\n",
"print(f\" VPC: langsmith-vpc{workshop_identifier}\\n\")\n",
"\n",
"if not terraform_dir.exists():\n",
" fail(f\"Terraform directory does not exist: {terraform_dir}\")\n",
@@ -371,6 +398,8 @@
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"\n",
"# Create terraform plan\n",
"plan_file = artifacts_dir / \"terraform-plan.txt\"\n",
"\n",
@@ -383,7 +412,22 @@
"postgres_username = os.environ.get(\"POSTGRES_USERNAME\", \"\").strip()\n",
"postgres_password = os.environ.get(\"POSTGRES_PASSWORD\", \"\").strip()\n",
"\n",
"if not postgres_username:\n",
" print(\"Please provide a PostgreSQL username: \")\n",
" postgres_username = input().strip()\n",
"\n",
"if not postgres_password:\n",
" print(\"Please provide a PostgreSQL password: \")\n",
" postgres_password = getpass.getpass().strip()\n",
"\n",
"print(\"### Terraform Variables\\n\")\n",
"\n",
"# Pass workshop identifier to Terraform\n",
"# This is the key variable that controls all resource naming\n",
"terraform_vars.extend([\"-var\", f\"identifier={workshop_identifier}\"])\n",
"print(f\"✅ IDENTIFIER: {workshop_identifier}\")\n",
"print(f\" This will be used for all resource names (cluster, redis, s3, postgres, vpc)\\n\")\n",
"\n",
"missing_vars = []\n",
"\n",
"if postgres_username:\n",
@@ -439,6 +483,232 @@
" print(\"💡 Review the errors above before proceeding\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pre-Apply Safety Check\n",
"\n",
"**⚠️ CRITICAL:** Before applying Terraform, verify that resources don't already exist. This prevents accidentally modifying or overwriting existing infrastructure.\n",
"\n",
"This check will:\n",
"- Verify the cluster doesn't already exist (or warn if it does)\n",
"- Check for existing RDS/Redis/S3 resources that might conflict\n",
"- Require explicit confirmation if resources are found\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Pre-apply safety check: Verify resources don't already exist\n",
"from shared._cloud_helpers import (\n",
" get_cloud_provider,\n",
" get_region,\n",
" cluster_exists,\n",
" get_kubernetes_service_name,\n",
" get_database_service_name,\n",
" get_cache_service_name,\n",
" get_blob_storage_service_name,\n",
")\n",
"from shared._validation import ok, warn, fail\n",
"from shared._shell import run\n",
"import json\n",
"\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"k8s_service = get_kubernetes_service_name()\n",
"\n",
"print(\"### Pre-Apply Resource Existence Check\\n\")\n",
"print(\"Checking for existing resources that might conflict...\\n\")\n",
"\n",
"existing_resources = []\n",
"warnings = []\n",
"\n",
"# Check if cluster already exists\n",
"if cluster_exists(cluster_name):\n",
" existing_resources.append(f\"{k8s_service} cluster: {cluster_name}\")\n",
" warn(f\"⚠️ Cluster '{cluster_name}' already exists!\")\n",
" print(f\" If you proceed with Terraform apply, Terraform may attempt to:\")\n",
" print(f\" - Import the existing cluster into state, OR\")\n",
" print(f\" - Modify the existing cluster configuration\")\n",
" print(f\" This could cause unexpected changes to your existing infrastructure.\\n\")\n",
" print(f\" 💡 If this is intentional, ensure your Terraform configuration matches the existing cluster\")\n",
" print(f\" 💡 If this is NOT intentional, STOP and update CLUSTER_NAME in your .env file\")\n",
"else:\n",
" ok(f\"Cluster '{cluster_name}' does not exist (safe to create)\")\n",
"\n",
"# Check for existing RDS instances (AWS) or PostgreSQL servers (Azure)\n",
"db_service = get_database_service_name()\n",
"print(f\"\\n### Checking for Existing {db_service} Resources\\n\")\n",
"\n",
"if provider == \"aws\":\n",
" # Check for RDS instances that might match our naming pattern\n",
" # We'll check for instances in the same region\n",
" try:\n",
" result = run(\n",
" [\"aws\", \"rds\", \"describe-db-instances\", \"--region\", region, \"--output\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" if result.returncode == 0:\n",
" rds_instances = json.loads(result.stdout).get(\"DBInstances\", [])\n",
" # Check if any instance name might conflict (exact match or similar pattern)\n",
" # Terraform typically uses cluster_name or workshop_name in resource names\n",
" for instance in rds_instances:\n",
" db_id = instance.get(\"DBInstanceIdentifier\", \"\")\n",
" # Check if instance name contains cluster_name or workshop_name\n",
" if cluster_name.lower() in db_id.lower() or workshop_name.lower() in db_id.lower():\n",
" existing_resources.append(f\"RDS instance: {db_id}\")\n",
" warnings.append(f\"Found RDS instance '{db_id}' that may conflict with Terraform resources\")\n",
" except Exception as e:\n",
" warn(f\"Could not check for RDS instances: {e}\")\n",
" print(\" 💡 This is OK - proceeding with caution\")\n",
"\n",
"elif provider == \"azure\":\n",
" # Check for PostgreSQL servers\n",
" try:\n",
" result = run(\n",
" [\"az\", \"postgres\", \"server\", \"list\", \"--output\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" if result.returncode == 0:\n",
" postgres_servers = json.loads(result.stdout)\n",
" for server in postgres_servers:\n",
" server_name = server.get(\"name\", \"\")\n",
" server_location = server.get(\"location\", \"\")\n",
" # Check if server is in same location and name might conflict\n",
" if server_location.lower() == region.lower():\n",
" if cluster_name.lower() in server_name.lower() or workshop_name.lower() in server_name.lower():\n",
" existing_resources.append(f\"PostgreSQL server: {server_name}\")\n",
" warnings.append(f\"Found PostgreSQL server '{server_name}' that may conflict with Terraform resources\")\n",
" except Exception as e:\n",
" warn(f\"Could not check for PostgreSQL servers: {e}\")\n",
" print(\" 💡 This is OK - proceeding with caution\")\n",
"\n",
"# Check for existing Redis/ElastiCache clusters\n",
"cache_service = get_cache_service_name()\n",
"print(f\"\\n### Checking for Existing {cache_service} Resources\\n\")\n",
"\n",
"if provider == \"aws\":\n",
" # Check for ElastiCache clusters\n",
" try:\n",
" result = run(\n",
" [\"aws\", \"elasticache\", \"describe-cache-clusters\", \"--region\", region, \"--output\", \"json\", \"--show-cache-node-info\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" if result.returncode == 0:\n",
" cache_clusters = json.loads(result.stdout).get(\"CacheClusters\", [])\n",
" for cluster in cache_clusters:\n",
" cluster_id = cluster.get(\"CacheClusterId\", \"\")\n",
" if cluster_name.lower() in cluster_id.lower() or workshop_name.lower() in cluster_id.lower():\n",
" existing_resources.append(f\"ElastiCache cluster: {cluster_id}\")\n",
" warnings.append(f\"Found ElastiCache cluster '{cluster_id}' that may conflict with Terraform resources\")\n",
" except Exception as e:\n",
" warn(f\"Could not check for ElastiCache clusters: {e}\")\n",
" print(\" 💡 This is OK - proceeding with caution\")\n",
"\n",
"elif provider == \"azure\":\n",
" # Check for Redis caches\n",
" try:\n",
" result = run(\n",
" [\"az\", \"redis\", \"list\", \"--output\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" if result.returncode == 0:\n",
" redis_caches = json.loads(result.stdout)\n",
" for cache in redis_caches:\n",
" cache_name = cache.get(\"name\", \"\")\n",
" cache_location = cache.get(\"location\", \"\")\n",
" if cache_location.lower() == region.lower():\n",
" if cluster_name.lower() in cache_name.lower() or workshop_name.lower() in cache_name.lower():\n",
" existing_resources.append(f\"Redis cache: {cache_name}\")\n",
" warnings.append(f\"Found Redis cache '{cache_name}' that may conflict with Terraform resources\")\n",
" except Exception as e:\n",
" warn(f\"Could not check for Redis caches: {e}\")\n",
" print(\" 💡 This is OK - proceeding with caution\")\n",
"\n",
"# Check for existing S3 buckets (AWS) or Storage Accounts (Azure)\n",
"blob_service = get_blob_storage_service_name()\n",
"print(f\"\\n### Checking for Existing {blob_service} Resources\\n\")\n",
"\n",
"if provider == \"aws\":\n",
" # Check for S3 buckets\n",
" try:\n",
" result = run(\n",
" [\"aws\", \"s3api\", \"list-buckets\", \"--output\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" if result.returncode == 0:\n",
" buckets = json.loads(result.stdout).get(\"Buckets\", [])\n",
" # Check if any bucket name might conflict\n",
" for bucket in buckets:\n",
" bucket_name = bucket.get(\"Name\", \"\")\n",
" if cluster_name.lower() in bucket_name.lower() or workshop_name.lower() in bucket_name.lower():\n",
" existing_resources.append(f\"S3 bucket: {bucket_name}\")\n",
" warnings.append(f\"Found S3 bucket '{bucket_name}' that may conflict with Terraform resources\")\n",
" except Exception as e:\n",
" warn(f\"Could not check for S3 buckets: {e}\")\n",
" print(\" 💡 This is OK - proceeding with caution\")\n",
"\n",
"elif provider == \"azure\":\n",
" # Check for Storage Accounts\n",
" try:\n",
" result = run(\n",
" [\"az\", \"storage\", \"account\", \"list\", \"--output\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" if result.returncode == 0:\n",
" storage_accounts = json.loads(result.stdout)\n",
" for account in storage_accounts:\n",
" account_name = account.get(\"name\", \"\")\n",
" account_location = account.get(\"location\", \"\")\n",
" if account_location.lower() == region.lower():\n",
" if cluster_name.lower() in account_name.lower() or workshop_name.lower() in account_name.lower():\n",
" existing_resources.append(f\"Storage account: {account_name}\")\n",
" warnings.append(f\"Found Storage account '{account_name}' that may conflict with Terraform resources\")\n",
" except Exception as e:\n",
" warn(f\"Could not check for Storage accounts: {e}\")\n",
" print(\" 💡 This is OK - proceeding with caution\")\n",
"\n",
"# Summary and decision\n",
"print(\"\\n\" + \"=\" * 60)\n",
"print(\"### Safety Check Summary\")\n",
"print(\"=\" * 60)\n",
"\n",
"if existing_resources:\n",
" fail(f\"Found {len(existing_resources)} existing resource(s) that may conflict:\")\n",
" for resource in existing_resources:\n",
" print(f\" ⚠️ {resource}\")\n",
" \n",
" print(\"\\n\" + \"=\" * 60)\n",
" print(\"⚠️ WARNING: Proceeding with Terraform apply may:\")\n",
" print(\" - Modify existing infrastructure\")\n",
" print(\" - Import existing resources into Terraform state\")\n",
" print(\" - Cause unexpected changes or conflicts\")\n",
" print(\"=\" * 60)\n",
" print(\"\\n💡 Recommendations:\")\n",
" print(\" 1. If these resources are from a previous deployment, that's OK\")\n",
" print(\" 2. If these resources are UNRELATED to this deployment:\")\n",
" print(f\" - Update CLUSTER_NAME or WORKSHOP_NAME in your .env file\")\n",
" print(f\" - Use different resource names to avoid conflicts\")\n",
" print(\" 3. Review the Terraform plan carefully before applying\")\n",
" print(\" 4. Consider using Terraform import if you want to manage existing resources\")\n",
" print(\"\\n⚠️ You must explicitly confirm you understand the risks before proceeding.\")\n",
" print(\" Review the plan output and ensure you're comfortable with the changes.\")\n",
"else:\n",
" ok(\"No conflicting resources found\")\n",
" print(\"✅ Safe to proceed with Terraform apply\")\n",
" print(\"💡 Still review the Terraform plan before applying to ensure it's correct\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -449,60 +449,72 @@
"if not license_key:\n",
" raise RuntimeError(\"❌ LANGSMITH_LICENSE_KEY is required\")\n",
"\n",
"# Helper function to create or update secret\n",
"def create_or_update_secret(secret_name: str, literals: dict, namespace: str):\n",
" \"\"\"Create a secret if it doesn't exist, or update it if it does.\"\"\"\n",
" # Check if secret exists\n",
" check_result = run(\n",
" [\"kubectl\", \"get\", \"secret\", secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" # Build kubectl command\n",
" cmd = [\"kubectl\", \"create\", \"secret\", \"generic\", secret_name, \"-n\", namespace]\n",
" for key, value in literals.items():\n",
" cmd.extend([\"--from-literal\", f\"{key}={value}\"])\n",
" \n",
" if check_result.returncode == 0:\n",
" # Secret exists - update it using apply\n",
" print(f\" Secret '{secret_name}' exists, updating...\")\n",
" # Generate YAML using dry-run, then apply it\n",
" create_cmd = cmd + [\"--dry-run=client\", \"-o\", \"yaml\"]\n",
" result = run(create_cmd, check=True, stream=False)\n",
" \n",
" # Apply the YAML\n",
" apply_result = run(\n",
" [\"kubectl\", \"apply\", \"-f\", \"-\"],\n",
" input=result.stdout,\n",
" check=True,\n",
" stream=True\n",
" )\n",
" return \"updated\"\n",
" else:\n",
" # Secret doesn't exist - create it\n",
" print(f\" Secret '{secret_name}' does not exist, creating...\")\n",
" run(cmd, check=True, stream=True)\n",
" return \"created\"\n",
"\n",
"# Create license key secret\n",
"print(\"Creating license key secret...\")\n",
"run(\n",
" [\n",
" \"kubectl\", \"create\", \"secret\", \"generic\", \"langsmith-license\",\n",
" f\"--from-literal=license-key={license_key}\",\n",
" \"-n\", namespace,\n",
" \"--dry-run=client\", \"-o\", \"yaml\"\n",
" ],\n",
" check=True,\n",
" stream=False\n",
"action = create_or_update_secret(\n",
" \"langsmith-license\",\n",
" {\"license-key\": license_key},\n",
" namespace\n",
")\n",
"# Actually create it (remove dry-run)\n",
"run(\n",
" [\n",
" \"kubectl\", \"create\", \"secret\", \"generic\", \"langsmith-license\",\n",
" f\"--from-literal=license-key={license_key}\",\n",
" \"-n\", namespace\n",
" ],\n",
" check=False, # May already exist\n",
" stream=True\n",
")\n",
"ok(\"License key secret created/updated\")\n",
"ok(f\"License key secret {action}\")\n",
"\n",
"# Create database secret if credentials provided\n",
"if db_user and db_password:\n",
" print(\"\\nCreating database secret...\")\n",
" run(\n",
" [\n",
" \"kubectl\", \"create\", \"secret\", \"generic\", \"langsmith-db\",\n",
" f\"--from-literal=username={db_user}\",\n",
" f\"--from-literal=password={db_password}\",\n",
" \"-n\", namespace\n",
" ],\n",
" check=False, # May already exist\n",
" stream=True\n",
" action = create_or_update_secret(\n",
" \"langsmith-db\",\n",
" {\"username\": db_user, \"password\": db_password},\n",
" namespace\n",
" )\n",
" ok(\"Database secret created/updated\")\n",
" ok(f\"Database secret {action}\")\n",
"else:\n",
" print(\"💡 Skipping database secret (using IAM auth or not needed)\")\n",
"\n",
"# Create Redis secret if password provided\n",
"if redis_password:\n",
" print(\"\\nCreating Redis secret...\")\n",
" run(\n",
" [\n",
" \"kubectl\", \"create\", \"secret\", \"generic\", \"langsmith-redis\",\n",
" f\"--from-literal=password={redis_password}\",\n",
" \"-n\", namespace\n",
" ],\n",
" check=False, # May already exist\n",
" stream=True\n",
" action = create_or_update_secret(\n",
" \"langsmith-redis\",\n",
" {\"password\": redis_password},\n",
" namespace\n",
" )\n",
" ok(\"Redis secret created/updated\")\n",
" ok(f\"Redis secret {action}\")\n",
"else:\n",
" print(\"💡 Skipping Redis secret (using IAM auth or not needed)\")\n",
"\n",
@@ -568,6 +580,79 @@
" print(\"💡 Review the errors above before proceeding\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pre-Install Safety Check\n",
"\n",
"**⚠️ CRITICAL:** Before installing with Helm, verify that a release doesn't already exist. This prevents accidentally overwriting or conflicting with existing deployments.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Pre-install safety check: Verify Helm release doesn't already exist\n",
"from shared._validation import ok, warn, fail\n",
"from shared._shell import run\n",
"import json\n",
"\n",
"print(\"### Pre-Install Helm Release Check\\n\")\n",
"print(\"Checking if Helm release already exists...\\n\")\n",
"\n",
"# Check if Helm release exists\n",
"result = run(\n",
" [\"helm\", \"list\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False, # Don't fail if namespace doesn't exist yet\n",
" stream=False\n",
")\n",
"\n",
"releases = []\n",
"if result.returncode == 0:\n",
" try:\n",
" releases = json.loads(result.stdout)\n",
" except json.JSONDecodeError:\n",
" # Empty output or invalid JSON\n",
" releases = []\n",
"elif \"not found\" in result.stderr.lower() or \"does not exist\" in result.stderr.lower():\n",
" # Namespace doesn't exist, which is fine\n",
" ok(f\"Namespace '{namespace}' does not exist (will be created)\")\n",
" releases = []\n",
"else:\n",
" # Some other error\n",
" warn(f\"Could not check for Helm releases: {result.stderr}\")\n",
" print(\"💡 Proceeding with caution\")\n",
"\n",
"# Check if our release name already exists\n",
"langsmith_releases = [r for r in releases if r.get(\"name\") == helm_release]\n",
"\n",
"if langsmith_releases:\n",
" release = langsmith_releases[0]\n",
" fail(f\"⚠️ Helm release '{helm_release}' already exists in namespace '{namespace}'!\")\n",
" print(f\"\\nRelease Details:\")\n",
" print(f\" Name: {release.get('name', 'N/A')}\")\n",
" print(f\" Status: {release.get('status', 'N/A')}\")\n",
" print(f\" Chart: {release.get('chart', 'N/A')}\")\n",
" print(f\" Revision: {release.get('revision', 'N/A')}\")\n",
" print(f\" Namespace: {release.get('namespace', 'N/A')}\")\n",
" \n",
" print(\"\\n\" + \"=\" * 60)\n",
" print(\"⚠️ WARNING: Cannot install - release already exists!\")\n",
" print(\"=\" * 60)\n",
" print(\"\\n💡 Options:\")\n",
" print(f\" 1. To upgrade the existing release, use: helm upgrade {helm_release} ...\")\n",
" print(f\" 2. To reinstall, first uninstall: helm uninstall {helm_release} -n {namespace}\")\n",
" print(f\" 3. To use a different release name, update HELM_RELEASE in your .env file\")\n",
" print(\"\\n❌ Do NOT proceed with 'helm install' - it will fail.\")\n",
" raise RuntimeError(f\"Helm release '{helm_release}' already exists. Use 'helm upgrade' or uninstall first.\")\n",
"else:\n",
" ok(f\"Helm release '{helm_release}' does not exist (safe to install)\")\n",
" print(\"✅ Safe to proceed with Helm install\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -13,10 +13,13 @@
"### What We'll Validate\n",
"\n",
"1. ✅ Pod readiness (all pods running)\n",
"2. ✅ PVC binding (storage provisioned)\n",
"3. ✅ Ingress provisioning (ALB created)\n",
"4. ✅ Endpoint reachability (services accessible)\n",
"5. ✅ Basic UI availability (web interface works)\n",
"2. ✅ License key validation (properly configured)\n",
"3. ✅ PVC binding (storage provisioned)\n",
"4. ✅ External services connectivity (PostgreSQL, Redis, blob storage)\n",
"5. ✅ Ingress provisioning (load balancer created)\n",
"6. ✅ Endpoint reachability (services accessible)\n",
"7. ✅ Basic UI availability (web interface works)\n",
"8. ✅ Basic functional test (optional trace submission)\n",
"\n",
"### Why This Matters\n",
"\n",
@@ -72,7 +75,7 @@
"source": [
"## Setting Up Cluster Access\n",
"\n",
"Ensure kubectl is configured for the EKS cluster.\n"
"Ensure kubectl is configured for the Kubernetes cluster.\n"
]
},
{
@@ -82,12 +85,13 @@
"outputs": [],
"source": [
"import os\n",
"from shared._validation import require_env, ok\n",
"from shared._validation import require_env, ok, warn\n",
"from shared._cloud_helpers import (\n",
" get_cloud_provider,\n",
" get_region,\n",
" configure_kubectl,\n",
")\n",
"from shared._shell import run\n",
"\n",
"provider = get_cloud_provider()\n",
"\n",
@@ -131,6 +135,8 @@
"outputs": [],
"source": [
"from shared._k8s_helpers import get_pods, wait_for_deployments_ready, require_namespace\n",
"from shared._validation import warn\n",
"from shared._shell import run\n",
"import json\n",
"\n",
"# Ensure namespace exists\n",
@@ -201,11 +207,279 @@
" run([\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by=.lastTimestamp\"], check=False, stream=True)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.5. License Key Validation\n",
"\n",
"**Critical:** Verify that the LangSmith license key is properly configured and valid. License issues will prevent the system from functioning correctly.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check license key secret\n",
"print(\"### License Key Validation\\n\")\n",
"\n",
"# Check if license secret exists\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", \"langsmith-license\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" ok(\"License key secret exists\")\n",
" \n",
" # Try to check if license key is set (without revealing it)\n",
" secret_data = json.loads(result.stdout)\n",
" if \"data\" in secret_data and \"license-key\" in secret_data[\"data\"]:\n",
" ok(\"License key is present in secret\")\n",
" else:\n",
" warn(\"License key secret exists but 'license-key' field not found\")\n",
" print(\"💡 Secret may use a different key name\")\n",
"else:\n",
" warn(\"License key secret not found\")\n",
" print(\"💡 License secret 'langsmith-license' should exist in namespace\")\n",
" print(\" Check that you created the secret during Helm installation\")\n",
"\n",
"# Check pod logs for license-related errors\n",
"print(\"\\n### Checking Pod Logs for License Errors\\n\")\n",
"\n",
"# Get all pods in namespace\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"jsonpath={.items[*].metadata.name}\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0 and result.stdout.strip():\n",
" pod_names = result.stdout.strip().split()\n",
" license_errors_found = False\n",
" \n",
" # Check logs from a few key pods (limit to first 3 to avoid too much output)\n",
" key_pods = [p for p in pod_names if any(keyword in p.lower() for keyword in [\"api\", \"server\", \"backend\"])][:3]\n",
" if not key_pods:\n",
" key_pods = pod_names[:3] # Fallback to first 3 pods\n",
" \n",
" for pod_name in key_pods:\n",
" try:\n",
" # Get recent logs (last 50 lines)\n",
" log_result = run(\n",
" [\"kubectl\", \"logs\", pod_name, \"-n\", namespace, \"--tail=50\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if log_result.returncode == 0:\n",
" logs = log_result.stdout.lower()\n",
" # Look for common license-related error patterns\n",
" license_error_patterns = [\n",
" \"license\",\n",
" \"unauthorized\",\n",
" \"invalid license\",\n",
" \"license expired\",\n",
" \"license key\",\n",
" \"beacon.langchain.com\",\n",
" ]\n",
" \n",
" for pattern in license_error_patterns:\n",
" if pattern in logs:\n",
" # Check if it's actually an error (not just a log message)\n",
" lines = log_result.stdout.split(\"\\n\")\n",
" error_lines = [line for line in lines if pattern in line.lower() and any(err in line.lower() for err in [\"error\", \"fail\", \"invalid\", \"unauthorized\"])]\n",
" if error_lines:\n",
" license_errors_found = True\n",
" warn(f\"Potential license issue found in {pod_name} logs\")\n",
" print(f\" Pattern: '{pattern}'\")\n",
" print(f\" Sample: {error_lines[0][:100]}...\")\n",
" break\n",
" except Exception as e:\n",
" # Skip pods that can't be logged (may not be ready)\n",
" pass\n",
" \n",
" if not license_errors_found:\n",
" ok(\"No obvious license-related errors found in pod logs\")\n",
" else:\n",
" print(\"\\n💡 If license errors are present, verify:\")\n",
" print(\" - License key is valid and not expired\")\n",
" print(\" - Egress to https://beacon.langchain.com is allowed (if not air-gapped)\")\n",
" print(\" - License secret is correctly mounted in pods\")\n",
"else:\n",
" warn(\"Could not retrieve pod names to check logs\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.5. External Services Connectivity\n",
"\n",
"**Important:** Verify that external services (PostgreSQL, Redis, blob storage) are accessible from the cluster. These are critical dependencies for LangSmith.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._cloud_helpers import (\n",
" get_database_service_name,\n",
" get_cache_service_name,\n",
" get_blob_storage_service_name,\n",
")\n",
"\n",
"# Check external services connectivity\n",
"print(\"### External Services Connectivity Check\\n\")\n",
"\n",
"# Try to load Terraform outputs to get service endpoints\n",
"terraform_outputs_file = artifacts_dir / \"terraform-outputs.json\"\n",
"terraform_outputs = {}\n",
"\n",
"if terraform_outputs_file.exists():\n",
" try:\n",
" with open(terraform_outputs_file) as f:\n",
" terraform_outputs_raw = json.load(f)\n",
" \n",
" # Unwrap Terraform output format\n",
" for key, value in terraform_outputs_raw.items():\n",
" if isinstance(value, dict) and \"value\" in value:\n",
" terraform_outputs[key] = value[\"value\"]\n",
" else:\n",
" terraform_outputs[key] = value\n",
" \n",
" print(\"💡 Loaded Terraform outputs for service endpoints\\n\")\n",
" except Exception as e:\n",
" warn(f\"Could not parse Terraform outputs: {e}\")\n",
" print(\"💡 Will attempt basic connectivity checks without endpoint details\")\n",
"else:\n",
" print(\"💡 Terraform outputs file not found - will check service connectivity from cluster\\n\")\n",
"\n",
"# Check PostgreSQL connectivity\n",
"print(\"### PostgreSQL/Database Connectivity\\n\")\n",
"db_service = get_database_service_name()\n",
"\n",
"# Try to find a pod we can exec into for connectivity tests\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"jsonpath={.items[0].metadata.name}\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0 and result.stdout.strip():\n",
" test_pod = result.stdout.strip()\n",
" \n",
" # Check if we can reach database (basic connectivity test)\n",
" # This is a simple test - actual connection requires credentials\n",
" db_endpoint = None\n",
" if \"rds_endpoint\" in terraform_outputs:\n",
" db_endpoint = terraform_outputs[\"rds_endpoint\"]\n",
" elif \"postgres_endpoint\" in terraform_outputs:\n",
" db_endpoint = terraform_outputs[\"postgres_endpoint\"]\n",
" elif \"database_endpoint\" in terraform_outputs:\n",
" db_endpoint = terraform_outputs[\"database_endpoint\"]\n",
" \n",
" if db_endpoint:\n",
" # Extract hostname from endpoint (remove port if present)\n",
" db_host = db_endpoint.split(\":\")[0] if \":\" in db_endpoint else db_endpoint\n",
" print(f\"Testing connectivity to {db_service} at {db_host}...\")\n",
" \n",
" # Try a simple DNS lookup or ping test\n",
" dns_result = run(\n",
" [\"kubectl\", \"exec\", \"-n\", namespace, test_pod, \"--\", \"nslookup\", db_host],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if dns_result.returncode == 0:\n",
" ok(f\"{db_service} hostname resolves: {db_host}\")\n",
" else:\n",
" warn(f\"Could not resolve {db_service} hostname\")\n",
" print(\"💡 This may be normal if DNS is not fully configured yet\")\n",
" else:\n",
" print(f\"💡 {db_service} endpoint not found in Terraform outputs\")\n",
" print(\" Verify database is accessible from cluster in cloud console\")\n",
"else:\n",
" print(\"💡 Could not find pod for connectivity testing\")\n",
" print(f\" Manually verify {db_service} is accessible from cluster\")\n",
"\n",
"# Check Redis connectivity\n",
"print(\"\\n### Redis/Cache Connectivity\\n\")\n",
"cache_service = get_cache_service_name()\n",
"\n",
"if result.returncode == 0 and result.stdout.strip():\n",
" redis_endpoint = None\n",
" if \"redis_endpoint\" in terraform_outputs:\n",
" redis_endpoint = terraform_outputs[\"redis_endpoint\"]\n",
" elif \"cache_endpoint\" in terraform_outputs:\n",
" redis_endpoint = terraform_outputs[\"cache_endpoint\"]\n",
" elif \"elasticache_endpoint\" in terraform_outputs:\n",
" redis_endpoint = terraform_outputs[\"elasticache_endpoint\"]\n",
" \n",
" if redis_endpoint:\n",
" # Extract hostname from endpoint\n",
" redis_host = redis_endpoint.split(\":\")[0] if \":\" in redis_endpoint else redis_endpoint\n",
" print(f\"Testing connectivity to {cache_service} at {redis_host}...\")\n",
" \n",
" dns_result = run(\n",
" [\"kubectl\", \"exec\", \"-n\", namespace, test_pod, \"--\", \"nslookup\", redis_host],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if dns_result.returncode == 0:\n",
" ok(f\"{cache_service} hostname resolves: {redis_host}\")\n",
" else:\n",
" warn(f\"Could not resolve {cache_service} hostname\")\n",
" print(\"💡 This may be normal if DNS is not fully configured yet\")\n",
" else:\n",
" print(f\"💡 {cache_service} endpoint not found in Terraform outputs\")\n",
" print(\" Verify cache is accessible from cluster in cloud console\")\n",
"\n",
"# Check blob storage (S3/Azure Blob)\n",
"print(\"\\n### Blob Storage Connectivity\\n\")\n",
"blob_service = get_blob_storage_service_name()\n",
"\n",
"# Check if blob storage secret exists (indicates it's configured)\n",
"blob_secret_result = run(\n",
" [\"kubectl\", \"get\", \"secret\", \"-n\", namespace, \"-o\", \"jsonpath={.items[*].metadata.name}\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if blob_secret_result.returncode == 0:\n",
" secrets = blob_secret_result.stdout.split()\n",
" blob_secrets = [s for s in secrets if any(keyword in s.lower() for keyword in [\"s3\", \"storage\", \"blob\", \"aws\"])]\n",
" if blob_secrets:\n",
" ok(f\"Blob storage secrets found: {', '.join(blob_secrets)}\")\n",
" else:\n",
" print(\"💡 Blob storage secrets not found (may use IAM roles instead)\")\n",
"\n",
"# Check for S3 bucket or blob storage account in Terraform outputs\n",
"if \"s3_bucket\" in terraform_outputs or \"bucket_name\" in terraform_outputs:\n",
" bucket_name = terraform_outputs.get(\"s3_bucket\") or terraform_outputs.get(\"bucket_name\")\n",
" ok(f\"Blob storage bucket/container configured: {bucket_name}\")\n",
"elif \"storage_account\" in terraform_outputs:\n",
" storage_account = terraform_outputs[\"storage_account\"]\n",
" ok(f\"Azure storage account configured: {storage_account}\")\n",
"else:\n",
" print(f\"💡 Verify {blob_service} is configured and accessible\")\n",
" print(\" Check Terraform outputs or cloud console for storage resource\")\n",
"\n",
"print(\"\\n💡 For comprehensive functional testing of external services,\")\n",
"print(\" see the validation guide for trace submission and attachment tests\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
@@ -269,9 +543,9 @@
"source": [
"## 3. Ingress Provisioning Check\n",
"\n",
"**Critical:** The AWS ALB (Application Load Balancer) must be provisioned. This is how external traffic reaches LangSmith.\n",
"**Critical:** The load balancer (ALB for AWS, Application Gateway for Azure) must be provisioned. This is how external traffic reaches LangSmith.\n",
"\n",
"Common issue: ALB never appears due to wrong ingress assumptions.\n"
"Common issue: Load balancer never appears due to wrong ingress assumptions.\n"
]
},
{
@@ -317,16 +591,25 @@
" if ingress_hosts:\n",
" print(f\" Hosts: {', '.join(ingress_hosts)}\")\n",
" \n",
" # Check for ALB address\n",
" # Check for load balancer address (cloud-agnostic)\n",
" if load_balancer.get(\"ingress\"):\n",
" alb_addresses = [ing.get(\"hostname\", ing.get(\"ip\", \"\")) for ing in load_balancer[\"ingress\"]]\n",
" if alb_addresses:\n",
" ok(f\"ALB provisioned: {', '.join(alb_addresses)}\")\n",
" print(f\" 💡 Access LangSmith at: https://{alb_addresses[0]}\")\n",
" lb_addresses = [ing.get(\"hostname\", ing.get(\"ip\", \"\")) for ing in load_balancer[\"ingress\"]]\n",
" if lb_addresses:\n",
" # Determine load balancer type based on address format\n",
" lb_type = \"Load Balancer\"\n",
" if provider == \"aws\":\n",
" if \".elb.\" in lb_addresses[0] or \".amazonaws.com\" in lb_addresses[0]:\n",
" lb_type = \"ALB (Application Load Balancer)\"\n",
" elif provider == \"azure\":\n",
" if \".azure.com\" in lb_addresses[0] or \"appgw\" in lb_addresses[0]:\n",
" lb_type = \"Application Gateway\"\n",
" \n",
" ok(f\"{lb_type} provisioned: {', '.join(lb_addresses)}\")\n",
" print(f\" 💡 Access LangSmith at: https://{lb_addresses[0]}\")\n",
" else:\n",
" warn(\"ALB ingress entry exists but no address found\")\n",
" warn(\"Load balancer ingress entry exists but no address found\")\n",
" else:\n",
" warn(\"ALB not yet provisioned (may take a few minutes)\")\n",
" warn(\"Load balancer not yet provisioned (may take a few minutes)\")\n",
" print(\" 💡 Wait a few minutes and check again\")\n",
" else:\n",
" warn(\"No ingress resources found\")\n",
@@ -335,24 +618,62 @@
" warn(\"Could not retrieve ingress resources\")\n",
" print(\"💡 Ingress may not exist yet or namespace is incorrect\")\n",
"\n",
"# Also check for ALB Ingress Controller\n",
"print(\"\\n### ALB Ingress Controller\\n\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", \"kube-system\", \"-l\", \"app.kubernetes.io/name=aws-load-balancer-controller\", \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"# Check for ingress controller (cloud-agnostic)\n",
"print(\"\\n### Ingress Controller\\n\")\n",
"\n",
"if result.returncode == 0:\n",
" controller_data = json.loads(result.stdout)\n",
" controllers = controller_data.get(\"items\", [])\n",
" if controllers:\n",
" ok(f\"ALB Ingress Controller found ({len(controllers)} pod(s))\")\n",
"if provider == \"aws\":\n",
" # Check for ALB Ingress Controller\n",
" result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", \"kube-system\", \"-l\", \"app.kubernetes.io/name=aws-load-balancer-controller\", \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" controller_data = json.loads(result.stdout)\n",
" controllers = controller_data.get(\"items\", [])\n",
" if controllers:\n",
" ok(f\"ALB Ingress Controller found ({len(controllers)} pod(s))\")\n",
" else:\n",
" warn(\"ALB Ingress Controller not found\")\n",
" print(\"💡 ALB Ingress Controller must be installed for ingress to work\")\n",
" else:\n",
" warn(\"ALB Ingress Controller not found\")\n",
" print(\"💡 ALB Ingress Controller must be installed for ingress to work\")\n",
" warn(\"Could not check ALB Ingress Controller status\")\n",
"\n",
"elif provider == \"azure\":\n",
" # Check for Azure Application Gateway Ingress Controller\n",
" result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", \"kube-system\", \"-l\", \"app=ingress-azure\", \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" controller_data = json.loads(result.stdout)\n",
" controllers = controller_data.get(\"items\", [])\n",
" if controllers:\n",
" ok(f\"Azure Application Gateway Ingress Controller found ({len(controllers)} pod(s))\")\n",
" else:\n",
" # Also check for AGIC (Application Gateway Ingress Controller)\n",
" result2 = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", \"kube-system\", \"-l\", \"app=ingress-appgw\", \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" if result2.returncode == 0:\n",
" controller_data2 = json.loads(result2.stdout)\n",
" controllers2 = controller_data2.get(\"items\", [])\n",
" if controllers2:\n",
" ok(f\"Application Gateway Ingress Controller found ({len(controllers2)} pod(s))\")\n",
" else:\n",
" warn(\"Application Gateway Ingress Controller not found\")\n",
" print(\"💡 Application Gateway Ingress Controller must be installed for ingress to work\")\n",
" else:\n",
" warn(\"Could not check Application Gateway Ingress Controller status\")\n",
" else:\n",
" warn(\"Could not check Application Gateway Ingress Controller status\")\n",
"else:\n",
" warn(\"Could not check ALB Ingress Controller status\")\n"
" print(\"💡 Verify ingress controller is installed for your cloud provider\")\n"
]
},
{
@@ -424,6 +745,129 @@
" warn(\"No services found\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Basic Functional Test (Optional)\n",
"\n",
"**Optional:** Submit a simple test trace to verify the end-to-end pipeline is working. This validates that traces can be ingested, stored, and retrieved.\n",
"\n",
"> **Note:** For comprehensive functional testing (traces, attachments, feedback, datasets), see the full validation guide.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Optional: Basic functional test\n",
"print(\"### Basic Functional Test (Optional)\\n\")\n",
"\n",
"# Check if we have the necessary information to run a test\n",
"ingress_result = run(\n",
" [\"kubectl\", \"get\", \"ingress\", \"-n\", namespace, \"-o\", \"jsonpath={.items[0].status.loadBalancer.ingress[0].hostname}\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if ingress_result.returncode == 0 and ingress_result.stdout.strip():\n",
" ingress_host = ingress_result.stdout.strip()\n",
" langsmith_endpoint = f\"https://{ingress_host}/api\"\n",
" \n",
" print(f\"LangSmith endpoint: {langsmith_endpoint}\")\n",
" print(\"\\n💡 To run a basic functional test:\")\n",
" print(\" 1. Generate an API key from the LangSmith UI\")\n",
" print(\" 2. Set LANGSMITH_API_KEY environment variable\")\n",
" print(\" 3. Run the test script below (or see validation guide for comprehensive tests)\\n\")\n",
" \n",
" # Check if API key is available\n",
" api_key = os.environ.get(\"LANGSMITH_API_KEY\", \"\").strip()\n",
" \n",
" if api_key:\n",
" print(\"✅ LANGSMITH_API_KEY found - attempting basic trace submission...\\n\")\n",
" \n",
" try:\n",
" # Simple test: submit a basic trace\n",
" test_code = f'''\n",
"import os\n",
"import requests\n",
"from langsmith import traceable\n",
"\n",
"# Configure LangSmith\n",
"os.environ[\"LANGSMITH_TRACING\"] = \"true\"\n",
"os.environ[\"LANGSMITH_API_KEY\"] = \"{api_key}\"\n",
"os.environ[\"LANGSMITH_ENDPOINT\"] = \"{langsmith_endpoint}\"\n",
"os.environ[\"LANGSMITH_PROJECT\"] = \"validation-test\"\n",
"\n",
"# Simple traced function\n",
"@traceable(name=\"test_basic_function\")\n",
"def test_function():\n",
" return \"Hello from LangSmith validation test!\"\n",
"\n",
"# Run test\n",
"try:\n",
" result = test_function()\n",
" print(f\"✅ Test trace submitted successfully: {{result}}\")\n",
" print(f\"💡 Check the LangSmith UI at https://{{ingress_host}} to see the trace\")\n",
" print(\" Navigate to the 'validation-test' project\")\n",
"except Exception as e:\n",
" print(f\"⚠️ Error submitting trace: {{e}}\")\n",
" print(\"💡 This may be normal if LangSmith is still initializing\")\n",
"'''\n",
" \n",
" # Try to import langsmith to see if it's available\n",
" try:\n",
" import langsmith\n",
" print(\"Running basic trace test...\")\n",
" exec(test_code)\n",
" ok(\"Basic functional test completed\")\n",
" except ImportError:\n",
" print(\"⚠️ langsmith package not installed\")\n",
" print(\"💡 Install with: pip install langsmith\")\n",
" print(\"\\nTest script (save and run separately):\")\n",
" print(\"=\" * 60)\n",
" print(test_code)\n",
" print(\"=\" * 60)\n",
" except Exception as e:\n",
" warn(f\"Could not run functional test: {e}\")\n",
" print(\"💡 This is optional - you can test functionality manually in the UI\")\n",
" else:\n",
" print(\"💡 To enable automated testing, set LANGSMITH_API_KEY in your environment\")\n",
" print(\" Get an API key from: https://{ingress_host}/settings/api-keys\")\n",
" print(\"\\nExample test script (run after getting API key):\")\n",
" print(\"=\" * 60)\n",
" print(f'''\n",
"import os\n",
"from langsmith import traceable\n",
"\n",
"os.environ[\"LANGSMITH_TRACING\"] = \"true\"\n",
"os.environ[\"LANGSMITH_API_KEY\"] = \"<your-api-key>\"\n",
"os.environ[\"LANGSMITH_ENDPOINT\"] = \"{langsmith_endpoint}\"\n",
"os.environ[\"LANGSMITH_PROJECT\"] = \"validation-test\"\n",
"\n",
"@traceable(name=\"test_basic_function\")\n",
"def test_function():\n",
" return \"Hello from LangSmith!\"\n",
"\n",
"test_function()\n",
"print(\"Check the UI for the trace!\")\n",
"''')\n",
" print(\"=\" * 60)\n",
"else:\n",
" print(\"💡 Ingress not available yet - functional test requires accessible endpoint\")\n",
" print(\" Complete ingress validation first, then return to this section\")\n",
"\n",
"print(\"\\n💡 For comprehensive functional testing including:\")\n",
"print(\" - Trace submission & ClickHouse analytics\")\n",
"print(\" - Attachments & blob storage\")\n",
"print(\" - Feedback system\")\n",
"print(\" - Dataset management\")\n",
"print(\" - Agent deployments\")\n",
"print(\" See the full validation guide for detailed test scripts\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -460,7 +904,14 @@
" # Try to access the UI (HTTPS)\n",
" ui_url = f\"https://{ingress_host}\"\n",
" print(f\"\\nTesting UI availability at: {ui_url}\")\n",
" print(\"(This may take a moment if ALB is still provisioning...)\\n\")\n",
" \n",
" # Cloud-specific messaging\n",
" if provider == \"aws\":\n",
" print(\"(This may take a moment if ALB is still provisioning...)\\n\")\n",
" elif provider == \"azure\":\n",
" print(\"(This may take a moment if Application Gateway is still provisioning...)\\n\")\n",
" else:\n",
" print(\"(This may take a moment if load balancer is still provisioning...)\\n\")\n",
" \n",
" try:\n",
" # Use a short timeout and allow redirects\n",
@@ -482,19 +933,36 @@
" print(\" Browser may show security warning - this is normal for self-signed certs\")\n",
" except requests.exceptions.Timeout:\n",
" warn(\"UI request timed out\")\n",
" print(\"💡 ALB may still be provisioning, or ingress is not fully configured\")\n",
" print(f\" Try again in a few minutes: {ui_url}\")\n",
" if provider == \"aws\":\n",
" print(\"💡 ALB may still be provisioning, or ingress is not fully configured\")\n",
" print(f\" Check AWS console for ALB status, then try: {ui_url}\")\n",
" elif provider == \"azure\":\n",
" print(\"💡 Application Gateway may still be provisioning, or ingress is not fully configured\")\n",
" print(f\" Check Azure portal for Application Gateway status, then try: {ui_url}\")\n",
" else:\n",
" print(f\" Try again in a few minutes: {ui_url}\")\n",
" except requests.exceptions.ConnectionError as e:\n",
" warn(f\"Could not connect to UI: {e}\")\n",
" print(\"💡 ALB may still be provisioning\")\n",
" print(f\" Check AWS console for ALB status, then try: {ui_url}\")\n",
" if provider == \"aws\":\n",
" print(\"💡 ALB may still be provisioning\")\n",
" print(f\" Check AWS console for ALB status, then try: {ui_url}\")\n",
" elif provider == \"azure\":\n",
" print(\"💡 Application Gateway may still be provisioning\")\n",
" print(f\" Check Azure portal for Application Gateway status, then try: {ui_url}\")\n",
" else:\n",
" print(f\" Try again in a few minutes: {ui_url}\")\n",
" except Exception as e:\n",
" warn(f\"Error accessing UI: {e}\")\n",
" print(f\"💡 Manual check: Open {ui_url} in a browser\")\n",
"else:\n",
" warn(\"Could not determine ingress hostname\")\n",
" print(\"💡 Ingress may not be provisioned yet\")\n",
" print(\" Run the ingress check above and wait for ALB to be created\")\n"
" if provider == \"aws\":\n",
" print(\" Run the ingress check above and wait for ALB to be created\")\n",
" elif provider == \"azure\":\n",
" print(\" Run the ingress check above and wait for Application Gateway to be created\")\n",
" else:\n",
" print(\" Run the ingress check above and wait for load balancer to be created\")\n"
]
},
{
@@ -561,10 +1029,13 @@
"### ✅ Validation Checklist\n",
"\n",
"- [ ] All pods are running and ready\n",
"- [ ] License key is properly configured (no errors in logs)\n",
"- [ ] All PVCs are bound\n",
"- [ ] Ingress/ALB is provisioned\n",
"- [ ] External services are accessible (PostgreSQL, Redis, blob storage)\n",
"- [ ] Ingress/load balancer is provisioned\n",
"- [ ] Services are accessible\n",
"- [ ] UI is reachable (or ALB is provisioning)\n",
"- [ ] UI is reachable (or load balancer is provisioning)\n",
"- [ ] Basic functional test passed (optional)\n",
"- [ ] Diagnostic artifacts collected\n",
"\n",
"### 🎯 Next Steps\n",
@@ -573,15 +1044,18 @@
"- ✅ You have a working baseline deployment\n",
"- ✅ You're on a supported path\n",
"- ✅ Ready to proceed to Module 2 (SSO/OIDC configuration)\n",
"- 💡 For comprehensive functional testing, see the full validation guide\n",
"\n",
"**If checks fail:**\n",
"- Review the warnings above\n",
"- Check diagnostic artifacts\n",
"- Common issues:\n",
" - **PVCs pending:** EBS CSI driver not installed\n",
" - **ALB not appearing:** Wrong ingress configuration\n",
" - **Pods not ready:** Check events and logs\n",
" - **UI not accessible:** Wait for ALB provisioning (can take 5-10 minutes)\n",
" - **PVCs pending:** Storage CSI driver not installed\n",
"- **Load balancer not appearing:** Wrong ingress configuration\n",
"- **Pods not ready:** Check events and logs\n",
"- **UI not accessible:** Wait for load balancer provisioning (can take 5-10 minutes)\n",
"- **License errors:** Verify license key is valid and secret is correctly mounted\n",
"- **External services unreachable:** Check network connectivity and security groups\n",
"\n",
"### 📋 Baseline Reference\n",
"\n",
+1 -1
View File
@@ -425,7 +425,7 @@
"\n",
"If you want to start over:\n",
"1. Review and update your `.env` file\n",
"2. Run `01_aws_preflight.ipynb` again\n",
"2. Run `01_preflight.ipynb` again\n",
"3. Proceed through the module notebooks\n",
"\n",
"**Thank you for completing Module 1!**\n"
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,863 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 2: SAML SSO Validation (Optional)\n",
"\n",
"## Overview\n",
"\n",
"This notebook validates SAML SSO configuration for your LangSmith deployment. Use this if your IdP only supports SAML or if enterprise policy requires SAML.\n",
"\n",
"**⚠️ SAFETY NOTICE:** This notebook is **READ-ONLY**. It performs validation checks only and does NOT modify any infrastructure, Helm values, secrets, or deployments. All operations are safe to run against production environments.\n",
"\n",
"**Prerequisites:**\n",
"- Module 1 deployment is healthy and accessible\n",
"- DNS configured and resolving correctly\n",
"- TLS certificate valid and trusted\n",
"- Ingress configured and working\n",
"- IdP team has provided SAML metadata or metadata URL\n",
"\n",
"## What We'll Validate\n",
"\n",
"1. ✅ Environment configuration (SAML settings, redacted)\n",
"2. ✅ Preflight checks (tools, kubectl, namespace, Helm release)\n",
"3. ✅ Current auth configuration (without leaking secrets)\n",
"4. ✅ Ingress/TLS preconditions (domain, HTTPS)\n",
"5. ✅ SAML metadata validation (URL reachability, XML parsing, required attributes)\n",
"6. ✅ Deployment verification (pods, logs, endpoints)\n",
"7. ✅ Common failure signatures\n",
"8. ✅ Support bundle pointers\n",
"\n",
"**Estimated time:** 30-45 minutes\n",
"\n",
"**Important:** \n",
"- This notebook never prints secrets. All sensitive values are redacted.\n",
"- This notebook does NOT modify any resources. It is safe for production use.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Bootstrap environment\n",
"import sys\n",
"from pathlib import Path\n",
"\n",
"# Add notebooks directory to path so we can import shared as a package\n",
"possible_paths = [\n",
" Path.cwd().parent, # If cwd is module-2, go up one level to notebooks\n",
" Path.cwd(), # If cwd is already notebooks\n",
" Path.cwd() / \"notebooks\", # If cwd is workspace root\n",
"]\n",
"\n",
"notebooks_path = None\n",
"for path in possible_paths:\n",
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
" notebooks_path = path\n",
" break\n",
"\n",
"if not notebooks_path:\n",
" notebooks_path = Path.cwd() / \"notebooks\"\n",
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
"\n",
"# Add notebooks directory to path so 'shared' can be imported as a package\n",
"if str(notebooks_path) not in sys.path:\n",
" sys.path.insert(0, str(notebooks_path))\n",
"\n",
"from shared._bootstrap import bootstrap\n",
"\n",
"# Run bootstrap\n",
"bootstrap_info = bootstrap()\n",
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Configuration\n",
"\n",
"Load and validate SAML configuration from environment variables. All secrets are redacted in output.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import json\n",
"from pathlib import Path\n",
"from shared._validation import require_env, print_config, redact, ok, warn\n",
"from shared._shell import run\n",
"\n",
"# Required SAML configuration variables\n",
"required_vars = [\n",
" \"NAMESPACE\",\n",
" \"SAML_METADATA_URL\", # OR SAML_METADATA_FILE (one must be provided)\n",
" \"LANGSMITH_DOMAIN\",\n",
"]\n",
"\n",
"# Optional but recommended\n",
"optional_vars = [\n",
" \"SAML_ENTITY_ID\",\n",
" \"SAML_EMAIL_ATTRIBUTE\",\n",
" \"SAML_NAME_ATTRIBUTE\",\n",
" \"SAML_GROUPS_ATTRIBUTE\",\n",
"]\n",
"\n",
"print(\"### Loading SAML Configuration\\n\")\n",
"\n",
"# Load required variables\n",
"config = {}\n",
"missing = []\n",
"\n",
"for var in required_vars:\n",
" value = os.environ.get(var, \"\").strip()\n",
" if not value:\n",
" missing.append(var)\n",
" config[var] = value\n",
"\n",
"# Check if SAML_METADATA_FILE is provided as alternative\n",
"saml_metadata_file = os.environ.get(\"SAML_METADATA_FILE\", \"\").strip()\n",
"if not config.get(\"SAML_METADATA_URL\") and not saml_metadata_file:\n",
" missing.append(\"SAML_METADATA_URL or SAML_METADATA_FILE\")\n",
"\n",
"if missing:\n",
" raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
" f\"💡 Copy env-samples/oidc.env.example to your .env file and fill in SAML values\")\n",
"\n",
"# Load optional variables\n",
"for var in optional_vars:\n",
" config[var] = os.environ.get(var, \"\").strip()\n",
"\n",
"# Set defaults for optional variables\n",
"if not config.get(\"SAML_EMAIL_ATTRIBUTE\"):\n",
" config[\"SAML_EMAIL_ATTRIBUTE\"] = \"email\"\n",
"if not config.get(\"SAML_NAME_ATTRIBUTE\"):\n",
" config[\"SAML_NAME_ATTRIBUTE\"] = \"name\"\n",
"if not config.get(\"SAML_GROUPS_ATTRIBUTE\"):\n",
" config[\"SAML_GROUPS_ATTRIBUTE\"] = \"groups\"\n",
"\n",
"# Print configuration (redacted)\n",
"print_config(config, redact_keys=set())\n",
"\n",
"ok(\"Configuration loaded\")\n",
"\n",
"# Validate metadata source\n",
"if config.get(\"SAML_METADATA_URL\"):\n",
" metadata_url = config[\"SAML_METADATA_URL\"]\n",
" if not metadata_url.startswith(\"https://\"):\n",
" warn(\"SAML metadata URL should use HTTPS\")\n",
" print(f\"\\n💡 Using metadata URL: {metadata_url}\")\n",
"elif saml_metadata_file:\n",
" metadata_path = Path(saml_metadata_file)\n",
" if not metadata_path.exists():\n",
" raise RuntimeError(f\"❌ SAML metadata file not found: {saml_metadata_file}\")\n",
" print(f\"\\n💡 Using metadata file: {saml_metadata_file}\")\n",
"else:\n",
" raise RuntimeError(\"❌ Either SAML_METADATA_URL or SAML_METADATA_FILE must be provided\")\n",
"\n",
"domain = config[\"LANGSMITH_DOMAIN\"]\n",
"print(f\"\\n💡 Verify these values match your IdP configuration:\")\n",
"print(f\" - Entity ID: {config.get('SAML_ENTITY_ID', 'N/A')}\")\n",
"print(f\" - Metadata URL/File: {config.get('SAML_METADATA_URL', saml_metadata_file)}\")\n",
"print(f\" - Domain: {domain}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Safety Check: Verify Environment\n",
"\n",
"Before proceeding with validation, confirm you're working with the correct environment and that auth configuration is appropriate to validate.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Safety check: Verify environment and auth configuration state\n",
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
"from shared._validation import ok, warn\n",
"from shared._shell import run\n",
"import json\n",
"\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"identity = get_identity()\n",
"\n",
"print(\"### Environment Safety Check\\n\")\n",
"\n",
"# Show current environment\n",
"provider_display = provider.upper()\n",
"print(f\"Cloud Provider: {provider_display}\")\n",
"print(f\"Region: {region}\")\n",
"\n",
"if provider == \"aws\":\n",
" print(f\"Account ID: {identity.get('Account', 'N/A')}\")\n",
" print(f\"User ARN: {identity.get('Arn', 'N/A')}\")\n",
"elif provider == \"azure\":\n",
" print(f\"Subscription ID: {identity.get('SubscriptionId', identity.get('Account', 'N/A'))}\")\n",
" print(f\"Subscription Name: {identity.get('SubscriptionName', 'N/A')}\")\n",
"\n",
"print(\"\\n\" + \"=\" * 60)\n",
"print(\"⚠️ IMPORTANT: This notebook is READ-ONLY\")\n",
"print(\"=\" * 60)\n",
"print(\"\\nThis notebook will:\")\n",
"print(\" ✅ Validate SAML configuration\")\n",
"print(\" ✅ Check deployment status\")\n",
"print(\" ✅ Inspect current auth settings (secrets redacted)\")\n",
"print(\" ✅ Collect support bundles\")\n",
"print(\"\\nThis notebook will NOT:\")\n",
"print(\" ❌ Modify Helm values or releases\")\n",
"print(\" ❌ Create or update secrets\")\n",
"print(\" ❌ Restart pods or deployments\")\n",
"print(\" ❌ Change any infrastructure\")\n",
"print(\"\\n\" + \"=\" * 60)\n",
"\n",
"# Check if auth is already configured\n",
"print(\"\\n### Checking Current Auth Configuration State\\n\")\n",
"namespace = config.get(\"NAMESPACE\", \"\")\n",
"helm_release = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
"\n",
"# Check for auth-related secrets\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"auth_configured = False\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" auth_secrets = [s for s in secrets.get(\"items\", [])\n",
" if any(keyword in s.get(\"metadata\", {}).get(\"name\", \"\").lower()\n",
" for keyword in [\"auth\", \"saml\", \"sso\"])]\n",
" \n",
" if auth_secrets:\n",
" auth_configured = True\n",
" ok(f\"Found {len(auth_secrets)} auth-related secret(s) - auth appears configured\")\n",
" print(\" 💡 This validation will check if your SAML configuration matches existing setup\")\n",
" else:\n",
" warn(\"No auth-related secrets found - auth may not be configured yet\")\n",
" print(\" 💡 This validation will verify your SAML configuration is ready to apply\")\n",
"\n",
"# Check Helm values for auth config\n",
"result = run(\n",
" [\"helm\", \"get\", \"values\", helm_release, \"-n\", namespace, \"--output\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" try:\n",
" values = json.loads(result.stdout)\n",
" if \"auth\" in str(values).lower() or \"saml\" in str(values).lower():\n",
" if not auth_configured:\n",
" auth_configured = True\n",
" ok(\"Helm values contain auth configuration\")\n",
" else:\n",
" warn(\"No auth configuration found in Helm values\")\n",
" except json.JSONDecodeError:\n",
" pass\n",
"\n",
"if auth_configured:\n",
" print(\"\\n\" + \"=\" * 60)\n",
" print(\"⚠️ Auth is already configured in this environment\")\n",
" print(\"=\" * 60)\n",
" print(\"\\nThis validation will:\")\n",
" print(\" - Verify your SAML settings match the existing configuration\")\n",
" print(\" - Check if authentication is working correctly\")\n",
" print(\" - Identify any configuration mismatches\")\n",
" print(\"\\n💡 If you need to CHANGE auth configuration, use Helm upgrade separately\")\n",
" print(\" This notebook only validates, it does not modify configuration\")\n",
"else:\n",
" print(\"\\n\" + \"=\" * 60)\n",
" print(\"️ Auth not yet configured\")\n",
" print(\"=\" * 60)\n",
" print(\"\\nThis validation will:\")\n",
" print(\" - Verify your SAML settings are correct\")\n",
" print(\" - Check prerequisites (DNS, TLS, ingress)\")\n",
" print(\" - Validate IdP metadata\")\n",
" print(\"\\n💡 After validation passes, apply configuration using Helm upgrade\")\n",
" print(\" This notebook only validates, it does not apply configuration\")\n",
"\n",
"ok(\"Environment safety check complete\")\n",
"print(\"\\n✅ Safe to proceed with validation\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Preflight Checks\n",
"\n",
"Same as OIDC notebook - verify tools, kubectl context, namespace, and Helm release.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._validation import ok, warn\n",
"from shared._k8s_helpers import require_namespace, namespace_exists\n",
"from shared._shell import run\n",
"from shared._cloud_helpers import get_cloud_provider, get_region\n",
"\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"namespace = config[\"NAMESPACE\"]\n",
"\n",
"print(\"### Preflight Checks\\n\")\n",
"\n",
"# Check kubectl is available\n",
"print(\"1. Checking kubectl...\")\n",
"result = run([\"kubectl\", \"version\", \"--client\", \"--short\"], check=False, stream=False)\n",
"if result.returncode == 0:\n",
" ok(\"kubectl is available\")\n",
" print(f\" {result.stdout.strip()}\")\n",
"else:\n",
" raise RuntimeError(\"❌ kubectl is not available or not working\")\n",
"\n",
"# Check kubectl context\n",
"print(\"\\n2. Checking kubectl context...\")\n",
"result = run([\"kubectl\", \"config\", \"current-context\"], check=False, stream=False)\n",
"if result.returncode == 0:\n",
" context = result.stdout.strip()\n",
" ok(f\"Current context: {context}\")\n",
"else:\n",
" warn(\"Could not determine kubectl context\")\n",
"\n",
"# Check namespace exists\n",
"print(f\"\\n3. Checking namespace '{namespace}'...\")\n",
"if namespace_exists(namespace):\n",
" ok(f\"Namespace '{namespace}' exists\")\n",
"else:\n",
" raise RuntimeError(f\"❌ Namespace '{namespace}' does not exist. Complete Module 1 first.\")\n",
"\n",
"# Check Helm release\n",
"print(f\"\\n4. Checking Helm release...\")\n",
"helm_release = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
"result = run(\n",
" [\"helm\", \"list\", \"-n\", namespace, \"--output\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" try:\n",
" releases = json.loads(result.stdout)\n",
" release_names = [r.get(\"name\") for r in releases]\n",
" if helm_release in release_names:\n",
" ok(f\"Helm release '{helm_release}' exists\")\n",
" else:\n",
" raise RuntimeError(f\"❌ Helm release '{helm_release}' not found\")\n",
" except json.JSONDecodeError:\n",
" warn(\"Could not parse Helm release list\")\n",
"\n",
"ok(\"Preflight checks complete\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Inspect Current Auth Configuration\n",
"\n",
"Examine the current authentication configuration without leaking secrets.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Inspecting Current Auth Configuration\\n\")\n",
"\n",
"# Check for auth-related environment variables in deployments\n",
"print(\"1. Checking deployment environment variables...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" deployments = json.loads(result.stdout)\n",
" auth_vars_found = False\n",
" \n",
" for deployment in deployments.get(\"items\", []):\n",
" name = deployment.get(\"metadata\", {}).get(\"name\", \"\")\n",
" containers = deployment.get(\"spec\", {}).get(\"template\", {}).get(\"spec\", {}).get(\"containers\", [])\n",
" \n",
" for container in containers:\n",
" env_vars = container.get(\"env\", [])\n",
" auth_env = [e for e in env_vars if any(keyword in e.get(\"name\", \"\").upper() for keyword in [\"AUTH\", \"SAML\", \"SSO\"])]\n",
" \n",
" if auth_env:\n",
" auth_vars_found = True\n",
" print(f\"\\n Deployment: {name}\")\n",
" for env in auth_env:\n",
" env_name = env.get(\"name\", \"\")\n",
" if \"SECRET\" in env_name.upper() or \"PASSWORD\" in env_name.upper():\n",
" print(f\" - {env_name}: <redacted>\")\n",
" elif env.get(\"valueFrom\"):\n",
" print(f\" - {env_name}: <from secret/configmap>\")\n",
" \n",
" if not auth_vars_found:\n",
" warn(\"No auth-related environment variables found\")\n",
"\n",
"# Check for auth-related secrets (names only)\n",
"print(\"\\n2. Checking for auth-related secrets...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" auth_secrets = [s for s in secrets.get(\"items\", [])\n",
" if any(keyword in s.get(\"metadata\", {}).get(\"name\", \"\").lower()\n",
" for keyword in [\"auth\", \"saml\", \"sso\"])]\n",
" \n",
" if auth_secrets:\n",
" ok(f\"Found {len(auth_secrets)} auth-related secret(s)\")\n",
" for secret in auth_secrets:\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" print(f\" - {name} (values not displayed)\")\n",
"\n",
"ok(\"Auth configuration inspection complete (no secrets displayed)\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Validate Ingress/TLS Preconditions\n",
"\n",
"Verify domain resolution, HTTPS accessibility, and TLS certificate validity.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import socket\n",
"import ssl\n",
"import requests\n",
"\n",
"domain = config[\"LANGSMITH_DOMAIN\"]\n",
"print(f\"### Validating Ingress/TLS for {domain}\\n\")\n",
"\n",
"# 1. DNS Resolution\n",
"print(\"1. Checking DNS resolution...\")\n",
"try:\n",
" ip_address = socket.gethostbyname(domain)\n",
" ok(f\"Domain resolves to: {ip_address}\")\n",
"except socket.gaierror as e:\n",
" raise RuntimeError(f\"❌ DNS resolution failed for {domain}: {e}\")\n",
"\n",
"# 2. HTTPS Reachability\n",
"print(f\"\\n2. Checking HTTPS reachability...\")\n",
"https_url = f\"https://{domain}\"\n",
"\n",
"try:\n",
" response = requests.get(https_url, timeout=10, verify=True, allow_redirects=True)\n",
" ok(f\"HTTPS accessible: {response.status_code}\")\n",
"except requests.exceptions.SSLError as e:\n",
" warn(f\"SSL verification failed: {e}\")\n",
" print(\" 💡 Certificate may be self-signed or invalid\")\n",
"except requests.exceptions.RequestException as e:\n",
" raise RuntimeError(f\"❌ Could not connect to {domain}: {e}\")\n",
"\n",
"# 3. TLS Certificate Check\n",
"print(f\"\\n3. Checking TLS certificate...\")\n",
"try:\n",
" context = ssl.create_default_context()\n",
" with socket.create_connection((domain, 443), timeout=10) as sock:\n",
" with context.wrap_socket(sock, server_hostname=domain) as ssock:\n",
" cert = ssock.getpeercert()\n",
" subject = dict(x[0] for x in cert['subject'])\n",
" \n",
" import datetime\n",
" not_after = datetime.datetime.strptime(cert['notAfter'], '%b %d %H:%M:%S %Y %Z')\n",
" days_until_expiry = (not_after - datetime.datetime.now()).days\n",
" \n",
" if days_until_expiry > 30:\n",
" ok(f\"Certificate valid for {days_until_expiry} more days\")\n",
" elif days_until_expiry > 0:\n",
" warn(f\"Certificate expires in {days_until_expiry} days\")\n",
" else:\n",
" raise RuntimeError(f\"❌ Certificate expired\")\n",
"except Exception as e:\n",
" warn(f\"Could not verify TLS certificate: {e}\")\n",
"\n",
"ok(\"Ingress/TLS preconditions validated\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. SAML Metadata Validation\n",
"\n",
"Validate SAML metadata URL reachability, XML parsing, and required attributes.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import xml.etree.ElementTree as ET\n",
"import requests\n",
"\n",
"print(\"### Validating SAML Metadata\\n\")\n",
"\n",
"metadata_url = config.get(\"SAML_METADATA_URL\", \"\")\n",
"metadata_file = os.environ.get(\"SAML_METADATA_FILE\", \"\").strip()\n",
"\n",
"# 1. Fetch or Load Metadata\n",
"print(\"1. Loading SAML metadata...\")\n",
"metadata_xml = None\n",
"\n",
"if metadata_url:\n",
" print(f\" Fetching from URL: {metadata_url}\")\n",
" try:\n",
" response = requests.get(metadata_url, timeout=10, verify=True)\n",
" if response.status_code == 200:\n",
" ok(\"Metadata URL accessible\")\n",
" metadata_xml = response.text\n",
" else:\n",
" raise RuntimeError(f\"❌ Metadata URL returned {response.status_code}\")\n",
" except requests.exceptions.RequestException as e:\n",
" raise RuntimeError(f\"❌ Could not fetch metadata URL: {e}\")\n",
"elif metadata_file:\n",
" print(f\" Loading from file: {metadata_file}\")\n",
" try:\n",
" with open(metadata_file, \"r\") as f:\n",
" metadata_xml = f.read()\n",
" ok(\"Metadata file loaded\")\n",
" except Exception as e:\n",
" raise RuntimeError(f\"❌ Could not load metadata file: {e}\")\n",
"\n",
"if not metadata_xml:\n",
" raise RuntimeError(\"❌ No metadata XML available\")\n",
"\n",
"# 2. Parse XML\n",
"print(\"\\n2. Parsing SAML metadata XML...\")\n",
"try:\n",
" # Register namespaces\n",
" namespaces = {\n",
" 'md': 'urn:oasis:names:tc:SAML:2.0:metadata',\n",
" 'ds': 'http://www.w3.org/2000/09/xmldsig#',\n",
" }\n",
" \n",
" root = ET.fromstring(metadata_xml)\n",
" ok(\"Metadata XML is valid\")\n",
"except ET.ParseError as e:\n",
" raise RuntimeError(f\"❌ Invalid XML: {e}\")\n",
"\n",
"# 3. Extract Entity Descriptor\n",
"print(\"\\n3. Extracting entity information...\")\n",
"entity_id = None\n",
"try:\n",
" entity_descriptor = root.find('.//md:EntityDescriptor', namespaces)\n",
" if entity_descriptor is not None:\n",
" entity_id = entity_descriptor.get('entityID')\n",
" if entity_id:\n",
" ok(f\"Entity ID found: {entity_id}\")\n",
" if config.get(\"SAML_ENTITY_ID\") and entity_id != config.get(\"SAML_ENTITY_ID\"):\n",
" warn(f\"Entity ID mismatch: config={config.get('SAML_ENTITY_ID')}, metadata={entity_id}\")\n",
" else:\n",
" warn(\"Entity ID not found in metadata\")\n",
" else:\n",
" warn(\"EntityDescriptor not found in metadata\")\n",
"except Exception as e:\n",
" warn(f\"Could not extract entity information: {e}\")\n",
"\n",
"# 4. Extract IDP SSO Descriptor\n",
"print(\"\\n4. Extracting IdP SSO descriptor...\")\n",
"try:\n",
" idp_sso = root.find('.//md:IDPSSODescriptor', namespaces)\n",
" if idp_sso is not None:\n",
" ok(\"IdP SSO descriptor found\")\n",
" \n",
" # Extract SSO endpoints\n",
" sso_endpoints = idp_sso.findall('.//md:SingleSignOnService', namespaces)\n",
" if sso_endpoints:\n",
" print(f\" Found {len(sso_endpoints)} SSO endpoint(s):\")\n",
" for endpoint in sso_endpoints:\n",
" location = endpoint.get('Location', '')\n",
" binding = endpoint.get('Binding', '')\n",
" print(f\" - {binding}: {location}\")\n",
" else:\n",
" warn(\"No SSO endpoints found\")\n",
" else:\n",
" warn(\"IDPSSODescriptor not found - may not be IdP metadata\")\n",
"except Exception as e:\n",
" warn(f\"Could not extract IdP SSO descriptor: {e}\")\n",
"\n",
"# 5. Extract Certificates\n",
"print(\"\\n5. Checking for signing certificates...\")\n",
"try:\n",
" certificates = root.findall('.//ds:X509Certificate', namespaces)\n",
" if certificates:\n",
" ok(f\"Found {len(certificates)} certificate(s)\")\n",
" for i, cert in enumerate(certificates):\n",
" cert_text = cert.text.strip() if cert.text else \"\"\n",
" if cert_text:\n",
" print(f\" Certificate {i+1}: {len(cert_text)} characters\")\n",
" else:\n",
" warn(f\"Certificate {i+1} is empty\")\n",
" else:\n",
" warn(\"No signing certificates found\")\n",
" print(\" 💡 IdP must provide signing certificate for assertion validation\")\n",
"except Exception as e:\n",
" warn(f\"Could not extract certificates: {e}\")\n",
"\n",
"# 6. Validate Required Attributes\n",
"print(\"\\n6. Validating attribute configuration...\")\n",
"print(f\" Expected email attribute: {config['SAML_EMAIL_ATTRIBUTE']}\")\n",
"print(f\" Expected name attribute: {config['SAML_NAME_ATTRIBUTE']}\")\n",
"print(f\" Expected groups attribute: {config['SAML_GROUPS_ATTRIBUTE']}\")\n",
"\n",
"ok(\"SAML metadata validation complete\")\n",
"print(\"\\n💡 Verify your IdP sends these attributes in SAML assertions:\")\n",
"print(f\" - {config['SAML_EMAIL_ATTRIBUTE']} (required)\")\n",
"print(f\" - {config['SAML_NAME_ATTRIBUTE']} (optional)\")\n",
"print(f\" - {config['SAML_GROUPS_ATTRIBUTE']} (optional, for role mapping)\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._shell import run\n",
"\n",
"print(\"### Checking for Common SAML Failure Signatures\\n\")\n",
"\n",
"# Get pod names\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"jsonpath={.items[*].metadata.name}\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0 and result.stdout.strip():\n",
" pod_names = result.stdout.strip().split()\n",
" api_pods = [p for p in pod_names if any(keyword in p.lower() for keyword in [\"api\", \"server\", \"backend\"])]\n",
" \n",
" if not api_pods:\n",
" api_pods = pod_names[:2]\n",
" \n",
" failure_patterns = {\n",
" \"Missing Attributes\": [\n",
" \"missing attribute\",\n",
" \"attribute not found\",\n",
" \"email attribute\",\n",
" \"required attribute\",\n",
" ],\n",
" \"Signature Validation\": [\n",
" \"signature validation failed\",\n",
" \"invalid signature\",\n",
" \"certificate\",\n",
" \"signing key\",\n",
" ],\n",
" \"Assertion Expired\": [\n",
" \"assertion expired\",\n",
" \"notonorafter\",\n",
" \"clock skew\",\n",
" \"timeout\",\n",
" ],\n",
" \"Entity ID Mismatch\": [\n",
" \"entity id\",\n",
" \"issuer mismatch\",\n",
" \"audience\",\n",
" ],\n",
" \"Metadata Issues\": [\n",
" \"metadata\",\n",
" \"xml parse\",\n",
" \"invalid metadata\",\n",
" ],\n",
" }\n",
" \n",
" found_issues = []\n",
" \n",
" for pod_name in api_pods[:2]:\n",
" try:\n",
" log_result = run(\n",
" [\"kubectl\", \"logs\", pod_name, \"-n\", namespace, \"--tail=200\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if log_result.returncode == 0:\n",
" logs_lower = log_result.stdout.lower()\n",
" \n",
" for category, patterns in failure_patterns.items():\n",
" for pattern in patterns:\n",
" if pattern in logs_lower:\n",
" # Check if it's actually an error (not just a log message)\n",
" lines = log_result.stdout.split(\"\\n\")\n",
" error_lines = [line for line in lines \n",
" if pattern in line.lower() \n",
" and any(err in line.lower() for err in [\"error\", \"fail\", \"invalid\", \"missing\"])]\n",
" \n",
" if error_lines and category not in found_issues:\n",
" found_issues.append(category)\n",
" warn(f\"Potential {category} issue found in {pod_name} logs\")\n",
" print(f\" Pattern: '{pattern}'\")\n",
" # Don't print full log line as it may contain sensitive data\n",
" break\n",
" except Exception:\n",
" pass\n",
" \n",
" if not found_issues:\n",
" ok(\"No common SAML failure signatures found in logs\")\n",
" else:\n",
" print(f\"\\n💡 Found potential issues: {', '.join(found_issues)}\")\n",
" print(\" Review logs manually for details:\")\n",
" print(f\" kubectl logs <pod-name> -n {namespace} --tail=100 | grep -i saml\")\n",
"else:\n",
" warn(\"Could not retrieve pod names\")\n",
"\n",
"print(\"\\n💡 Common SAML failure causes:\")\n",
"print(\" 1. Missing required attributes in assertion\")\n",
"print(\" 2. Certificate mismatch or expired certificate\")\n",
"print(\" 3. Clock skew between LangSmith and IdP\")\n",
"print(\" 4. Entity ID mismatch\")\n",
"print(\" 5. Attribute name mismatch\")\n",
"print(\"\\n See docs/shared/auth_troubleshooting.md for detailed troubleshooting\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Deployment Verification & Support Bundle\n",
"\n",
"Same as OIDC notebook - verify pods, check logs, collect support bundle.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._k8s_helpers import get_pods, wait_for_deployments_ready\n",
"from datetime import datetime\n",
"import requests\n",
"\n",
"print(\"### Deployment Verification\\n\")\n",
"\n",
"# 1. Pod Readiness\n",
"print(\"1. Checking pod readiness...\")\n",
"require_namespace(namespace)\n",
"\n",
"try:\n",
" wait_for_deployments_ready(namespace, timeout=\"5m\")\n",
" ok(\"All deployments ready\")\n",
"except Exception as e:\n",
" warn(f\"Some deployments may not be ready: {e}\")\n",
"\n",
"pods_output = get_pods(namespace)\n",
"print(\"\\nPod Status:\")\n",
"print(pods_output)\n",
"\n",
"# 2. Test Endpoint Auth Behavior\n",
"print(f\"\\n2. Testing endpoint auth behavior...\")\n",
"domain = config[\"LANGSMITH_DOMAIN\"]\n",
"test_url = f\"https://{domain}/api/v1/me\"\n",
"\n",
"try:\n",
" response = requests.get(test_url, timeout=10, verify=True, allow_redirects=False)\n",
" if response.status_code in [401, 403]:\n",
" ok(f\"Endpoint requires authentication ({response.status_code})\")\n",
" elif response.status_code in [301, 302, 307, 308]:\n",
" redirect_location = response.headers.get(\"Location\", \"\")\n",
" if \"login\" in redirect_location.lower() or \"saml\" in redirect_location.lower():\n",
" ok(\"Endpoint redirects to authentication\")\n",
" else:\n",
" warn(f\"Endpoint redirects but not to auth: {redirect_location}\")\n",
" else:\n",
" warn(f\"Unexpected status code: {response.status_code}\")\n",
"except requests.exceptions.RequestException as e:\n",
" warn(f\"Could not test endpoint: {e}\")\n",
"\n",
"# 3. Support Bundle\n",
"print(f\"\\n3. Collecting support bundle...\")\n",
"timestamp = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
"support_dir = artifacts_dir / f\"saml-support-{timestamp}\"\n",
"support_dir.mkdir(exist_ok=True)\n",
"\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"jsonpath={.items[*].metadata.name}\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0 and result.stdout.strip():\n",
" pod_names = result.stdout.strip().split()\n",
" api_pods = [p for p in pod_names if any(keyword in p.lower() for keyword in [\"api\", \"server\", \"backend\"])]\n",
" \n",
" for pod_name in (api_pods[:3] if api_pods else pod_names[:3]):\n",
" try:\n",
" log_result = run(\n",
" [\"kubectl\", \"logs\", pod_name, \"-n\", namespace, \"--tail=200\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" if log_result.returncode == 0:\n",
" log_file = support_dir / f\"{pod_name}-logs.txt\"\n",
" with open(log_file, \"w\") as f:\n",
" f.write(log_result.stdout)\n",
" print(f\" ✅ Saved logs for {pod_name}\")\n",
" except Exception:\n",
" pass\n",
"\n",
"ok(f\"Support bundle saved to: {support_dir}\")\n",
"print(\"\\n💡 Include pod logs and configuration when contacting support\")\n",
"print(\" See docs/shared/auth_troubleshooting.md for complete bundle procedure\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -0,0 +1,933 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 3: Operations Sanity Checks\n",
"\n",
"## Overview\n",
"\n",
"This notebook performs read-only validation and signal checks for your LangSmith production deployment. It assumes Module 1 and Module 2 are complete.\n",
"\n",
"**⚠️ SAFETY NOTICE:** This notebook is **READ-ONLY**. It performs validation checks only and does NOT modify any infrastructure, Helm values, secrets, deployments, or resources. All operations are safe to run against production environments.\n",
"\n",
"**Prerequisites:**\n",
"- Module 1 deployment is healthy and accessible\n",
"- Module 2 authentication is configured\n",
"- kubectl access to the cluster\n",
"- Read access to cloud provider APIs (for managed services)\n",
"\n",
"## What We'll Check\n",
"\n",
"1. ✅ Configuration (environment variables, redacted)\n",
"2. ✅ Preflight (kubectl context, namespace, deployments)\n",
"3. ✅ Current state snapshot (pods, services, events)\n",
"4. ✅ Early warning signals (restarts, pending pods, resource saturation)\n",
"5. ✅ Storage/durability checks (blob storage, backups)\n",
"6. ✅ Sidecar checks (Istio, if applicable)\n",
"\n",
"**Estimated time:** 15-20 minutes\n",
"\n",
"**Important:** \n",
"- This notebook is read-only and safe to run. It does not modify any resources.\n",
"- All operations are read-only: `kubectl get`, `kubectl logs`, `kubectl top`, `helm get values`\n",
"- Artifacts are saved locally only (no cluster modifications)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Bootstrap environment\n",
"import sys\n",
"from pathlib import Path\n",
"\n",
"# Add notebooks directory to path so we can import shared as a package\n",
"possible_paths = [\n",
" Path.cwd().parent, # If cwd is module-3, go up one level to notebooks\n",
" Path.cwd(), # If cwd is already notebooks\n",
" Path.cwd() / \"notebooks\", # If cwd is workspace root\n",
"]\n",
"\n",
"notebooks_path = None\n",
"for path in possible_paths:\n",
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
" notebooks_path = path\n",
" break\n",
"\n",
"if not notebooks_path:\n",
" notebooks_path = Path.cwd() / \"notebooks\"\n",
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
"\n",
"# Add notebooks directory to path so 'shared' can be imported as a package\n",
"if str(notebooks_path) not in sys.path:\n",
" sys.path.insert(0, str(notebooks_path))\n",
"\n",
"from shared._bootstrap import bootstrap\n",
"\n",
"# Run bootstrap\n",
"bootstrap_info = bootstrap()\n",
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Safety Check: Verify Environment\n",
"\n",
"Before proceeding with validation, confirm you're working with the correct environment. This notebook is read-only and safe for production use.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Safety check: Verify environment and confirm read-only operations\n",
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
"from shared._validation import ok, warn\n",
"\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"identity = get_identity()\n",
"\n",
"print(\"### Environment Safety Check\\n\")\n",
"\n",
"# Show current environment\n",
"provider_display = provider.upper()\n",
"print(f\"Cloud Provider: {provider_display}\")\n",
"print(f\"Region: {region}\")\n",
"\n",
"if provider == \"aws\":\n",
" print(f\"Account ID: {identity.get('Account', 'N/A')}\")\n",
" print(f\"User ARN: {identity.get('Arn', 'N/A')}\")\n",
"elif provider == \"azure\":\n",
" print(f\"Subscription ID: {identity.get('SubscriptionId', identity.get('Account', 'N/A'))}\")\n",
" print(f\"Subscription Name: {identity.get('SubscriptionName', 'N/A')}\")\n",
"\n",
"print(\"\\n\" + \"=\" * 60)\n",
"print(\"⚠️ IMPORTANT: This notebook is READ-ONLY\")\n",
"print(\"=\" * 60)\n",
"print(\"\\nThis notebook will:\")\n",
"print(\" ✅ Validate production readiness\")\n",
"print(\" ✅ Check deployment status and health\")\n",
"print(\" ✅ Inspect resource usage and signals\")\n",
"print(\" ✅ Verify storage and backup configuration\")\n",
"print(\" ✅ Collect state snapshots (saved locally)\")\n",
"print(\"\\nThis notebook will NOT:\")\n",
"print(\" ❌ Modify Helm values or releases\")\n",
"print(\" ❌ Create or update secrets\")\n",
"print(\" ❌ Restart pods or deployments\")\n",
"print(\" ❌ Change any infrastructure\")\n",
"print(\" ❌ Modify any cluster resources\")\n",
"print(\"\\nAll operations are read-only:\")\n",
"print(\" - kubectl get (read resources)\")\n",
"print(\" - kubectl logs (read logs)\")\n",
"print(\" - kubectl top (read metrics)\")\n",
"print(\" - helm get values (read configuration)\")\n",
"print(\" - Write artifacts to local directory only\")\n",
"print(\"\\n\" + \"=\" * 60)\n",
"\n",
"ok(\"Environment safety check complete\")\n",
"print(\"\\n✅ Safe to proceed with read-only validation\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Configuration\n",
"\n",
"Load and validate configuration from environment variables. All secrets are redacted in output.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import json\n",
"from shared._validation import require_env, print_config, redact, ok, warn\n",
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
"\n",
"# Required configuration variables\n",
"required_vars = [\n",
" \"NAMESPACE\",\n",
" \"CLUSTER_NAME\",\n",
"]\n",
"\n",
"# Optional but recommended\n",
"optional_vars = [\n",
" \"HELM_RELEASE\",\n",
" \"LANGSMITH_DOMAIN\",\n",
"]\n",
"\n",
"print(\"### Loading Configuration\\n\")\n",
"\n",
"# Load required variables\n",
"config = {}\n",
"missing = []\n",
"\n",
"for var in required_vars:\n",
" value = os.environ.get(var, \"\").strip()\n",
" if not value:\n",
" missing.append(var)\n",
" config[var] = value\n",
"\n",
"if missing:\n",
" raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
" f\"💡 Copy env-samples/workshop.env.example to your .env file and fill in values\")\n",
"\n",
"# Load optional variables\n",
"for var in optional_vars:\n",
" config[var] = os.environ.get(var, \"\").strip()\n",
"\n",
"# Set defaults\n",
"if not config.get(\"HELM_RELEASE\"):\n",
" config[\"HELM_RELEASE\"] = \"langsmith\"\n",
"\n",
"# Print configuration (redacted)\n",
"print_config(config, redact_keys=set())\n",
"\n",
"# Show cloud provider info\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"identity = get_identity()\n",
"\n",
"provider_display = provider.upper()\n",
"print(f\"\\n### Current {provider_display} Session\")\n",
"print(f\"Cloud Provider: {provider_display}\")\n",
"print(f\"Region: {region}\")\n",
"\n",
"if provider == \"aws\":\n",
" print(f\"Account ID: {identity.get('Account', 'N/A')}\")\n",
"elif provider == \"azure\":\n",
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"\")\n",
" print(f\"Subscription ID: {subscription_id}\")\n",
"\n",
"ok(\"Configuration loaded\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Preflight Checks\n",
"\n",
"Verify kubectl context, namespace exists, and deployments are ready.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._validation import ok, warn\n",
"from shared._k8s_helpers import require_namespace, namespace_exists, wait_for_deployments_ready\n",
"from shared._shell import run\n",
"\n",
"namespace = config[\"NAMESPACE\"]\n",
"helm_release = config[\"HELM_RELEASE\"]\n",
"\n",
"print(\"### Preflight Checks\\n\")\n",
"\n",
"# Check kubectl is available\n",
"print(\"1. Checking kubectl...\")\n",
"result = run([\"kubectl\", \"version\", \"--client\", \"--short\"], check=False, stream=False)\n",
"if result.returncode == 0:\n",
" ok(\"kubectl is available\")\n",
" print(f\" {result.stdout.strip()}\")\n",
"else:\n",
" raise RuntimeError(\"❌ kubectl is not available or not working\")\n",
"\n",
"# Check kubectl context\n",
"print(\"\\n2. Checking kubectl context...\")\n",
"result = run([\"kubectl\", \"config\", \"current-context\"], check=False, stream=False)\n",
"if result.returncode == 0:\n",
" context = result.stdout.strip()\n",
" ok(f\"Current context: {context}\")\n",
"else:\n",
" warn(\"Could not determine kubectl context\")\n",
" print(\" 💡 Run: kubectl config get-contexts\")\n",
"\n",
"# Check namespace exists\n",
"print(f\"\\n3. Checking namespace '{namespace}'...\")\n",
"if namespace_exists(namespace):\n",
" ok(f\"Namespace '{namespace}' exists\")\n",
"else:\n",
" raise RuntimeError(f\"❌ Namespace '{namespace}' does not exist. Complete Module 1 first.\")\n",
"\n",
"# Check deployments are ready\n",
"print(f\"\\n4. Checking deployments...\")\n",
"require_namespace(namespace)\n",
"\n",
"try:\n",
" wait_for_deployments_ready(namespace, timeout=\"2m\")\n",
" ok(\"All deployments ready\")\n",
"except Exception as e:\n",
" warn(f\"Some deployments may not be ready: {e}\")\n",
" print(\" 💡 Check pod status manually: kubectl get pods -n {namespace}\")\n",
"\n",
"ok(\"Preflight checks complete\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Snapshot Current State\n",
"\n",
"Capture current cluster state for baseline reference.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"from shared._k8s_helpers import get_pods\n",
"from shared._shell import run\n",
"\n",
"print(\"### Snapshotting Current State\\n\")\n",
"\n",
"timestamp = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
"snapshot_dir = artifacts_dir / f\"ops-snapshot-{timestamp}\"\n",
"snapshot_dir.mkdir(exist_ok=True)\n",
"\n",
"print(f\"Saving snapshot to: {snapshot_dir}\\n\")\n",
"\n",
"# 1. Get all resources\n",
"print(\"1. Capturing all resources...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"all\", \"-n\", namespace, \"-o\", \"wide\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(snapshot_dir / \"all-resources.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" ok(\"All resources captured\")\n",
" print(result.stdout)\n",
"else:\n",
" warn(\"Could not capture all resources\")\n",
"\n",
"# 2. Get events (sorted by timestamp)\n",
"print(\"\\n2. Capturing recent events...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by=.lastTimestamp\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(snapshot_dir / \"events.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" ok(\"Events captured\")\n",
" \n",
" # Show recent events\n",
" lines = result.stdout.strip().split(\"\\n\")\n",
" if len(lines) > 1:\n",
" print(f\"\\n Last 10 events:\")\n",
" for line in lines[-10:]:\n",
" print(f\" {line}\")\n",
"else:\n",
" warn(\"Could not capture events\")\n",
"\n",
"# 3. Get node and pod resource usage (if metrics available)\n",
"print(\"\\n3. Checking resource usage...\")\n",
"result = run(\n",
" [\"kubectl\", \"top\", \"nodes\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(snapshot_dir / \"node-usage.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" ok(\"Node usage captured\")\n",
" print(result.stdout)\n",
"else:\n",
" warn(\"Node metrics not available (metrics-server may not be installed)\")\n",
"\n",
"result = run(\n",
" [\"kubectl\", \"top\", \"pods\", \"-n\", namespace],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(snapshot_dir / \"pod-usage.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" ok(\"Pod usage captured\")\n",
" print(result.stdout)\n",
"else:\n",
" warn(\"Pod metrics not available\")\n",
"\n",
"# 4. Check for data store services\n",
"print(\"\\n4. Checking data store services...\")\n",
"data_stores = {\n",
" \"postgres\": [\"postgres\", \"postgresql\", \"database\", \"db\"],\n",
" \"redis\": [\"redis\", \"cache\"],\n",
" \"clickhouse\": [\"clickhouse\", \"ch\"],\n",
"}\n",
"\n",
"found_stores = []\n",
"for store_type, keywords in data_stores.items():\n",
" result = run(\n",
" [\"kubectl\", \"get\", \"svc\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" services = json.loads(result.stdout)\n",
" for svc in services.get(\"items\", []):\n",
" name = svc.get(\"metadata\", {}).get(\"name\", \"\").lower()\n",
" if any(keyword in name for keyword in keywords):\n",
" found_stores.append((store_type, name))\n",
" print(f\" ✅ Found {store_type} service: {name}\")\n",
"\n",
"if found_stores:\n",
" ok(f\"Found {len(found_stores)} data store service(s)\")\n",
"else:\n",
" warn(\"No in-cluster data stores found (may be using managed services)\")\n",
"\n",
"ok(f\"State snapshot saved to: {snapshot_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._validation import ok, warn\n",
"from shared._shell import run\n",
"\n",
"print(\"### Early Warning Signal Checks\\n\")\n",
"\n",
"# 1. Check pod restarts\n",
"print(\"1. Checking pod restart counts...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"critical_restarts = []\n",
"warning_restarts = []\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" for pod in pods.get(\"items\", []):\n",
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
" status = pod.get(\"status\", {})\n",
" phase = status.get(\"phase\", \"\")\n",
" \n",
" # Check restart count\n",
" container_statuses = status.get(\"containerStatuses\", [])\n",
" for cs in container_statuses:\n",
" restart_count = cs.get(\"restartCount\", 0)\n",
" if restart_count > 5:\n",
" critical_restarts.append((name, restart_count))\n",
" elif restart_count > 2:\n",
" warning_restarts.append((name, restart_count))\n",
" \n",
" # Check pod phase\n",
" if phase == \"CrashLoopBackOff\":\n",
" critical_restarts.append((name, \"CrashLoopBackOff\"))\n",
" elif phase == \"Pending\":\n",
" # Check how long it's been pending\n",
" conditions = status.get(\"conditions\", [])\n",
" for cond in conditions:\n",
" if cond.get(\"type\") == \"PodScheduled\" and cond.get(\"status\") != \"True\":\n",
" # Pod is pending\n",
" warning_restarts.append((name, \"Pending\"))\n",
"\n",
"if critical_restarts:\n",
" warn(f\"❌ Critical: Found {len(critical_restarts)} pod(s) with critical issues\")\n",
" for pod_name, issue in critical_restarts:\n",
" print(f\" - {pod_name}: {issue}\")\n",
" print(\"\\n 💡 Action required: Check pod logs and events\")\n",
"elif warning_restarts:\n",
" warn(f\"Found {len(warning_restarts)} pod(s) with warnings\")\n",
" for pod_name, issue in warning_restarts:\n",
" print(f\" - {pod_name}: {issue}\")\n",
" print(\"\\n 💡 Monitor these pods closely\")\n",
"else:\n",
" ok(\"No critical pod restart issues found\")\n",
"\n",
"# 2. Check for pending pods\n",
"print(\"\\n2. Checking for pending pods...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"--field-selector=status.phase=Pending\", \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" pending = pods.get(\"items\", [])\n",
" if pending:\n",
" warn(f\"Found {len(pending)} pending pod(s)\")\n",
" for pod in pending:\n",
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
" print(f\" - {name}\")\n",
" print(\"\\n 💡 Check events: kubectl describe pod <name> -n {namespace}\")\n",
" else:\n",
" ok(\"No pending pods\")\n",
"\n",
"# 3. Check resource saturation (if metrics available)\n",
"print(\"\\n3. Checking resource saturation...\")\n",
"result = run(\n",
" [\"kubectl\", \"top\", \"pods\", \"-n\", namespace],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" lines = result.stdout.strip().split(\"\\n\")[1:] # Skip header\n",
" saturated_pods = []\n",
" \n",
" for line in lines:\n",
" parts = line.split()\n",
" if len(parts) >= 3:\n",
" pod_name = parts[0]\n",
" cpu = parts[1]\n",
" memory = parts[2]\n",
" \n",
" # Parse CPU (handle \"m\" suffix for millicores)\n",
" try:\n",
" if cpu.endswith(\"m\"):\n",
" cpu_val = int(cpu[:-1])\n",
" else:\n",
" cpu_val = int(float(cpu.replace(\"Gi\", \"\").replace(\"Mi\", \"\")))\n",
" \n",
" # Parse memory (handle \"Mi\" or \"Gi\" suffix)\n",
" if \"Gi\" in memory:\n",
" mem_val = float(memory.replace(\"Gi\", \"\")) * 1024\n",
" elif \"Mi\" in memory:\n",
" mem_val = float(memory.replace(\"Mi\", \"\"))\n",
" else:\n",
" mem_val = 0\n",
" \n",
" # Check thresholds (simplified - would need requests/limits for accurate %)\n",
" # For now, just flag very high absolute values\n",
" if cpu_val > 2000: # > 2 cores\n",
" saturated_pods.append((pod_name, f\"High CPU: {cpu}\"))\n",
" if mem_val > 4096: # > 4 Gi\n",
" saturated_pods.append((pod_name, f\"High Memory: {memory}\"))\n",
" except (ValueError, IndexError):\n",
" pass\n",
" \n",
" if saturated_pods:\n",
" warn(f\"Found {len(saturated_pods)} pod(s) with high resource usage\")\n",
" for pod_name, issue in saturated_pods:\n",
" print(f\" - {pod_name}: {issue}\")\n",
" print(\"\\n 💡 Review resource requests/limits and consider scaling\")\n",
" else:\n",
" ok(\"No obvious resource saturation detected\")\n",
"else:\n",
" warn(\"Resource metrics not available (cannot check saturation)\")\n",
"\n",
"# 4. Check logs for common failure patterns\n",
"print(\"\\n4. Checking logs for common failure patterns...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"jsonpath={.items[*].metadata.name}\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"failure_patterns = {\n",
" \"connection refused\": [],\n",
" \"timeout\": [],\n",
" \"out of memory\": [],\n",
" \"database error\": [],\n",
"}\n",
"\n",
"if result.returncode == 0 and result.stdout.strip():\n",
" pod_names = result.stdout.strip().split()\n",
" # Check API and worker pods (most likely to have issues)\n",
" api_pods = [p for p in pod_names if any(keyword in p.lower() for keyword in [\"api\", \"server\", \"backend\"])]\n",
" worker_pods = [p for p in pod_names if any(keyword in p.lower() for keyword in [\"worker\", \"processor\"])]\n",
" \n",
" pods_to_check = (api_pods[:2] if api_pods else []) + (worker_pods[:2] if worker_pods else [])\n",
" \n",
" for pod_name in pods_to_check[:4]: # Check up to 4 pods\n",
" try:\n",
" log_result = run(\n",
" [\"kubectl\", \"logs\", pod_name, \"-n\", namespace, \"--tail=50\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if log_result.returncode == 0:\n",
" logs_lower = log_result.stdout.lower()\n",
" \n",
" for pattern, matches in failure_patterns.items():\n",
" if pattern in logs_lower:\n",
" # Check if it's actually an error (not just a log message)\n",
" lines = log_result.stdout.split(\"\\n\")\n",
" error_lines = [line for line in lines \n",
" if pattern in line.lower() \n",
" and any(err in line.lower() for err in [\"error\", \"fail\", \"refused\", \"timeout\"])]\n",
" \n",
" if error_lines:\n",
" matches.append((pod_name, len(error_lines)))\n",
" except Exception:\n",
" pass\n",
" \n",
" found_issues = False\n",
" for pattern, matches in failure_patterns.items():\n",
" if matches:\n",
" found_issues = True\n",
" warn(f\"Found '{pattern}' pattern in {len(matches)} pod(s)\")\n",
" for pod_name, count in matches:\n",
" print(f\" - {pod_name}: {count} occurrence(s)\")\n",
" \n",
" if not found_issues:\n",
" ok(\"No common failure patterns found in recent logs\")\n",
" else:\n",
" print(\"\\n 💡 Review pod logs for details: kubectl logs <pod> -n {namespace} --tail=100\")\n",
"else:\n",
" warn(\"Could not retrieve pod names for log checking\")\n",
"\n",
"ok(\"Early warning signal checks complete\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._validation import ok, warn\n",
"from shared._shell import run\n",
"\n",
"print(\"### Storage / Durability Checks\\n\")\n",
"\n",
"# 1. Check blob storage configuration\n",
"print(\"1. Checking blob storage configuration...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"blob_storage_configured = False\n",
"blob_storage_provider = None\n",
"\n",
"if result.returncode == 0:\n",
" deployments = json.loads(result.stdout)\n",
" for deployment in deployments.get(\"items\", []):\n",
" containers = deployment.get(\"spec\", {}).get(\"template\", {}).get(\"spec\", {}).get(\"containers\", [])\n",
" \n",
" for container in containers:\n",
" env_vars = container.get(\"env\", [])\n",
" for env in env_vars:\n",
" env_name = env.get(\"name\", \"\").upper()\n",
" env_value = env.get(\"value\", \"\")\n",
" \n",
" # Check for blob storage configuration\n",
" if \"BLOB\" in env_name or \"S3\" in env_name or \"STORAGE\" in env_name:\n",
" if \"PROVIDER\" in env_name:\n",
" blob_storage_provider = env_value\n",
" blob_storage_configured = True\n",
" elif env_value and env_value not in [\"local\", \"filesystem\", \"\"]:\n",
" blob_storage_configured = True\n",
"\n",
"# Also check Helm values if accessible\n",
"helm_release = config.get(\"HELM_RELEASE\", \"langsmith\")\n",
"result = run(\n",
" [\"helm\", \"get\", \"values\", helm_release, \"-n\", namespace, \"--output\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" try:\n",
" values = json.loads(result.stdout)\n",
" values_str = str(values).lower()\n",
" \n",
" # Look for blob storage configuration\n",
" if \"blob\" in values_str or \"s3\" in values_str:\n",
" if \"local\" not in values_str and \"filesystem\" not in values_str:\n",
" blob_storage_configured = True\n",
" if \"s3\" in values_str:\n",
" blob_storage_provider = \"s3\"\n",
" elif \"azure\" in values_str:\n",
" blob_storage_provider = \"azure\"\n",
" except json.JSONDecodeError:\n",
" pass\n",
"\n",
"if blob_storage_configured:\n",
" if blob_storage_provider:\n",
" ok(f\"Blob storage configured: {blob_storage_provider}\")\n",
" else:\n",
" ok(\"Blob storage appears configured (provider not detected)\")\n",
" print(\" 💡 Verify blob storage is NOT using local filesystem in production\")\n",
"else:\n",
" warn(\"❌ CRITICAL: Blob storage may not be configured\")\n",
" print(\" 💡 Blob storage is REQUIRED for production\")\n",
" print(\" 💡 Without it, ClickHouse will become unusable under load\")\n",
" print(\" 💡 Configure S3 (AWS) or Azure Blob Storage (Azure)\")\n",
" print(\" 💡 Check Helm values: helm get values <release> -n <namespace>\")\n",
"\n",
"# 2. Check for backup configuration indicators\n",
"print(\"\\n2. Checking backup configuration...\")\n",
"print(\" Note: Backup configuration verification depends on deployment type\")\n",
"\n",
"# For managed services, we can't verify from cluster\n",
"# For in-cluster services, we can check for backup jobs\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"cronjobs,jobs\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"backup_jobs_found = False\n",
"if result.returncode == 0:\n",
" resources = json.loads(result.stdout)\n",
" for item in resources.get(\"items\", []):\n",
" name = item.get(\"metadata\", {}).get(\"name\", \"\").lower()\n",
" if \"backup\" in name:\n",
" backup_jobs_found = True\n",
" print(f\" ✅ Found backup job: {name}\")\n",
"\n",
"if backup_jobs_found:\n",
" ok(\"Backup jobs found in cluster\")\n",
"else:\n",
" warn(\"No backup jobs found in cluster\")\n",
" print(\" 💡 For managed services (RDS, Azure Database), backups are automated\")\n",
" print(\" 💡 Verify backups in cloud provider console:\")\n",
" if provider == \"aws\":\n",
" print(\" - AWS RDS: Check automated backups in RDS console\")\n",
" print(\" - AWS ElastiCache: Check snapshot configuration\")\n",
" elif provider == \"azure\":\n",
" print(\" - Azure Database: Check backup configuration in Azure portal\")\n",
" print(\" - Azure Cache: Check backup configuration\")\n",
" print(\" 💡 For in-cluster ClickHouse, configure backup CronJob\")\n",
"\n",
"# 3. Check PVCs (for in-cluster storage)\n",
"print(\"\\n3. Checking persistent volume claims...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pvc\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pvcs = json.loads(result.stdout)\n",
" pvc_items = pvcs.get(\"items\", [])\n",
" \n",
" if pvc_items:\n",
" ok(f\"Found {len(pvc_items)} PVC(s)\")\n",
" for pvc in pvc_items:\n",
" name = pvc.get(\"metadata\", {}).get(\"name\", \"\")\n",
" status = pvc.get(\"status\", {}).get(\"phase\", \"\")\n",
" size = pvc.get(\"spec\", {}).get(\"resources\", {}).get(\"requests\", {}).get(\"storage\", \"N/A\")\n",
" print(f\" - {name}: {status}, {size}\")\n",
" \n",
" # Check for unbound PVCs\n",
" unbound = [pvc for pvc in pvc_items if pvc.get(\"status\", {}).get(\"phase\") != \"Bound\"]\n",
" if unbound:\n",
" warn(f\"Found {len(unbound)} unbound PVC(s)\")\n",
" print(\" 💡 Check storage class and node capacity\")\n",
" else:\n",
" print(\" No PVCs found (may be using managed services or ephemeral storage)\")\n",
"\n",
"ok(\"Storage / durability checks complete\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Sidecar Checks (Istio)\n",
"\n",
"Detect if Istio sidecars are present and provide guidance on log access.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._validation import ok, warn\n",
"from shared._shell import run\n",
"\n",
"print(\"### Sidecar Checks (Istio)\\n\")\n",
"\n",
"# Check if Istio is installed (check for istiod)\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"deployment\", \"-A\", \"-o\", \"jsonpath={.items[?(@.metadata.name==\\\"istiod\\\")].metadata.name}\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"istio_installed = False\n",
"if result.returncode == 0 and result.stdout.strip():\n",
" istio_installed = True\n",
" ok(\"Istio appears to be installed\")\n",
"else:\n",
" print(\" Istio not detected (or not in default namespace)\")\n",
" print(\" 💡 Sidecar checks will be skipped\")\n",
"\n",
"if istio_installed:\n",
" # Check for sidecar injection in namespace\n",
" result = run(\n",
" [\"kubectl\", \"get\", \"namespace\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" namespace_injection = False\n",
" if result.returncode == 0:\n",
" ns = json.loads(result.stdout)\n",
" labels = ns.get(\"metadata\", {}).get(\"labels\", {})\n",
" if labels.get(\"istio-injection\") == \"enabled\" or labels.get(\"istio-discovery\") == \"enabled\":\n",
" namespace_injection = True\n",
" ok(\"Namespace-level sidecar injection enabled\")\n",
" print(f\" Labels: {labels}\")\n",
" else:\n",
" print(\" Namespace-level injection not enabled\")\n",
" print(\" 💡 Sidecars may be injected per-workload\")\n",
" \n",
" # Check for sidecars in pods\n",
" print(\"\\n2. Checking for sidecars in pods...\")\n",
" result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" pods_with_sidecars = []\n",
" pods_without_sidecars = []\n",
" \n",
" if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" for pod in pods.get(\"items\", []):\n",
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
" containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
" container_names = [c.get(\"name\", \"\") for c in containers]\n",
" \n",
" if \"istio-proxy\" in container_names:\n",
" pods_with_sidecars.append((name, container_names))\n",
" else:\n",
" pods_without_sidecars.append((name, container_names))\n",
" \n",
" if pods_with_sidecars:\n",
" ok(f\"Found {len(pods_with_sidecars)} pod(s) with sidecars\")\n",
" print(\"\\n Pods with sidecars:\")\n",
" for pod_name, containers in pods_with_sidecars[:5]: # Show first 5\n",
" app_containers = [c for c in containers if c != \"istio-proxy\"]\n",
" print(f\" - {pod_name}: {', '.join(app_containers)} + istio-proxy\")\n",
" \n",
" print(\"\\n 💡 Important: When fetching logs, specify container name:\")\n",
" print(\" kubectl logs <pod> -n <namespace> -c <container-name>\")\n",
" print(\" kubectl logs <pod> -n <namespace> -c istio-proxy # for proxy logs\")\n",
" print(\" kubectl logs <pod> -n <namespace> --all-containers=true # for all logs\")\n",
" print(\"\\n ⚠️ If logs appear missing, you're likely looking at the wrong container!\")\n",
" \n",
" if pods_without_sidecars:\n",
" warn(f\"Found {len(pods_without_sidecars)} pod(s) without sidecars\")\n",
" print(\" 💡 These pods may need sidecar injection or are opted out\")\n",
" else:\n",
" if namespace_injection:\n",
" warn(\"No pods with sidecars found (may need pod restart)\")\n",
" print(\" 💡 Existing pods require restart to get sidecars\")\n",
" else:\n",
" print(\" No sidecars detected (Istio may not be used or injection disabled)\")\n",
"\n",
"ok(\"Sidecar checks complete\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"### ✅ Sanity Checks Complete\n",
"\n",
"This notebook has validated:\n",
"- ✅ Configuration loaded\n",
"- ✅ Preflight checks passed\n",
"- ✅ Current state snapshotted\n",
"- ✅ Early warning signals checked\n",
"- ✅ Storage/durability verified\n",
"- ✅ Sidecar status checked (if applicable)\n",
"\n",
"### 🎯 Next Steps\n",
"\n",
"1. **Review production readiness checklist:**\n",
" - See `docs/shared/production_readiness_checklist.md`\n",
" - Address any gaps identified\n",
"\n",
"2. **Review signals and thresholds:**\n",
" - See `docs/shared/ops_signals_and_thresholds.md`\n",
" - Configure alerts based on thresholds\n",
"\n",
"3. **Review sidecar documentation (if using Istio):**\n",
" - See `docs/shared/sidecars_and_service_mesh.md`\n",
" - Verify ServiceEntry configuration for external databases\n",
"\n",
"4. **Document your baselines:**\n",
" - Record current resource usage\n",
" - Document scaling thresholds\n",
" - Update runbooks with findings\n",
"\n",
"### 📋 Common Issues Found\n",
"\n",
"If checks failed, common issues include:\n",
"- Blob storage not configured (CRITICAL for production)\n",
"- Pods restarting (check logs and resource limits)\n",
"- Pending pods (check node capacity and PVC binding)\n",
"- High resource usage (review requests/limits)\n",
"- Missing backups (verify in cloud console)\n",
"\n",
"### 🔍 Evidence for Support\n",
"\n",
"When contacting support, include:\n",
"- State snapshot from this notebook\n",
"- Pod logs (from correct container if sidecars enabled)\n",
"- Recent events\n",
"- Resource usage metrics\n",
"- Configuration summary (redacted)\n",
"\n",
"See `docs/shared/ops_signals_and_thresholds.md` for escalation evidence requirements.\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -0,0 +1,666 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 4: Diagnostics Baseline\n",
"\n",
"## Overview\n",
"\n",
"**This notebook teaches \"baseline first\" discipline.** Before introducing failures or debugging issues, you must capture what \"good\" looks like. This baseline becomes your reference point for all troubleshooting.\n",
"\n",
"**⚠️ SAFETY NOTICE:** This notebook is **READ-ONLY**. However, Module 4 failure labs will modify your environment. Ensure you've completed the safety check in `../shared/00_setup_or_resume_environment.ipynb` before proceeding.\n",
"\n",
"**What This Notebook Does:**\n",
"1. Captures cluster state snapshot (pods, services, deployments)\n",
"2. Collects recent events and resource usage\n",
"3. Runs the canonical diagnostics script\n",
"4. Performs basic health checks\n",
"5. Saves everything to a timestamped directory\n",
"\n",
"**Why This Matters:**\n",
"- You need \"before\" to compare to \"after\"\n",
"- Support will ask for baseline diagnostics\n",
"- Good debugging starts with understanding normal state\n",
"- Evidence collection is time-sensitive\n",
"\n",
"**Estimated time:** 15-20 minutes\n",
"\n",
"**Important:** \n",
"- Run this notebook BEFORE starting any failure labs. It's your evidence baseline.\n",
"- This notebook is read-only and safe to run.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Bootstrap environment\n",
"import sys\n",
"from pathlib import Path\n",
"from datetime import datetime\n",
"\n",
"# Add notebooks directory to path so we can import shared as a package\n",
"possible_paths = [\n",
" Path.cwd().parent, # If cwd is module-4, go up one level to notebooks\n",
" Path.cwd(), # If cwd is already notebooks\n",
" Path.cwd() / \"notebooks\", # If cwd is workspace root\n",
"]\n",
"\n",
"notebooks_path = None\n",
"for path in possible_paths:\n",
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
" notebooks_path = path\n",
" break\n",
"\n",
"if not notebooks_path:\n",
" notebooks_path = Path.cwd() / \"notebooks\"\n",
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
"\n",
"# Add notebooks directory to path so 'shared' can be imported as a package\n",
"if str(notebooks_path) not in sys.path:\n",
" sys.path.insert(0, str(notebooks_path))\n",
"\n",
"from shared._bootstrap import bootstrap\n",
"\n",
"# Run bootstrap\n",
"bootstrap_info = bootstrap()\n",
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
"\n",
"# Create timestamped directory for this baseline\n",
"timestamp = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
"baseline_dir = artifacts_dir / \"module-4\" / f\"baseline-{timestamp}\"\n",
"baseline_dir.mkdir(parents=True, exist_ok=True)\n",
"\n",
"print(f\"\\nBaseline directory: {baseline_dir}\")\n",
"print(f\"All diagnostics will be saved here.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Safety Check: Environment Verification\n",
"\n",
"Verify you're in a safe environment before collecting baseline. Module 4 failure labs will modify your environment.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Safety check: Verify environment is safe for Module 4\n",
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
"from shared._validation import ok, warn\n",
"import os\n",
"\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"identity = get_identity()\n",
"\n",
"print(\"### Environment Safety Check\\n\")\n",
"\n",
"# Show environment details\n",
"provider_display = provider.upper()\n",
"print(f\"Cloud Provider: {provider_display}\")\n",
"print(f\"Region: {region}\")\n",
"\n",
"if provider == \"aws\":\n",
" print(f\"Account ID: {identity.get('Account', 'N/A')}\")\n",
" print(f\"User ARN: {identity.get('Arn', 'N/A')}\")\n",
"elif provider == \"azure\":\n",
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
" print(f\"Subscription ID: {subscription_id}\")\n",
"\n",
"# Show environment variables\n",
"print(f\"\\n### Environment Variables\")\n",
"print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
"print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
"print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
"\n",
"# Check for Module 4 safety flag\n",
"module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
"if module4_safe in [\"true\", \"yes\", \"1\"]:\n",
" ok(\"MODULE4_SAFE_ENVIRONMENT flag is set\")\n",
" print(\" ✅ Environment verified as safe for Module 4 failure labs\")\n",
"else:\n",
" warn(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
" print(\" 💡 This notebook is read-only, but failure labs require this flag\")\n",
" print(\" 💡 Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
" print(\" 💡 Complete safety check in ../shared/00_setup_or_resume_environment.ipynb first\")\n",
"\n",
"print(\"\\n⚠️ REMINDER: This notebook is read-only.\")\n",
"print(\" Failure labs in Module 4 will modify secrets and cause disruptions.\")\n",
"print(\" Only run failure labs in TEST/NON-PRODUCTION environments.\")\n",
"\n",
"ok(\"Environment check complete\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Configuration\n",
"\n",
"Load and validate configuration from environment variables.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from shared._validation import ok, warn\n",
"from shared._cloud_helpers import get_cloud_provider, get_region\n",
"\n",
"# Required configuration\n",
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
"\n",
"print(\"### Loading Configuration\\n\")\n",
"\n",
"config = {}\n",
"missing = []\n",
"\n",
"for var in required_vars:\n",
" value = os.environ.get(var, \"\").strip()\n",
" if not value:\n",
" missing.append(var)\n",
" config[var] = value\n",
"\n",
"if missing:\n",
" raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
" f\"💡 Copy env-samples/workshop.env.example to your .env file and fill in values\")\n",
"\n",
"# Optional but recommended\n",
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
"config[\"LANGSMITH_DOMAIN\"] = os.environ.get(\"LANGSMITH_DOMAIN\", \"\")\n",
"\n",
"namespace = config[\"NAMESPACE\"]\n",
"\n",
"# Show cloud provider info\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"\n",
"print(f\"Cloud Provider: {provider.upper()}\")\n",
"print(f\"Region: {region}\")\n",
"print(f\"Namespace: {namespace}\")\n",
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
"\n",
"if config[\"LANGSMITH_DOMAIN\"]:\n",
" print(f\"LangSmith Domain: {config['LANGSMITH_DOMAIN']}\")\n",
"\n",
"ok(\"Configuration loaded\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Cluster State Snapshot\n",
"\n",
"Capture a complete snapshot of all resources in the namespace. This is your \"before\" picture.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._shell import run\n",
"import json\n",
"\n",
"print(\"### Capturing Cluster State Snapshot\\n\")\n",
"\n",
"# Get all resources\n",
"print(\"1. Collecting all resources...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"all\", \"-n\", namespace, \"-o\", \"wide\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" snapshot_file = baseline_dir / \"all-resources.txt\"\n",
" with open(snapshot_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" ok(f\"Saved resource snapshot to {snapshot_file.name}\")\n",
" print(f\" Resources captured: {len(result.stdout.splitlines())} lines\")\n",
"else:\n",
" warn(\"Could not capture resource snapshot\")\n",
"\n",
"# Get all resources as YAML (more detailed)\n",
"print(\"\\n2. Collecting detailed YAML...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"all\", \"-n\", namespace, \"-o\", \"yaml\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" yaml_file = baseline_dir / \"all-resources.yaml\"\n",
" with open(yaml_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" ok(f\"Saved detailed YAML to {yaml_file.name}\")\n",
"else:\n",
" warn(\"Could not capture detailed YAML\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Key Deployments Description\n",
"\n",
"Get detailed information about key deployments.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Describing Key Deployments\\n\")\n",
"\n",
"# Get list of deployments\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" deployments = json.loads(result.stdout)\n",
" deployment_items = deployments.get(\"items\", [])\n",
" \n",
" if deployment_items:\n",
" ok(f\"Found {len(deployment_items)} deployment(s)\")\n",
" \n",
" # Describe each deployment\n",
" for deployment in deployment_items:\n",
" name = deployment.get(\"metadata\", {}).get(\"name\", \"\")\n",
" print(f\"\\n3. Describing deployment: {name}\")\n",
" \n",
" result = run(\n",
" [\"kubectl\", \"describe\", \"deployment\", name, \"-n\", namespace],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" desc_file = baseline_dir / f\"deployment-{name}.txt\"\n",
" with open(desc_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" print(f\" ✅ Saved description to {desc_file.name}\")\n",
" else:\n",
" warn(f\"Could not describe deployment {name}\")\n",
" else:\n",
" warn(\"No deployments found\")\n",
"else:\n",
" warn(\"Could not list deployments\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Recent Events\n",
"\n",
"Capture recent events sorted by timestamp. Events often contain the first clues about what's happening.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Collecting Recent Events\\n\")\n",
"\n",
"# Get events sorted by timestamp\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" events_file = baseline_dir / \"events.txt\"\n",
" with open(events_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" ok(f\"Saved events to {events_file.name}\")\n",
" \n",
" # Count events by type\n",
" lines = result.stdout.strip().split(\"\\n\")\n",
" if len(lines) > 1: # Header + events\n",
" event_count = len(lines) - 1\n",
" print(f\" Captured {event_count} event(s)\")\n",
" \n",
" # Show last few events\n",
" if event_count > 0:\n",
" print(\"\\n Last 5 events:\")\n",
" for line in lines[-5:]:\n",
" if line.strip():\n",
" print(f\" {line}\")\n",
" else:\n",
" print(\" No events found (this is normal for a healthy cluster)\")\n",
"else:\n",
" warn(\"Could not collect events\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Resource Usage\n",
"\n",
"Capture resource usage (CPU, memory) if metrics are available.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Collecting Resource Usage\\n\")\n",
"\n",
"# Top pods\n",
"print(\"1. Checking pod resource usage...\")\n",
"result = run(\n",
" [\"kubectl\", \"top\", \"pods\", \"-n\", namespace],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" top_pods_file = baseline_dir / \"top-pods.txt\"\n",
" with open(top_pods_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" ok(f\"Saved pod resource usage to {top_pods_file.name}\")\n",
" print(result.stdout)\n",
"else:\n",
" warn(\"Could not get pod resource usage (metrics server may not be available)\")\n",
" print(\" 💡 This is OK - metrics are optional for baseline collection\")\n",
"\n",
"# Top nodes (if available)\n",
"print(\"\\n2. Checking node resource usage...\")\n",
"result = run(\n",
" [\"kubectl\", \"top\", \"nodes\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" top_nodes_file = baseline_dir / \"top-nodes.txt\"\n",
" with open(top_nodes_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" ok(f\"Saved node resource usage to {top_nodes_file.name}\")\n",
" print(result.stdout)\n",
"else:\n",
" warn(\"Could not get node resource usage (metrics server may not be available)\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Canonical Diagnostics Script\n",
"\n",
"**This is the most important step.** Run the official LangChain diagnostics script that Support expects.\n",
"\n",
"The script captures:\n",
"- Pod logs (all containers)\n",
"- Events (sorted by timestamp)\n",
"- Resource usage (CPU, memory)\n",
"- Configuration (deployments, services, ingress)\n",
"- Storage (PVCs, storage classes)\n",
"- Network (services, endpoints)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import urllib.request\n",
"import subprocess\n",
"\n",
"print(\"### Running Canonical Diagnostics Script\\n\")\n",
"\n",
"# URL to the canonical script\n",
"script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
"script_path = baseline_dir / \"get_k8s_debugging_info.sh\"\n",
"\n",
"print(f\"1. Downloading script from: {script_url}\")\n",
"try:\n",
" urllib.request.urlretrieve(script_url, script_path)\n",
" ok(f\"Downloaded script to {script_path.name}\")\n",
" \n",
" # Make executable\n",
" script_path.chmod(0o755)\n",
" \n",
" # Run the script\n",
" print(f\"\\n2. Running diagnostics script for namespace: {namespace}\")\n",
" print(\" (This may take a few minutes...)\")\n",
" \n",
" result = run(\n",
" [str(script_path), namespace],\n",
" check=False,\n",
" stream=True # Stream output so user can see progress\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" ok(\"Diagnostics script completed successfully\")\n",
" \n",
" # The script creates a tarball - find it\n",
" diagnostics_tarball = None\n",
" for file in baseline_dir.parent.iterdir():\n",
" if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
" diagnostics_tarball = file\n",
" break\n",
" \n",
" if diagnostics_tarball:\n",
" # Move it to our baseline directory\n",
" target_path = baseline_dir / diagnostics_tarball.name\n",
" diagnostics_tarball.rename(target_path)\n",
" ok(f\"Diagnostics bundle saved to: {target_path.name}\")\n",
" print(f\" Size: {target_path.stat().st_size / 1024 / 1024:.2f} MB\")\n",
" else:\n",
" warn(\"Could not find diagnostics tarball (check script output above)\")\n",
" else:\n",
" warn(f\"Diagnostics script returned non-zero exit code: {result.returncode}\")\n",
" print(\" Check the output above for errors\")\n",
" print(\" 💡 The script may still have collected useful information\")\n",
" \n",
"except urllib.request.URLError as e:\n",
" warn(f\"Could not download diagnostics script: {e}\")\n",
" print(\" 💡 You can download it manually and run it:\")\n",
" print(f\" curl -O {script_url}\")\n",
" print(f\" chmod +x get_k8s_debugging_info.sh\")\n",
" print(f\" ./get_k8s_debugging_info.sh {namespace}\")\n",
"except Exception as e:\n",
" warn(f\"Error running diagnostics script: {e}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Basic Health Check\n",
"\n",
"Perform a basic HTTP check to verify the LangSmith endpoint is reachable.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"import urllib3\n",
"\n",
"# Disable SSL warnings for self-signed certs\n",
"urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\n",
"\n",
"print(\"### Testing Endpoint Reachability\\n\")\n",
"\n",
"# Determine endpoint URL\n",
"if config[\"LANGSMITH_DOMAIN\"]:\n",
" test_url = f\"https://{config['LANGSMITH_DOMAIN']}\"\n",
"else:\n",
" # Try to get from ingress\n",
" result = run(\n",
" [\"kubectl\", \"get\", \"ingress\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" ingresses = json.loads(result.stdout)\n",
" for ingress in ingresses.get(\"items\", []):\n",
" rules = ingress.get(\"spec\", {}).get(\"rules\", [])\n",
" for rule in rules:\n",
" host = rule.get(\"host\", \"\")\n",
" if host:\n",
" test_url = f\"https://{host}\"\n",
" break\n",
" else:\n",
" test_url = None\n",
"\n",
"if test_url:\n",
" print(f\"Testing: {test_url}\")\n",
" try:\n",
" response = requests.get(test_url, allow_redirects=True, verify=False, timeout=10)\n",
" \n",
" health_file = baseline_dir / \"endpoint-health.txt\"\n",
" with open(health_file, \"w\") as f:\n",
" f.write(f\"URL: {test_url}\\n\")\n",
" f.write(f\"Status Code: {response.status_code}\\n\")\n",
" f.write(f\"Response Headers:\\n{json.dumps(dict(response.headers), indent=2)}\\n\")\n",
" \n",
" if response.status_code in [200, 302, 401, 403]:\n",
" ok(f\"Endpoint is reachable (HTTP {response.status_code})\")\n",
" print(f\" Response saved to {health_file.name}\")\n",
" else:\n",
" warn(f\"Endpoint returned unexpected status: {response.status_code}\")\n",
" except requests.exceptions.SSLError:\n",
" warn(\"SSL verification failed (may be self-signed certificate)\")\n",
" print(\" 💡 This is OK for testing. In production, use proper TLS certificates.\")\n",
" except requests.exceptions.RequestException as e:\n",
" warn(f\"Could not reach endpoint: {e}\")\n",
" print(\" 💡 Endpoint may still be provisioning or DNS not configured\")\n",
"else:\n",
" warn(\"No endpoint URL available for testing\")\n",
" print(\" 💡 Set LANGSMITH_DOMAIN in your .env file to test endpoint reachability\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. What Good Looks Like\n",
"\n",
"Quick validation checks to confirm the baseline is healthy.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._validation import ok, warn\n",
"\n",
"print(\"### Quick Health Validation\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"healthy_pods = 0\n",
"unhealthy_pods = []\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" for pod in pods.get(\"items\", []):\n",
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
" phase = pod.get(\"status\", {}).get(\"phase\", \"\")\n",
" container_statuses = pod.get(\"status\", {}).get(\"containerStatuses\", [])\n",
" \n",
" is_ready = True\n",
" for cs in container_statuses:\n",
" if not cs.get(\"ready\", False):\n",
" is_ready = False\n",
" break\n",
" \n",
" if phase == \"Running\" and is_ready:\n",
" healthy_pods += 1\n",
" else:\n",
" unhealthy_pods.append((name, phase, is_ready))\n",
" \n",
" if unhealthy_pods:\n",
" warn(f\"Found {len(unhealthy_pods)} pod(s) that are not healthy:\")\n",
" for name, phase, ready in unhealthy_pods:\n",
" print(f\" - {name}: phase={phase}, ready={ready}\")\n",
" else:\n",
" ok(f\"All {healthy_pods} pod(s) are healthy and ready\")\n",
"else:\n",
" warn(\"Could not check pod status\")\n",
"\n",
"# Check for CrashLoopBackOff\n",
"if unhealthy_pods:\n",
" crash_loops = [name for name, phase, _ in unhealthy_pods if phase == \"CrashLoopBackOff\"]\n",
" if crash_loops:\n",
" warn(f\"Found {len(crash_loops)} pod(s) in CrashLoopBackOff:\")\n",
" for name in crash_loops:\n",
" print(f\" - {name}\")\n",
" print(\" 💡 Check pod logs to understand why they're crashing\")\n",
"\n",
"# Check for Pending pods\n",
"pending = [name for name, phase, _ in unhealthy_pods if phase == \"Pending\"]\n",
"if pending:\n",
" warn(f\"Found {len(pending)} pod(s) in Pending state:\")\n",
" for name in pending:\n",
" print(f\" - {name}\")\n",
" print(\" 💡 Check events and resource availability\")\n",
"\n",
"print(\"\\n### Baseline Summary\\n\")\n",
"print(f\"✅ Baseline captured at: {timestamp}\")\n",
"print(f\"📁 Baseline directory: {baseline_dir}\")\n",
"print(f\"📊 Resources captured:\")\n",
"print(f\" - Cluster state snapshot\")\n",
"print(f\" - Deployment descriptions\")\n",
"print(f\" - Recent events\")\n",
"print(f\" - Resource usage (if available)\")\n",
"print(f\" - Canonical diagnostics bundle\")\n",
"print(f\" - Endpoint health check\")\n",
"\n",
"ok(\"Baseline collection complete!\")\n",
"print(\"\\n💡 Use this baseline as your reference point for all failure labs.\")\n",
"print(\" Compare future diagnostics to this baseline to identify what changed.\")\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -0,0 +1,944 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 4: Failure Lab - PostgreSQL\n",
"\n",
"## ⚠️ CRITICAL SAFETY WARNING\n",
"\n",
"**THIS NOTEBOOK WILL MODIFY YOUR ENVIRONMENT:**\n",
"- **Modifies Kubernetes secrets** (breaks PostgreSQL password)\n",
"- **Causes service disruptions** (API failures, login failures)\n",
"- **Requires remediation** to restore functionality\n",
"\n",
"**REQUIREMENTS:**\n",
"- ✅ **TEST/NON-PRODUCTION environment ONLY**\n",
"- ✅ **MODULE4_SAFE_ENVIRONMENT=true** must be set in .env\n",
"- ✅ **Baseline diagnostics collected** (run `01_diagnostics_baseline.ipynb` first)\n",
"- ✅ **Backup/restore plan** available\n",
"\n",
"**DO NOT RUN THIS LAB AGAINST PRODUCTION SYSTEMS.**\n",
"\n",
"## Overview\n",
"\n",
"**This lab teaches you how to debug PostgreSQL connectivity failures in LangSmith.**\n",
"\n",
"PostgreSQL is LangSmith's primary metadata store. It holds:\n",
"- User accounts and workspaces\n",
"- Project definitions\n",
"- API keys and permissions\n",
"- Trace metadata (not the traces themselves, which go to ClickHouse)\n",
"\n",
"**When PostgreSQL fails, you'll see:**\n",
"- API endpoints return 5xx errors\n",
"- Login/authentication may fail\n",
"- UI may load but actions fail\n",
"- Connection exhaustion patterns in logs\n",
"\n",
"**Learning Objectives:**\n",
"1. Understand how PostgreSQL failures manifest\n",
"2. Practice collecting diagnostics for database issues\n",
"3. Learn to identify connection vs. credential vs. network issues\n",
"4. Practice safe remediation\n",
"\n",
"**Estimated time:** 30-45 minutes\n",
"\n",
"**⚠️ Important:** \n",
"- Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n",
"- Complete safety check in `../shared/00_setup_or_resume_environment.ipynb` first!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Bootstrap environment\n",
"import sys\n",
"from pathlib import Path\n",
"\n",
"# Add notebooks directory to path\n",
"possible_paths = [\n",
" Path.cwd().parent,\n",
" Path.cwd(),\n",
" Path.cwd() / \"notebooks\",\n",
"]\n",
"\n",
"notebooks_path = None\n",
"for path in possible_paths:\n",
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
" notebooks_path = path\n",
" break\n",
"\n",
"if not notebooks_path:\n",
" notebooks_path = Path.cwd() / \"notebooks\"\n",
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
"\n",
"if str(notebooks_path) not in sys.path:\n",
" sys.path.insert(0, str(notebooks_path))\n",
"\n",
"from shared._bootstrap import bootstrap\n",
"from shared._validation import ok, warn\n",
"\n",
"# Run bootstrap\n",
"bootstrap_info = bootstrap()\n",
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ⚠️ CRITICAL: Environment Safety Verification\n",
"\n",
"**Before proceeding, verify you're in a TEST/NON-PRODUCTION environment and understand what will be modified.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# CRITICAL SAFETY CHECK: Verify environment is safe for failure injection\n",
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
"from shared._validation import ok, warn, fail\n",
"import os\n",
"\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"identity = get_identity()\n",
"\n",
"print(\"=\" * 70)\n",
"print(\"⚠️ CRITICAL SAFETY CHECK - POSTGRESQL FAILURE LAB\")\n",
"print(\"=\" * 70)\n",
"\n",
"# Show environment details prominently\n",
"provider_display = provider.upper()\n",
"print(f\"\\n### Current Environment Configuration\")\n",
"print(f\"Cloud Provider: {provider_display}\")\n",
"print(f\"Region: {region}\")\n",
"\n",
"if provider == \"aws\":\n",
" account_id = identity.get('Account', 'N/A')\n",
" user_arn = identity.get('Arn', 'N/A')\n",
" print(f\"Account ID: {account_id}\")\n",
" print(f\"User ARN: {user_arn}\")\n",
"elif provider == \"azure\":\n",
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
" subscription_name = identity.get(\"SubscriptionName\", \"N/A\")\n",
" print(f\"Subscription ID: {subscription_id}\")\n",
" print(f\"Subscription Name: {subscription_name}\")\n",
"\n",
"# Show all relevant environment variables\n",
"print(f\"\\n### Environment Variables (VERIFY THESE ARE CORRECT)\")\n",
"print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
"print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
"print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
"print(f\"LANGSMITH_DOMAIN: {os.environ.get('LANGSMITH_DOMAIN', 'NOT SET')}\")\n",
"\n",
"print(\"\\n\" + \"=\" * 70)\n",
"print(\"⚠️ WHAT THIS LAB WILL DO:\")\n",
"print(\"=\" * 70)\n",
"print(\"\\nThis failure lab will:\")\n",
"print(\" 1. Find the PostgreSQL secret in your namespace\")\n",
"print(\" 2. BACKUP the original secret (saved to artifacts)\")\n",
"print(\" 3. MODIFY the secret to set an INVALID password\")\n",
"print(\" 4. Apply the modified secret (breaks database connectivity)\")\n",
"print(\" 5. Cause API failures and login failures\")\n",
"print(\" 6. Require remediation to restore (restore original secret)\")\n",
"print(\"\\n\" + \"=\" * 70)\n",
"\n",
"# Check for Module 4 safety flag\n",
"module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
"if module4_safe not in [\"true\", \"yes\", \"1\"]:\n",
" fail(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
" print(\"\\n❌ SAFETY CHECK FAILED - Cannot proceed\")\n",
" print(\"\\nTo run this failure lab, you MUST:\")\n",
" print(\" 1. Verify this is a TEST/NON-PRODUCTION environment\")\n",
" print(\" 2. Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
" print(\" 3. Complete safety check in ../shared/00_setup_or_resume_environment.ipynb\")\n",
" print(\" 4. Re-run this cell to confirm\")\n",
" print(\"\\nThis flag is REQUIRED to prevent accidental execution in production.\")\n",
" raise RuntimeError(\"MODULE4_SAFE_ENVIRONMENT not set. Required for failure labs.\")\n",
"\n",
"ok(\"MODULE4_SAFE_ENVIRONMENT flag is set\")\n",
"print(\"\\n✅ Safety check passed - environment marked as safe for failure injection\")\n",
"print(\"\\n⚠️ REMINDER: This lab will break PostgreSQL connectivity.\")\n",
"print(\" Ensure you understand the remediation steps before proceeding.\")\n",
"print(\" Original secret will be backed up automatically.\")\n",
"\n",
"print(\"\\n\" + \"=\" * 70)\n",
"print(\"✅ Environment verified - ready for PostgreSQL failure lab\")\n",
"print(\"=\" * 70)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Configuration & Prerequisites\n",
"\n",
"Load configuration and verify prerequisites.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from shared._validation import require_env\n",
"\n",
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
"config = require_env(*required_vars)\n",
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
"\n",
"namespace = config[\"NAMESPACE\"]\n",
"\n",
"print(f\"Namespace: {namespace}\")\n",
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
"\n",
"ok(\"Configuration loaded\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. What This Service Does for LangSmith\n",
"\n",
"PostgreSQL is LangSmith's **primary metadata store**. It holds:\n",
"\n",
"- **User accounts and authentication data**\n",
"- **Workspaces and projects** (organizational structure)\n",
"- **API keys and permissions** (access control)\n",
"- **Trace metadata** (not the trace data itself, which goes to ClickHouse)\n",
"- **Evaluation results and feedback**\n",
"\n",
"**Why it matters:**\n",
"- Without PostgreSQL, users can't log in\n",
"- API calls fail (no authentication, no project lookups)\n",
"- UI loads but can't perform actions\n",
"- All LangSmith functionality depends on it\n",
"\n",
"**How LangSmith connects:**\n",
"- Connection string stored in Kubernetes Secrets\n",
"- Connection pool managed by application\n",
"- Connection limits are critical (PostgreSQL has max connections)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Expected Symptoms When PostgreSQL Fails\n",
"\n",
"**What you'll see:**\n",
"\n",
"1. **API 5xx errors:**\n",
" - `/api/v1/...` endpoints return 500 or 503\n",
" - Error messages mention \"database\" or \"connection\"\n",
"\n",
"2. **Login failures:**\n",
" - Users can't authenticate\n",
" - OIDC/SAML may work (redirects) but session creation fails\n",
"\n",
"3. **UI loads but actions fail:**\n",
" - Pages render (static content)\n",
" - API calls fail (can't load projects, traces, etc.)\n",
"\n",
"4. **Log patterns:**\n",
" - Connection timeout errors\n",
" - \"connection refused\" or \"connection reset\"\n",
" - \"too many connections\" (if connection pool exhausted)\n",
" - \"authentication failed\" (if credentials wrong)\n",
"\n",
"**Timeline:**\n",
"- Symptoms appear within seconds of failure\n",
"- API calls start failing immediately\n",
"- Existing connections may work briefly, then fail\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Failure Injection Options\n",
"\n",
"**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
"\n",
"### Level 1: Subtle Failure (Recommended for first run)\n",
"\n",
"**Option A: Wrong Database Password**\n",
"- Modify the PostgreSQL password in the Kubernetes Secret\n",
"- Symptoms: Authentication failures, connection refused\n",
"\n",
"**Option B: Wrong Database Host**\n",
"- Point connection string to non-existent host\n",
"- Symptoms: Connection timeout, DNS resolution failures\n",
"\n",
"**Option C: Network Isolation (if NetworkPolicy supported)**\n",
"- Apply NetworkPolicy blocking egress to PostgreSQL\n",
"- Symptoms: Connection timeout, no route to host\n",
"\n",
"### Level 2: Obvious Failure\n",
"\n",
"**Option D: Remove Secret Entirely**\n",
"- Delete the PostgreSQL connection secret\n",
"- Symptoms: Pods crash on startup, immediate failures\n",
"\n",
"**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Do the Drill - Step 1: Confirm Baseline\n",
"\n",
"**Before injecting any failure, verify your baseline is healthy.**\n",
"\n",
"💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._shell import run\n",
"import json\n",
"\n",
"print(\"### Quick Baseline Check\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" healthy = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" print(f\"Pods: {healthy}/{total} running\")\n",
" \n",
" if healthy == total and total > 0:\n",
" ok(\"Baseline looks healthy\")\n",
" else:\n",
" warn(\"Some pods are not running - check baseline first\")\n",
"else:\n",
" warn(\"Could not check pod status\")\n",
"\n",
"# Check for PostgreSQL secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"postgres_secrets = []\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" if \"postgres\" in name.lower() or \"database\" in name.lower() or \"db\" in name.lower():\n",
" postgres_secrets.append(name)\n",
"\n",
"if postgres_secrets:\n",
" ok(f\"Found {len(postgres_secrets)} PostgreSQL-related secret(s)\")\n",
" for secret_name in postgres_secrets:\n",
" print(f\" - {secret_name}\")\n",
"else:\n",
" warn(\"No PostgreSQL secrets found\")\n",
" print(\" 💡 PostgreSQL connection may be configured differently\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Do the Drill - Step 2: Apply Failure Injection\n",
"\n",
"**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
"\n",
"Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# FAILURE INJECTION: Wrong Database Password\n",
"# This cell modifies the PostgreSQL password secret to an invalid value\n",
"\n",
"import base64\n",
"import yaml\n",
"from datetime import datetime\n",
"\n",
"# Find PostgreSQL secret (look for common names)\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"postgres_secret_name = None\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" # Common patterns: postgres, database, db, langsmith-db\n",
" if any(keyword in name.lower() for keyword in [\"postgres\", \"database\", \"db\"]):\n",
" # Check if it has password-related keys\n",
" data = secret.get(\"data\", {})\n",
" if any(key in data for key in [\"password\", \"POSTGRES_PASSWORD\", \"DB_PASSWORD\"]):\n",
" postgres_secret_name = name\n",
" break\n",
"\n",
"if not postgres_secret_name:\n",
" raise RuntimeError(\"❌ Could not find PostgreSQL secret. Check your deployment configuration.\")\n",
"\n",
"print(f\"Found PostgreSQL secret: {postgres_secret_name}\")\n",
"\n",
"# Get current secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", postgres_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
" check=True,\n",
" stream=False\n",
")\n",
"\n",
"# Save original secret for restoration\n",
"backup_file = artifacts_dir / \"module-4\" / f\"postgres-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
"backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
"with open(backup_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
"\n",
"ok(f\"Backed up original secret to: {backup_file.name}\")\n",
"\n",
"# Parse YAML and modify password\n",
"secret_data = yaml.safe_load(result.stdout)\n",
"if \"data\" not in secret_data:\n",
" raise RuntimeError(\"Secret has no data section\")\n",
"\n",
"# Find password key (could be password, POSTGRES_PASSWORD, DB_PASSWORD, etc.)\n",
"password_key = None\n",
"for key in [\"password\", \"POSTGRES_PASSWORD\", \"DB_PASSWORD\", \"postgres-password\"]:\n",
" if key in secret_data[\"data\"]:\n",
" password_key = key\n",
" break\n",
"\n",
"if not password_key:\n",
" raise RuntimeError(\"Could not find password key in secret\")\n",
"\n",
"# Set invalid password\n",
"invalid_password = \"INVALID_PASSWORD_12345\"\n",
"invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
"\n",
"# Modify secret\n",
"secret_data[\"data\"][password_key] = invalid_password_b64\n",
"\n",
"# Save modified secret to temp file\n",
"temp_secret_file = artifacts_dir / \"module-4\" / \"postgres-secret-modified.yaml\"\n",
"with open(temp_secret_file, \"w\") as f:\n",
" yaml.dump(secret_data, f)\n",
"\n",
"print(\"=\" * 70)\n",
"print(\"⚠️ READY TO APPLY FAILURE INJECTION\")\n",
"print(\"=\" * 70)\n",
"print(f\"\\nThis will modify secret: {postgres_secret_name}\")\n",
"print(f\"Modified secret saved to: {temp_secret_file.name}\")\n",
"print(f\"Backup saved to: {backup_file.name}\")\n",
"print(\"\\n\" + \"=\" * 70)\n",
"print(\"⚠️ FINAL WARNING BEFORE FAILURE INJECTION\")\n",
"print(\"=\" * 70)\n",
"print(\"\\nThis will:\")\n",
"print(\" ❌ Break PostgreSQL connectivity\")\n",
"print(\" ❌ Cause API 5xx errors\")\n",
"print(\" ❌ Break login/authentication\")\n",
"print(\" ❌ Disrupt LangSmith functionality\")\n",
"print(\"\\nTo apply the failure:\")\n",
"print(\" 1. Verify MODULE4_SAFE_ENVIRONMENT=true is set\")\n",
"print(\" 2. Verify you're in a TEST environment\")\n",
"print(\" 3. Uncomment the code in the next cell\")\n",
"print(\" 4. Run the next cell to apply\")\n",
"print(\"\\nTo restore after the lab:\")\n",
"print(f\" - Use the backup file: {backup_file.name}\")\n",
"print(\" - See the 'Remediation' section below\")\n",
"print(\"\\n\" + \"=\" * 70)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# UNCOMMENT TO APPLY FAILURE INJECTION\n",
"# \n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Failure injection applied - PostgreSQL password is now invalid\")\n",
"# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
"# print(\" This may take 1-2 minutes. Watch for pod restarts:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# # Wait a moment for changes to propagate\n",
"# import time\n",
"# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
"# time.sleep(30)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Do the Drill - Step 3: Observe Symptoms\n",
"\n",
"**Now that the failure is injected, observe how it manifests.**\n",
"\n",
"Check:\n",
"1. Pod logs for connection errors\n",
"2. API endpoint responses\n",
"3. UI behavior\n",
"4. Events for pod restarts\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"# Create incident directory for diagnostics\n",
"incident_dir = artifacts_dir / \"module-4\" / f\"postgres-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
"incident_dir.mkdir(parents=True, exist_ok=True)\n",
"\n",
"print(f\"### Collecting Failure Diagnostics\\n\")\n",
"print(f\"Saving to: {incident_dir}\\n\")\n",
"\n",
"# 1. Check pod status\n",
"print(\"1. Checking pod status...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" print(result.stdout)\n",
" \n",
" # Check for restarts\n",
" lines = result.stdout.split(\"\\n\")\n",
" restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
" if restarts:\n",
" print(\"\\n Pod restart counts:\")\n",
" for line in restarts[1:]: # Skip header\n",
" if line.strip():\n",
" parts = line.split()\n",
" if len(parts) > 3:\n",
" print(f\" {parts[0]}: {parts[3]} restarts\")\n",
"\n",
"# 2. Check recent events\n",
"print(\"\\n2. Checking recent events...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"events.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" if result.stdout.strip():\n",
" print(\" Recent warning/error events:\")\n",
" for line in result.stdout.split(\"\\n\")[-5:]:\n",
" if line.strip():\n",
" print(f\" {line}\")\n",
"\n",
"# 3. Check API pod logs for database errors\n",
"print(\"\\n3. Checking API pod logs for database errors...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-api\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"api_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if api_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, api_pod, \"--tail=50\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" logs_file = incident_dir / f\"api-pod-{api_pod}-logs.txt\"\n",
" with open(logs_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" \n",
" # Look for database-related errors\n",
" error_keywords = [\"database\", \"postgres\", \"connection\", \"timeout\", \"refused\", \"authentication\"]\n",
" error_lines = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if error_lines:\n",
" print(\" Found database-related errors:\")\n",
" for line in error_lines[-5:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" print(\" No obvious database errors in recent logs\")\n",
"else:\n",
" warn(\"Could not find API pod\")\n",
"\n",
"ok(f\"Diagnostics saved to: {incident_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
"\n",
"**This is critical - Support will ask for this bundle.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import urllib.request\n",
"\n",
"print(\"### Running Canonical Diagnostics Script\\n\")\n",
"\n",
"script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
"script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
"\n",
"try:\n",
" urllib.request.urlretrieve(script_url, script_path)\n",
" script_path.chmod(0o755)\n",
" \n",
" print(f\"Running diagnostics script for namespace: {namespace}\")\n",
" result = run(\n",
" [str(script_path), namespace],\n",
" check=False,\n",
" stream=True\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" ok(\"Diagnostics script completed\")\n",
" \n",
" # Find and move tarball\n",
" for file in incident_dir.parent.iterdir():\n",
" if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
" target_path = incident_dir / file.name\n",
" file.rename(target_path)\n",
" ok(f\"Diagnostics bundle: {target_path.name}\")\n",
" break\n",
" else:\n",
" warn(\"Diagnostics script had errors (check output above)\")\n",
" \n",
"except Exception as e:\n",
" warn(f\"Could not run diagnostics script: {e}\")\n",
" print(\" 💡 You can run it manually:\")\n",
" print(f\" curl -O {script_url}\")\n",
" print(f\" chmod +x get_k8s_debugging_info.sh\")\n",
" print(f\" ./get_k8s_debugging_info.sh {namespace}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Do the Drill - Step 5: Guided Triage\n",
"\n",
"**Where to look first for PostgreSQL issues:**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Guided Triage Steps\\n\")\n",
"\n",
"print(\"1. Check pod logs for connection errors:\")\n",
"print(f\" kubectl logs -n {namespace} <pod-name> | grep -i 'database\\\\|postgres\\\\|connection'\")\n",
"print()\n",
"\n",
"print(\"2. Verify secret exists and has correct keys:\")\n",
"print(f\" kubectl get secret {postgres_secret_name} -n {namespace} -o yaml\")\n",
"print(\" (Don't print the actual values - they're base64 encoded)\")\n",
"print()\n",
"\n",
"print(\"3. Check for pod restarts (indicates startup failures):\")\n",
"print(f\" kubectl get pods -n {namespace}\")\n",
"print()\n",
"\n",
"print(\"4. Test database connectivity from a pod (if possible):\")\n",
"print(\" kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \\\\\")\n",
"print(\" psql -h <db-host> -U <user> -d <database>\")\n",
"print()\n",
"\n",
"print(\"5. Check events for authentication/connection errors:\")\n",
"print(f\" kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
"print()\n",
"\n",
"# Check what we can automatically\n",
"print(\"\\n### Automatic Checks\\n\")\n",
"\n",
"# Check secret still exists\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", postgres_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" ok(f\"Secret '{postgres_secret_name}' still exists\")\n",
" secret_data = json.loads(result.stdout)\n",
" keys = list(secret_data.get(\"data\", {}).keys())\n",
" print(f\" Secret keys: {', '.join(keys)}\")\n",
"else:\n",
" warn(f\"Secret '{postgres_secret_name}' not found!\")\n",
"\n",
"# Check for pods with database connection env vars\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" db_related_pods = []\n",
" for pod in pods.get(\"items\", []):\n",
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
" containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
" for container in containers:\n",
" env = container.get(\"env\", [])\n",
" db_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
" for kw in [\"DB\", \"POSTGRES\", \"DATABASE\"])]\n",
" if db_env:\n",
" db_related_pods.append(name)\n",
" break\n",
" \n",
" if db_related_pods:\n",
" print(f\"\\n Pods with database environment variables:\")\n",
" for pod_name in set(db_related_pods):\n",
" print(f\" - {pod_name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Do the Drill - Step 6: Remediation\n",
"\n",
"**Restore the original secret to fix the issue.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# REMEDIATION: Restore original secret\n",
"# UNCOMMENT TO RESTORE\n",
"\n",
"# if backup_file.exists():\n",
"# print(f\"Restoring original secret from: {backup_file.name}\")\n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Original secret restored\")\n",
"# print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
"# print(\" This may take 1-2 minutes. Monitor pod status:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# import time\n",
"# print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
"# time.sleep(60)\n",
"# else:\n",
"# warn(f\"Backup file not found: {backup_file}\")\n",
"# print(\" 💡 You may need to manually restore the secret\")\n",
"\n",
"print(\"⚠️ To restore, uncomment the code above and run this cell.\")\n",
"print(f\" Backup file: {backup_file.name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Do the Drill - Step 7: Confirm Recovery\n",
"\n",
"**Verify that everything is working again.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Verifying Recovery\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" running = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" \n",
" if running == total and total > 0:\n",
" ok(f\"All {total} pod(s) are running\")\n",
" else:\n",
" warn(f\"Only {running}/{total} pod(s) running\")\n",
" print(\" 💡 Wait a bit longer for pods to fully recover\")\n",
"\n",
"# Check for recent errors in logs\n",
"print(\"\\nChecking for recent errors...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-api\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"api_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if api_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, api_pod, \"--tail=20\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" error_keywords = [\"error\", \"fail\", \"database\", \"postgres\", \"connection\"]\n",
" recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if recent_errors:\n",
" warn(\"Still seeing some errors in logs:\")\n",
" for line in recent_errors[-3:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" ok(\"No recent errors in API logs\")\n",
"\n",
"ok(\"Recovery verification complete\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. What Support Will Ask For\n",
"\n",
"**When escalating a PostgreSQL issue, Support will need:**\n",
"\n",
"1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
"2. **PostgreSQL connection details:**\n",
" - Host/endpoint (redacted)\n",
" - Database name\n",
" - Username (redacted)\n",
" - Whether using SSL/TLS\n",
"3. **Error messages from logs:**\n",
" - Full error text (not just \"connection failed\")\n",
" - Timestamps of first occurrence\n",
"4. **Recent changes:**\n",
" - Secret rotations\n",
" - Database migrations\n",
" - Network policy changes\n",
"5. **Connection pool status:**\n",
" - Current connections vs. max connections\n",
" - Connection pool exhaustion patterns\n",
"6. **Database health (if accessible):**\n",
" - PostgreSQL version\n",
" - Active connections\n",
" - Lock contention\n",
"\n",
"**Evidence collected in this lab:**\n",
"- ✅ Diagnostics bundle\n",
"- ✅ Pod logs with database errors\n",
"- ✅ Events showing failures\n",
"- ✅ Secret configuration (structure, not values)\n",
"\n",
"**Additional evidence to gather (if escalating):**\n",
"- Database endpoint connectivity test\n",
"- Connection pool metrics (if available)\n",
"- PostgreSQL logs (if accessible via cloud provider)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Lessons Learned\n",
"\n",
"**Key takeaways from this lab:**\n",
"\n",
"1. **PostgreSQL failures manifest quickly** - API calls fail within seconds\n",
"2. **Logs are your friend** - Connection errors appear in pod logs immediately\n",
"3. **Secrets matter** - Wrong credentials cause authentication failures\n",
"4. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
"5. **Diagnostics bundle is essential** - Support needs it for root cause analysis\n",
"\n",
"**Common mistakes to avoid:**\n",
"- ❌ Changing multiple things at once (hard to identify root cause)\n",
"- ❌ Not collecting diagnostics before remediation\n",
"- ❌ Ignoring connection pool limits\n",
"- ❌ Not testing database connectivity independently\n",
"\n",
"**Next steps:**\n",
"- Practice with other failure injection methods (Level 2)\n",
"- Try the Redis, ClickHouse, or Blob Storage failure labs\n",
"- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -0,0 +1,941 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 4: Failure Lab - Redis\n",
"\n",
"## ⚠️ CRITICAL SAFETY WARNING\n",
"\n",
"**THIS NOTEBOOK WILL MODIFY YOUR ENVIRONMENT:**\n",
"- **Modifies Kubernetes secrets** (breaks Redis password)\n",
"- **Causes service disruptions** (intermittent ingestion, worker backlog)\n",
"- **Requires remediation** to restore functionality\n",
"\n",
"**REQUIREMENTS:**\n",
"- ✅ **TEST/NON-PRODUCTION environment ONLY**\n",
"- ✅ **MODULE4_SAFE_ENVIRONMENT=true** must be set in .env\n",
"- ✅ **Baseline diagnostics collected** (run `01_diagnostics_baseline.ipynb` first)\n",
"- ✅ **Backup/restore plan** available\n",
"\n",
"**DO NOT RUN THIS LAB AGAINST PRODUCTION SYSTEMS.**\n",
"\n",
"## Overview\n",
"\n",
"**This lab teaches you how to debug Redis connectivity failures in LangSmith.**\n",
"\n",
"Redis is LangSmith's **cache and job queue**. It handles:\n",
"- Job queue for asynchronous trace processing\n",
"- Caching for frequently accessed data\n",
"- Rate limiting and session management\n",
"- Worker coordination\n",
"\n",
"**When Redis fails, you'll see:**\n",
"- Intermittent ingestion issues\n",
"- Latency spikes and retries\n",
"- Worker backlog (jobs piling up)\n",
"- Traces may be delayed or missing\n",
"\n",
"**Learning Objectives:**\n",
"1. Understand how Redis failures manifest\n",
"2. Practice collecting diagnostics for cache/queue issues\n",
"3. Learn to identify connection vs. credential vs. network issues\n",
"4. Practice safe remediation\n",
"\n",
"**Estimated time:** 30-45 minutes\n",
"\n",
"**⚠️ Important:** \n",
"- Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n",
"- Complete safety check in `../shared/00_setup_or_resume_environment.ipynb` first!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Bootstrap environment\n",
"import sys\n",
"from pathlib import Path\n",
"\n",
"# Add notebooks directory to path\n",
"possible_paths = [\n",
" Path.cwd().parent,\n",
" Path.cwd(),\n",
" Path.cwd() / \"notebooks\",\n",
"]\n",
"\n",
"notebooks_path = None\n",
"for path in possible_paths:\n",
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
" notebooks_path = path\n",
" break\n",
"\n",
"if not notebooks_path:\n",
" notebooks_path = Path.cwd() / \"notebooks\"\n",
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
"\n",
"if str(notebooks_path) not in sys.path:\n",
" sys.path.insert(0, str(notebooks_path))\n",
"\n",
"from shared._bootstrap import bootstrap\n",
"from shared._validation import ok, warn\n",
"\n",
"# Run bootstrap\n",
"bootstrap_info = bootstrap()\n",
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ⚠️ CRITICAL: Environment Safety Verification\n",
"\n",
"**Before proceeding, verify you're in a TEST/NON-PRODUCTION environment and understand what will be modified.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# CRITICAL SAFETY CHECK: Verify environment is safe for failure injection\n",
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
"from shared._validation import ok, warn, fail\n",
"import os\n",
"\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"identity = get_identity()\n",
"\n",
"print(\"=\" * 70)\n",
"print(\"⚠️ CRITICAL SAFETY CHECK - REDIS FAILURE LAB\")\n",
"print(\"=\" * 70)\n",
"\n",
"# Show environment details prominently\n",
"provider_display = provider.upper()\n",
"print(f\"\\n### Current Environment Configuration\")\n",
"print(f\"Cloud Provider: {provider_display}\")\n",
"print(f\"Region: {region}\")\n",
"\n",
"if provider == \"aws\":\n",
" account_id = identity.get('Account', 'N/A')\n",
" user_arn = identity.get('Arn', 'N/A')\n",
" print(f\"Account ID: {account_id}\")\n",
" print(f\"User ARN: {user_arn}\")\n",
"elif provider == \"azure\":\n",
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
" subscription_name = identity.get(\"SubscriptionName\", \"N/A\")\n",
" print(f\"Subscription ID: {subscription_id}\")\n",
" print(f\"Subscription Name: {subscription_name}\")\n",
"\n",
"# Show all relevant environment variables\n",
"print(f\"\\n### Environment Variables (VERIFY THESE ARE CORRECT)\")\n",
"print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
"print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
"print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
"print(f\"LANGSMITH_DOMAIN: {os.environ.get('LANGSMITH_DOMAIN', 'NOT SET')}\")\n",
"\n",
"print(\"\\n\" + \"=\" * 70)\n",
"print(\"⚠️ WHAT THIS LAB WILL DO:\")\n",
"print(\"=\" * 70)\n",
"print(\"\\nThis failure lab will:\")\n",
"print(\" 1. Find the Redis secret in your namespace\")\n",
"print(\" 2. BACKUP the original secret (saved to artifacts)\")\n",
"print(\" 3. MODIFY the secret to set an INVALID password\")\n",
"print(\" 4. Apply the modified secret (breaks Redis connectivity)\")\n",
"print(\" 5. Cause intermittent ingestion issues and worker backlog\")\n",
"print(\" 6. Require remediation to restore (restore original secret)\")\n",
"print(\"\\n\" + \"=\" * 70)\n",
"\n",
"# Check for Module 4 safety flag\n",
"module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
"if module4_safe not in [\"true\", \"yes\", \"1\"]:\n",
" fail(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
" print(\"\\n❌ SAFETY CHECK FAILED - Cannot proceed\")\n",
" print(\"\\nTo run this failure lab, you MUST:\")\n",
" print(\" 1. Verify this is a TEST/NON-PRODUCTION environment\")\n",
" print(\" 2. Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
" print(\" 3. Complete safety check in ../shared/00_setup_or_resume_environment.ipynb\")\n",
" print(\" 4. Re-run this cell to confirm\")\n",
" print(\"\\nThis flag is REQUIRED to prevent accidental execution in production.\")\n",
" raise RuntimeError(\"MODULE4_SAFE_ENVIRONMENT not set. Required for failure labs.\")\n",
"\n",
"ok(\"MODULE4_SAFE_ENVIRONMENT flag is set\")\n",
"print(\"\\n✅ Safety check passed - environment marked as safe for failure injection\")\n",
"print(\"\\n⚠️ REMINDER: This lab will break Redis connectivity.\")\n",
"print(\" Ensure you understand the remediation steps before proceeding.\")\n",
"print(\" Original secret will be backed up automatically.\")\n",
"\n",
"print(\"\\n\" + \"=\" * 70)\n",
"print(\"✅ Environment verified - ready for Redis failure lab\")\n",
"print(\"=\" * 70)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Configuration & Prerequisites\n",
"\n",
"Load configuration and verify prerequisites.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from shared._validation import require_env\n",
"\n",
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
"config = require_env(*required_vars)\n",
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
"\n",
"namespace = config[\"NAMESPACE\"]\n",
"\n",
"print(f\"Namespace: {namespace}\")\n",
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
"\n",
"ok(\"Configuration loaded\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. What This Service Does for LangSmith\n",
"\n",
"Redis is LangSmith's **cache and job queue**. It handles:\n",
"\n",
"- **Job queue for asynchronous processing:**\n",
" - Workers pull trace processing jobs from Redis\n",
" - Jobs are queued when traces arrive via API\n",
" - Queue backlog indicates processing delays\n",
"\n",
"- **Caching:**\n",
" - Frequently accessed data (project metadata, user info)\n",
" - Reduces load on PostgreSQL\n",
" - Improves response times\n",
"\n",
"- **Rate limiting and session management:**\n",
" - API rate limiting\n",
" - Session storage (if configured)\n",
"\n",
"- **Worker coordination:**\n",
" - Distributed locking\n",
" - Task distribution\n",
"\n",
"**Why it matters:**\n",
"- Without Redis, workers can't process traces\n",
"- Job queue fills up, causing delays\n",
"- Cache misses increase load on PostgreSQL\n",
"- Ingestion becomes unreliable\n",
"\n",
"**How LangSmith connects:**\n",
"- Connection string stored in Kubernetes Secrets\n",
"- Workers connect to Redis to pull jobs\n",
"- API servers use Redis for caching\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Expected Symptoms When Redis Fails\n",
"\n",
"**What you'll see:**\n",
"\n",
"1. **Intermittent ingestion issues:**\n",
" - Some traces process, others don't\n",
" - Inconsistent behavior (works sometimes, fails other times)\n",
" - Retries visible in logs\n",
"\n",
"2. **Latency spikes:**\n",
" - API responses slow down\n",
" - Worker processing delays\n",
" - Timeout errors\n",
"\n",
"3. **Worker backlog:**\n",
" - Jobs piling up in queue\n",
" - Workers unable to pull new jobs\n",
" - Queue length increasing\n",
"\n",
"4. **Log patterns:**\n",
" - Connection timeout errors\n",
" - \"connection refused\" or \"connection reset\"\n",
" - \"NOAUTH Authentication required\" (if password wrong)\n",
" - Retry attempts in worker logs\n",
" - Cache miss patterns\n",
"\n",
"**Timeline:**\n",
"- Symptoms may be intermittent (connection pool retries)\n",
"- Worker backlog builds over time\n",
"- Cache misses cause cascading delays\n",
"- Full failure if connection pool exhausted\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Failure Injection Options\n",
"\n",
"**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
"\n",
"### Level 1: Subtle Failure (Recommended for first run)\n",
"\n",
"**Option A: Wrong Redis Password**\n",
"- Modify the Redis password in the Kubernetes Secret\n",
"- Symptoms: Authentication failures, connection refused, intermittent failures\n",
"\n",
"**Option B: Block Egress to Redis Endpoint**\n",
"- Apply NetworkPolicy blocking egress to Redis (if NetworkPolicy supported)\n",
"- Symptoms: Connection timeout, no route to host, intermittent failures\n",
"\n",
"### Level 2: Obvious Failure\n",
"\n",
"**Option C: Wrong Redis Host/Endpoint**\n",
"- Point connection string to non-existent host\n",
"- Symptoms: Connection timeout, DNS resolution failures, immediate failures\n",
"\n",
"**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Do the Drill - Step 1: Confirm Baseline\n",
"\n",
"**Before injecting any failure, verify your baseline is healthy.**\n",
"\n",
"💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._shell import run\n",
"import json\n",
"\n",
"print(\"### Quick Baseline Check\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" healthy = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" print(f\"Pods: {healthy}/{total} running\")\n",
" \n",
" if healthy == total and total > 0:\n",
" ok(\"Baseline looks healthy\")\n",
" else:\n",
" warn(\"Some pods are not running - check baseline first\")\n",
"else:\n",
" warn(\"Could not check pod status\")\n",
"\n",
"# Check for Redis secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"redis_secrets = []\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" if \"redis\" in name.lower() or \"cache\" in name.lower():\n",
" redis_secrets.append(name)\n",
"\n",
"if redis_secrets:\n",
" ok(f\"Found {len(redis_secrets)} Redis-related secret(s)\")\n",
" for secret_name in redis_secrets:\n",
" print(f\" - {secret_name}\")\n",
"else:\n",
" warn(\"No Redis secrets found\")\n",
" print(\" 💡 Redis connection may be configured differently\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Do the Drill - Step 2: Apply Failure Injection\n",
"\n",
"**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
"\n",
"Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# FAILURE INJECTION: Wrong Redis Password\n",
"# This cell modifies the Redis password secret to an invalid value\n",
"\n",
"import base64\n",
"import yaml\n",
"from datetime import datetime\n",
"\n",
"# Find Redis secret (look for common names)\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"redis_secret_name = None\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" # Common patterns: redis, cache\n",
" if any(keyword in name.lower() for keyword in [\"redis\", \"cache\"]):\n",
" # Check if it has password-related keys\n",
" data = secret.get(\"data\", {})\n",
" if any(key in data for key in [\"password\", \"REDIS_PASSWORD\", \"CACHE_PASSWORD\"]):\n",
" redis_secret_name = name\n",
" break\n",
"\n",
"if not redis_secret_name:\n",
" raise RuntimeError(\"❌ Could not find Redis secret. Check your deployment configuration.\")\n",
"\n",
"print(f\"Found Redis secret: {redis_secret_name}\")\n",
"\n",
"# Get current secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", redis_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
" check=True,\n",
" stream=False\n",
")\n",
"\n",
"# Save original secret for restoration\n",
"backup_file = artifacts_dir / \"module-4\" / f\"redis-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
"backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
"with open(backup_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
"\n",
"ok(f\"Backed up original secret to: {backup_file.name}\")\n",
"\n",
"# Parse YAML and modify password\n",
"secret_data = yaml.safe_load(result.stdout)\n",
"if \"data\" not in secret_data:\n",
" raise RuntimeError(\"Secret has no data section\")\n",
"\n",
"# Find password key (could be password, REDIS_PASSWORD, CACHE_PASSWORD, etc.)\n",
"password_key = None\n",
"for key in [\"password\", \"REDIS_PASSWORD\", \"CACHE_PASSWORD\", \"redis-password\"]:\n",
" if key in secret_data[\"data\"]:\n",
" password_key = key\n",
" break\n",
"\n",
"if not password_key:\n",
" raise RuntimeError(\"Could not find password key in secret\")\n",
"\n",
"# Set invalid password\n",
"invalid_password = \"INVALID_PASSWORD_12345\"\n",
"invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
"\n",
"# Modify secret\n",
"secret_data[\"data\"][password_key] = invalid_password_b64\n",
"\n",
"# Save modified secret to temp file\n",
"temp_secret_file = artifacts_dir / \"module-4\" / \"redis-secret-modified.yaml\"\n",
"with open(temp_secret_file, \"w\") as f:\n",
" yaml.dump(secret_data, f)\n",
"\n",
"print(f\"\\n⚠️ READY TO APPLY FAILURE INJECTION\")\n",
"print(f\" This will set an invalid password in secret: {redis_secret_name}\")\n",
"print(f\" Modified secret saved to: {temp_secret_file.name}\")\n",
"print(f\"\\n To apply, uncomment and run the next cell.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# UNCOMMENT TO APPLY FAILURE INJECTION\n",
"# \n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Failure injection applied - Redis password is now invalid\")\n",
"# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
"# print(\" This may take 1-2 minutes. Watch for pod restarts:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# # Wait a moment for changes to propagate\n",
"# import time\n",
"# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
"# time.sleep(30)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Do the Drill - Step 3: Observe Symptoms\n",
"\n",
"**Now that the failure is injected, observe how it manifests.**\n",
"\n",
"Check:\n",
"1. Worker pod logs for Redis connection errors\n",
"2. Queue backlog (if visible)\n",
"3. Worker retry patterns\n",
"4. Latency in API responses\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"# Create incident directory for diagnostics\n",
"incident_dir = artifacts_dir / \"module-4\" / f\"redis-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
"incident_dir.mkdir(parents=True, exist_ok=True)\n",
"\n",
"print(f\"### Collecting Failure Diagnostics\\n\")\n",
"print(f\"Saving to: {incident_dir}\\n\")\n",
"\n",
"# 1. Check pod status\n",
"print(\"1. Checking pod status...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" print(result.stdout)\n",
" \n",
" # Check for restarts\n",
" lines = result.stdout.split(\"\\n\")\n",
" restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
" if restarts:\n",
" print(\"\\n Pod restart counts:\")\n",
" for line in restarts[1:]: # Skip header\n",
" if line.strip():\n",
" parts = line.split()\n",
" if len(parts) > 3:\n",
" print(f\" {parts[0]}: {parts[3]} restarts\")\n",
"\n",
"# 2. Check recent events\n",
"print(\"\\n2. Checking recent events...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"events.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" if result.stdout.strip():\n",
" print(\" Recent warning/error events:\")\n",
" for line in result.stdout.split(\"\\n\")[-5:]:\n",
" if line.strip():\n",
" print(f\" {line}\")\n",
"\n",
"# 3. Check worker pod logs for Redis errors\n",
"print(\"\\n3. Checking worker pod logs for Redis errors...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if worker_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=50\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" logs_file = incident_dir / f\"worker-pod-{worker_pod}-logs.txt\"\n",
" with open(logs_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" \n",
" # Look for Redis-related errors\n",
" error_keywords = [\"redis\", \"cache\", \"connection\", \"timeout\", \"refused\", \"authentication\", \"NOAUTH\", \"retry\"]\n",
" error_lines = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if error_lines:\n",
" print(\" Found Redis-related errors:\")\n",
" for line in error_lines[-5:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" print(\" No obvious Redis errors in recent logs\")\n",
"else:\n",
" warn(\"Could not find worker pod\")\n",
"\n",
"ok(f\"Diagnostics saved to: {incident_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
"\n",
"**This is critical - Support will ask for this bundle.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import urllib.request\n",
"\n",
"print(\"### Running Canonical Diagnostics Script\\n\")\n",
"\n",
"script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
"script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
"\n",
"try:\n",
" urllib.request.urlretrieve(script_url, script_path)\n",
" script_path.chmod(0o755)\n",
" \n",
" print(f\"Running diagnostics script for namespace: {namespace}\")\n",
" result = run(\n",
" [str(script_path), namespace],\n",
" check=False,\n",
" stream=True\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" ok(\"Diagnostics script completed\")\n",
" \n",
" # Find and move tarball\n",
" for file in incident_dir.parent.iterdir():\n",
" if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
" target_path = incident_dir / file.name\n",
" file.rename(target_path)\n",
" ok(f\"Diagnostics bundle: {target_path.name}\")\n",
" break\n",
" else:\n",
" warn(\"Diagnostics script had errors (check output above)\")\n",
" \n",
"except Exception as e:\n",
" warn(f\"Could not run diagnostics script: {e}\")\n",
" print(\" 💡 You can run it manually:\")\n",
" print(f\" curl -O {script_url}\")\n",
" print(f\" chmod +x get_k8s_debugging_info.sh\")\n",
" print(f\" ./get_k8s_debugging_info.sh {namespace}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Do the Drill - Step 5: Guided Triage\n",
"\n",
"**Where to look first for Redis issues:**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Guided Triage Steps\\n\")\n",
"\n",
"print(\"1. Check worker pod logs for Redis connection errors:\")\n",
"print(f\" kubectl logs -n {namespace} <worker-pod-name> | grep -i 'redis\\\\|cache\\\\|connection'\")\n",
"print()\n",
"\n",
"print(\"2. Verify secret exists and has correct keys:\")\n",
"print(f\" kubectl get secret {redis_secret_name} -n {namespace} -o yaml\")\n",
"print(\" (Don't print the actual values - they're base64 encoded)\")\n",
"print()\n",
"\n",
"print(\"3. Check for worker pod restarts (indicates connection failures):\")\n",
"print(f\" kubectl get pods -n {namespace} -l app=langsmith-worker\")\n",
"print()\n",
"\n",
"print(\"4. Test Redis connectivity from a pod (if possible):\")\n",
"print(\" kubectl run -it --rm debug --image=redis:7 --restart=Never -- \\\\\")\n",
"print(\" redis-cli -h <redis-host> -p <port> -a <password> ping\")\n",
"print()\n",
"\n",
"print(\"5. Check events for connection/authentication errors:\")\n",
"print(f\" kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
"print()\n",
"\n",
"# Check what we can automatically\n",
"print(\"\\n### Automatic Checks\\n\")\n",
"\n",
"# Check secret still exists\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", redis_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" ok(f\"Secret '{redis_secret_name}' still exists\")\n",
" secret_data = json.loads(result.stdout)\n",
" keys = list(secret_data.get(\"data\", {}).keys())\n",
" print(f\" Secret keys: {', '.join(keys)}\")\n",
"else:\n",
" warn(f\"Secret '{redis_secret_name}' not found!\")\n",
"\n",
"# Check for pods with Redis connection env vars\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" redis_related_pods = []\n",
" for pod in pods.get(\"items\", []):\n",
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
" containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
" for container in containers:\n",
" env = container.get(\"env\", [])\n",
" redis_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
" for kw in [\"REDIS\", \"CACHE\"])]\n",
" if redis_env:\n",
" redis_related_pods.append(name)\n",
" break\n",
" \n",
" if redis_related_pods:\n",
" print(f\"\\n Pods with Redis environment variables:\")\n",
" for pod_name in set(redis_related_pods):\n",
" print(f\" - {pod_name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Do the Drill - Step 6: Remediation\n",
"\n",
"**Restore the original secret to fix the issue.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# REMEDIATION: Restore original secret\n",
"# UNCOMMENT TO RESTORE\n",
"\n",
"# if backup_file.exists():\n",
"# print(f\"Restoring original secret from: {backup_file.name}\")\n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Original secret restored\")\n",
"# print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
"# print(\" This may take 1-2 minutes. Monitor pod status:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# import time\n",
"# print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
"# time.sleep(60)\n",
"# else:\n",
"# warn(f\"Backup file not found: {backup_file}\")\n",
"# print(\" 💡 You may need to manually restore the secret\")\n",
"\n",
"print(\"⚠️ To restore, uncomment the code above and run this cell.\")\n",
"print(f\" Backup file: {backup_file.name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Do the Drill - Step 7: Confirm Recovery\n",
"\n",
"**Verify that everything is working again.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Verifying Recovery\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" running = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" \n",
" if running == total and total > 0:\n",
" ok(f\"All {total} pod(s) are running\")\n",
" else:\n",
" warn(f\"Only {running}/{total} pod(s) running\")\n",
" print(\" 💡 Wait a bit longer for pods to fully recover\")\n",
"\n",
"# Check for recent errors in worker logs\n",
"print(\"\\nChecking for recent errors in worker logs...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if worker_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=20\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" error_keywords = [\"error\", \"fail\", \"redis\", \"cache\", \"connection\"]\n",
" recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if recent_errors:\n",
" warn(\"Still seeing some errors in logs:\")\n",
" for line in recent_errors[-3:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" ok(\"No recent errors in worker logs\")\n",
"\n",
"ok(\"Recovery verification complete\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. What Support Will Ask For\n",
"\n",
"**When escalating a Redis issue, Support will need:**\n",
"\n",
"1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
"2. **Redis connection details:**\n",
" - Host/endpoint (redacted)\n",
" - Port\n",
" - Password (redacted)\n",
" - Whether using SSL/TLS\n",
"3. **Error messages from logs:**\n",
" - Full error text (not just \"connection failed\")\n",
" - Timestamps of first occurrence\n",
" - Retry patterns\n",
"4. **Recent changes:**\n",
" - Secret rotations\n",
" - Network policy changes\n",
" - Redis configuration changes\n",
"5. **Queue status (if accessible):**\n",
" - Queue length\n",
" - Worker processing rate\n",
" - Backlog growth rate\n",
"6. **Redis health (if accessible):**\n",
" - Redis version\n",
" - Memory usage\n",
" - Connection count\n",
" - Slow queries\n",
"\n",
"**Evidence collected in this lab:**\n",
"- ✅ Diagnostics bundle\n",
"- ✅ Worker pod logs with Redis errors\n",
"- ✅ Events showing failures\n",
"- ✅ Secret configuration (structure, not values)\n",
"\n",
"**Additional evidence to gather (if escalating):**\n",
"- Redis endpoint connectivity test\n",
"- Queue metrics (if available)\n",
"- Redis logs (if accessible via cloud provider)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Lessons Learned\n",
"\n",
"**Key takeaways from this lab:**\n",
"\n",
"1. **Redis failures can be intermittent** - Connection pool retries may mask issues\n",
"2. **Worker logs are critical** - Redis errors appear in worker pod logs\n",
"3. **Queue backlog is a symptom** - Jobs pile up when workers can't connect\n",
"4. **Secrets matter** - Wrong credentials cause authentication failures\n",
"5. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
"\n",
"**Common mistakes to avoid:**\n",
"- ❌ Ignoring intermittent failures (they indicate connection issues)\n",
"- ❌ Not checking worker logs (API logs may not show Redis errors)\n",
"- ❌ Not monitoring queue length (backlog indicates processing delays)\n",
"- ❌ Not testing Redis connectivity independently\n",
"\n",
"**Next steps:**\n",
"- Practice with other failure injection methods (Level 2)\n",
"- Try the ClickHouse or Blob Storage failure labs\n",
"- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -0,0 +1,942 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 4: Failure Lab - ClickHouse\n",
"\n",
"## ⚠️ CRITICAL SAFETY WARNING\n",
"\n",
"**THIS NOTEBOOK WILL MODIFY YOUR ENVIRONMENT:**\n",
"- **Modifies Kubernetes secrets** (breaks ClickHouse password)\n",
"- **Causes service disruptions** (traces delayed/missing, insert errors)\n",
"- **Requires remediation** to restore functionality\n",
"\n",
"**REQUIREMENTS:**\n",
"- ✅ **TEST/NON-PRODUCTION environment ONLY**\n",
"- ✅ **MODULE4_SAFE_ENVIRONMENT=true** must be set in .env\n",
"- ✅ **Baseline diagnostics collected** (run `01_diagnostics_baseline.ipynb` first)\n",
"- ✅ **Backup/restore plan** available\n",
"\n",
"**DO NOT RUN THIS LAB AGAINST PRODUCTION SYSTEMS.**\n",
"\n",
"## Overview\n",
"\n",
"**This lab teaches you how to debug ClickHouse connectivity failures in LangSmith.**\n",
"\n",
"ClickHouse is LangSmith's **trace storage**. It handles:\n",
"- Storing trace data (spans, events, metadata)\n",
"- Time-series queries for trace search and filtering\n",
"- High-volume writes from workers\n",
"- Efficient querying for UI display\n",
"\n",
"**When ClickHouse fails, you'll see:**\n",
"- Traces delayed or missing\n",
"- Insert errors and merge/backlog hints\n",
"- UI loads but traces don't appear\n",
"- Query timeouts\n",
"\n",
"**Learning Objectives:**\n",
"1. Understand how ClickHouse failures manifest\n",
"2. Practice collecting diagnostics for trace storage issues\n",
"3. Learn to identify connection vs. credential vs. network issues\n",
"4. Practice safe remediation\n",
"\n",
"**Estimated time:** 30-45 minutes\n",
"\n",
"**⚠️ Important:** \n",
"- Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n",
"- Complete safety check in `../shared/00_setup_or_resume_environment.ipynb` first!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Bootstrap environment\n",
"import sys\n",
"from pathlib import Path\n",
"\n",
"# Add notebooks directory to path\n",
"possible_paths = [\n",
" Path.cwd().parent,\n",
" Path.cwd(),\n",
" Path.cwd() / \"notebooks\",\n",
"]\n",
"\n",
"notebooks_path = None\n",
"for path in possible_paths:\n",
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
" notebooks_path = path\n",
" break\n",
"\n",
"if not notebooks_path:\n",
" notebooks_path = Path.cwd() / \"notebooks\"\n",
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
"\n",
"if str(notebooks_path) not in sys.path:\n",
" sys.path.insert(0, str(notebooks_path))\n",
"\n",
"from shared._bootstrap import bootstrap\n",
"from shared._validation import ok, warn\n",
"\n",
"# Run bootstrap\n",
"bootstrap_info = bootstrap()\n",
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ⚠️ CRITICAL: Environment Safety Verification\n",
"\n",
"**Before proceeding, verify you're in a TEST/NON-PRODUCTION environment and understand what will be modified.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# CRITICAL SAFETY CHECK: Verify environment is safe for failure injection\n",
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
"from shared._validation import ok, warn, fail\n",
"import os\n",
"\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"identity = get_identity()\n",
"\n",
"print(\"=\" * 70)\n",
"print(\"⚠️ CRITICAL SAFETY CHECK - CLICKHOUSE FAILURE LAB\")\n",
"print(\"=\" * 70)\n",
"\n",
"# Show environment details prominently\n",
"provider_display = provider.upper()\n",
"print(f\"\\n### Current Environment Configuration\")\n",
"print(f\"Cloud Provider: {provider_display}\")\n",
"print(f\"Region: {region}\")\n",
"\n",
"if provider == \"aws\":\n",
" account_id = identity.get('Account', 'N/A')\n",
" user_arn = identity.get('Arn', 'N/A')\n",
" print(f\"Account ID: {account_id}\")\n",
" print(f\"User ARN: {user_arn}\")\n",
"elif provider == \"azure\":\n",
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
" subscription_name = identity.get(\"SubscriptionName\", \"N/A\")\n",
" print(f\"Subscription ID: {subscription_id}\")\n",
" print(f\"Subscription Name: {subscription_name}\")\n",
"\n",
"# Show all relevant environment variables\n",
"print(f\"\\n### Environment Variables (VERIFY THESE ARE CORRECT)\")\n",
"print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
"print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
"print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
"print(f\"LANGSMITH_DOMAIN: {os.environ.get('LANGSMITH_DOMAIN', 'NOT SET')}\")\n",
"\n",
"print(\"\\n\" + \"=\" * 70)\n",
"print(\"⚠️ WHAT THIS LAB WILL DO:\")\n",
"print(\"=\" * 70)\n",
"print(\"\\nThis failure lab will:\")\n",
"print(\" 1. Find the ClickHouse secret in your namespace\")\n",
"print(\" 2. BACKUP the original secret (saved to artifacts)\")\n",
"print(\" 3. MODIFY the secret to set an INVALID password\")\n",
"print(\" 4. Apply the modified secret (breaks ClickHouse connectivity)\")\n",
"print(\" 5. Cause trace ingestion failures and query timeouts\")\n",
"print(\" 6. Require remediation to restore (restore original secret)\")\n",
"print(\"\\n\" + \"=\" * 70)\n",
"\n",
"# Check for Module 4 safety flag\n",
"module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
"if module4_safe not in [\"true\", \"yes\", \"1\"]:\n",
" fail(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
" print(\"\\n❌ SAFETY CHECK FAILED - Cannot proceed\")\n",
" print(\"\\nTo run this failure lab, you MUST:\")\n",
" print(\" 1. Verify this is a TEST/NON-PRODUCTION environment\")\n",
" print(\" 2. Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
" print(\" 3. Complete safety check in ../shared/00_setup_or_resume_environment.ipynb\")\n",
" print(\" 4. Re-run this cell to confirm\")\n",
" print(\"\\nThis flag is REQUIRED to prevent accidental execution in production.\")\n",
" raise RuntimeError(\"MODULE4_SAFE_ENVIRONMENT not set. Required for failure labs.\")\n",
"\n",
"ok(\"MODULE4_SAFE_ENVIRONMENT flag is set\")\n",
"print(\"\\n✅ Safety check passed - environment marked as safe for failure injection\")\n",
"print(\"\\n⚠️ REMINDER: This lab will break ClickHouse connectivity.\")\n",
"print(\" Ensure you understand the remediation steps before proceeding.\")\n",
"print(\" Original secret will be backed up automatically.\")\n",
"\n",
"print(\"\\n\" + \"=\" * 70)\n",
"print(\"✅ Environment verified - ready for ClickHouse failure lab\")\n",
"print(\"=\" * 70)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Configuration & Prerequisites\n",
"\n",
"Load configuration and verify prerequisites.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from shared._validation import require_env\n",
"\n",
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
"config = require_env(*required_vars)\n",
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
"\n",
"namespace = config[\"NAMESPACE\"]\n",
"\n",
"print(f\"Namespace: {namespace}\")\n",
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
"\n",
"ok(\"Configuration loaded\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. What This Service Does for LangSmith\n",
"\n",
"ClickHouse is LangSmith's **clickhouse and job queue**. It handles:\n",
"\n",
"- **Job queue for asynchronous processing:**\n",
" - Workers pull trace processing jobs from ClickHouse\n",
" - Jobs are queued when traces arrive via API\n",
" - Queue backlog indicates processing delays\n",
"\n",
"- **Caching:**\n",
" - Frequently accessed data (project metadata, user info)\n",
" - Reduces load on PostgreSQL\n",
" - Improves response times\n",
"\n",
"- **Rate limiting and session management:**\n",
" - API rate limiting\n",
" - Session storage (if configured)\n",
"\n",
"- **Worker coordination:**\n",
" - Distributed locking\n",
" - Task distribution\n",
"\n",
"**Why it matters:**\n",
"- Without ClickHouse, workers can't process traces\n",
"- Job queue fills up, causing delays\n",
"- Cache misses increase load on PostgreSQL\n",
"- Ingestion becomes unreliable\n",
"\n",
"**How LangSmith connects:**\n",
"- Connection string stored in Kubernetes Secrets\n",
"- Workers connect to ClickHouse to pull jobs\n",
"- API servers use ClickHouse for caching\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Expected Symptoms When ClickHouse Fails\n",
"\n",
"**What you'll see:**\n",
"\n",
"1. **Traces delayed or missing:**\n",
" - Some traces process, others don't\n",
" - Inconsistent behavior (works sometimes, fails other times)\n",
" - Retries visible in logs\n",
"\n",
"2. **Latency spikes:**\n",
" - API responses slow down\n",
" - Worker processing delays\n",
" - Timeout errors\n",
"\n",
"3. **Worker backlog:**\n",
" - Jobs piling up in queue\n",
" - Workers unable to pull new jobs\n",
" - Queue length increasing\n",
"\n",
"4. **Log patterns:**\n",
" - Connection timeout errors\n",
" - \"connection refused\" or \"connection reset\"\n",
" - \"NOAUTH Authentication required\" (if password wrong)\n",
" - Retry attempts in worker logs\n",
" - Cache miss patterns\n",
"\n",
"**Timeline:**\n",
"- Symptoms may be intermittent (connection pool retries)\n",
"- Worker backlog builds over time\n",
"- Cache misses cause cascading delays\n",
"- Full failure if connection pool exhausted\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Failure Injection Options\n",
"\n",
"**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
"\n",
"### Level 1: Subtle Failure (Recommended for first run)\n",
"\n",
"**Option A: Wrong ClickHouse Password**\n",
"- Modify the ClickHouse password in the Kubernetes Secret\n",
"- Symptoms: Authentication failures, connection refused, intermittent failures\n",
"\n",
"**Option B: Block Egress to ClickHouse Endpoint**\n",
"- Apply NetworkPolicy blocking egress to ClickHouse (if NetworkPolicy supported)\n",
"- Symptoms: Connection timeout, no route to host, intermittent failures\n",
"\n",
"### Level 2: Obvious Failure\n",
"\n",
"**Option C: Wrong ClickHouse Host/Endpoint**\n",
"- Point connection string to non-existent host\n",
"- Symptoms: Connection timeout, DNS resolution failures, immediate failures\n",
"\n",
"**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Do the Drill - Step 1: Confirm Baseline\n",
"\n",
"**Before injecting any failure, verify your baseline is healthy.**\n",
"\n",
"💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._shell import run\n",
"import json\n",
"\n",
"print(\"### Quick Baseline Check\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" healthy = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" print(f\"Pods: {healthy}/{total} running\")\n",
" \n",
" if healthy == total and total > 0:\n",
" ok(\"Baseline looks healthy\")\n",
" else:\n",
" warn(\"Some pods are not running - check baseline first\")\n",
"else:\n",
" warn(\"Could not check pod status\")\n",
"\n",
"# Check for ClickHouse secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"clickhouse_secrets = []\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" if \"clickhouse\" in name.lower() or \"clickhouse\" in name.lower():\n",
" clickhouse_secrets.append(name)\n",
"\n",
"if clickhouse_secrets:\n",
" ok(f\"Found {len(clickhouse_secrets)} ClickHouse-related secret(s)\")\n",
" for secret_name in clickhouse_secrets:\n",
" print(f\" - {secret_name}\")\n",
"else:\n",
" warn(\"No ClickHouse secrets found\")\n",
" print(\" 💡 ClickHouse connection may be configured differently\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Do the Drill - Step 2: Apply Failure Injection\n",
"\n",
"**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
"\n",
"Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# FAILURE INJECTION: Wrong ClickHouse Password\n",
"# This cell modifies the ClickHouse password secret to an invalid value\n",
"\n",
"import base64\n",
"import yaml\n",
"from datetime import datetime\n",
"\n",
"# Find ClickHouse secret (look for common names)\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"clickhouse_secret_name = None\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" # Common patterns: clickhouse, clickhouse\n",
" if any(keyword in name.lower() for keyword in [\"clickhouse\", \"clickhouse\"]):\n",
" # Check if it has password-related keys\n",
" data = secret.get(\"data\", {})\n",
" if any(key in data for key in [\"password\", \"REDIS_PASSWORD\", \"CLICKHOUSE_PASSWORD\"]):\n",
" clickhouse_secret_name = name\n",
" break\n",
"\n",
"if not clickhouse_secret_name:\n",
" raise RuntimeError(\"❌ Could not find ClickHouse secret. Check your deployment configuration.\")\n",
"\n",
"print(f\"Found ClickHouse secret: {clickhouse_secret_name}\")\n",
"\n",
"# Get current secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", clickhouse_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
" check=True,\n",
" stream=False\n",
")\n",
"\n",
"# Save original secret for restoration\n",
"backup_file = artifacts_dir / \"module-4\" / f\"clickhouse-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
"backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
"with open(backup_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
"\n",
"ok(f\"Backed up original secret to: {backup_file.name}\")\n",
"\n",
"# Parse YAML and modify password\n",
"secret_data = yaml.safe_load(result.stdout)\n",
"if \"data\" not in secret_data:\n",
" raise RuntimeError(\"Secret has no data section\")\n",
"\n",
"# Find password key (could be password, REDIS_PASSWORD, CLICKHOUSE_PASSWORD, etc.)\n",
"password_key = None\n",
"for key in [\"password\", \"REDIS_PASSWORD\", \"CLICKHOUSE_PASSWORD\", \"clickhouse-password\"]:\n",
" if key in secret_data[\"data\"]:\n",
" password_key = key\n",
" break\n",
"\n",
"if not password_key:\n",
" raise RuntimeError(\"Could not find password key in secret\")\n",
"\n",
"# Set invalid password\n",
"invalid_password = \"INVALID_PASSWORD_12345\"\n",
"invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
"\n",
"# Modify secret\n",
"secret_data[\"data\"][password_key] = invalid_password_b64\n",
"\n",
"# Save modified secret to temp file\n",
"temp_secret_file = artifacts_dir / \"module-4\" / \"clickhouse-secret-modified.yaml\"\n",
"with open(temp_secret_file, \"w\") as f:\n",
" yaml.dump(secret_data, f)\n",
"\n",
"print(f\"\\n⚠️ READY TO APPLY FAILURE INJECTION\")\n",
"print(f\" This will set an invalid password in secret: {clickhouse_secret_name}\")\n",
"print(f\" Modified secret saved to: {temp_secret_file.name}\")\n",
"print(f\"\\n To apply, uncomment and run the next cell.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# UNCOMMENT TO APPLY FAILURE INJECTION\n",
"# \n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Failure injection applied - ClickHouse password is now invalid\")\n",
"# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
"# print(\" This may take 1-2 minutes. Watch for pod restarts:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# # Wait a moment for changes to propagate\n",
"# import time\n",
"# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
"# time.sleep(30)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Do the Drill - Step 3: Observe Symptoms\n",
"\n",
"**Now that the failure is injected, observe how it manifests.**\n",
"\n",
"Check:\n",
"1. Worker pod logs for ClickHouse connection errors\n",
"2. Queue backlog (if visible)\n",
"3. Worker retry patterns\n",
"4. Latency in API responses\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"# Create incident directory for diagnostics\n",
"incident_dir = artifacts_dir / \"module-4\" / f\"clickhouse-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
"incident_dir.mkdir(parents=True, exist_ok=True)\n",
"\n",
"print(f\"### Collecting Failure Diagnostics\\n\")\n",
"print(f\"Saving to: {incident_dir}\\n\")\n",
"\n",
"# 1. Check pod status\n",
"print(\"1. Checking pod status...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" print(result.stdout)\n",
" \n",
" # Check for restarts\n",
" lines = result.stdout.split(\"\\n\")\n",
" restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
" if restarts:\n",
" print(\"\\n Pod restart counts:\")\n",
" for line in restarts[1:]: # Skip header\n",
" if line.strip():\n",
" parts = line.split()\n",
" if len(parts) > 3:\n",
" print(f\" {parts[0]}: {parts[3]} restarts\")\n",
"\n",
"# 2. Check recent events\n",
"print(\"\\n2. Checking recent events...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"events.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" if result.stdout.strip():\n",
" print(\" Recent warning/error events:\")\n",
" for line in result.stdout.split(\"\\n\")[-5:]:\n",
" if line.strip():\n",
" print(f\" {line}\")\n",
"\n",
"# 3. Check worker pod logs for ClickHouse errors\n",
"print(\"\\n3. Checking worker pod logs for ClickHouse errors...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if worker_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=50\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" logs_file = incident_dir / f\"worker-pod-{worker_pod}-logs.txt\"\n",
" with open(logs_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" \n",
" # Look for ClickHouse-related errors\n",
" error_keywords = [\"clickhouse\", \"clickhouse\", \"connection\", \"timeout\", \"refused\", \"authentication\", \"NOAUTH\", \"retry\"]\n",
" error_lines = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if error_lines:\n",
" print(\" Found ClickHouse-related errors:\")\n",
" for line in error_lines[-5:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" print(\" No obvious ClickHouse errors in recent logs\")\n",
"else:\n",
" warn(\"Could not find worker pod\")\n",
"\n",
"ok(f\"Diagnostics saved to: {incident_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
"\n",
"**This is critical - Support will ask for this bundle.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import urllib.request\n",
"\n",
"print(\"### Running Canonical Diagnostics Script\\n\")\n",
"\n",
"script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
"script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
"\n",
"try:\n",
" urllib.request.urlretrieve(script_url, script_path)\n",
" script_path.chmod(0o755)\n",
" \n",
" print(f\"Running diagnostics script for namespace: {namespace}\")\n",
" result = run(\n",
" [str(script_path), namespace],\n",
" check=False,\n",
" stream=True\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" ok(\"Diagnostics script completed\")\n",
" \n",
" # Find and move tarball\n",
" for file in incident_dir.parent.iterdir():\n",
" if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
" target_path = incident_dir / file.name\n",
" file.rename(target_path)\n",
" ok(f\"Diagnostics bundle: {target_path.name}\")\n",
" break\n",
" else:\n",
" warn(\"Diagnostics script had errors (check output above)\")\n",
" \n",
"except Exception as e:\n",
" warn(f\"Could not run diagnostics script: {e}\")\n",
" print(\" 💡 You can run it manually:\")\n",
" print(f\" curl -O {script_url}\")\n",
" print(f\" chmod +x get_k8s_debugging_info.sh\")\n",
" print(f\" ./get_k8s_debugging_info.sh {namespace}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Do the Drill - Step 5: Guided Triage\n",
"\n",
"**Where to look first for ClickHouse issues:**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Guided Triage Steps\\n\")\n",
"\n",
"print(\"1. Check worker pod logs for ClickHouse connection errors:\")\n",
"print(f\" kubectl logs -n {namespace} <worker-pod-name> | grep -i 'clickhouse\\\\|clickhouse\\\\|connection'\")\n",
"print()\n",
"\n",
"print(\"2. Verify secret exists and has correct keys:\")\n",
"print(f\" kubectl get secret {clickhouse_secret_name} -n {namespace} -o yaml\")\n",
"print(\" (Don't print the actual values - they're base64 encoded)\")\n",
"print()\n",
"\n",
"print(\"3. Check for worker pod restarts (indicates connection failures):\")\n",
"print(f\" kubectl get pods -n {namespace} -l app=langsmith-worker\")\n",
"print()\n",
"\n",
"print(\"4. Test ClickHouse connectivity from a pod (if possible):\")\n",
"print(\" kubectl run -it --rm debug --image=clickhouse:7 --restart=Never -- \\\\\")\n",
"print(\" clickhouse-cli -h <clickhouse-host> -p <port> -a <password> ping\")\n",
"print()\n",
"\n",
"print(\"5. Check events for connection/authentication errors:\")\n",
"print(f\" kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
"print()\n",
"\n",
"# Check what we can automatically\n",
"print(\"\\n### Automatic Checks\\n\")\n",
"\n",
"# Check secret still exists\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", clickhouse_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" ok(f\"Secret '{clickhouse_secret_name}' still exists\")\n",
" secret_data = json.loads(result.stdout)\n",
" keys = list(secret_data.get(\"data\", {}).keys())\n",
" print(f\" Secret keys: {', '.join(keys)}\")\n",
"else:\n",
" warn(f\"Secret '{clickhouse_secret_name}' not found!\")\n",
"\n",
"# Check for pods with ClickHouse connection env vars\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" clickhouse_related_pods = []\n",
" for pod in pods.get(\"items\", []):\n",
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
" containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
" for container in containers:\n",
" env = container.get(\"env\", [])\n",
" clickhouse_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
" for kw in [\"REDIS\", \"CLICKHOUSE\"])]\n",
" if clickhouse_env:\n",
" clickhouse_related_pods.append(name)\n",
" break\n",
" \n",
" if clickhouse_related_pods:\n",
" print(f\"\\n Pods with ClickHouse environment variables:\")\n",
" for pod_name in set(clickhouse_related_pods):\n",
" print(f\" - {pod_name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Do the Drill - Step 6: Remediation\n",
"\n",
"**Restore the original secret to fix the issue.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# REMEDIATION: Restore original secret\n",
"# UNCOMMENT TO RESTORE\n",
"\n",
"# if backup_file.exists():\n",
"# print(f\"Restoring original secret from: {backup_file.name}\")\n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Original secret restored\")\n",
"# print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
"# print(\" This may take 1-2 minutes. Monitor pod status:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# import time\n",
"# print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
"# time.sleep(60)\n",
"# else:\n",
"# warn(f\"Backup file not found: {backup_file}\")\n",
"# print(\" 💡 You may need to manually restore the secret\")\n",
"\n",
"print(\"⚠️ To restore, uncomment the code above and run this cell.\")\n",
"if 'backup_file' in locals():\n",
" print(f\" Backup file: {backup_file.name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Do the Drill - Step 7: Confirm Recovery\n",
"\n",
"**Verify that everything is working again.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Verifying Recovery\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" running = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" \n",
" if running == total and total > 0:\n",
" ok(f\"All {total} pod(s) are running\")\n",
" else:\n",
" warn(f\"Only {running}/{total} pod(s) running\")\n",
" print(\" 💡 Wait a bit longer for pods to fully recover\")\n",
"\n",
"# Check for recent errors in worker logs\n",
"print(\"\\nChecking for recent errors in worker logs...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if worker_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=20\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" error_keywords = [\"error\", \"fail\", \"clickhouse\", \"clickhouse\", \"connection\"]\n",
" recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if recent_errors:\n",
" warn(\"Still seeing some errors in logs:\")\n",
" for line in recent_errors[-3:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" ok(\"No recent errors in worker logs\")\n",
"\n",
"ok(\"Recovery verification complete\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. What Support Will Ask For\n",
"\n",
"**When escalating a ClickHouse issue, Support will need:**\n",
"\n",
"1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
"2. **ClickHouse connection details:**\n",
" - Host/endpoint (redacted)\n",
" - Port\n",
" - Password (redacted)\n",
" - Whether using SSL/TLS\n",
"3. **Error messages from logs:**\n",
" - Full error text (not just \"connection failed\")\n",
" - Timestamps of first occurrence\n",
" - Retry patterns\n",
"4. **Recent changes:**\n",
" - Secret rotations\n",
" - Network policy changes\n",
" - ClickHouse configuration changes\n",
"5. **Queue status (if accessible):**\n",
" - Queue length\n",
" - Worker processing rate\n",
" - Backlog growth rate\n",
"6. **ClickHouse health (if accessible):**\n",
" - ClickHouse version\n",
" - Memory usage\n",
" - Connection count\n",
" - Slow queries\n",
"\n",
"**Evidence collected in this lab:**\n",
"- ✅ Diagnostics bundle\n",
"- ✅ Worker pod logs with ClickHouse errors\n",
"- ✅ Events showing failures\n",
"- ✅ Secret configuration (structure, not values)\n",
"\n",
"**Additional evidence to gather (if escalating):**\n",
"- ClickHouse endpoint connectivity test\n",
"- Queue metrics (if available)\n",
"- ClickHouse logs (if accessible via cloud provider)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Lessons Learned\n",
"\n",
"**Key takeaways from this lab:**\n",
"\n",
"1. **ClickHouse failures can be intermittent** - Connection pool retries may mask issues\n",
"2. **Worker logs are critical** - ClickHouse errors appear in worker pod logs\n",
"3. **Queue backlog is a symptom** - Jobs pile up when workers can't connect\n",
"4. **Secrets matter** - Wrong credentials cause authentication failures\n",
"5. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
"\n",
"**Common mistakes to avoid:**\n",
"- ❌ Ignoring intermittent failures (they indicate connection issues)\n",
"- ❌ Not checking worker logs (API logs may not show ClickHouse errors)\n",
"- ❌ Not monitoring queue length (backlog indicates processing delays)\n",
"- ❌ Not testing ClickHouse connectivity independently\n",
"\n",
"**Next steps:**\n",
"- Practice with other failure injection methods (Level 2)\n",
"- Try the ClickHouse or Blob Storage failure labs\n",
"- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -0,0 +1,946 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 4: Failure Lab - Blob Storage\n",
"\n",
"## ⚠️ CRITICAL SAFETY WARNING\n",
"\n",
"**THIS NOTEBOOK WILL MODIFY YOUR ENVIRONMENT:**\n",
"- **Modifies Kubernetes secrets** (breaks blob storage credentials)\n",
"- **Causes service disruptions** (large payload failures, ClickHouse degradation)\n",
"- **Requires remediation** to restore functionality\n",
"\n",
"**REQUIREMENTS:**\n",
"- ✅ **TEST/NON-PRODUCTION environment ONLY**\n",
"- ✅ **MODULE4_SAFE_ENVIRONMENT=true** must be set in .env\n",
"- ✅ **Baseline diagnostics collected** (run `01_diagnostics_baseline.ipynb` first)\n",
"- ✅ **Backup/restore plan** available\n",
"\n",
"**DO NOT RUN THIS LAB AGAINST PRODUCTION SYSTEMS.**\n",
"\n",
"## Overview\n",
"\n",
"**This lab teaches you how to debug Blob Storage configuration failures in LangSmith.**\n",
"\n",
"Blob Storage is LangSmith's **large payload storage**. It handles:\n",
"- Storing large trace payloads and artifacts\n",
"- Offloading large data from ClickHouse\n",
"- Providing durable storage for trace data\n",
"\n",
"**When Blob Storage fails, you'll see:**\n",
"- Large payload traces degrade ClickHouse performance\n",
"- Warnings/errors in logs about artifact storage\n",
"- Increased ClickHouse pressure and latency under load\n",
"- Traces with large payloads fail to store properly\n",
"\n",
"**Learning Objectives:**\n",
"1. Understand how Blob Storage failures manifest\n",
"2. Practice collecting diagnostics for blob storage issues\n",
"3. Learn to identify configuration vs. credential vs. network issues\n",
"4. Practice safe remediation\n",
"\n",
"**Estimated time:** 30-45 minutes\n",
"\n",
"**⚠️ Important:** \n",
"- Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n",
"- Complete safety check in `../shared/00_setup_or_resume_environment.ipynb` first!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Bootstrap environment\n",
"import sys\n",
"from pathlib import Path\n",
"\n",
"# Add notebooks directory to path\n",
"possible_paths = [\n",
" Path.cwd().parent,\n",
" Path.cwd(),\n",
" Path.cwd() / \"notebooks\",\n",
"]\n",
"\n",
"notebooks_path = None\n",
"for path in possible_paths:\n",
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
" notebooks_path = path\n",
" break\n",
"\n",
"if not notebooks_path:\n",
" notebooks_path = Path.cwd() / \"notebooks\"\n",
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
"\n",
"if str(notebooks_path) not in sys.path:\n",
" sys.path.insert(0, str(notebooks_path))\n",
"\n",
"from shared._bootstrap import bootstrap\n",
"from shared._validation import ok, warn\n",
"\n",
"# Run bootstrap\n",
"bootstrap_info = bootstrap()\n",
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ⚠️ CRITICAL: Environment Safety Verification\n",
"\n",
"**Before proceeding, verify you're in a TEST/NON-PRODUCTION environment and understand what will be modified.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# CRITICAL SAFETY CHECK: Verify environment is safe for failure injection\n",
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
"from shared._validation import ok, warn, fail\n",
"import os\n",
"\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"identity = get_identity()\n",
"\n",
"print(\"=\" * 70)\n",
"print(\"⚠️ CRITICAL SAFETY CHECK - BLOB STORAGE FAILURE LAB\")\n",
"print(\"=\" * 70)\n",
"\n",
"# Show environment details prominently\n",
"provider_display = provider.upper()\n",
"print(f\"\\n### Current Environment Configuration\")\n",
"print(f\"Cloud Provider: {provider_display}\")\n",
"print(f\"Region: {region}\")\n",
"\n",
"if provider == \"aws\":\n",
" account_id = identity.get('Account', 'N/A')\n",
" user_arn = identity.get('Arn', 'N/A')\n",
" print(f\"Account ID: {account_id}\")\n",
" print(f\"User ARN: {user_arn}\")\n",
"elif provider == \"azure\":\n",
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
" subscription_name = identity.get(\"SubscriptionName\", \"N/A\")\n",
" print(f\"Subscription ID: {subscription_id}\")\n",
" print(f\"Subscription Name: {subscription_name}\")\n",
"\n",
"# Show all relevant environment variables\n",
"print(f\"\\n### Environment Variables (VERIFY THESE ARE CORRECT)\")\n",
"print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
"print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
"print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
"print(f\"LANGSMITH_DOMAIN: {os.environ.get('LANGSMITH_DOMAIN', 'NOT SET')}\")\n",
"\n",
"print(\"\\n\" + \"=\" * 70)\n",
"print(\"⚠️ WHAT THIS LAB WILL DO:\")\n",
"print(\"=\" * 70)\n",
"print(\"\\nThis failure lab will:\")\n",
"print(\" 1. Find the Blob Storage secret in your namespace\")\n",
"print(\" 2. BACKUP the original secret (saved to artifacts)\")\n",
"print(\" 3. MODIFY the secret to set INVALID credentials\")\n",
"print(\" 4. Apply the modified secret (breaks blob storage connectivity)\")\n",
"print(\" 5. Cause large payload failures and ClickHouse degradation\")\n",
"print(\" 6. Require remediation to restore (restore original secret)\")\n",
"print(\"\\n\" + \"=\" * 70)\n",
"\n",
"# Check for Module 4 safety flag\n",
"module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
"if module4_safe not in [\"true\", \"yes\", \"1\"]:\n",
" fail(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
" print(\"\\n❌ SAFETY CHECK FAILED - Cannot proceed\")\n",
" print(\"\\nTo run this failure lab, you MUST:\")\n",
" print(\" 1. Verify this is a TEST/NON-PRODUCTION environment\")\n",
" print(\" 2. Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
" print(\" 3. Complete safety check in ../shared/00_setup_or_resume_environment.ipynb\")\n",
" print(\" 4. Re-run this cell to confirm\")\n",
" print(\"\\nThis flag is REQUIRED to prevent accidental execution in production.\")\n",
" raise RuntimeError(\"MODULE4_SAFE_ENVIRONMENT not set. Required for failure labs.\")\n",
"\n",
"ok(\"MODULE4_SAFE_ENVIRONMENT flag is set\")\n",
"print(\"\\n✅ Safety check passed - environment marked as safe for failure injection\")\n",
"print(\"\\n⚠️ REMINDER: This lab will break blob storage connectivity.\")\n",
"print(\" Ensure you understand the remediation steps before proceeding.\")\n",
"print(\" Original secret will be backed up automatically.\")\n",
"\n",
"print(\"\\n\" + \"=\" * 70)\n",
"print(\"✅ Environment verified - ready for Blob Storage failure lab\")\n",
"print(\"=\" * 70)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Configuration & Prerequisites\n",
"\n",
"Load configuration and verify prerequisites.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from shared._validation import require_env\n",
"\n",
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
"config = require_env(*required_vars)\n",
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
"\n",
"namespace = config[\"NAMESPACE\"]\n",
"\n",
"print(f\"Namespace: {namespace}\")\n",
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
"\n",
"ok(\"Configuration loaded\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. What This Service Does for LangSmith\n",
"\n",
"Blob Storage is LangSmith's **blob and job queue**. It handles:\n",
"\n",
"- **Job queue for asynchronous processing:**\n",
" - Workers pull trace processing jobs from Blob Storage\n",
" - Jobs are queued when traces arrive via API\n",
" - Queue backlog indicates processing delays\n",
"\n",
"- **Caching:**\n",
" - Frequently accessed data (project metadata, user info)\n",
" - Reduces load on PostgreSQL\n",
" - Improves response times\n",
"\n",
"- **Rate limiting and session management:**\n",
" - API rate limiting\n",
" - Session storage (if configured)\n",
"\n",
"- **Worker coordination:**\n",
" - Distributed locking\n",
" - Task distribution\n",
"\n",
"**Why it matters:**\n",
"- Without Blob Storage, workers can't process traces\n",
"- Job queue fills up, causing delays\n",
"- Cache misses increase load on PostgreSQL\n",
"- Ingestion becomes unreliable\n",
"\n",
"**How LangSmith connects:**\n",
"- Connection string stored in Kubernetes Secrets\n",
"- Workers connect to Blob Storage to pull jobs\n",
"- API servers use Blob Storage for caching\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Expected Symptoms When Blob Storage Fails\n",
"\n",
"**What you'll see:**\n",
"\n",
"1. **Large payload traces degrade ClickHouse:**\n",
" - ClickHouse performance degrades under load\n",
" - Insert operations slow down\n",
" - Query performance suffers\n",
" - Storage pressure increases\n",
"\n",
"2. **Warnings/errors in logs about artifact storage:**\n",
" - Worker logs show artifact upload failures\n",
" - Bucket access errors\n",
" - Credential errors\n",
" - \"No such bucket\" or \"Access Denied\" errors\n",
"\n",
"3. **Increased ClickHouse pressure:**\n",
" - ClickHouse latency increases\n",
" - Merge operations backlog\n",
" - Storage usage spikes\n",
" - Query timeouts\n",
"\n",
"4. **Log patterns:**\n",
" - Artifact storage errors in worker logs\n",
" - S3/blob storage connection errors\n",
" - Bucket access denied errors\n",
" - Credential errors\n",
" - Configuration errors\n",
"\n",
"**Timeline:**\n",
"- Symptoms appear gradually (under load)\n",
"- ClickHouse performance degrades over time\n",
"- Large traces fail or are rejected\n",
"- Full failure if blob storage completely unavailable\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Failure Injection Options\n",
"\n",
"**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
"\n",
"### Level 1: Subtle Failure (Recommended for first run)\n",
"\n",
"**Option A: Wrong Blob Storage Password**\n",
"- Modify the Blob Storage password in the Kubernetes Secret\n",
"- Symptoms: Authentication failures, connection refused, intermittent failures\n",
"\n",
"**Option B: Block Egress to Blob Storage Endpoint**\n",
"- Apply NetworkPolicy blocking egress to Blob Storage (if NetworkPolicy supported)\n",
"- Symptoms: Connection timeout, no route to host, intermittent failures\n",
"\n",
"### Level 2: Obvious Failure\n",
"\n",
"**Option C: Wrong Blob Storage Host/Endpoint**\n",
"- Point connection string to non-existent host\n",
"- Symptoms: Connection timeout, DNS resolution failures, immediate failures\n",
"\n",
"**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Do the Drill - Step 1: Confirm Baseline\n",
"\n",
"**Before injecting any failure, verify your baseline is healthy.**\n",
"\n",
"💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._shell import run\n",
"import json\n",
"\n",
"print(\"### Quick Baseline Check\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" healthy = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" print(f\"Pods: {healthy}/{total} running\")\n",
" \n",
" if healthy == total and total > 0:\n",
" ok(\"Baseline looks healthy\")\n",
" else:\n",
" warn(\"Some pods are not running - check baseline first\")\n",
"else:\n",
" warn(\"Could not check pod status\")\n",
"\n",
"# Check for Blob Storage secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"blob_secrets = []\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" if \"blob\" in name.lower() or \"blob\" in name.lower():\n",
" blob_secrets.append(name)\n",
"\n",
"if blob_secrets:\n",
" ok(f\"Found {len(blob_secrets)} Blob Storage-related secret(s)\")\n",
" for secret_name in blob_secrets:\n",
" print(f\" - {secret_name}\")\n",
"else:\n",
" warn(\"No Blob Storage secrets found\")\n",
" print(\" 💡 Blob Storage connection may be configured differently\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Do the Drill - Step 2: Apply Failure Injection\n",
"\n",
"**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
"\n",
"Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# FAILURE INJECTION: Wrong Blob Storage Password\n",
"# This cell modifies the Blob Storage password secret to an invalid value\n",
"\n",
"import base64\n",
"import yaml\n",
"from datetime import datetime\n",
"\n",
"# Find Blob Storage secret (look for common names)\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"blob_secret_name = None\n",
"if result.returncode == 0:\n",
" secrets = json.loads(result.stdout)\n",
" for secret in secrets.get(\"items\", []):\n",
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
" # Common patterns: blob, blob\n",
" if any(keyword in name.lower() for keyword in [\"blob\", \"blob\"]):\n",
" # Check if it has password-related keys\n",
" data = secret.get(\"data\", {})\n",
" if any(key in data for key in [\"password\", \"REDIS_PASSWORD\", \"BLOB_PASSWORD\"]):\n",
" blob_secret_name = name\n",
" break\n",
"\n",
"if not blob_secret_name:\n",
" raise RuntimeError(\"❌ Could not find Blob Storage secret. Check your deployment configuration.\")\n",
"\n",
"print(f\"Found Blob Storage secret: {blob_secret_name}\")\n",
"\n",
"# Get current secret\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", blob_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
" check=True,\n",
" stream=False\n",
")\n",
"\n",
"# Save original secret for restoration\n",
"backup_file = artifacts_dir / \"module-4\" / f\"blob-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
"backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
"with open(backup_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
"\n",
"ok(f\"Backed up original secret to: {backup_file.name}\")\n",
"\n",
"# Parse YAML and modify password\n",
"secret_data = yaml.safe_load(result.stdout)\n",
"if \"data\" not in secret_data:\n",
" raise RuntimeError(\"Secret has no data section\")\n",
"\n",
"# Find password key (could be password, REDIS_PASSWORD, BLOB_PASSWORD, etc.)\n",
"password_key = None\n",
"for key in [\"password\", \"REDIS_PASSWORD\", \"BLOB_PASSWORD\", \"blob-password\"]:\n",
" if key in secret_data[\"data\"]:\n",
" password_key = key\n",
" break\n",
"\n",
"if not password_key:\n",
" raise RuntimeError(\"Could not find password key in secret\")\n",
"\n",
"# Set invalid password\n",
"invalid_password = \"INVALID_PASSWORD_12345\"\n",
"invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
"\n",
"# Modify secret\n",
"secret_data[\"data\"][password_key] = invalid_password_b64\n",
"\n",
"# Save modified secret to temp file\n",
"temp_secret_file = artifacts_dir / \"module-4\" / \"blob-secret-modified.yaml\"\n",
"with open(temp_secret_file, \"w\") as f:\n",
" yaml.dump(secret_data, f)\n",
"\n",
"print(f\"\\n⚠️ READY TO APPLY FAILURE INJECTION\")\n",
"print(f\" This will set an invalid password in secret: {blob_secret_name}\")\n",
"print(f\" Modified secret saved to: {temp_secret_file.name}\")\n",
"print(f\"\\n To apply, uncomment and run the next cell.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# UNCOMMENT TO APPLY FAILURE INJECTION\n",
"# \n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Failure injection applied - Blob Storage password is now invalid\")\n",
"# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
"# print(\" This may take 1-2 minutes. Watch for pod restarts:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# # Wait a moment for changes to propagate\n",
"# import time\n",
"# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
"# time.sleep(30)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Do the Drill - Step 3: Observe Symptoms\n",
"\n",
"**Now that the failure is injected, observe how it manifests.**\n",
"\n",
"Check:\n",
"1. Worker pod logs for Blob Storage connection errors\n",
"2. Queue backlog (if visible)\n",
"3. Worker retry patterns\n",
"4. Latency in API responses\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"# Create incident directory for diagnostics\n",
"incident_dir = artifacts_dir / \"module-4\" / f\"blob-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
"incident_dir.mkdir(parents=True, exist_ok=True)\n",
"\n",
"print(f\"### Collecting Failure Diagnostics\\n\")\n",
"print(f\"Saving to: {incident_dir}\\n\")\n",
"\n",
"# 1. Check pod status\n",
"print(\"1. Checking pod status...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" print(result.stdout)\n",
" \n",
" # Check for restarts\n",
" lines = result.stdout.split(\"\\n\")\n",
" restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
" if restarts:\n",
" print(\"\\n Pod restart counts:\")\n",
" for line in restarts[1:]: # Skip header\n",
" if line.strip():\n",
" parts = line.split()\n",
" if len(parts) > 3:\n",
" print(f\" {parts[0]}: {parts[3]} restarts\")\n",
"\n",
"# 2. Check recent events\n",
"print(\"\\n2. Checking recent events...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" with open(incident_dir / \"events.txt\", \"w\") as f:\n",
" f.write(result.stdout)\n",
" if result.stdout.strip():\n",
" print(\" Recent warning/error events:\")\n",
" for line in result.stdout.split(\"\\n\")[-5:]:\n",
" if line.strip():\n",
" print(f\" {line}\")\n",
"\n",
"# 3. Check worker pod logs for Blob Storage errors\n",
"print(\"\\n3. Checking worker pod logs for Blob Storage errors...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if worker_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=50\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" logs_file = incident_dir / f\"worker-pod-{worker_pod}-logs.txt\"\n",
" with open(logs_file, \"w\") as f:\n",
" f.write(result.stdout)\n",
" \n",
" # Look for Blob Storage-related errors\n",
" error_keywords = [\"blob\", \"blob\", \"connection\", \"timeout\", \"refused\", \"authentication\", \"NOAUTH\", \"retry\"]\n",
" error_lines = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if error_lines:\n",
" print(\" Found Blob Storage-related errors:\")\n",
" for line in error_lines[-5:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" print(\" No obvious Blob Storage errors in recent logs\")\n",
"else:\n",
" warn(\"Could not find worker pod\")\n",
"\n",
"ok(f\"Diagnostics saved to: {incident_dir}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
"\n",
"**This is critical - Support will ask for this bundle.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import urllib.request\n",
"\n",
"print(\"### Running Canonical Diagnostics Script\\n\")\n",
"\n",
"script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
"script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
"\n",
"try:\n",
" urllib.request.urlretrieve(script_url, script_path)\n",
" script_path.chmod(0o755)\n",
" \n",
" print(f\"Running diagnostics script for namespace: {namespace}\")\n",
" result = run(\n",
" [str(script_path), namespace],\n",
" check=False,\n",
" stream=True\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" ok(\"Diagnostics script completed\")\n",
" \n",
" # Find and move tarball\n",
" for file in incident_dir.parent.iterdir():\n",
" if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
" target_path = incident_dir / file.name\n",
" file.rename(target_path)\n",
" ok(f\"Diagnostics bundle: {target_path.name}\")\n",
" break\n",
" else:\n",
" warn(\"Diagnostics script had errors (check output above)\")\n",
" \n",
"except Exception as e:\n",
" warn(f\"Could not run diagnostics script: {e}\")\n",
" print(\" 💡 You can run it manually:\")\n",
" print(f\" curl -O {script_url}\")\n",
" print(f\" chmod +x get_k8s_debugging_info.sh\")\n",
" print(f\" ./get_k8s_debugging_info.sh {namespace}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Do the Drill - Step 5: Guided Triage\n",
"\n",
"**Where to look first for Blob Storage issues:**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Guided Triage Steps\\n\")\n",
"\n",
"print(\"1. Check worker pod logs for Blob Storage connection errors:\")\n",
"print(f\" kubectl logs -n {namespace} <worker-pod-name> | grep -i 'blob\\\\|blob\\\\|connection'\")\n",
"print()\n",
"\n",
"print(\"2. Verify secret exists and has correct keys:\")\n",
"print(f\" kubectl get secret {blob_secret_name} -n {namespace} -o yaml\")\n",
"print(\" (Don't print the actual values - they're base64 encoded)\")\n",
"print()\n",
"\n",
"print(\"3. Check for worker pod restarts (indicates connection failures):\")\n",
"print(f\" kubectl get pods -n {namespace} -l app=langsmith-worker\")\n",
"print()\n",
"\n",
"print(\"4. Test Blob Storage connectivity from a pod (if possible):\")\n",
"print(\" kubectl run -it --rm debug --image=blob:7 --restart=Never -- \\\\\")\n",
"print(\" blob-cli -h <blob-host> -p <port> -a <password> ping\")\n",
"print()\n",
"\n",
"print(\"5. Check events for connection/authentication errors:\")\n",
"print(f\" kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
"print()\n",
"\n",
"# Check what we can automatically\n",
"print(\"\\n### Automatic Checks\\n\")\n",
"\n",
"# Check secret still exists\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"secret\", blob_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" ok(f\"Secret '{blob_secret_name}' still exists\")\n",
" secret_data = json.loads(result.stdout)\n",
" keys = list(secret_data.get(\"data\", {}).keys())\n",
" print(f\" Secret keys: {', '.join(keys)}\")\n",
"else:\n",
" warn(f\"Secret '{blob_secret_name}' not found!\")\n",
"\n",
"# Check for pods with Blob Storage connection env vars\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" blob_related_pods = []\n",
" for pod in pods.get(\"items\", []):\n",
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
" containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
" for container in containers:\n",
" env = container.get(\"env\", [])\n",
" blob_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
" for kw in [\"REDIS\", \"BLOB\"])]\n",
" if blob_env:\n",
" blob_related_pods.append(name)\n",
" break\n",
" \n",
" if blob_related_pods:\n",
" print(f\"\\n Pods with Blob Storage environment variables:\")\n",
" for pod_name in set(blob_related_pods):\n",
" print(f\" - {pod_name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Do the Drill - Step 6: Remediation\n",
"\n",
"**Restore the original secret to fix the issue.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# REMEDIATION: Restore original secret\n",
"# UNCOMMENT TO RESTORE\n",
"\n",
"# if backup_file.exists():\n",
"# print(f\"Restoring original secret from: {backup_file.name}\")\n",
"# result = run(\n",
"# [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
"# check=True,\n",
"# stream=True\n",
"# )\n",
"# \n",
"# ok(\"Original secret restored\")\n",
"# print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
"# print(\" This may take 1-2 minutes. Monitor pod status:\")\n",
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
"# \n",
"# import time\n",
"# print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
"# time.sleep(60)\n",
"# else:\n",
"# warn(f\"Backup file not found: {backup_file}\")\n",
"# print(\" 💡 You may need to manually restore the secret\")\n",
"\n",
"print(\"⚠️ To restore, uncomment the code above and run this cell.\")\n",
"if 'backup_file' in locals() and backup_file:\n",
" print(f\" Backup file: {backup_file.name}\")\n",
"else:\n",
" print(\" 💡 If you modified Helm values, restore them manually\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Do the Drill - Step 7: Confirm Recovery\n",
"\n",
"**Verify that everything is working again.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Verifying Recovery\\n\")\n",
"\n",
"# Check pod status\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" pods = json.loads(result.stdout)\n",
" running = sum(1 for p in pods.get(\"items\", [])\n",
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
" total = len(pods.get(\"items\", []))\n",
" \n",
" if running == total and total > 0:\n",
" ok(f\"All {total} pod(s) are running\")\n",
" else:\n",
" warn(f\"Only {running}/{total} pod(s) running\")\n",
" print(\" 💡 Wait a bit longer for pods to fully recover\")\n",
"\n",
"# Check for recent errors in worker logs\n",
"print(\"\\nChecking for recent errors in worker logs...\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
"if worker_pod:\n",
" result = run(\n",
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=20\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" error_keywords = [\"error\", \"fail\", \"blob\", \"blob\", \"connection\"]\n",
" recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
" if any(kw in l.lower() for kw in error_keywords)]\n",
" \n",
" if recent_errors:\n",
" warn(\"Still seeing some errors in logs:\")\n",
" for line in recent_errors[-3:]:\n",
" print(f\" {line}\")\n",
" else:\n",
" ok(\"No recent errors in worker logs\")\n",
"\n",
"ok(\"Recovery verification complete\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. What Support Will Ask For\n",
"\n",
"**When escalating a Blob Storage issue, Support will need:**\n",
"\n",
"1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
"2. **Blob Storage connection details:**\n",
" - Host/endpoint (redacted)\n",
" - Port\n",
" - Password (redacted)\n",
" - Whether using SSL/TLS\n",
"3. **Error messages from logs:**\n",
" - Full error text (not just \"connection failed\")\n",
" - Timestamps of first occurrence\n",
" - Retry patterns\n",
"4. **Recent changes:**\n",
" - Secret rotations\n",
" - Network policy changes\n",
" - Blob Storage configuration changes\n",
"5. **Queue status (if accessible):**\n",
" - Queue length\n",
" - Worker processing rate\n",
" - Backlog growth rate\n",
"6. **Blob Storage health (if accessible):**\n",
" - Blob Storage version\n",
" - Memory usage\n",
" - Connection count\n",
" - Slow queries\n",
"\n",
"**Evidence collected in this lab:**\n",
"- ✅ Diagnostics bundle\n",
"- ✅ Worker pod logs with Blob Storage errors\n",
"- ✅ Events showing failures\n",
"- ✅ Secret configuration (structure, not values)\n",
"\n",
"**Additional evidence to gather (if escalating):**\n",
"- Blob Storage endpoint connectivity test\n",
"- Queue metrics (if available)\n",
"- Blob Storage logs (if accessible via cloud provider)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Lessons Learned\n",
"\n",
"**Key takeaways from this lab:**\n",
"\n",
"1. **Blob Storage failures can be intermittent** - Connection pool retries may mask issues\n",
"2. **Worker logs are critical** - Blob Storage errors appear in worker pod logs\n",
"3. **Queue backlog is a symptom** - Jobs pile up when workers can't connect\n",
"4. **Secrets matter** - Wrong credentials cause authentication failures\n",
"5. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
"\n",
"**Common mistakes to avoid:**\n",
"- ❌ Ignoring intermittent failures (they indicate connection issues)\n",
"- ❌ Not checking worker logs (API logs may not show Blob Storage errors)\n",
"- ❌ Not monitoring queue length (backlog indicates processing delays)\n",
"- ❌ Not testing Blob Storage connectivity independently\n",
"\n",
"**Next steps:**\n",
"- Practice with other failure injection methods (Level 2)\n",
"- Try the ClickHouse or Blob Storage failure labs\n",
"- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
+37
View File
@@ -0,0 +1,37 @@
# Module 4: Troubleshooting & Incident Response
This directory contains notebooks for Module 4 of the LangSmith Self-Hosted Operator workshop.
## Notebooks
### Setup & Baseline
- **`../shared/00_setup_or_resume_environment.ipynb`** - Validates environment is ready (shared across modules 2, 3, 4)
- **`01_diagnostics_baseline.ipynb`** - Captures baseline diagnostics (run this first!)
### Failure Labs
- **`10_failure_lab_postgres.ipynb`** - PostgreSQL connectivity failure debugging
- **`20_failure_lab_redis.ipynb`** - Redis connectivity failure debugging
- **`30_failure_lab_clickhouse.ipynb`** - ClickHouse connectivity failure debugging
- **`40_failure_lab_blob_storage.ipynb`** - Blob storage configuration failure debugging
### Advanced
- **`90_full_incident_drill.ipynb`** - Complete incident simulation (optional)
## Workflow
1. Run `../shared/00_setup_or_resume_environment.ipynb` to verify your environment
2. Run `01_diagnostics_baseline.ipynb` to capture baseline
3. Run failure labs in order (10, 20, 30, 40) or pick specific ones
4. Optionally run `90_full_incident_drill.ipynb` for complete practice
## Important Notes
- **Always run baseline first** - You need "before" to compare to "after"
- **Failure injections are reversible** - All labs include remediation steps
- **Don't skip diagnostics collection** - Support will ask for the canonical bundle
- **Practice in test environments only** - These labs modify your deployment
## Documentation
See `docs/modules/module-4.md` for complete module documentation.
@@ -0,0 +1,614 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setup or Resume Environment\n",
"\n",
"## Overview\n",
"\n",
"This notebook helps you prepare for workshop modules 2 through 4 (Identity & Auth, Production Operations, or Troubleshooting). It validates that your LangSmith environment is running and accessible, or guides you to deploy it using Module 1.\n",
"\n",
"### About This Notebook\n",
"This notebook is **READ-ONLY** and safe to run. It performs validation checks only and does not modify any resources. \n",
"\n",
"### Module-Specific Notes\n",
"- **Modules 2 and 3** are read-only validation notebooks, perfect for understanding your current configuration\n",
"- **Module 4** includes hands-on failure labs that intentionally modify secrets to teach troubleshooting—these require a test environment\n",
"- Module-specific guidance is provided below to help you understand what to expect\n",
"\n",
"### Prerequisites\n",
"- Module 1 notebooks available (for deployment if needed)\n",
"- `kubectl` configured (if environment exists)\n",
"\n",
"### What This Notebook Does\n",
"1. Checks if LangSmith is already deployed\n",
"2. If not, provides links to Module 1 deployment notebooks\n",
"3. If yes, validates the environment is healthy and reachable\n",
"4. **Verifies you're in the correct environment** (shows account/region)\n",
"5. Shows module-specific safety warnings\n",
"\n",
"**Estimated time:** 10-15 minutes\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Bootstrap environment\n",
"import sys\n",
"import os\n",
"from pathlib import Path\n",
"\n",
"# Add notebooks directory to path so we can import shared as a package\n",
"possible_paths = [\n",
" Path.cwd().parent, # If cwd is a module directory, go up one level to notebooks\n",
" Path.cwd(), # If cwd is already notebooks\n",
" Path.cwd() / \"notebooks\", # If cwd is workspace root\n",
"]\n",
"\n",
"notebooks_path = None\n",
"for path in possible_paths:\n",
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
" notebooks_path = path\n",
" break\n",
"\n",
"if not notebooks_path:\n",
" notebooks_path = Path.cwd() / \"notebooks\"\n",
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
"\n",
"# Add notebooks directory to path so 'shared' can be imported as a package\n",
"if str(notebooks_path) not in sys.path:\n",
" sys.path.insert(0, str(notebooks_path))\n",
"\n",
"from shared._bootstrap import bootstrap\n",
"\n",
"# Run bootstrap\n",
"bootstrap_info = bootstrap()\n",
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n",
"\n",
"# Detect which module is using this notebook\n",
"# Check current working directory or environment variable\n",
"current_module = None\n",
"cwd_str = str(Path.cwd())\n",
"if \"module-2\" in cwd_str:\n",
" current_module = \"2\"\n",
"elif \"module-3\" in cwd_str:\n",
" current_module = \"3\"\n",
"elif \"module-4\" in cwd_str:\n",
" current_module = \"4\"\n",
"else:\n",
" # Try environment variable\n",
" current_module = os.environ.get(\"CURRENT_MODULE\", \"\")\n",
" if not current_module:\n",
" # Default: assume generic use\n",
" current_module = None\n",
"\n",
"print(f\"\\nDetected module context: {current_module if current_module else 'Generic'}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Environment Safety Verification\n",
"\n",
"**Before proceeding, verify you're working with the correct environment.**\n",
"\n",
"**Module-specific notes:**\n",
"- **Module 2 (Identity & Auth):** Read-only validation - safe for production\n",
"- **Module 3 (Production Operations):** Read-only validation - safe for production\n",
"- **Module 4 (Troubleshooting):** Includes failure labs that modify secrets - **TEST environment ONLY**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Environment Safety Check: Verify environment and show module-specific warnings\n",
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
"from shared._validation import ok, warn, fail\n",
"\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"identity = get_identity()\n",
"\n",
"print(\"=\" * 70)\n",
"print(\"⚠️ ENVIRONMENT SAFETY CHECK\")\n",
"print(\"=\" * 70)\n",
"\n",
"# Show environment details prominently\n",
"provider_display = provider.upper()\n",
"print(f\"\\n### Current Environment Configuration\")\n",
"print(f\"Cloud Provider: {provider_display}\")\n",
"print(f\"Region: {region}\")\n",
"\n",
"if provider == \"aws\":\n",
" account_id = identity.get('Account', 'N/A')\n",
" user_arn = identity.get('Arn', 'N/A')\n",
" print(f\"Account ID: {account_id}\")\n",
" print(f\"User ARN: {user_arn}\")\n",
"elif provider == \"azure\":\n",
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
" subscription_name = identity.get(\"SubscriptionName\", \"N/A\")\n",
" print(f\"Subscription ID: {subscription_id}\")\n",
" print(f\"Subscription Name: {subscription_name}\")\n",
"\n",
"# Show all relevant environment variables\n",
"print(f\"\\n### Environment Variables (for verification)\")\n",
"print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
"print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
"print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
"print(f\"LANGSMITH_DOMAIN: {os.environ.get('LANGSMITH_DOMAIN', 'NOT SET')}\")\n",
"\n",
"# Module-specific safety checks\n",
"if current_module == \"4\":\n",
" # Module 4: Failure labs require TEST environment\n",
" print(\"\\n\" + \"=\" * 70)\n",
" print(\"⚠️ CRITICAL: Module 4 Failure Labs Will Modify Your Environment\")\n",
" print(\"=\" * 70)\n",
" print(\"\\nThe failure labs in Module 4 will:\")\n",
" print(\" ❌ Modify Kubernetes secrets (break passwords/credentials)\")\n",
" print(\" ❌ Cause service disruptions (API failures, login failures)\")\n",
" print(\" ❌ Require remediation to restore functionality\")\n",
" print(\"\\nThis is INTENTIONAL for learning troubleshooting, but:\")\n",
" print(\" ⚠️ ONLY run in TEST/NON-PRODUCTION environments\")\n",
" print(\" ⚠️ DO NOT run against production systems\")\n",
" print(\" ⚠️ Ensure you can restore the environment after labs\")\n",
" print(\"\\n\" + \"=\" * 70)\n",
" \n",
" # Require explicit confirmation for Module 4\n",
" print(\"\\n### Environment Verification Required for Module 4\")\n",
" print(\"\\nPlease confirm:\")\n",
" print(\" 1. ✅ This is a TEST/NON-PRODUCTION environment\")\n",
" print(\" 2. ✅ You understand failure labs will modify secrets\")\n",
" print(\" 3. ✅ You have a way to restore the environment (backup/teardown)\")\n",
" print(\" 4. ✅ You will NOT run these labs against production\")\n",
" \n",
" # Check if user has explicitly acknowledged\n",
" module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
" if module4_safe in [\"true\", \"yes\", \"1\"]:\n",
" ok(\"MODULE4_SAFE_ENVIRONMENT flag is set - proceeding\")\n",
" print(\"\\n✅ Safety check passed - environment marked as safe for Module 4\")\n",
" else:\n",
" fail(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
" print(\"\\n❌ SAFETY CHECK FAILED\")\n",
" print(\"\\nTo proceed with Module 4 failure labs, you MUST:\")\n",
" print(\" 1. Verify this is a TEST/NON-PRODUCTION environment\")\n",
" print(\" 2. Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
" print(\" 3. Re-run this cell to confirm\")\n",
" print(\"\\nThis flag is REQUIRED to prevent accidental execution in production.\")\n",
" raise RuntimeError(\"MODULE4_SAFE_ENVIRONMENT not set. This is required for Module 4 failure labs.\")\n",
" \n",
" print(\"\\n\" + \"=\" * 70)\n",
" print(\"✅ Environment verified as safe for Module 4 failure labs\")\n",
" print(\"=\" * 70)\n",
"elif current_module in [\"2\", \"3\"]:\n",
" # Modules 2 and 3: Read-only, safe for production\n",
" print(\"\\n\" + \"=\" * 70)\n",
" print(f\"✅ Module {current_module} is READ-ONLY\")\n",
" print(\"=\" * 70)\n",
" if current_module == \"2\":\n",
" print(\"\\nModule 2 (Identity & Auth) notebooks:\")\n",
" print(\" ✅ Perform read-only validation checks\")\n",
" print(\" ✅ Do NOT modify any infrastructure or secrets\")\n",
" print(\" ✅ Safe to run against production environments\")\n",
" elif current_module == \"3\":\n",
" print(\"\\nModule 3 (Production Operations) notebooks:\")\n",
" print(\" ✅ Perform read-only validation and signal checks\")\n",
" print(\" ✅ Do NOT modify any infrastructure or resources\")\n",
" print(\" ✅ Safe to run against production environments\")\n",
" print(\"\\n\" + \"=\" * 70)\n",
" ok(\"Environment check complete - safe to proceed with read-only validation\")\n",
"else:\n",
" # Generic use - show general warning\n",
" print(\"\\n\" + \"=\" * 70)\n",
" print(\"⚠️ MODULE CONTEXT NOT DETECTED\")\n",
" print(\"=\" * 70)\n",
" print(\"\\nThis notebook is used by multiple modules:\")\n",
" print(\" - Module 2: Read-only validation (safe for production)\")\n",
" print(\" - Module 3: Read-only validation (safe for production)\")\n",
" print(\" - Module 4: Failure labs (TEST environment ONLY)\")\n",
" print(\"\\n💡 If using Module 4, ensure MODULE4_SAFE_ENVIRONMENT=true is set\")\n",
" print(\"\\n\" + \"=\" * 70)\n",
" ok(\"Environment check complete\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Configuration\n",
"\n",
"Load and validate configuration from environment variables.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from shared._validation import require_env, ok, warn\n",
"from shared._cloud_helpers import get_cloud_provider, get_region\n",
"\n",
"# Required configuration\n",
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
"\n",
"print(\"### Loading Configuration\\n\")\n",
"\n",
"config = {}\n",
"missing = []\n",
"\n",
"for var in required_vars:\n",
" value = os.environ.get(var, \"\").strip()\n",
" if not value:\n",
" missing.append(var)\n",
" config[var] = value\n",
"\n",
"if missing:\n",
" raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
" f\"💡 Copy env-samples/workshop.env.example to your .env file and fill in values\")\n",
"\n",
"# Optional but recommended\n",
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
"config[\"LANGSMITH_DOMAIN\"] = os.environ.get(\"LANGSMITH_DOMAIN\", \"\")\n",
"\n",
"# Show cloud provider info\n",
"provider = get_cloud_provider()\n",
"region = get_region()\n",
"\n",
"print(f\"Cloud Provider: {provider.upper()}\")\n",
"print(f\"Region: {region}\")\n",
"print(f\"Namespace: {config['NAMESPACE']}\")\n",
"print(f\"Cluster: {config['CLUSTER_NAME']}\")\n",
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
"\n",
"if config[\"LANGSMITH_DOMAIN\"]:\n",
" print(f\"LangSmith Domain: {config['LANGSMITH_DOMAIN']}\")\n",
"\n",
"ok(\"Configuration loaded\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Check if Environment Exists\n",
"\n",
"We'll check if LangSmith is already deployed. If not, we'll provide instructions to deploy using Module 1.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from shared._shell import run\n",
"from shared._cloud_helpers import cluster_exists, configure_kubectl, get_kubernetes_service_name\n",
"\n",
"namespace = config[\"NAMESPACE\"]\n",
"cluster_name = config[\"CLUSTER_NAME\"]\n",
"k8s_service = get_kubernetes_service_name()\n",
"\n",
"print(f\"### Checking {k8s_service} Cluster\\n\")\n",
"\n",
"# Check if cluster exists\n",
"if cluster_exists(cluster_name):\n",
" ok(f\"Cluster '{cluster_name}' exists\")\n",
" \n",
" # Configure kubectl\n",
" print(f\"\\n### Configuring kubectl\\n\")\n",
" try:\n",
" configure_kubectl(cluster_name, region)\n",
" ok(\"kubectl configured\")\n",
" except Exception as e:\n",
" warn(f\"Could not configure kubectl: {e}\")\n",
" print(\"💡 Make sure you have proper cloud provider credentials\")\n",
" raise\n",
"else:\n",
" warn(f\"Cluster '{cluster_name}' not found\")\n",
" print(\"\\n💡 You need to deploy LangSmith first using Module 1 notebooks.\")\n",
" print(\" See the 'Deploy Environment' section below.\")\n",
" raise RuntimeError(\"Cluster not found. Deploy using Module 1 first.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Verify Namespace and Helm Release\n",
"\n",
"Check that the LangSmith namespace exists and Helm release is installed.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"helm_release = config[\"HELM_RELEASE\"]\n",
"\n",
"print(\"### Checking Namespace\\n\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"namespace\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" ok(f\"Namespace '{namespace}' exists\")\n",
"else:\n",
" warn(f\"Namespace '{namespace}' not found\")\n",
" print(\"\\n💡 You need to deploy LangSmith first using Module 1 notebooks.\")\n",
" print(\" See the 'Deploy Environment' section below.\")\n",
" raise RuntimeError(\"Namespace not found. Deploy using Module 1 first.\")\n",
"\n",
"print(\"\\n### Checking Helm Release\\n\")\n",
"result = run(\n",
" [\"helm\", \"list\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" releases = json.loads(result.stdout)\n",
" release_found = any(r.get(\"name\") == helm_release for r in releases)\n",
" \n",
" if release_found:\n",
" ok(f\"Helm release '{helm_release}' found\")\n",
" # Get release info\n",
" result = run(\n",
" [\"helm\", \"status\", helm_release, \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
" )\n",
" if result.returncode == 0:\n",
" release_info = json.loads(result.stdout)\n",
" print(f\" Status: {release_info.get('info', {}).get('status', 'unknown')}\")\n",
" print(f\" Chart: {release_info.get('chart', {}).get('metadata', {}).get('name', 'unknown')}\")\n",
" print(f\" Version: {release_info.get('chart', {}).get('metadata', {}).get('version', 'unknown')}\")\n",
" else:\n",
" warn(f\"Helm release '{helm_release}' not found in namespace '{namespace}'\")\n",
" print(\"\\n💡 You need to deploy LangSmith first using Module 1 notebooks.\")\n",
" print(\" See the 'Deploy Environment' section below.\")\n",
" raise RuntimeError(\"Helm release not found. Deploy using Module 1 first.\")\n",
"else:\n",
" warn(\"Could not list Helm releases\")\n",
" print(\"💡 Make sure Helm is installed and kubectl is configured correctly\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Verify Ingress Endpoint\n",
"\n",
"Check that the LangSmith ingress is configured and reachable.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"from urllib.parse import urlparse\n",
"\n",
"print(\"### Checking Ingress\\n\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"ingress\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"ingress_found = False\n",
"ingress_host = None\n",
"\n",
"if result.returncode == 0:\n",
" ingresses = json.loads(result.stdout)\n",
" for ingress in ingresses.get(\"items\", []):\n",
" rules = ingress.get(\"spec\", {}).get(\"rules\", [])\n",
" for rule in rules:\n",
" host = rule.get(\"host\", \"\")\n",
" if host:\n",
" ingress_found = True\n",
" ingress_host = host\n",
" print(f\" Found ingress with host: {host}\")\n",
" break\n",
"\n",
"if not ingress_found:\n",
" warn(\"No ingress found\")\n",
" print(\"💡 Ingress may still be provisioning. Check Module 1 validation notebook.\")\n",
"else:\n",
" ok(f\"Ingress configured with host: {ingress_host}\")\n",
" \n",
" # Try to reach the endpoint\n",
" if config[\"LANGSMITH_DOMAIN\"]:\n",
" test_url = f\"https://{config['LANGSMITH_DOMAIN']}\"\n",
" elif ingress_host:\n",
" test_url = f\"https://{ingress_host}\"\n",
" else:\n",
" test_url = None\n",
" \n",
" if test_url:\n",
" print(f\"\\n### Testing Endpoint Reachability\\n\")\n",
" print(f\"Testing: {test_url}\")\n",
" try:\n",
" # Allow redirects, don't verify SSL (may be self-signed)\n",
" response = requests.get(test_url, allow_redirects=True, verify=False, timeout=10)\n",
" if response.status_code in [200, 302, 401, 403]:\n",
" ok(f\"Endpoint is reachable (HTTP {response.status_code})\")\n",
" else:\n",
" warn(f\"Endpoint returned unexpected status: {response.status_code}\")\n",
" except requests.exceptions.SSLError:\n",
" # SSL error is OK if using self-signed certs\n",
" warn(\"SSL verification failed (may be self-signed certificate)\")\n",
" print(\"💡 This is OK for testing. In production, use proper TLS certificates.\")\n",
" except requests.exceptions.RequestException as e:\n",
" warn(f\"Could not reach endpoint: {e}\")\n",
" print(\"💡 Ingress may still be provisioning. Wait a few minutes and try again.\")\n",
" else:\n",
" warn(\"No domain configured for testing\")\n",
" print(\"💡 Set LANGSMITH_DOMAIN in your .env file to test endpoint reachability\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Quick Health Check\n",
"\n",
"Verify that key deployments are running.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"### Checking Key Deployments\\n\")\n",
"result = run(\n",
" [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
" check=False,\n",
" stream=False\n",
")\n",
"\n",
"if result.returncode == 0:\n",
" deployments = json.loads(result.stdout)\n",
" deployment_items = deployments.get(\"items\", [])\n",
" \n",
" if deployment_items:\n",
" ok(f\"Found {len(deployment_items)} deployment(s)\")\n",
" print(\"\\nDeployment Status:\")\n",
" for deployment in deployment_items:\n",
" name = deployment.get(\"metadata\", {}).get(\"name\", \"\")\n",
" spec_replicas = deployment.get(\"spec\", {}).get(\"replicas\", 0)\n",
" status_replicas = deployment.get(\"status\", {}).get(\"replicas\", 0)\n",
" ready_replicas = deployment.get(\"status\", {}).get(\"readyReplicas\", 0)\n",
" available_replicas = deployment.get(\"status\", {}).get(\"availableReplicas\", 0)\n",
" \n",
" status_icon = \"✅\" if ready_replicas == spec_replicas and available_replicas == spec_replicas else \"⚠️\"\n",
" print(f\" {status_icon} {name}: {ready_replicas}/{spec_replicas} ready, {available_replicas}/{spec_replicas} available\")\n",
" else:\n",
" warn(\"No deployments found\")\n",
" print(\"💡 LangSmith may not be fully deployed. Check Module 1 validation notebook.\")\n",
"else:\n",
" warn(\"Could not list deployments\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ✅ Environment Ready\n",
"\n",
"Your LangSmith environment is running and accessible. You're ready to proceed with your module.\n",
"\n",
"**Next Steps (Module-Specific):**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Show module-specific next steps\n",
"print(\"### Module-Specific Next Steps\\n\")\n",
"\n",
"if current_module == \"2\":\n",
" print(\"**Module 2: Identity & Auth**\\n\")\n",
" print(\"1. Run `01_sso_oidc_validation.ipynb` to validate OIDC SSO configuration\")\n",
" print(\"2. (Optional) Run `02_sso_saml_validation.ipynb` if using SAML\")\n",
" print(\"\\n💡 These notebooks are read-only and safe for production use.\")\n",
"elif current_module == \"3\":\n",
" print(\"**Module 3: Production Operations & Scaling**\\n\")\n",
" print(\"1. Run `01_ops_sanity_checks.ipynb` to validate production readiness\")\n",
" print(\"2. Review production readiness checklist: `docs/shared/production_readiness_checklist.md`\")\n",
" print(\"3. Review signals and thresholds: `docs/shared/ops_signals_and_thresholds.md`\")\n",
" print(\"\\n💡 This notebook is read-only and safe for production use.\")\n",
"elif current_module == \"4\":\n",
" print(\"**Module 4: Troubleshooting & Incident Response**\\n\")\n",
" print(\"1. Run `01_diagnostics_baseline.ipynb` to capture a baseline snapshot\")\n",
" print(\"2. Proceed with failure labs (10, 20, 30, 40)\")\n",
" print(\"3. Optionally run `90_full_incident_drill.ipynb` for a complete incident simulation\")\n",
" print(\"\\n⚠️ REMINDER: Module 4 failure labs modify secrets and cause disruptions.\")\n",
" print(\" Only run in TEST/NON-PRODUCTION environments.\")\n",
"else:\n",
" print(\"**Generic Use**\\n\")\n",
" print(\"This notebook can be used by:\")\n",
" print(\" - Module 2: Identity & Auth validation (read-only)\")\n",
" print(\" - Module 3: Production Operations checks (read-only)\")\n",
" print(\" - Module 4: Troubleshooting failure labs (modifies environment)\")\n",
" print(\"\\n💡 Navigate to the appropriate module directory and run this notebook from there.\")\n",
"\n",
"print(\"\\n\" + \"=\" * 70)\n",
"print(\"📝 Important Reminder\")\n",
"print(\"=\" * 70)\n",
"print(\"\\n**When finished with workshop modules, run Module 1's `99_teardown.ipynb`\")\n",
"print(\"to delete the environment and avoid ongoing cloud costs.**\")\n",
"print(\"\\nThe teardown notebook will:\")\n",
"print(\" - Remove Helm release\")\n",
"print(\" - Destroy Terraform-managed infrastructure (Kubernetes cluster, database, cache, blob storage, etc.)\")\n",
"print(\" - Clean up any remaining resources\")\n",
"print(\"\\n**Location:** `../module-1/99_teardown.ipynb`\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 🚀 Deploy Environment (If Not Already Deployed)\n",
"\n",
"If your environment is not running, follow these steps to deploy LangSmith using Module 1:\n",
"\n",
"### Step 1: Preflight Checks\n",
"Run `../module-1/01_preflight.ipynb` to validate your environment.\n",
"\n",
"### Step 2: Provision Infrastructure\n",
"Run `../module-1/02_terraform_apply.ipynb` to deploy cloud infrastructure (Kubernetes cluster, database, cache, blob storage).\n",
"\n",
"### Step 3: Install LangSmith\n",
"Run `../module-1/03_helm_install_langsmith.ipynb` to install LangSmith using Helm.\n",
"\n",
"### Step 4: Validate Deployment\n",
"Run `../module-1/04_validate_ingress_and_ui.ipynb` to verify everything is working.\n",
"\n",
"### Step 5: Return Here\n",
"Once deployment is complete, return to this notebook and re-run the cells above to verify your environment is ready.\n",
"\n",
"---\n",
"\n",
"**Note:** If you encounter errors during deployment, refer to Module 1 documentation and troubleshooting guides.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
+4
View File
@@ -1,5 +1,6 @@
from __future__ import annotations
import os
from datetime import date
def ok(msg: str) -> None:
print(f"{msg}")
@@ -18,6 +19,9 @@ def require_env(*keys: str) -> dict:
if not v:
missing.append(k)
cfg[k] = v
if k == 'CLUSTER_NAME':
# Add a hardcoded prefix to the cluster name
cfg[k] = f"langsmith-workshop-{date.today().strftime('%Y%m%d')}-{v}"
if missing:
fail(f"Missing required environment variables: {', '.join(missing)}")
return cfg
+11
View File
@@ -0,0 +1,11 @@
# Test artifacts
artifacts/
*.pyc
__pycache__/
.pytest_cache/
.coverage
htmlcov/
# Notebook execution outputs
*.ipynb_checkpoints
+127
View File
@@ -0,0 +1,127 @@
# Tests for LangSmith Self-Hosted Workshops
This directory contains tests for validating notebook execution and syntax.
## Test Structure
- `conftest.py`: Pytest configuration and fixtures
- `test_notebook_execution.py`: Notebook execution tests
- `requirements.txt`: Test dependencies
- `artifacts/`: Directory for test artifacts (created automatically)
## Running Tests Locally
### Prerequisites
```bash
# Install test dependencies
pip install -r tests/requirements.txt
# Install system dependencies (if needed)
# macOS: brew install jq
# Ubuntu: sudo apt-get install jq
```
### Run All Tests
```bash
# Run syntax tests only (fast, no infrastructure required)
CI_SKIP_EXECUTION=true pytest tests/ -v
# Run full execution tests (requires infrastructure)
pytest tests/ -v
```
### Run Specific Test Suites
```bash
# Test Module 1 notebooks
pytest tests/test_notebook_execution.py::TestModule1Notebooks -v
# Test Module 2 notebooks
pytest tests/test_notebook_execution.py::TestModule2Notebooks -v
```
### Run Individual Notebook Tests
```bash
# Test specific notebook syntax
pytest tests/test_notebook_execution.py::TestModule1Notebooks::test_module1_notebook_syntax -v
```
## CI/CD Integration
Tests run automatically on:
- Pull requests to `main`/`master`
- Pushes to `main`/`master`
- Manual workflow dispatch
### GitHub Actions Workflow
The `.github/workflows/test-notebooks.yml` workflow:
1. **Test Notebook Syntax**: Validates JSON structure and code cells
2. **Test Module 1 Preflight**: Validates preflight notebook structure
3. **Test Module 2 Syntax**: Validates auth validation notebooks
4. **Lint Python Code**: Runs flake8 and black checks
### Environment Variables
The workflow uses test environment variables. For full execution tests, set:
```yaml
# In GitHub Actions secrets/variables
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_REGION
CLUSTER_NAME
NAMESPACE
# ... etc
```
## Test Strategy
### Syntax Tests (Always Run)
- Validate notebook JSON structure
- Check for code cells
- Verify imports can be resolved
- No infrastructure required
### Execution Tests (Conditional)
- Full notebook execution
- Requires actual infrastructure (cluster, IdP, etc.)
- Skipped in CI by default (`CI_SKIP_EXECUTION=true`)
- Can be enabled for integration testing environments
## Adding New Tests
1. Add notebook to appropriate test class in `test_notebook_execution.py`
2. Update `pytest.parametrize` decorator with notebook name
3. Add any required environment variables to `conftest.py`
4. Update GitHub Actions workflow if needed
## Troubleshooting
### Import Errors
If tests fail with import errors:
- Ensure `notebooks/shared/` is in Python path
- Check that `conftest.py` is setting up paths correctly
- Verify all required packages are in `requirements.txt`
### Timeout Errors
If notebook execution times out:
- Increase timeout in `execute_notebook()` function
- Check for infinite loops or long-running operations
- Consider mocking external API calls
### Environment Variable Issues
If tests fail due to missing env vars:
- Check `conftest.py` for default values
- Verify GitHub Actions workflow sets required variables
- Add variables to test fixtures if needed
+2
View File
@@ -0,0 +1,2 @@
# Tests for LangSmith Self-Hosted Workshops notebooks
+28
View File
@@ -0,0 +1,28 @@
"""
Pytest configuration and fixtures for notebook testing.
"""
import os
import sys
from pathlib import Path
# Add notebooks directory to path
repo_root = Path(__file__).parent.parent
notebooks_dir = repo_root / "notebooks"
if str(notebooks_dir) not in sys.path:
sys.path.insert(0, str(notebooks_dir))
# Set test environment variables
os.environ.setdefault("NAMESPACE", "langsmith-test")
os.environ.setdefault("CLUSTER_NAME", "test-cluster")
os.environ.setdefault("HELM_RELEASE", "langsmith")
os.environ.setdefault("ARTIFACTS_DIR", str(repo_root / "tests" / "artifacts"))
# Cloud provider defaults (can be overridden by GitHub Actions)
os.environ.setdefault("CLOUD_PROVIDER", "aws")
os.environ.setdefault("AWS_REGION", "us-west-2")
os.environ.setdefault("AZURE_LOCATION", "eastus")
# Create artifacts directory
artifacts_dir = Path(os.environ["ARTIFACTS_DIR"])
artifacts_dir.mkdir(parents=True, exist_ok=True)
+11
View File
@@ -0,0 +1,11 @@
# Test dependencies for notebook execution
pytest>=7.0.0
jupyter>=1.0.0
nbconvert>=6.0.0
ipykernel>=6.0.0
# Notebook dependencies (should match what notebooks need)
python-dotenv>=1.0.0
pyyaml>=6.0
requests>=2.28.0
+283
View File
@@ -0,0 +1,283 @@
"""
Test notebook execution using nbconvert.
This module executes notebooks and validates they complete without errors.
"""
import json
import os
import subprocess
import sys
from pathlib import Path
import pytest
# Repository root
REPO_ROOT = Path(__file__).parent.parent
NOTEBOOKS_DIR = REPO_ROOT / "notebooks"
def execute_notebook(notebook_path: Path, timeout: int = 600) -> tuple[bool, str]:
"""
Execute a Jupyter notebook using nbconvert.
Args:
notebook_path: Path to the notebook file
timeout: Maximum execution time in seconds
Returns:
Tuple of (success: bool, output: str)
"""
try:
# Use nbconvert to execute the notebook
result = subprocess.run(
[
sys.executable,
"-m",
"jupyter",
"nbconvert",
"--to",
"notebook",
"--execute",
"--inplace",
"--ExecutePreprocessor.timeout=600",
"--ExecutePreprocessor.kernel_name=python3",
str(notebook_path),
],
capture_output=True,
text=True,
timeout=timeout,
cwd=str(notebook_path.parent),
)
if result.returncode == 0:
return True, result.stdout
else:
error_msg = f"STDOUT:\n{result.stdout}\n\nSTDERR:\n{result.stderr}"
return False, error_msg
except subprocess.TimeoutExpired:
return False, f"Notebook execution timed out after {timeout} seconds"
except Exception as e:
return False, f"Error executing notebook: {str(e)}"
def get_notebook_cells(notebook_path: Path) -> list:
"""Get all code cells from a notebook."""
with open(notebook_path, "r") as f:
nb = json.load(f)
return [cell for cell in nb.get("cells", []) if cell.get("cell_type") == "code"]
class TestNotebookExecution:
"""Base class for notebook execution tests."""
@pytest.fixture(autouse=True)
def setup_test_env(self, monkeypatch):
"""Set up test environment variables."""
# Set minimal required env vars for testing
test_env = {
"NAMESPACE": "langsmith-test",
"CLUSTER_NAME": "test-cluster",
"HELM_RELEASE": "langsmith",
"ARTIFACTS_DIR": str(REPO_ROOT / "tests" / "artifacts"),
"CLOUD_PROVIDER": os.environ.get("CLOUD_PROVIDER", "aws"),
"AWS_REGION": os.environ.get("AWS_REGION", "us-west-2"),
"AZURE_LOCATION": os.environ.get("AZURE_LOCATION", "eastus"),
# Mock values for testing (will fail actual operations but allow syntax checks)
"LANGSMITH_DOMAIN": "test.langsmith.example.com",
"OIDC_ISSUER": "https://test-idp.example.com/oauth2/default",
"OIDC_CLIENT_ID": "test-client-id",
"OIDC_CLIENT_SECRET": "test-client-secret",
"OIDC_REDIRECT_URI": "https://test.langsmith.example.com/auth/callback",
}
for key, value in test_env.items():
monkeypatch.setenv(key, value)
def _validate_notebook_syntax(self, notebook_path: Path):
"""Helper method to validate notebook has valid JSON structure and code cells."""
assert notebook_path.exists(), f"Notebook not found: {notebook_path}"
with open(notebook_path, "r") as f:
nb = json.load(f)
assert "cells" in nb, "Notebook missing cells"
assert len(nb["cells"]) > 0, "Notebook has no cells"
code_cells = [c for c in nb["cells"] if c.get("cell_type") == "code"]
assert len(code_cells) > 0, "Notebook has no code cells"
# Module 1 tests
class TestModule1Notebooks(TestNotebookExecution):
"""Test Module 1 notebooks."""
@pytest.mark.parametrize("notebook", [
"01_preflight.ipynb",
"99_teardown.ipynb", # Always test syntax, even if execution is skipped
# Note: Skip terraform/helm/validation notebooks in CI as they require actual infrastructure
# "02_terraform_apply.ipynb",
# "03_helm_install_langsmith.ipynb",
# "04_validate_ingress_and_ui.ipynb",
])
def test_module1_notebook_syntax(self, notebook):
"""Test Module 1 notebook syntax."""
notebook_path = NOTEBOOKS_DIR / "module-1" / notebook
self._validate_notebook_syntax(notebook_path)
@pytest.mark.skipif(
os.environ.get("CI_SKIP_EXECUTION") == "true",
reason="Skipping execution in CI (requires infrastructure)"
)
@pytest.mark.parametrize("notebook", [
"01_preflight.ipynb",
])
def test_module1_notebook_execution(self, notebook):
"""Test Module 1 notebook execution (only if infrastructure available)."""
notebook_path = NOTEBOOKS_DIR / "module-1" / notebook
success, output = execute_notebook(notebook_path, timeout=300)
assert success, f"Notebook execution failed:\n{output}"
@pytest.mark.skipif(
os.environ.get("CI_SKIP_EXECUTION") == "true",
reason="Skipping execution in CI (requires infrastructure)"
)
def test_module1_teardown_execution(self):
"""
Test Module 1 teardown notebook execution.
This test runs when CI_SKIP_EXECUTION is not true, ensuring that
resources created during execution tests are properly cleaned up.
IMPORTANT: This test should run AFTER other execution tests to ensure
proper cleanup. It will destroy all infrastructure created during testing.
Note: The teardown notebook has commented-out code sections that must be
uncommented to actually destroy resources. This test validates the notebook
structure and execution flow, but actual resource destruction requires
manual uncommenting in the notebook itself.
"""
notebook_path = NOTEBOOKS_DIR / "module-1" / "99_teardown.ipynb"
# Teardown may take longer, especially for Terraform destroy
# Using 30 minutes timeout to allow for full infrastructure teardown
success, output = execute_notebook(notebook_path, timeout=1800) # 30 minutes
assert success, f"Teardown notebook execution failed:\n{output}"
# Module 2 tests
class TestModule2Notebooks(TestNotebookExecution):
"""Test Module 2 notebooks."""
@pytest.mark.parametrize("notebook", [
"01_sso_oidc_validation.ipynb",
"02_sso_saml_validation.ipynb",
])
def test_module2_notebook_syntax(self, notebook):
"""Test Module 2 notebook syntax."""
notebook_path = NOTEBOOKS_DIR / "module-2" / notebook
self._validate_notebook_syntax(notebook_path)
@pytest.mark.skipif(
os.environ.get("CI_SKIP_EXECUTION") == "true",
reason="Skipping execution in CI (requires infrastructure)"
)
@pytest.mark.parametrize("notebook", [
"01_sso_oidc_validation.ipynb",
"02_sso_saml_validation.ipynb",
])
def test_module2_notebook_execution(self, notebook):
"""Test Module 2 notebook execution (only if infrastructure available)."""
notebook_path = NOTEBOOKS_DIR / "module-2" / notebook
success, output = execute_notebook(notebook_path, timeout=300)
assert success, f"Notebook execution failed:\n{output}"
# Module 3 tests
class TestModule3Notebooks(TestNotebookExecution):
"""Test Module 3 notebooks."""
@pytest.mark.parametrize("notebook", [
"01_ops_sanity_checks.ipynb",
])
def test_module3_notebook_syntax(self, notebook):
"""Test Module 3 notebook syntax."""
notebook_path = NOTEBOOKS_DIR / "module-3" / notebook
self._validate_notebook_syntax(notebook_path)
@pytest.mark.skipif(
os.environ.get("CI_SKIP_EXECUTION") == "true",
reason="Skipping execution in CI (requires infrastructure)"
)
@pytest.mark.parametrize("notebook", [
"01_ops_sanity_checks.ipynb",
])
def test_module3_notebook_execution(self, notebook):
"""Test Module 3 notebook execution (only if infrastructure available)."""
notebook_path = NOTEBOOKS_DIR / "module-3" / notebook
# Ops sanity checks may take longer due to resource usage checks
success, output = execute_notebook(notebook_path, timeout=600)
assert success, f"Notebook execution failed:\n{output}"
# Module 4 tests
class TestModule4Notebooks(TestNotebookExecution):
"""Test Module 4 notebooks."""
@pytest.mark.parametrize("notebook", [
"00_setup_or_resume_environment.ipynb",
"01_diagnostics_baseline.ipynb",
"10_failure_lab_postgres.ipynb",
"20_failure_lab_redis.ipynb",
"30_failure_lab_clickhouse.ipynb",
"40_failure_lab_blob_storage.ipynb",
])
def test_module4_notebook_syntax(self, notebook):
"""Test Module 4 notebook syntax."""
notebook_path = NOTEBOOKS_DIR / "module-4" / notebook
self._validate_notebook_syntax(notebook_path)
@pytest.mark.skipif(
os.environ.get("CI_SKIP_EXECUTION") == "true",
reason="Skipping execution in CI (requires infrastructure)"
)
@pytest.mark.parametrize("notebook", [
"00_setup_or_resume_environment.ipynb",
"01_diagnostics_baseline.ipynb",
])
def test_module4_notebook_execution(self, notebook):
"""
Test Module 4 notebook execution (only if infrastructure available).
Tests setup and baseline notebooks which are read-only validation.
Failure labs are syntax-tested only to avoid modifying production environments.
"""
notebook_path = NOTEBOOKS_DIR / "module-4" / notebook
# Setup and baseline checks may take longer due to diagnostics collection
success, output = execute_notebook(notebook_path, timeout=600)
assert success, f"Notebook execution failed:\n{output}"
@pytest.mark.skipif(
os.environ.get("CI_SKIP_EXECUTION") == "true",
reason="Skipping execution in CI (requires infrastructure and failure injection)"
)
@pytest.mark.parametrize("notebook", [
"10_failure_lab_postgres.ipynb",
"20_failure_lab_redis.ipynb",
"30_failure_lab_clickhouse.ipynb",
"40_failure_lab_blob_storage.ipynb",
])
def test_module4_failure_lab_execution(self, notebook):
"""
Test Module 4 failure lab notebook execution (only if infrastructure available).
WARNING: These notebooks inject failures by modifying secrets and configurations.
They should only be run in test environments, not production.
These tests validate that failure injection and remediation workflows function
correctly. The notebooks include safety mechanisms (commented-out injection code)
but should still be used with caution.
"""
notebook_path = NOTEBOOKS_DIR / "module-4" / notebook
# Failure labs may take longer due to failure injection, observation, and remediation
success, output = execute_notebook(notebook_path, timeout=900) # 15 minutes
assert success, f"Notebook execution failed:\n{output}"