mirror of
https://github.com/langchain-ai/langsmith-self-hosted-workshops.git
synced 2026-07-01 20:44:14 -04:00
Merge pull request #1 from langchain-ai/cwaddingham/refinements
Refinements to the initial creation
This commit is contained in:
@@ -0,0 +1,3 @@
|
||||
# Automatically strip output cells from Jupyter notebooks before committing
|
||||
*.ipynb filter=nbstripout
|
||||
|
||||
@@ -0,0 +1,120 @@
|
||||
# GitHub Actions Workflows
|
||||
|
||||
This directory contains CI/CD workflows for the LangSmith Self-Hosted Workshops repository.
|
||||
|
||||
## Workflows
|
||||
|
||||
### `test-notebooks.yml`
|
||||
|
||||
Main workflow for testing notebook syntax and execution.
|
||||
|
||||
**Triggers:**
|
||||
- Pull requests to `main`/`master`
|
||||
- Pushes to `main`/`master`
|
||||
- Manual workflow dispatch
|
||||
|
||||
**Jobs:**
|
||||
1. **test-notebook-syntax**: Validates notebook JSON structure
|
||||
2. **test-module-1-preflight**: Tests Module 1 preflight notebook
|
||||
3. **test-module-2-syntax**: Tests Module 2 auth validation notebooks
|
||||
4. **lint-python**: Lints Python code in shared modules
|
||||
|
||||
## Environment Variables
|
||||
|
||||
### Required for Syntax Tests (Always Available)
|
||||
|
||||
These are set in the workflow and don't require secrets:
|
||||
|
||||
```yaml
|
||||
NAMESPACE: "langsmith-test"
|
||||
CLUSTER_NAME: "test-cluster"
|
||||
HELM_RELEASE: "langsmith"
|
||||
CLOUD_PROVIDER: "aws"
|
||||
AWS_REGION: "us-west-2"
|
||||
LANGSMITH_DOMAIN: "test.langsmith.example.com"
|
||||
```
|
||||
|
||||
### Required for Full Execution Tests (Optional)
|
||||
|
||||
For full notebook execution, set these in GitHub Secrets:
|
||||
|
||||
**AWS:**
|
||||
- `AWS_ACCESS_KEY_ID`
|
||||
- `AWS_SECRET_ACCESS_KEY`
|
||||
- `AWS_REGION`
|
||||
- `AWS_ACCOUNT_ID` (optional, for validation)
|
||||
|
||||
**Azure:**
|
||||
- `AZURE_CLIENT_ID`
|
||||
- `AZURE_CLIENT_SECRET`
|
||||
- `AZURE_TENANT_ID`
|
||||
- `AZURE_SUBSCRIPTION_ID`
|
||||
- `AZURE_LOCATION`
|
||||
|
||||
**Infrastructure:**
|
||||
- `CLUSTER_NAME`
|
||||
- `NAMESPACE`
|
||||
- `TERRAFORM_REPO_DIR`
|
||||
- `HELM_REPO_DIR`
|
||||
|
||||
**OIDC/SAML (Module 2):**
|
||||
- `OIDC_ISSUER`
|
||||
- `OIDC_CLIENT_ID`
|
||||
- `OIDC_CLIENT_SECRET`
|
||||
- `OIDC_REDIRECT_URI`
|
||||
- `SAML_METADATA_URL` (if using SAML)
|
||||
|
||||
## Customizing Workflows
|
||||
|
||||
### Adding New Test Jobs
|
||||
|
||||
1. Add job to `test-notebooks.yml`
|
||||
2. Set appropriate `needs:` dependencies
|
||||
3. Configure environment variables
|
||||
4. Add artifact uploads if needed
|
||||
|
||||
### Enabling Full Execution Tests
|
||||
|
||||
To enable full notebook execution in CI:
|
||||
|
||||
1. Set required secrets in GitHub repository settings
|
||||
2. Remove or modify `CI_SKIP_EXECUTION` environment variable
|
||||
3. Update test conditions in `test_notebook_execution.py`
|
||||
|
||||
### Adding New Modules
|
||||
|
||||
When adding Module 3, 4, etc.:
|
||||
|
||||
1. Create new test class in `test_notebook_execution.py`
|
||||
2. Add parametrized tests for new notebooks
|
||||
3. Add new job in GitHub Actions workflow
|
||||
4. Update this README
|
||||
|
||||
## Workflow Status
|
||||
|
||||
Workflow status badges can be added to README:
|
||||
|
||||
```markdown
|
||||

|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Workflow Fails on Syntax Tests
|
||||
|
||||
- Check notebook JSON is valid
|
||||
- Verify all imports are available
|
||||
- Check Python version compatibility
|
||||
|
||||
### Workflow Times Out
|
||||
|
||||
- Increase `timeout-minutes` in job definition
|
||||
- Check for long-running operations
|
||||
- Consider splitting into smaller jobs
|
||||
|
||||
### Environment Variable Issues
|
||||
|
||||
- Verify secrets are set in repository settings
|
||||
- Check variable names match exactly
|
||||
- Ensure secrets are accessible to workflow
|
||||
|
||||
@@ -0,0 +1,256 @@
|
||||
name: Test Notebooks
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
branches:
|
||||
- main
|
||||
- master
|
||||
push:
|
||||
branches:
|
||||
- main
|
||||
- master
|
||||
workflow_dispatch: # Allow manual triggering
|
||||
|
||||
jobs:
|
||||
test-notebook-syntax:
|
||||
name: Test Notebook Syntax
|
||||
runs-on: ubuntu-latest
|
||||
timeout-minutes: 30
|
||||
|
||||
steps:
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: '3.10'
|
||||
cache: 'pip'
|
||||
|
||||
- name: Install system dependencies
|
||||
run: |
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y jq
|
||||
|
||||
- name: Install test dependencies
|
||||
run: |
|
||||
pip install -r tests/requirements.txt
|
||||
|
||||
- name: Run notebook syntax tests
|
||||
env:
|
||||
CI_SKIP_EXECUTION: "true" # Skip full execution, only test syntax
|
||||
NAMESPACE: "langsmith-test"
|
||||
CLUSTER_NAME: "test-cluster"
|
||||
HELM_RELEASE: "langsmith"
|
||||
CLOUD_PROVIDER: "aws"
|
||||
AWS_REGION: "us-west-2"
|
||||
AZURE_LOCATION: "eastus"
|
||||
LANGSMITH_DOMAIN: "test.langsmith.example.com"
|
||||
OIDC_ISSUER: "https://test-idp.example.com/oauth2/default"
|
||||
OIDC_CLIENT_ID: "test-client-id"
|
||||
OIDC_CLIENT_SECRET: "test-client-secret"
|
||||
OIDC_REDIRECT_URI: "https://test.langsmith.example.com/auth/callback"
|
||||
run: |
|
||||
pytest tests/test_notebook_execution.py::TestModule1Notebooks::test_module1_notebook_syntax -v
|
||||
pytest tests/test_notebook_execution.py::TestModule2Notebooks::test_module2_notebook_syntax -v
|
||||
pytest tests/test_notebook_execution.py::TestModule3Notebooks::test_module3_notebook_syntax -v
|
||||
pytest tests/test_notebook_execution.py::TestModule4Notebooks::test_module4_notebook_syntax -v
|
||||
|
||||
- name: Upload test artifacts
|
||||
if: always()
|
||||
uses: actions/upload-artifact@v3
|
||||
with:
|
||||
name: test-artifacts
|
||||
path: tests/artifacts/
|
||||
retention-days: 1
|
||||
|
||||
test-module-1-preflight:
|
||||
name: Test Module 1 Preflight (Dry Run)
|
||||
runs-on: ubuntu-latest
|
||||
timeout-minutes: 15
|
||||
needs: test-notebook-syntax
|
||||
|
||||
steps:
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: '3.10'
|
||||
cache: 'pip'
|
||||
|
||||
- name: Install system dependencies
|
||||
run: |
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y jq
|
||||
# Install mock tools (these won't actually work, but allow import checks)
|
||||
sudo ln -sf /bin/true /usr/local/bin/aws || true
|
||||
sudo ln -sf /bin/true /usr/local/bin/terraform || true
|
||||
sudo ln -sf /bin/true /usr/local/bin/helm || true
|
||||
sudo ln -sf /bin/true /usr/local/bin/kubectl || true
|
||||
|
||||
- name: Install test dependencies
|
||||
run: |
|
||||
pip install -r tests/requirements.txt
|
||||
|
||||
- name: Create test environment file
|
||||
run: |
|
||||
mkdir -p notebooks
|
||||
cat > notebooks/workshop.env <<EOF
|
||||
WORKSHOP_NAME="langsmith-test"
|
||||
NAMESPACE="langsmith-test"
|
||||
CLUSTER_NAME="test-cluster"
|
||||
AWS_REGION="us-west-2"
|
||||
HELM_RELEASE="langsmith"
|
||||
ARTIFACTS_DIR="./tests/artifacts"
|
||||
DRY_RUN="true"
|
||||
EOF
|
||||
|
||||
- name: Run Module 1 preflight notebook (syntax only)
|
||||
env:
|
||||
CI_SKIP_EXECUTION: "true"
|
||||
NAMESPACE: "langsmith-test"
|
||||
CLUSTER_NAME: "test-cluster"
|
||||
HELM_RELEASE: "langsmith"
|
||||
CLOUD_PROVIDER: "aws"
|
||||
AWS_REGION: "us-west-2"
|
||||
run: |
|
||||
# Test that notebook can be loaded and parsed
|
||||
python -c "
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
nb_path = Path('notebooks/module-1/01_preflight.ipynb')
|
||||
with open(nb_path) as f:
|
||||
nb = json.load(f)
|
||||
|
||||
# Validate structure
|
||||
assert 'cells' in nb
|
||||
assert len(nb['cells']) > 0
|
||||
print(f'✅ Notebook structure valid: {len(nb[\"cells\"])} cells')
|
||||
sys.exit(0)
|
||||
"
|
||||
|
||||
- name: Upload test artifacts
|
||||
if: always()
|
||||
uses: actions/upload-artifact@v3
|
||||
with:
|
||||
name: module-1-artifacts
|
||||
path: tests/artifacts/
|
||||
retention-days: 1
|
||||
|
||||
test-module-2-syntax:
|
||||
name: Test Module 2 Notebooks (Syntax)
|
||||
runs-on: ubuntu-latest
|
||||
timeout-minutes: 15
|
||||
needs: test-notebook-syntax
|
||||
|
||||
steps:
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: '3.10'
|
||||
cache: 'pip'
|
||||
|
||||
- name: Install test dependencies
|
||||
run: |
|
||||
pip install -r tests/requirements.txt
|
||||
|
||||
- name: Create test environment file
|
||||
run: |
|
||||
mkdir -p notebooks
|
||||
cat > notebooks/workshop.env <<EOF
|
||||
WORKSHOP_NAME="langsmith-test"
|
||||
NAMESPACE="langsmith-test"
|
||||
CLUSTER_NAME="test-cluster"
|
||||
AWS_REGION="us-west-2"
|
||||
HELM_RELEASE="langsmith"
|
||||
ARTIFACTS_DIR="./tests/artifacts"
|
||||
LANGSMITH_DOMAIN="test.langsmith.example.com"
|
||||
OIDC_ISSUER="https://test-idp.example.com/oauth2/default"
|
||||
OIDC_CLIENT_ID="test-client-id"
|
||||
OIDC_CLIENT_SECRET="test-client-secret"
|
||||
OIDC_REDIRECT_URI="https://test.langsmith.example.com/auth/callback"
|
||||
EOF
|
||||
|
||||
- name: Validate Module 2 notebooks
|
||||
env:
|
||||
CI_SKIP_EXECUTION: "true"
|
||||
NAMESPACE: "langsmith-test"
|
||||
CLUSTER_NAME: "test-cluster"
|
||||
HELM_RELEASE: "langsmith"
|
||||
LANGSMITH_DOMAIN: "test.langsmith.example.com"
|
||||
OIDC_ISSUER: "https://test-idp.example.com/oauth2/default"
|
||||
OIDC_CLIENT_ID: "test-client-id"
|
||||
OIDC_CLIENT_SECRET: "test-client-secret"
|
||||
OIDC_REDIRECT_URI: "https://test.langsmith.example.com/auth/callback"
|
||||
run: |
|
||||
python -c "
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
notebooks = [
|
||||
'notebooks/module-2/01_sso_oidc_validation.ipynb',
|
||||
'notebooks/module-2/02_sso_saml_validation.ipynb',
|
||||
]
|
||||
|
||||
for nb_path_str in notebooks:
|
||||
nb_path = Path(nb_path_str)
|
||||
if not nb_path.exists():
|
||||
print(f'❌ Notebook not found: {nb_path}')
|
||||
sys.exit(1)
|
||||
|
||||
with open(nb_path) as f:
|
||||
nb = json.load(f)
|
||||
|
||||
assert 'cells' in nb, f'Missing cells in {nb_path}'
|
||||
assert len(nb['cells']) > 0, f'No cells in {nb_path}'
|
||||
|
||||
code_cells = [c for c in nb['cells'] if c.get('cell_type') == 'code']
|
||||
print(f'✅ {nb_path.name}: {len(code_cells)} code cells, {len(nb[\"cells\"])} total cells')
|
||||
|
||||
print('✅ All Module 2 notebooks validated')
|
||||
sys.exit(0)
|
||||
"
|
||||
|
||||
- name: Upload test artifacts
|
||||
if: always()
|
||||
uses: actions/upload-artifact@v3
|
||||
with:
|
||||
name: module-2-artifacts
|
||||
path: tests/artifacts/
|
||||
retention-days: 1
|
||||
|
||||
lint-python:
|
||||
name: Lint Python Code
|
||||
runs-on: ubuntu-latest
|
||||
timeout-minutes: 10
|
||||
|
||||
steps:
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: '3.10'
|
||||
cache: 'pip'
|
||||
|
||||
- name: Install linting tools
|
||||
run: |
|
||||
pip install flake8 black isort
|
||||
|
||||
- name: Run flake8
|
||||
run: |
|
||||
flake8 notebooks/shared/ tests/ --max-line-length=120 --ignore=E501,W503 || true
|
||||
|
||||
- name: Check code formatting with black
|
||||
run: |
|
||||
black --check notebooks/shared/ tests/ || true
|
||||
|
||||
@@ -6,6 +6,8 @@ The workshop is designed for **platform, infrastructure, and MLOps engineers** r
|
||||
|
||||
> **Note:** This workshop assumes deployment using *NIX-based servers, preferably Linux. If you must use Windows please raise an issue in the [Github](https://github.com/langchain-ai/langsmith-self-hosted-workshops) repo and LangChain engineers will address it.
|
||||
|
||||
> **Note:** This workshop uses Jupyter notebooks for its demonstrations. You have the option of running them locally via your own [Jupyter server](https://jupyter.org/) or use Google's [Github-to-Colab tool](https://githubtocolab.com) with your existing Google Suite account.
|
||||
|
||||
This repo complements (but does not replace) the high-level deployment instructions [the LangSmith documentation](https://docs.langchain.com). Where the docs explain *what* to do, this workshop focuses on *how to do it safely and repeatedly*.
|
||||
|
||||
---
|
||||
@@ -178,7 +180,7 @@ git clone https://github.com/langchain-ai/helm.git <your-helm-path>
|
||||
### 3. Start the Workshop
|
||||
|
||||
1. Read `docs/modules/module-1.md` for module overview and context
|
||||
2. Open `notebooks/module-1/01_aws_preflight.ipynb` in Jupyter
|
||||
2. Open `notebooks/module-1/01_preflight.ipynb` in Jupyter
|
||||
3. Run the bootstrap cell (first cell) to validate your environment
|
||||
4. Follow the notebook cells sequentially
|
||||
|
||||
|
||||
@@ -0,0 +1,606 @@
|
||||
# Module 1: Deployment & Baseline Validation
|
||||
|
||||
**Goal:** Deploy LangSmith self-hosted using the official Terraform and Helm repositories, establishing a supported baseline configuration.
|
||||
|
||||
**Duration:** ~2 hours
|
||||
**Audience:** Platform engineers, infrastructure teams, and operators deploying LangSmith for the first time
|
||||
**Prerequisites:**
|
||||
- Cloud provider account with appropriate permissions
|
||||
- Local tooling installed (`aws`/`az`, `terraform`, `kubectl`, `helm`, `jq`)
|
||||
- LangSmith self-hosted license key
|
||||
- Basic familiarity with Kubernetes (pods, services, ingress)
|
||||
|
||||
---
|
||||
|
||||
## Motivation
|
||||
|
||||
Most self-hosted LangSmith failures occur **before** users ever touch the product:
|
||||
|
||||
- Mis-sized clusters that "work" until users arrive
|
||||
- Unsupported ingress setups causing connectivity issues
|
||||
- In-cluster databases used past their limits
|
||||
- Missing storage primitives (blob storage, persistent volumes)
|
||||
- Incorrect infrastructure configuration leading to data loss
|
||||
|
||||
Module 1 exists to ensure every deployment starts from a **supported baseline** using the **official Terraform and Helm repositories**. This baseline becomes the foundation for production operations (Module 3) and authentication (Module 2).
|
||||
|
||||
---
|
||||
|
||||
## Outcomes
|
||||
|
||||
By the end of this module, participants will:
|
||||
|
||||
- Deploy cloud infrastructure using the official `langchain-ai/terraform` repository
|
||||
- Install LangSmith using the official `langchain-ai/helm` chart
|
||||
- Validate cluster readiness, storage, and ingress
|
||||
- Understand *why* specific architectural choices are required
|
||||
- Establish a baseline configuration for future modules
|
||||
- Be ready to layer in authentication (Module 2) and production operations (Module 3)
|
||||
|
||||
---
|
||||
|
||||
## What This Module Avoids
|
||||
|
||||
- **SSO / OIDC / SAML:** Covered in Module 2
|
||||
- **HA tuning beyond defaults:** Covered in Module 3
|
||||
- **Advanced autoscaling (KEDA):** Covered in Module 3
|
||||
- **Performance benchmarking:** Out of scope
|
||||
- **Custom infrastructure:** We use official Terraform modules only
|
||||
- **Forked repositories:** We reference official repos directly
|
||||
|
||||
This keeps the baseline clean, repeatable, and supportable.
|
||||
|
||||
---
|
||||
|
||||
## Architecture Baseline (What We Support)
|
||||
|
||||
This workshop uses a **single, opinionated baseline**:
|
||||
|
||||
### Compute
|
||||
- **AWS:** Amazon EKS (Elastic Kubernetes Service)
|
||||
- **Azure:** Azure Kubernetes Service (AKS)
|
||||
- **GCP:** Google Kubernetes Engine (GKE) - coming soon
|
||||
|
||||
### Ingress
|
||||
- **AWS:** AWS Application Load Balancer (ALB) - cloud-native load balancer only
|
||||
- **Azure:** Azure Application Gateway - cloud-native load balancer only
|
||||
- **Why:** Cloud-native load balancers provide automatic scaling, health checks, and integration with cloud provider services
|
||||
|
||||
### Datastores
|
||||
- **PostgreSQL:** Managed service (RDS for AWS, Azure Database for PostgreSQL)
|
||||
- **Redis:** Managed service (ElastiCache for AWS, Azure Cache for Redis)
|
||||
- **ClickHouse:** Managed service (ClickHouse Cloud) OR in-cluster with EBS CSI/Azure Disk CSI
|
||||
- **Why:** Managed services reduce operational overhead and provide automated backups
|
||||
|
||||
### Blob Storage
|
||||
- **AWS:** S3 (Simple Storage Service) - **required for production**
|
||||
- **Azure:** Azure Blob Storage - **required for production**
|
||||
- **Why:** Without blob storage, ClickHouse table size explodes under load, making the system unusable
|
||||
|
||||
### Provisioning
|
||||
- **Infrastructure:** Terraform (official `langchain-ai/terraform` repository)
|
||||
- **Application:** Helm (official `langchain-ai/helm` chart)
|
||||
|
||||
### Deviations
|
||||
|
||||
Deviations from this baseline are discussed in advanced modules but not used here. This ensures:
|
||||
- Support can help troubleshoot standard configurations
|
||||
- Updates and security patches are straightforward
|
||||
- Documentation and runbooks apply directly
|
||||
|
||||
---
|
||||
|
||||
## Workshop Flow
|
||||
|
||||
### 1️⃣ Environment Readiness & Preflight (20–30 min)
|
||||
|
||||
**Notebook:** `01_preflight.ipynb`
|
||||
|
||||
**What we validate:**
|
||||
- Tooling validation (cloud CLI, terraform, kubectl, helm, jq)
|
||||
- Cloud provider credentials & region sanity check
|
||||
- Cluster capacity expectations
|
||||
- Storage prerequisites (CSI drivers, StorageClasses)
|
||||
- Blob storage requirement (cloud object storage)
|
||||
|
||||
**Key emphasis:**
|
||||
- Verify you're using the correct cloud account/subscription
|
||||
- Ensure all required tools are available and in PATH
|
||||
- Validate storage CSI drivers are installed
|
||||
- Confirm blob storage is accessible
|
||||
|
||||
**Output:**
|
||||
- Environment validated and ready
|
||||
- Artifacts directory created
|
||||
- Cloud provider identity confirmed
|
||||
|
||||
---
|
||||
|
||||
### 2️⃣ Terraform: Provisioning the Platform Substrate (45–60 min)
|
||||
|
||||
**Notebook:** `02_terraform_apply.ipynb`
|
||||
|
||||
**What we deploy:**
|
||||
- Managed Kubernetes cluster (EKS/AKS)
|
||||
- Managed PostgreSQL database (RDS/Azure Database)
|
||||
- Managed Redis cache (ElastiCache/Azure Cache)
|
||||
- Object storage for blob storage (S3/Azure Blob Storage)
|
||||
- IAM/RBAC roles and policies
|
||||
- Storage CSI driver addon
|
||||
|
||||
**Key principles:**
|
||||
- Use the **official** Terraform repo (do not fork)
|
||||
- Pin module versions for reproducibility
|
||||
- Use remote state & locking
|
||||
- Plan before applying
|
||||
- Capture outputs needed for Helm
|
||||
|
||||
**Workflow:**
|
||||
1. Clone and navigate to official Terraform repository
|
||||
2. Identify correct module path for your cloud provider
|
||||
3. Pin module versions in `versions.tf`
|
||||
4. Configure Terraform variables (region, cluster name, database credentials)
|
||||
5. Initialize Terraform (`terraform init`)
|
||||
6. Create Terraform plan (`terraform plan`)
|
||||
7. Review plan carefully
|
||||
8. Apply infrastructure (`terraform apply`)
|
||||
9. Capture outputs for Helm configuration
|
||||
|
||||
**Key emphasis:**
|
||||
- Why we do *not* fork upstream
|
||||
- Why remote state & locking matter
|
||||
- What support will expect to see later
|
||||
- How to interpret Terraform outputs
|
||||
|
||||
**Output:**
|
||||
- Infrastructure deployed and healthy
|
||||
- Terraform outputs captured
|
||||
- Cluster accessible via kubectl
|
||||
|
||||
---
|
||||
|
||||
### 3️⃣ Helm: Installing LangSmith (45–60 min)
|
||||
|
||||
**Notebook:** `03_helm_install_langsmith.ipynb`
|
||||
|
||||
**What we install:**
|
||||
- LangSmith application components
|
||||
- External service connections (PostgreSQL, Redis, blob storage)
|
||||
- Resource requests & limits
|
||||
- Ingress configuration
|
||||
|
||||
**Key principles:**
|
||||
- Use the **official** Helm chart (do not fork)
|
||||
- Pin chart versions for reproducibility
|
||||
- Create minimal, sane values file
|
||||
- Inject required secrets properly
|
||||
- Render templates before install
|
||||
- Understand that "helm install succeeded" ≠ "system is healthy"
|
||||
|
||||
**Workflow:**
|
||||
1. Clone and navigate to official Helm repository
|
||||
2. Identify correct chart path
|
||||
3. Pin chart version
|
||||
4. Create minimal values file:
|
||||
- External service connections (database, cache, blob storage)
|
||||
- Resource requests & limits
|
||||
- Ingress configuration
|
||||
- Required secrets
|
||||
5. Create Kubernetes secrets for sensitive values
|
||||
6. Render templates (`helm template`) to validate
|
||||
7. Install chart (`helm install`)
|
||||
8. Verify installation (`helm status`)
|
||||
|
||||
**Key emphasis:**
|
||||
- External services wiring (why managed services matter)
|
||||
- Resource requests & limits (why they're required)
|
||||
- Why "helm install succeeded" ≠ "system is healthy"
|
||||
- Start with minimal values file and only configure what you need
|
||||
|
||||
**Output:**
|
||||
- LangSmith application deployed
|
||||
- Pods starting (may not be ready yet)
|
||||
- Helm release created
|
||||
|
||||
---
|
||||
|
||||
### 4️⃣ Validation & Go/No-Go Checklist (20–30 min)
|
||||
|
||||
**Notebook:** `04_validate_ingress_and_ui.ipynb`
|
||||
|
||||
**What we validate:**
|
||||
1. Pod readiness (all pods running)
|
||||
2. License key validation (properly configured)
|
||||
3. PVC binding (storage provisioned)
|
||||
4. External services connectivity (PostgreSQL, Redis, blob storage)
|
||||
5. Ingress provisioning (load balancer created)
|
||||
6. Endpoint reachability (services accessible)
|
||||
7. Basic UI availability (web interface works)
|
||||
8. Basic functional test (optional trace submission)
|
||||
|
||||
**Key emphasis:**
|
||||
- This checklist becomes your **baseline reference** for future troubleshooting
|
||||
- Most issues are caught here, before real users onboard
|
||||
- Validation ensures you're on a **supported path**
|
||||
|
||||
**Workflow:**
|
||||
1. Verify all pods are running and ready
|
||||
2. Validate license key is configured correctly
|
||||
3. Check PVCs are bound (storage provisioned)
|
||||
4. Test connectivity to external services
|
||||
5. Verify ingress is provisioned and accessible
|
||||
6. Test endpoint reachability (HTTPS)
|
||||
7. Verify UI is accessible
|
||||
8. Optional: Submit test trace to validate functionality
|
||||
|
||||
**Output:**
|
||||
- Deployment validated and healthy
|
||||
- Baseline reference established
|
||||
- Ready for Module 2 (authentication)
|
||||
|
||||
---
|
||||
|
||||
### 5️⃣ Teardown & Cleanup (Optional, 30–45 min)
|
||||
|
||||
**Notebook:** `99_teardown.ipynb`
|
||||
|
||||
**What we clean up:**
|
||||
- Helm release (LangSmith application)
|
||||
- Kubernetes resources (secrets, PVCs)
|
||||
- Terraform-managed infrastructure (cluster, database, cache, blob storage)
|
||||
|
||||
**Key emphasis:**
|
||||
- Avoid ongoing cloud costs
|
||||
- Practice proper resource lifecycle management
|
||||
- Verify cleanup completed successfully
|
||||
|
||||
**Workflow:**
|
||||
1. Uninstall Helm release
|
||||
2. Clean up remaining Kubernetes resources
|
||||
3. Destroy Terraform infrastructure
|
||||
4. Verify all resources removed
|
||||
|
||||
**Output:**
|
||||
- All resources destroyed
|
||||
- No ongoing costs
|
||||
- Clean slate for re-deployment
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls Addressed in Module 1
|
||||
|
||||
### ClickHouse PVCs Stuck in `Pending`
|
||||
|
||||
**Symptom:** ClickHouse pods cannot start, PVCs remain in `Pending` state.
|
||||
|
||||
**Cause:** Missing EBS CSI driver (AWS) or Azure Disk CSI driver (Azure).
|
||||
|
||||
**Fix:** Install CSI driver addon before deploying LangSmith.
|
||||
|
||||
**Prevention:** Preflight checks validate CSI driver installation.
|
||||
|
||||
### Load Balancer Never Appears
|
||||
|
||||
**Symptom:** Ingress created but no load balancer provisioned.
|
||||
|
||||
**Cause:** Wrong ingress class or missing ingress controller.
|
||||
|
||||
**Fix:** Use cloud-native ingress class (AWS: `alb`, Azure: `azure/application-gateway`).
|
||||
|
||||
**Prevention:** Preflight checks validate ingress controller installation.
|
||||
|
||||
### Inline Trace Payloads Exploding ClickHouse
|
||||
|
||||
**Symptom:** ClickHouse table size grows rapidly, queries slow down.
|
||||
|
||||
**Cause:** Blob storage not configured, large payloads stored inline in ClickHouse.
|
||||
|
||||
**Fix:** Configure S3 (AWS) or Azure Blob Storage (Azure) before deployment.
|
||||
|
||||
**Prevention:** Preflight checks validate blob storage accessibility.
|
||||
|
||||
### Under-Sized Clusters That "Work" Until Users Arrive
|
||||
|
||||
**Symptom:** Deployment works initially but fails under load.
|
||||
|
||||
**Cause:** Cluster nodes too small, insufficient resources.
|
||||
|
||||
**Fix:** Use recommended node sizes (see service sizing baselines in Module 3).
|
||||
|
||||
**Prevention:** Preflight checks validate cluster capacity expectations.
|
||||
|
||||
### Terraform State Lock Issues
|
||||
|
||||
**Symptom:** `terraform apply` fails with state lock error.
|
||||
|
||||
**Cause:** Another process holds the lock, or previous operation didn't release lock.
|
||||
|
||||
**Fix:** Use remote state backend with locking (S3 + DynamoDB for AWS, Azure Storage for Azure).
|
||||
|
||||
**Prevention:** Terraform configuration uses remote state by default.
|
||||
|
||||
---
|
||||
|
||||
## Service Sizing Baselines
|
||||
|
||||
### Kubernetes Cluster
|
||||
|
||||
**Production baseline:**
|
||||
- **Node instance type:** m5.xlarge (4 vCPU, 16 GB RAM) minimum
|
||||
- **Node count:** 3+ nodes (for HA)
|
||||
- **Storage:** EBS gp3 (AWS) or Premium SSD (Azure) with 100+ GB per node
|
||||
|
||||
**Non-production guidance:**
|
||||
- m5.large (2 vCPU, 8 GB RAM) acceptable for development
|
||||
- 2 nodes sufficient for non-production
|
||||
|
||||
### PostgreSQL
|
||||
|
||||
**Production baseline:**
|
||||
- **Instance size:** db.r5.xlarge (4 vCPU, 32 GB RAM) minimum
|
||||
- **Storage:** 500 GB+ with autoscaling enabled
|
||||
- **High availability:** Multi-AZ deployment (RDS) or read replicas (Azure)
|
||||
|
||||
**Non-production guidance:**
|
||||
- db.t3.medium (2 vCPU, 4 GB RAM) acceptable for development
|
||||
- Single-AZ acceptable for non-production
|
||||
|
||||
### Redis
|
||||
|
||||
**Production baseline:**
|
||||
- **Instance type:** cache.r6g.xlarge (6 vCPU, 13.07 GB RAM) minimum
|
||||
- **High availability:** Redis Cluster mode enabled (3+ nodes)
|
||||
|
||||
**Non-production guidance:**
|
||||
- cache.t3.micro acceptable for development
|
||||
- Single node acceptable for non-production
|
||||
|
||||
### ClickHouse
|
||||
|
||||
**Production baseline:**
|
||||
- **Deployment:** Managed ClickHouse (ClickHouse Cloud) OR in-cluster with EBS CSI
|
||||
- **In-cluster sizing:** 3-node cluster minimum (for HA)
|
||||
- **Resources per node:** 8 CPU, 32 GB RAM, 1 TB storage
|
||||
|
||||
**Non-production guidance:**
|
||||
- Single node acceptable for development
|
||||
- 4 CPU, 16 GB RAM per node sufficient
|
||||
|
||||
---
|
||||
|
||||
## Blob Storage Requirement
|
||||
|
||||
### Why Blob Storage is Required
|
||||
|
||||
**Problem without blob storage:**
|
||||
- Large trace payloads stored inline in ClickHouse
|
||||
- ClickHouse table size explodes
|
||||
- Query performance degrades
|
||||
- Storage costs increase dramatically
|
||||
- System becomes unusable under load
|
||||
|
||||
**Solution with blob storage:**
|
||||
- Large payloads stored in S3/Azure Blob Storage
|
||||
- ClickHouse stores only references (small strings)
|
||||
- Query performance remains stable
|
||||
- Storage costs scale linearly
|
||||
- System handles production load
|
||||
|
||||
### Requirements
|
||||
|
||||
**Production:**
|
||||
- **Service:** S3 (AWS) or Azure Blob Storage (Azure)
|
||||
- **Bucket/Container:** Dedicated bucket for LangSmith
|
||||
- **Access:** IAM roles (AWS) or Managed Identity (Azure) - no access keys
|
||||
- **Versioning:** Enabled for data protection
|
||||
- **Encryption:** Server-side encryption enabled
|
||||
|
||||
**Non-production:**
|
||||
- Local MinIO or in-cluster object storage acceptable
|
||||
- Access keys acceptable (not for production)
|
||||
- No versioning required
|
||||
|
||||
---
|
||||
|
||||
## Terraform Best Practices
|
||||
|
||||
### Use Official Repository
|
||||
|
||||
**Why:**
|
||||
- Support expects standard configurations
|
||||
- Updates and security patches are provided
|
||||
- Documentation and examples are maintained
|
||||
- Compatibility with Helm chart is guaranteed
|
||||
|
||||
**How:**
|
||||
- Clone `langchain-ai/terraform` repository
|
||||
- Reference modules directly (do not fork)
|
||||
- Pin module versions in `versions.tf`
|
||||
|
||||
### Remote State & Locking
|
||||
|
||||
**Why:**
|
||||
- Prevents concurrent modifications
|
||||
- Enables team collaboration
|
||||
- Provides state history
|
||||
- Prevents state corruption
|
||||
|
||||
**Configuration:**
|
||||
- **AWS:** S3 backend with DynamoDB table for locking
|
||||
- **Azure:** Azure Storage backend with blob container
|
||||
|
||||
### Plan Before Apply
|
||||
|
||||
**Why:**
|
||||
- Review changes before applying
|
||||
- Catch configuration errors early
|
||||
- Understand resource impact
|
||||
- Validate variable values
|
||||
|
||||
**Workflow:**
|
||||
1. `terraform init` - Initialize backend and modules
|
||||
2. `terraform plan` - Generate execution plan
|
||||
3. Review plan carefully
|
||||
4. `terraform apply` - Apply changes
|
||||
|
||||
---
|
||||
|
||||
## Helm Best Practices
|
||||
|
||||
### Use Official Chart
|
||||
|
||||
**Why:**
|
||||
- Support expects standard configurations
|
||||
- Updates and security patches are provided
|
||||
- Documentation and examples are maintained
|
||||
- Compatibility with Terraform outputs is guaranteed
|
||||
|
||||
**How:**
|
||||
- Clone `langchain-ai/helm` repository
|
||||
- Reference chart directly (do not fork)
|
||||
- Pin chart version
|
||||
|
||||
### Minimal Values File
|
||||
|
||||
**Principle:** Start with minimal configuration and only add what you need.
|
||||
|
||||
**Why:**
|
||||
- Reduces complexity
|
||||
- Fewer points of failure
|
||||
- Easier to troubleshoot
|
||||
- Clearer configuration intent
|
||||
|
||||
**What to include:**
|
||||
- External service connections (database, cache, blob storage)
|
||||
- Resource requests & limits
|
||||
- Ingress configuration
|
||||
- Required secrets
|
||||
|
||||
**What to avoid:**
|
||||
- Configuration for services you're not using
|
||||
- Over-optimization before baseline works
|
||||
- Custom modifications without justification
|
||||
|
||||
### Render Before Install
|
||||
|
||||
**Why:**
|
||||
- Validate template syntax
|
||||
- Review generated manifests
|
||||
- Catch configuration errors early
|
||||
- Understand what will be deployed
|
||||
|
||||
**Command:**
|
||||
```bash
|
||||
helm template <release-name> <chart-path> -f <values-file> -n <namespace>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Validation Checklist
|
||||
|
||||
See `notebooks/module-1/04_validate_ingress_and_ui.ipynb` for complete validation.
|
||||
|
||||
**Quick checklist:**
|
||||
- [ ] All pods running and ready
|
||||
- [ ] License key configured correctly
|
||||
- [ ] PVCs bound (storage provisioned)
|
||||
- [ ] External services accessible (PostgreSQL, Redis, blob storage)
|
||||
- [ ] Ingress provisioned and accessible
|
||||
- [ ] Endpoint reachable via HTTPS
|
||||
- [ ] UI accessible in browser
|
||||
- [ ] Basic functional test passes (optional)
|
||||
|
||||
---
|
||||
|
||||
## Artifacts Participants Leave With
|
||||
|
||||
1. **Working baseline deployment**
|
||||
- LangSmith accessible via HTTPS
|
||||
- All services healthy and connected
|
||||
- Ingress configured correctly
|
||||
|
||||
2. **Pinned Terraform + Helm configuration**
|
||||
- Terraform module versions documented
|
||||
- Helm chart version documented
|
||||
- Values file saved and version controlled
|
||||
|
||||
3. **Validated ingress endpoint**
|
||||
- HTTPS URL accessible
|
||||
- TLS certificate valid
|
||||
- DNS configured correctly
|
||||
|
||||
4. **Readiness checklist**
|
||||
- Validation results documented
|
||||
- Baseline reference established
|
||||
- Troubleshooting evidence collected
|
||||
|
||||
5. **Confidence they're on a supported path**
|
||||
- Official repositories used
|
||||
- Standard configuration applied
|
||||
- Support can help troubleshoot
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Run the validation notebook:**
|
||||
- `notebooks/module-1/04_validate_ingress_and_ui.ipynb`
|
||||
- Address any failures before proceeding
|
||||
|
||||
2. **Proceed to Module 2:**
|
||||
- Configure authentication (OIDC/SAML)
|
||||
- Set up role mapping
|
||||
- Validate SSO flows
|
||||
|
||||
3. **Proceed to Module 3:**
|
||||
- Configure production operations
|
||||
- Set up autoscaling
|
||||
- Establish observability
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [Official Terraform Repository](https://github.com/langchain-ai/terraform)
|
||||
- [Official Helm Repository](https://github.com/langchain-ai/helm)
|
||||
- LangSmith Self-Hosted Documentation
|
||||
- Cloud Provider Documentation (AWS EKS, Azure AKS)
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Terraform apply fails:**
|
||||
- Check cloud provider credentials
|
||||
- Verify IAM permissions
|
||||
- Review Terraform plan for errors
|
||||
- Check remote state backend configuration
|
||||
|
||||
**Helm install fails:**
|
||||
- Verify chart path is correct
|
||||
- Check values file syntax
|
||||
- Validate secrets exist
|
||||
- Review Helm template output
|
||||
|
||||
**Pods not starting:**
|
||||
- Check pod logs: `kubectl logs <pod> -n <namespace>`
|
||||
- Check events: `kubectl get events -n <namespace>`
|
||||
- Verify resource requests/limits
|
||||
- Check PVC binding status
|
||||
|
||||
**Ingress not accessible:**
|
||||
- Verify ingress controller installed
|
||||
- Check ingress class matches controller
|
||||
- Verify DNS configuration
|
||||
- Check TLS certificate validity
|
||||
|
||||
**External services not accessible:**
|
||||
- Verify network connectivity (VPC/VNet)
|
||||
- Check security group/NSG rules
|
||||
- Validate connection strings
|
||||
- Test connectivity from pod
|
||||
|
||||
For detailed troubleshooting, see the validation notebook and Module 3 operations guide.
|
||||
|
||||
@@ -0,0 +1,541 @@
|
||||
# Module 2: Identity & Authentication
|
||||
|
||||
**Duration:** ~2 hours
|
||||
**Audience:** Operators deploying and managing LangSmith self-hosted
|
||||
**Prerequisite:** Module 1 complete (working deployment with DNS/TLS/Ingress configured)
|
||||
|
||||
---
|
||||
|
||||
## Motivation
|
||||
|
||||
Most production LangSmith deployments require centralized identity management. Configuring SSO **before** onboarding users prevents:
|
||||
|
||||
- Manual user provisioning overhead
|
||||
- Security gaps from shared credentials
|
||||
- Compliance violations from unmanaged access
|
||||
- Operational toil from authentication failures
|
||||
|
||||
This module ensures your authentication setup is **correct from day one**, not retrofitted after users are already in the system.
|
||||
|
||||
---
|
||||
|
||||
## Outcomes
|
||||
|
||||
By the end of this module, participants will:
|
||||
|
||||
- Understand LangSmith's authentication and authorization model
|
||||
- Configure OIDC or SAML SSO with their identity provider
|
||||
- Validate authentication flows end-to-end
|
||||
- Map identity provider groups to LangSmith roles
|
||||
- Troubleshoot common authentication failures
|
||||
- Maintain authentication configuration as code
|
||||
|
||||
---
|
||||
|
||||
## What This Module Avoids
|
||||
|
||||
- **IdP admin tutorials:** We assume your IdP team provides required configuration values
|
||||
- **SCIM deep-dive:** User provisioning via SCIM is out of scope
|
||||
- **Multi-IdP scenarios:** We focus on single IdP configuration
|
||||
- **Local auth production use:** Local authentication is discouraged for production deployments
|
||||
|
||||
---
|
||||
|
||||
## Supported Identity Models
|
||||
|
||||
### OIDC (Preferred)
|
||||
- **When to use:** Modern IdPs (Okta, Azure AD, Google Workspace, Auth0)
|
||||
- **Advantages:** Standard protocol, easier debugging, better error messages
|
||||
- **Requirements:** OIDC-compliant IdP with client credentials
|
||||
|
||||
### SAML (Fallback)
|
||||
- **When to use:** Legacy IdPs or enterprise requirements
|
||||
- **Advantages:** Widely supported, enterprise-standard
|
||||
- **Requirements:** SAML 2.0 IdP with metadata endpoint or XML file
|
||||
|
||||
### Local Authentication (Discouraged)
|
||||
- **When to use:** Development/testing only
|
||||
- **Limitations:** No centralized management, manual user creation, security risk
|
||||
- **Note:** This module does not cover local auth configuration
|
||||
|
||||
---
|
||||
|
||||
## Authentication Request Flow
|
||||
|
||||
```
|
||||
┌─────────┐ ┌──────────────┐ ┌─────────────┐
|
||||
│ Browser │ │ LangSmith │ │ Identity │
|
||||
│ │ │ (Ingress) │ │ Provider │
|
||||
└────┬────┘ └──────┬───────┘ └──────┬──────┘
|
||||
│ │ │
|
||||
│ 1. GET /login │ │
|
||||
├────────────────────>│ │
|
||||
│ │ │
|
||||
│ 2. Redirect to IdP │ │
|
||||
│ (with state) │ │
|
||||
│<────────────────────┤ │
|
||||
│ │ │
|
||||
│ 3. GET /authorize │ │
|
||||
├───────────────────────────────────────────────>│
|
||||
│ │ │
|
||||
│ 4. User authenticates │
|
||||
│ (IdP UI) │
|
||||
│ │ │
|
||||
│ 5. Callback with code/token │
|
||||
│<───────────────────────────────────────────────┤
|
||||
│ │ │
|
||||
│ 6. POST /callback │ │
|
||||
├────────────────────>│ │
|
||||
│ │ │
|
||||
│ 7. Exchange code for token │
|
||||
│ ├─────────────────────────>│
|
||||
│ │<─────────────────────────┤
|
||||
│ │ │
|
||||
│ 8. Validate token & extract claims │
|
||||
│ │ │
|
||||
│ 9. Create/update user session │
|
||||
│ │ │
|
||||
│ 10. Redirect to dashboard │
|
||||
│<────────────────────┤ │
|
||||
│ │ │
|
||||
```
|
||||
|
||||
**Key Points:**
|
||||
- Redirect URI must match **exactly** (protocol, domain, path, trailing slashes)
|
||||
- State parameter prevents CSRF attacks
|
||||
- Token validation includes signature, expiration, and issuer verification
|
||||
- Claims mapping determines user roles and workspace access
|
||||
|
||||
---
|
||||
|
||||
## Workshop Flow
|
||||
|
||||
### 1. LangSmith Authentication Model
|
||||
|
||||
**Authentication vs Authorization:**
|
||||
- **Authentication (AuthN):** "Who are you?" - Verified by IdP
|
||||
- **Authorization (AuthZ):** "What can you do?" - Determined by role mapping
|
||||
|
||||
**Roles:**
|
||||
- **Admin:** Full system access, workspace management, user management
|
||||
- **Member:** Workspace access, project creation, trace viewing
|
||||
- **Viewer:** Read-only access to assigned workspaces
|
||||
|
||||
**Workspaces & Organizations:**
|
||||
- Users belong to **organizations** (top-level container)
|
||||
- Users access **workspaces** within organizations
|
||||
- Role mapping determines which workspaces a user can access
|
||||
- **No shared admin accounts** - each user authenticates individually
|
||||
|
||||
**Key Principle:** Authentication is centralized (IdP), authorization is application-level (LangSmith role mapping).
|
||||
|
||||
---
|
||||
|
||||
### 2. Choosing OIDC vs SAML
|
||||
|
||||
**Decision Rule:**
|
||||
|
||||
```
|
||||
IF IdP supports OIDC AND you can configure OIDC client
|
||||
→ Use OIDC (preferred)
|
||||
ELSE IF IdP only supports SAML OR enterprise requires SAML
|
||||
→ Use SAML (fallback)
|
||||
ELSE
|
||||
→ Re-evaluate IdP choice
|
||||
```
|
||||
|
||||
**OIDC Advantages:**
|
||||
- Better error messages
|
||||
- Easier debugging (standard endpoints)
|
||||
- Modern protocol with better security defaults
|
||||
- Simpler configuration
|
||||
|
||||
**SAML Advantages:**
|
||||
- Enterprise-standard
|
||||
- Widely supported
|
||||
- Mature protocol
|
||||
|
||||
**Recommendation:** Start with OIDC unless blocked by IdP limitations or policy.
|
||||
|
||||
---
|
||||
|
||||
### 3. Configuring OIDC
|
||||
|
||||
#### Required IdP Inputs
|
||||
|
||||
Your IdP team must provide:
|
||||
|
||||
1. **Issuer URL** (e.g., `https://your-org.okta.com/oauth2/default`)
|
||||
- Must be HTTPS
|
||||
- Must be reachable from LangSmith pods
|
||||
- Used for discovery and token validation
|
||||
|
||||
2. **Client ID**
|
||||
- OAuth2 client identifier
|
||||
- Public value (safe to log)
|
||||
|
||||
3. **Client Secret**
|
||||
- OAuth2 client secret
|
||||
- **Never log or print**
|
||||
- Store in Kubernetes secret
|
||||
|
||||
4. **Redirect URI**
|
||||
- **Exact format:** `https://your-langsmith-domain.com/auth/callback`
|
||||
- Must match **exactly** (case-sensitive, no trailing slash unless specified)
|
||||
- IdP team must whitelist this URI
|
||||
|
||||
5. **Required Claims**
|
||||
- `email` (required): User email address
|
||||
- `name` (optional): Display name
|
||||
- `groups` (optional): Group membership for role mapping
|
||||
|
||||
6. **Scopes**
|
||||
- `openid` (required)
|
||||
- `email` (required)
|
||||
- `profile` (optional)
|
||||
- `groups` (optional, if using group-based role mapping)
|
||||
|
||||
#### Helm/Environment Configuration
|
||||
|
||||
**Helm Values (recommended):**
|
||||
|
||||
```yaml
|
||||
auth:
|
||||
provider: oidc
|
||||
oidc:
|
||||
issuer: "https://your-org.okta.com/oauth2/default"
|
||||
clientId: "your-client-id"
|
||||
clientSecret:
|
||||
secretName: langsmith-oidc-secret
|
||||
secretKey: client-secret
|
||||
redirectURI: "https://your-langsmith-domain.com/auth/callback"
|
||||
scopes:
|
||||
- openid
|
||||
- email
|
||||
- profile
|
||||
- groups
|
||||
claimMapping:
|
||||
email: email
|
||||
name: name
|
||||
groups: groups
|
||||
```
|
||||
|
||||
**Environment Variables (alternative):**
|
||||
|
||||
```bash
|
||||
AUTH_PROVIDER=oidc
|
||||
OIDC_ISSUER=https://your-org.okta.com/oauth2/default
|
||||
OIDC_CLIENT_ID=your-client-id
|
||||
OIDC_CLIENT_SECRET=<from-secret>
|
||||
OIDC_REDIRECT_URI=https://your-langsmith-domain.com/auth/callback
|
||||
OIDC_SCOPES=openid,email,profile,groups
|
||||
```
|
||||
|
||||
#### Redirect URI Exactness
|
||||
|
||||
**Critical:** The redirect URI must match **exactly** between:
|
||||
- LangSmith configuration
|
||||
- IdP whitelist
|
||||
- Actual callback URL
|
||||
|
||||
**Common Mistakes:**
|
||||
- Trailing slash mismatch: `/auth/callback` vs `/auth/callback/`
|
||||
- Protocol mismatch: `http://` vs `https://`
|
||||
- Domain mismatch: `langsmith.example.com` vs `www.langsmith.example.com`
|
||||
- Port mismatch: `:443` vs no port
|
||||
|
||||
**Validation:** Use the validation notebook to verify exact match.
|
||||
|
||||
#### TLS Requirements
|
||||
|
||||
- IdP issuer URL must be HTTPS
|
||||
- LangSmith domain must have valid TLS certificate
|
||||
- Certificate must be trusted by browser (not self-signed for production)
|
||||
- Certificate must match domain exactly (no wildcard issues)
|
||||
|
||||
#### Clock Skew
|
||||
|
||||
- LangSmith and IdP clocks must be synchronized
|
||||
- Maximum allowed skew: typically 5 minutes
|
||||
- Use NTP on Kubernetes nodes
|
||||
- Validate with: `kubectl exec <pod> -- date` vs IdP server time
|
||||
|
||||
---
|
||||
|
||||
### 4. Role Mapping
|
||||
|
||||
**Principle:** Map IdP groups to LangSmith roles, not individual users.
|
||||
|
||||
#### Group-Based Mapping (Recommended)
|
||||
|
||||
```yaml
|
||||
auth:
|
||||
roleMapping:
|
||||
groups:
|
||||
- group: "langsmith-admins"
|
||||
role: "admin"
|
||||
- group: "langsmith-members"
|
||||
role: "member"
|
||||
- group: "langsmith-viewers"
|
||||
role: "viewer"
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- Centralized management in IdP
|
||||
- Easier audit trail
|
||||
- Scales to large organizations
|
||||
|
||||
#### User-Based Mapping (Fallback)
|
||||
|
||||
```yaml
|
||||
auth:
|
||||
roleMapping:
|
||||
users:
|
||||
- email: "admin@example.com"
|
||||
role: "admin"
|
||||
```
|
||||
|
||||
**Use only when:**
|
||||
- Group claims unavailable
|
||||
- Temporary workaround
|
||||
- Small team (< 10 users)
|
||||
|
||||
#### Minimal Admins Principle
|
||||
|
||||
- **Start with 1-2 admins**
|
||||
- Add admins only when necessary
|
||||
- Use group-based mapping for admins
|
||||
- Document admin assignments
|
||||
|
||||
#### Mapping Claims to Roles
|
||||
|
||||
**Claim Structure:**
|
||||
|
||||
```json
|
||||
{
|
||||
"email": "user@example.com",
|
||||
"name": "John Doe",
|
||||
"groups": ["langsmith-members", "engineering"]
|
||||
}
|
||||
```
|
||||
|
||||
**Mapping Logic:**
|
||||
1. Extract `groups` claim
|
||||
2. Match against role mapping configuration
|
||||
3. Assign highest privilege role found
|
||||
4. Default to "member" if no match
|
||||
|
||||
**Validation:** Test with users in different groups to verify mapping.
|
||||
|
||||
---
|
||||
|
||||
### 5. SAML Configuration
|
||||
|
||||
#### Required Metadata
|
||||
|
||||
Your IdP team must provide:
|
||||
|
||||
1. **SAML Metadata URL** (preferred)
|
||||
- HTTPS endpoint serving XML metadata
|
||||
- Must be reachable from LangSmith pods
|
||||
- Auto-refreshes configuration
|
||||
|
||||
2. **SAML Metadata XML** (fallback)
|
||||
- Static XML file
|
||||
- Must be updated manually when IdP changes
|
||||
- Store in Kubernetes secret or ConfigMap
|
||||
|
||||
#### Expected Attributes
|
||||
|
||||
**Required:**
|
||||
- `email` or `http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress`
|
||||
- `name` or `http://schemas.xmlsoap.org/ws/2005/05/identity/claims/name`
|
||||
|
||||
**Optional (for role mapping):**
|
||||
- `groups` or `http://schemas.microsoft.com/ws/2008/06/identity/claims/groups`
|
||||
- Custom attribute names (must match exactly)
|
||||
|
||||
#### Common Failures
|
||||
|
||||
1. **Missing Attributes**
|
||||
- Symptom: User authenticates but has no email/name
|
||||
- Cause: IdP not sending required attributes
|
||||
- Fix: Configure IdP to send required attributes
|
||||
|
||||
2. **Attribute Name Mismatch**
|
||||
- Symptom: Claims not mapped correctly
|
||||
- Cause: LangSmith expects different attribute name
|
||||
- Fix: Update attribute mapping in Helm values
|
||||
|
||||
3. **Signature Validation Failure**
|
||||
- Symptom: Authentication fails with "invalid signature"
|
||||
- Cause: Certificate mismatch or expired certificate
|
||||
- Fix: Update IdP certificate in metadata
|
||||
|
||||
4. **Assertion Expired**
|
||||
- Symptom: Authentication times out
|
||||
- Cause: Clock skew or assertion validity window too short
|
||||
- Fix: Synchronize clocks, adjust validity window
|
||||
|
||||
---
|
||||
|
||||
### 6. Validation & Failure Drills
|
||||
|
||||
#### Validation Checklist
|
||||
|
||||
See `docs/shared/auth_validation_checklist.md` for complete checklist.
|
||||
|
||||
**Quick Validation:**
|
||||
1. ✅ Ingress/TLS configured correctly
|
||||
2. ✅ Redirect URI matches exactly
|
||||
3. ✅ IdP issuer reachable
|
||||
4. ✅ Client credentials valid
|
||||
5. ✅ Role mapping configured
|
||||
6. ✅ Login flow works end-to-end
|
||||
7. ✅ Logout works
|
||||
8. ✅ Session invalidation works
|
||||
|
||||
#### Failure Drills
|
||||
|
||||
**Purpose:** Understand failure modes and recovery procedures.
|
||||
|
||||
**Drill 1: Redirect URI Mismatch**
|
||||
- **Change:** Modify redirect URI in Helm values (add trailing slash)
|
||||
- **Observe:** Login redirect fails
|
||||
- **Recover:** Revert change, restart pods
|
||||
- **Validate:** Login works again
|
||||
|
||||
**Drill 2: Missing Claim**
|
||||
- **Change:** Remove `groups` claim from IdP configuration
|
||||
- **Observe:** Users authenticate but have no role
|
||||
- **Recover:** Restore `groups` claim
|
||||
- **Validate:** Role mapping works again
|
||||
|
||||
**Drill 3: Secret Rotation Wrong**
|
||||
- **Change:** Update client secret in IdP but not in LangSmith
|
||||
- **Observe:** Authentication fails with "invalid client"
|
||||
- **Recover:** Update Kubernetes secret, restart pods
|
||||
- **Validate:** Authentication works again
|
||||
|
||||
**Note:** These drills are **optional** and should only be run in non-production environments.
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### Login Loop
|
||||
**Symptom:** User redirected to IdP, then back to LangSmith, then to IdP again (infinite loop)
|
||||
|
||||
**Causes:**
|
||||
- Redirect URI mismatch
|
||||
- Session cookie not set (TLS/cookie issues)
|
||||
- Token validation failure
|
||||
|
||||
**Fix:** Check redirect URI exactness, verify TLS certificate, check token validation logs
|
||||
|
||||
### No Data After Login
|
||||
**Symptom:** User authenticates successfully but sees empty workspace
|
||||
|
||||
**Causes:**
|
||||
- Role mapping not configured
|
||||
- User not in any mapped groups
|
||||
- Workspace not assigned to user's organization
|
||||
|
||||
**Fix:** Verify role mapping configuration, check user's group membership, verify workspace assignment
|
||||
|
||||
### TLS Callback Issues
|
||||
**Symptom:** IdP callback fails with TLS errors
|
||||
|
||||
**Causes:**
|
||||
- Self-signed certificate on LangSmith domain
|
||||
- Certificate chain incomplete
|
||||
- Certificate expired
|
||||
|
||||
**Fix:** Use valid TLS certificate from trusted CA, ensure full chain is present
|
||||
|
||||
### Multiple IdPs
|
||||
**Symptom:** Confusion about which IdP to use
|
||||
|
||||
**Causes:**
|
||||
- Multiple IdP configurations present
|
||||
- Configuration precedence unclear
|
||||
|
||||
**Fix:** Use single IdP configuration, remove unused configurations
|
||||
|
||||
---
|
||||
|
||||
## Security & Compliance Callouts
|
||||
|
||||
### Least Privilege
|
||||
- Start with minimal admin access
|
||||
- Use group-based role mapping
|
||||
- Regular access reviews
|
||||
- Document all admin assignments
|
||||
|
||||
### Auditability
|
||||
- All authentication events logged
|
||||
- Role changes tracked
|
||||
- Session creation/destruction logged
|
||||
- Export logs to SIEM for compliance
|
||||
|
||||
### Centralized Identity Governance
|
||||
- Manage users in IdP, not LangSmith
|
||||
- Use IdP groups for access control
|
||||
- Regular access reviews in IdP
|
||||
- Deprovision users in IdP when they leave
|
||||
|
||||
---
|
||||
|
||||
## Artifacts Participants Leave With
|
||||
|
||||
1. **SSO Configuration**
|
||||
- Helm values file with auth configuration
|
||||
- Kubernetes secrets for client credentials
|
||||
- Documentation of IdP settings
|
||||
|
||||
2. **IdP Settings Document**
|
||||
- Redirect URI whitelisted
|
||||
- Required claims configured
|
||||
- Scopes configured
|
||||
- Group structure documented
|
||||
|
||||
3. **Mapping Reference**
|
||||
- Group-to-role mapping table
|
||||
- Admin assignments documented
|
||||
- Workspace access rules
|
||||
|
||||
4. **Validation Checklist**
|
||||
- Completed validation checklist
|
||||
- Test results for admin and standard user
|
||||
- Logout/session invalidation verified
|
||||
|
||||
5. **Debugging Playbook**
|
||||
- Troubleshooting guide reference
|
||||
- Log locations documented
|
||||
- Support bundle procedure
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Run the validation notebook:**
|
||||
- `notebooks/module-2/01_sso_oidc_validation.ipynb` (OIDC)
|
||||
- `notebooks/module-2/02_sso_saml_validation.ipynb` (SAML)
|
||||
|
||||
2. **Complete the validation checklist:**
|
||||
- `docs/shared/auth_validation_checklist.md`
|
||||
|
||||
3. **Review troubleshooting guide:**
|
||||
- `docs/shared/auth_troubleshooting.md`
|
||||
|
||||
4. **Proceed to Module 3** (if applicable)
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [OIDC Specification](https://openid.net/specs/openid-connect-core-1_0.html)
|
||||
- [SAML 2.0 Specification](http://docs.oasis-open.org/security/saml/v2.0/)
|
||||
- LangSmith Helm Chart Documentation
|
||||
- Your IdP's OIDC/SAML documentation
|
||||
|
||||
@@ -0,0 +1,679 @@
|
||||
# Module 3: Production Operations & Scaling
|
||||
|
||||
**Goal:** Enable operators to run LangSmith reliably under real production load, understand scaling domains, and respond effectively when things go wrong (day-2 operations).
|
||||
|
||||
**Duration:** ~2 hours
|
||||
**Audience:** Platform engineers, infrastructure teams, SREs, and on-call operators
|
||||
**Prerequisites:**
|
||||
- Module 1 complete: LangSmith deployed and reachable (AWS/EKS or Azure/AKS baseline)
|
||||
- Module 2 complete: Authentication and authorization configured (OIDC/SAML)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Module 3 transitions from "it works" to "it works reliably under load." This module covers production operations, scaling strategies, observability, and the mental models needed for day-2 operations.
|
||||
|
||||
**What you'll accomplish:**
|
||||
- Understand LangSmith's distributed architecture and scaling domains
|
||||
- Configure production-grade service sizing and HA
|
||||
- Implement autoscaling strategies (HPA and KEDA)
|
||||
- Set up observability and early warning signals
|
||||
- Validate production readiness
|
||||
- Prepare for incident response
|
||||
|
||||
**What this module avoids:**
|
||||
- Deep dives into specific monitoring tools (Prometheus/Grafana setup)
|
||||
- Custom alerting rule creation (covered in incident response)
|
||||
- Performance tuning and optimization (out of scope)
|
||||
- Multi-region deployments (advanced topic)
|
||||
|
||||
---
|
||||
|
||||
## Section 1: Production Mental Model
|
||||
|
||||
### Distributed System Reality
|
||||
|
||||
LangSmith is a **distributed system** with multiple services that must coordinate:
|
||||
|
||||
- **API Server:** Handles HTTP requests, authentication, routing
|
||||
- **Workers:** Process traces, spans, and evaluations asynchronously
|
||||
- **ClickHouse:** Time-series data storage and queries
|
||||
- **PostgreSQL:** Metadata, users, workspaces, projects
|
||||
- **Redis:** Caching, rate limiting, job queues
|
||||
- **Blob Storage:** Large payload storage (traces, artifacts)
|
||||
|
||||
**Key insight:** These services have different scaling characteristics and failure modes. Understanding these differences is critical for production operations.
|
||||
|
||||
### Scaling Domains
|
||||
|
||||
**Scaling domains** are groups of resources that scale together or have shared bottlenecks:
|
||||
|
||||
1. **Ingestion Domain:**
|
||||
- API server pods (stateless, horizontal scaling)
|
||||
- Ingress/Load Balancer (cloud-managed, scales automatically)
|
||||
- **Bottleneck:** API server CPU/memory under high request volume
|
||||
|
||||
2. **Processing Domain:**
|
||||
- Worker pods (stateless, horizontal scaling)
|
||||
- Redis (single instance or cluster, vertical scaling)
|
||||
- **Bottleneck:** Worker capacity and Redis throughput
|
||||
|
||||
3. **Storage Domain:**
|
||||
- ClickHouse (stateful, complex scaling)
|
||||
- PostgreSQL (stateful, vertical scaling + read replicas)
|
||||
- Blob Storage (cloud-managed, effectively unlimited)
|
||||
- **Bottleneck:** ClickHouse query performance, PostgreSQL connection limits
|
||||
|
||||
4. **Control Plane:**
|
||||
- Kubernetes cluster (managed service)
|
||||
- Helm releases, ConfigMaps, Secrets
|
||||
- **Bottleneck:** Cluster capacity and node resources
|
||||
|
||||
**Critical understanding:** Scaling one domain without addressing downstream bottlenecks creates cascading failures.
|
||||
|
||||
---
|
||||
|
||||
## Section 2: Scaling Model
|
||||
|
||||
### What Scales Well
|
||||
|
||||
**Horizontal scaling (add more pods):**
|
||||
- API server pods (stateless HTTP handlers)
|
||||
- Worker pods (stateless job processors)
|
||||
- Ingress controllers (cloud-managed load balancers)
|
||||
|
||||
**Why:** These services are stateless and can be scaled independently based on load.
|
||||
|
||||
### What Does NOT Autoscale
|
||||
|
||||
**Vertical scaling only (increase resources per instance):**
|
||||
- PostgreSQL (managed RDS/Azure Database)
|
||||
- Redis (managed ElastiCache/Azure Cache)
|
||||
- ClickHouse (in-cluster or managed, complex scaling)
|
||||
|
||||
**Why:** These are stateful services with data locality requirements. Scaling requires careful planning and may involve downtime.
|
||||
|
||||
**Manual scaling required:**
|
||||
- Kubernetes node capacity (cluster autoscaling helps, but has limits)
|
||||
- Blob storage buckets (unlimited capacity, but requires configuration)
|
||||
- Network bandwidth (cloud-managed, but has limits)
|
||||
|
||||
### Failure Pattern: HPA Increases Ingestion → Downstream Saturation
|
||||
|
||||
**Common anti-pattern:**
|
||||
|
||||
1. High request volume triggers HPA to scale API server pods
|
||||
2. API servers successfully handle more requests
|
||||
3. Workers cannot keep up with increased trace volume
|
||||
4. Redis queue fills up
|
||||
5. ClickHouse ingestion rate saturates
|
||||
6. PostgreSQL connection pool exhausts
|
||||
7. System degrades despite "scaled" API servers
|
||||
|
||||
**Solution:** Scale all domains together, or implement backpressure and rate limiting.
|
||||
|
||||
**Key principle:** Monitor downstream services, not just upstream services.
|
||||
|
||||
---
|
||||
|
||||
## Section 3: Service Sizing Baselines
|
||||
|
||||
### PostgreSQL (Database)
|
||||
|
||||
**Production baseline:**
|
||||
- **Instance size:** db.r5.xlarge (4 vCPU, 32 GB RAM) minimum
|
||||
- **Storage:** 500 GB+ with autoscaling enabled
|
||||
- **High availability:** Multi-AZ deployment (RDS) or read replicas (Azure)
|
||||
- **Connection pool:** 100+ connections configured in LangSmith
|
||||
- **Backups:** Automated daily backups with 7-day retention minimum
|
||||
|
||||
**Non-production guidance:**
|
||||
- db.t3.medium (2 vCPU, 4 GB RAM) acceptable for development
|
||||
- Single-AZ acceptable for non-production
|
||||
- 30-day backup retention sufficient
|
||||
|
||||
**Verification:**
|
||||
```bash
|
||||
# AWS RDS
|
||||
aws rds describe-db-instances --db-instance-identifier <instance-id>
|
||||
|
||||
# Azure Database
|
||||
az postgres flexible-server show --name <server-name> --resource-group <rg>
|
||||
```
|
||||
|
||||
**What to check:**
|
||||
- Instance class/size
|
||||
- Multi-AZ status
|
||||
- Storage autoscaling enabled
|
||||
- Backup retention period
|
||||
- Private networking (VPC/subnet configuration)
|
||||
|
||||
### Redis (Cache)
|
||||
|
||||
**Production baseline:**
|
||||
- **Instance type:** cache.r6g.xlarge (6 vCPU, 13.07 GB RAM) minimum
|
||||
- **High availability:** Redis Cluster mode enabled (3+ nodes)
|
||||
- **Memory:** 50% headroom for growth
|
||||
- **Persistence:** AOF (Append Only File) enabled for durability
|
||||
|
||||
**Non-production guidance:**
|
||||
- cache.t3.micro acceptable for development
|
||||
- Single node acceptable for non-production
|
||||
- RDB snapshots sufficient (no AOF required)
|
||||
|
||||
**Verification:**
|
||||
```bash
|
||||
# AWS ElastiCache
|
||||
aws elasticache describe-cache-clusters --cache-cluster-id <cluster-id>
|
||||
|
||||
# Azure Cache
|
||||
az redis show --name <cache-name> --resource-group <rg>
|
||||
```
|
||||
|
||||
**What to check:**
|
||||
- Node type and memory size
|
||||
- Cluster mode enabled (production)
|
||||
- AOF persistence enabled
|
||||
- Private networking
|
||||
|
||||
### ClickHouse
|
||||
|
||||
**Production baseline:**
|
||||
- **Deployment:** Managed ClickHouse (ClickHouse Cloud) OR in-cluster with EBS CSI
|
||||
- **In-cluster sizing:** 3-node cluster minimum (for HA)
|
||||
- **Resources per node:** 8 CPU, 32 GB RAM, 1 TB storage
|
||||
- **Storage:** EBS gp3 volumes with 3000 IOPS
|
||||
- **Replication:** 2x replication factor (6 total pods for 3-node cluster)
|
||||
|
||||
**Non-production guidance:**
|
||||
- Single node acceptable for development
|
||||
- 4 CPU, 16 GB RAM per node sufficient
|
||||
- 100 GB storage per node
|
||||
|
||||
**Verification:**
|
||||
```bash
|
||||
# In-cluster ClickHouse
|
||||
kubectl get statefulset -n <namespace> | grep clickhouse
|
||||
kubectl get pvc -n <namespace> | grep clickhouse
|
||||
|
||||
# Check ClickHouse cluster status
|
||||
kubectl exec -it <clickhouse-pod> -n <namespace> -- clickhouse-client --query "SELECT * FROM system.clusters"
|
||||
```
|
||||
|
||||
**What to check:**
|
||||
- StatefulSet replica count
|
||||
- PVC size and storage class
|
||||
- Resource requests/limits
|
||||
- Replication factor
|
||||
|
||||
### Managed vs In-Cluster
|
||||
|
||||
**Managed services (recommended for production):**
|
||||
- PostgreSQL: RDS (AWS) or Azure Database for PostgreSQL
|
||||
- Redis: ElastiCache (AWS) or Azure Cache for Redis
|
||||
- ClickHouse: ClickHouse Cloud (managed service)
|
||||
|
||||
**Benefits:**
|
||||
- Automated backups and maintenance
|
||||
- High availability built-in
|
||||
- Security patches applied automatically
|
||||
- Monitoring and alerting included
|
||||
|
||||
**In-cluster services (acceptable for non-production):**
|
||||
- PostgreSQL: Postgres operator (Crunchy Data, Zalando)
|
||||
- Redis: Redis operator or Helm chart
|
||||
- ClickHouse: ClickHouse operator
|
||||
|
||||
**Trade-offs:**
|
||||
- More operational overhead
|
||||
- Requires backup strategy
|
||||
- Manual HA configuration
|
||||
- Lower cost for development
|
||||
|
||||
### Private Networking
|
||||
|
||||
**Production requirement:** All data stores must be in private subnets with no public internet access.
|
||||
|
||||
**Why:**
|
||||
- Security: Reduces attack surface
|
||||
- Compliance: Required for many compliance frameworks
|
||||
- Performance: Lower latency within VPC/VNet
|
||||
|
||||
**Verification:**
|
||||
- RDS/Azure Database: Check subnet group (private subnets only)
|
||||
- ElastiCache/Azure Cache: Check subnet group (private subnets only)
|
||||
- ClickHouse: Check pod network policies and service mesh egress rules
|
||||
|
||||
---
|
||||
|
||||
## Section 4: Blob Storage REQUIRED for Production
|
||||
|
||||
### Why Blob Storage is Required
|
||||
|
||||
**Problem without blob storage:**
|
||||
- Large trace payloads stored inline in ClickHouse
|
||||
- ClickHouse table size explodes
|
||||
- Query performance degrades
|
||||
- Storage costs increase dramatically
|
||||
- System becomes unusable under load
|
||||
|
||||
**Solution with blob storage:**
|
||||
- Large payloads stored in S3/Azure Blob Storage
|
||||
- ClickHouse stores only references (small strings)
|
||||
- Query performance remains stable
|
||||
- Storage costs scale linearly
|
||||
- System handles production load
|
||||
|
||||
### Requirements
|
||||
|
||||
**Production:**
|
||||
- **Service:** S3 (AWS) or Azure Blob Storage (Azure)
|
||||
- **Bucket/Container:** Dedicated bucket for LangSmith
|
||||
- **Access:** IAM roles (AWS) or Managed Identity (Azure) - no access keys
|
||||
- **Lifecycle policies:** Configured for cost optimization (move to Glacier/Cool tier after 90 days)
|
||||
- **Versioning:** Enabled for data protection
|
||||
- **Encryption:** Server-side encryption enabled
|
||||
|
||||
**Non-production:**
|
||||
- Local MinIO or in-cluster object storage acceptable
|
||||
- Access keys acceptable (not for production)
|
||||
- No lifecycle policies required
|
||||
|
||||
### Verification
|
||||
|
||||
**Check Helm values:**
|
||||
```yaml
|
||||
blobStorage:
|
||||
provider: s3 # or azure
|
||||
bucket: langsmith-traces
|
||||
region: us-west-2
|
||||
# IAM role ARN (not access keys)
|
||||
iamRoleArn: arn:aws:iam::<account>:role/langsmith-blob-storage
|
||||
```
|
||||
|
||||
**Check environment variables:**
|
||||
```bash
|
||||
kubectl exec <api-pod> -n <namespace> -- env | grep -i blob
|
||||
kubectl exec <api-pod> -n <namespace> -- env | grep -i s3
|
||||
```
|
||||
|
||||
**What to verify:**
|
||||
- Blob storage provider configured (not "local" or "filesystem")
|
||||
- Bucket/container name present
|
||||
- IAM role or managed identity configured (no access keys)
|
||||
- Blob storage health check passes (see ops sanity checks notebook)
|
||||
|
||||
---
|
||||
|
||||
## Section 5: Autoscaling Strategy
|
||||
|
||||
### HPA (Horizontal Pod Autoscaler) for API Servers
|
||||
|
||||
**Use case:** Scale API server pods based on CPU/memory utilization.
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
apiVersion: autoscaling/v2
|
||||
kind: HorizontalPodAutoscaler
|
||||
metadata:
|
||||
name: langsmith-api
|
||||
spec:
|
||||
scaleTargetRef:
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
name: langsmith-api
|
||||
minReplicas: 2
|
||||
maxReplicas: 10
|
||||
metrics:
|
||||
- type: Resource
|
||||
resource:
|
||||
name: cpu
|
||||
target:
|
||||
type: Utilization
|
||||
averageUtilization: 70
|
||||
- type: Resource
|
||||
resource:
|
||||
name: memory
|
||||
target:
|
||||
type: Utilization
|
||||
averageUtilization: 80
|
||||
```
|
||||
|
||||
**Baseline:**
|
||||
- **Min replicas:** 2 (for HA)
|
||||
- **Max replicas:** 10 (adjust based on cluster capacity)
|
||||
- **CPU target:** 70% average utilization
|
||||
- **Memory target:** 80% average utilization
|
||||
|
||||
**Verification:**
|
||||
```bash
|
||||
kubectl get hpa -n <namespace>
|
||||
kubectl describe hpa langsmith-api -n <namespace>
|
||||
```
|
||||
|
||||
### KEDA for Bursty Worker Scaling
|
||||
|
||||
**Why KEDA instead of HPA:**
|
||||
- Workers process jobs from Redis queues
|
||||
- Queue depth is a better scaling signal than CPU/memory
|
||||
- Bursty workloads need rapid scaling (seconds, not minutes)
|
||||
- KEDA supports Redis queue depth metrics
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
apiVersion: keda.sh/v1alpha1
|
||||
kind: ScaledObject
|
||||
metadata:
|
||||
name: langsmith-workers
|
||||
spec:
|
||||
scaleTargetRef:
|
||||
name: langsmith-worker
|
||||
minReplicaCount: 1
|
||||
maxReplicaCount: 20
|
||||
triggers:
|
||||
- type: redis
|
||||
metadata:
|
||||
address: <redis-host>:6379
|
||||
listName: langsmith:jobs:traces
|
||||
listLength: "10" # Scale up when queue depth > 10
|
||||
```
|
||||
|
||||
**Baseline:**
|
||||
- **Min replicas:** 1
|
||||
- **Max replicas:** 20 (adjust based on workload)
|
||||
- **Queue depth threshold:** 10 jobs (adjust based on processing time)
|
||||
- **Cooldown period:** 30 seconds
|
||||
|
||||
**Verification:**
|
||||
```bash
|
||||
kubectl get scaledobject -n <namespace>
|
||||
kubectl describe scaledobject langsmith-workers -n <namespace>
|
||||
```
|
||||
|
||||
### What Does NOT Autoscale
|
||||
|
||||
**Manual scaling required:**
|
||||
- PostgreSQL instance size (vertical scaling only)
|
||||
- Redis cluster size (add nodes manually)
|
||||
- ClickHouse nodes (StatefulSet scaling requires data rebalancing)
|
||||
- Kubernetes nodes (cluster autoscaler helps, but has limits)
|
||||
|
||||
**Key principle:** Monitor these services and scale proactively based on capacity planning, not reactively based on alerts.
|
||||
|
||||
---
|
||||
|
||||
## Section 6: Observability & Early Warning Signals
|
||||
|
||||
### Three Layers of Observability
|
||||
|
||||
**1. Kubernetes Layer:**
|
||||
- Pod status, restarts, resource usage
|
||||
- Node capacity and utilization
|
||||
- Events and warnings
|
||||
- **Tools:** `kubectl`, `kubectl top`, cluster monitoring
|
||||
|
||||
**2. LangSmith Application Layer:**
|
||||
- Request rates, latencies, error rates
|
||||
- Trace ingestion rates
|
||||
- Worker queue depths
|
||||
- **Tools:** Application metrics, logs, dashboards
|
||||
|
||||
**3. Data Store Layer:**
|
||||
- PostgreSQL connection counts, query performance
|
||||
- Redis memory usage, hit rates
|
||||
- ClickHouse query performance, table sizes
|
||||
- **Tools:** Cloud provider monitoring, database metrics
|
||||
|
||||
### Early Warning Signals
|
||||
|
||||
See `docs/shared/ops_signals_and_thresholds.md` for complete signal catalog.
|
||||
|
||||
**Critical signals (red flags):**
|
||||
- Pod restart count > 5 in 1 hour
|
||||
- Pending pods > 0 for > 5 minutes
|
||||
- API server CPU > 80% for > 10 minutes
|
||||
- Worker queue depth > 100
|
||||
- PostgreSQL connections > 80% of max
|
||||
- Redis memory > 90%
|
||||
- ClickHouse query latency > 5 seconds (p95)
|
||||
|
||||
**Warning signals (yellow flags):**
|
||||
- Pod restart count > 2 in 1 hour
|
||||
- API server CPU > 70% for > 10 minutes
|
||||
- Worker queue depth > 50
|
||||
- PostgreSQL connections > 60% of max
|
||||
- Redis memory > 75%
|
||||
|
||||
### Red Flag Thresholds
|
||||
|
||||
**Immediate action required:**
|
||||
- Any pod in `CrashLoopBackOff` state
|
||||
- Any pod `Pending` for > 10 minutes
|
||||
- API server error rate > 5%
|
||||
- Worker queue depth > 200
|
||||
- PostgreSQL connection pool exhausted
|
||||
- Redis out of memory
|
||||
- ClickHouse query timeout > 10 seconds
|
||||
|
||||
**Escalation evidence:**
|
||||
- Pod logs (last 100 lines)
|
||||
- Recent events (`kubectl get events --sort-by=.lastTimestamp`)
|
||||
- Resource usage (`kubectl top pods`)
|
||||
- Application metrics snapshot
|
||||
- Database connection counts
|
||||
|
||||
---
|
||||
|
||||
## Section 7: Backups, DR, and Failure Domains
|
||||
|
||||
### What Backups Cover
|
||||
|
||||
**PostgreSQL backups (managed services):**
|
||||
- Automated daily backups (RDS/Azure Database)
|
||||
- Point-in-time recovery (PITR) for last 7 days
|
||||
- Cross-region backup replication (if configured)
|
||||
- **Covers:** Database schema, user data, workspace/project metadata
|
||||
|
||||
**ClickHouse backups:**
|
||||
- Manual backups via `clickhouse-backup` tool
|
||||
- Cloud storage snapshots (if using managed ClickHouse)
|
||||
- **Covers:** Trace data, span data, evaluation results
|
||||
|
||||
**Blob storage:**
|
||||
- Versioning enabled (S3/Azure Blob)
|
||||
- Lifecycle policies for cost optimization
|
||||
- Cross-region replication (if configured)
|
||||
- **Covers:** Large trace payloads, artifacts, files
|
||||
|
||||
### What Backups Do NOT Cover
|
||||
|
||||
**Not backed up automatically:**
|
||||
- Kubernetes secrets (stored in cluster, not in backups)
|
||||
- Helm values (stored in Git, not in backups)
|
||||
- In-cluster ClickHouse data (unless backup job configured)
|
||||
- Redis data (ephemeral cache, not backed up)
|
||||
- Application configuration (ConfigMaps, stored in cluster)
|
||||
|
||||
**Manual backup required:**
|
||||
- Kubernetes secrets (export to encrypted storage)
|
||||
- Helm values files (store in Git)
|
||||
- In-cluster ClickHouse (configure backup job)
|
||||
- Application logs (export to log aggregation service)
|
||||
|
||||
### Failure Domains
|
||||
|
||||
**Availability Zone (AZ) failures:**
|
||||
- **Impact:** Pods in one AZ unavailable
|
||||
- **Mitigation:** Multi-AZ deployment (pods spread across AZs)
|
||||
- **Recovery:** Kubernetes reschedules pods to healthy AZs
|
||||
|
||||
**Node failures:**
|
||||
- **Impact:** All pods on failed node unavailable
|
||||
- **Mitigation:** Multiple nodes, pod anti-affinity rules
|
||||
- **Recovery:** Kubernetes reschedules pods to healthy nodes
|
||||
|
||||
**Database failures:**
|
||||
- **Impact:** Application cannot read/write data
|
||||
- **Mitigation:** Multi-AZ RDS, automated failover
|
||||
- **Recovery:** RDS promotes standby to primary (5-10 minutes)
|
||||
|
||||
**Region failures:**
|
||||
- **Impact:** Entire deployment unavailable
|
||||
- **Mitigation:** Multi-region deployment (advanced, out of scope)
|
||||
- **Recovery:** Manual failover to secondary region
|
||||
|
||||
**Reality check:** Most failures are AZ or node-level. Region failures are rare but catastrophic. Plan accordingly.
|
||||
|
||||
---
|
||||
|
||||
## Section 8: Production Readiness Checklist
|
||||
|
||||
See `docs/shared/production_readiness_checklist.md` for complete checklist.
|
||||
|
||||
**Each checklist item maps to real incidents:**
|
||||
|
||||
1. **Blob storage configured** → Prevents ClickHouse table explosion
|
||||
2. **PostgreSQL HA enabled** → Prevents database downtime
|
||||
3. **Redis cluster mode** → Prevents cache failures
|
||||
4. **ClickHouse replication** → Prevents data loss
|
||||
5. **HPA configured** → Prevents API server overload
|
||||
6. **KEDA configured** → Prevents worker queue saturation
|
||||
7. **Monitoring enabled** → Enables early detection
|
||||
8. **Backups configured** → Enables data recovery
|
||||
9. **Private networking** → Meets security requirements
|
||||
10. **Resource limits set** → Prevents resource exhaustion
|
||||
|
||||
**Validation:**
|
||||
- Run `notebooks/module-3/01_ops_sanity_checks.ipynb` to validate each item
|
||||
- Review cloud provider console for managed service configuration
|
||||
- Check Helm values for application configuration
|
||||
- Verify monitoring dashboards show expected metrics
|
||||
|
||||
---
|
||||
|
||||
## Section 9: Sidecars & Service Mesh (Istio)
|
||||
|
||||
### When Sidecars Are Needed
|
||||
|
||||
**Use cases:**
|
||||
- **Egress control:** Restrict outbound traffic to approved destinations
|
||||
- **mTLS:** Encrypt traffic between services
|
||||
- **Policy enforcement:** Rate limiting, circuit breakers
|
||||
- **Observability:** Distributed tracing, metrics collection
|
||||
|
||||
**When NOT needed:**
|
||||
- Simple deployments without egress requirements
|
||||
- Development environments
|
||||
- Proof-of-concept deployments
|
||||
|
||||
### How to Enable Injection Safely
|
||||
|
||||
**Namespace-level injection (recommended for LangSmith):**
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: langsmith
|
||||
labels:
|
||||
istio-injection: enabled
|
||||
istio-discovery: enabled
|
||||
```
|
||||
|
||||
**Per-workload annotation (for selective injection):**
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: langsmith-api
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
annotations:
|
||||
sidecar.istio.io/inject: "true"
|
||||
```
|
||||
|
||||
**Revision-based injection (for canary/blue-green):**
|
||||
```yaml
|
||||
labels:
|
||||
istio-injection: enabled
|
||||
istio.io/rev: default
|
||||
```
|
||||
|
||||
### Operational Implications
|
||||
|
||||
**Logging and kubectl logs:**
|
||||
- Multi-container pods require container selection
|
||||
- **App logs:** `kubectl logs <pod> -c <container-name> -n <namespace>`
|
||||
- **Proxy logs:** `kubectl logs <pod> -c istio-proxy -n <namespace>`
|
||||
- **All logs:** `kubectl logs <pod> --all-containers=true -n <namespace>`
|
||||
|
||||
**Common issue:** "If logs appear missing after injection, you're likely looking at the wrong container."
|
||||
|
||||
**Health probes and timeouts:**
|
||||
- Sidecar adds latency to health checks
|
||||
- Increase probe timeouts if sidecars are enabled
|
||||
- Verify readiness probes account for sidecar startup
|
||||
|
||||
**Egress to external databases:**
|
||||
- Configure `ServiceEntry` for external PostgreSQL/Redis endpoints
|
||||
- Configure `DestinationRule` for traffic policies
|
||||
- Verify egress rules allow database connections
|
||||
|
||||
See `docs/shared/sidecars_and_service_mesh.md` for detailed guidance.
|
||||
|
||||
---
|
||||
|
||||
## Section 10: Transition to Incident Response
|
||||
|
||||
Module 3 establishes the baseline for production operations. The next step is **incident response**:
|
||||
|
||||
**What you'll learn:**
|
||||
- How to diagnose common failure modes
|
||||
- How to gather evidence for support
|
||||
- How to implement runbooks
|
||||
- How to perform post-incident reviews
|
||||
|
||||
**Prerequisites:**
|
||||
- Module 3 complete (production readiness validated)
|
||||
- Monitoring and alerting configured
|
||||
- On-call rotation established
|
||||
|
||||
---
|
||||
|
||||
## Artifacts Participants Leave With
|
||||
|
||||
1. **Production readiness checklist** (completed)
|
||||
2. **Service sizing documentation** (baselines documented)
|
||||
3. **Autoscaling configuration** (HPA and KEDA configured)
|
||||
4. **Observability setup** (signals and thresholds documented)
|
||||
5. **Backup strategy** (backups configured and tested)
|
||||
6. **Ops sanity checks notebook** (validation results)
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Run the ops sanity checks notebook:**
|
||||
- `notebooks/module-3/01_ops_sanity_checks.ipynb`
|
||||
|
||||
2. **Review production readiness checklist:**
|
||||
- `docs/shared/production_readiness_checklist.md`
|
||||
|
||||
3. **Document your thresholds:**
|
||||
- `docs/shared/ops_signals_and_thresholds.md`
|
||||
|
||||
4. **Configure monitoring and alerting** (next module)
|
||||
|
||||
5. **Proceed to incident response training** (next module)
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
|
||||
- [KEDA Documentation](https://keda.sh/docs/)
|
||||
- [Istio Service Mesh](https://istio.io/latest/docs/)
|
||||
- LangSmith Helm Chart Documentation
|
||||
- Cloud Provider Documentation (AWS RDS, Azure Database, etc.)
|
||||
|
||||
@@ -0,0 +1,426 @@
|
||||
# Module 4: Troubleshooting & Incident Response
|
||||
|
||||
**Goal:** Teach operators how to diagnose LangSmith self-hosted issues under pressure, collect the right evidence, and resolve incidents efficiently—either independently or with LangChain Support.
|
||||
|
||||
**Duration:** ~3-4 hours (with optional full incident drill)
|
||||
**Audience:** On-call engineers, platform owners, SREs, and anyone responsible for keeping LangSmith healthy
|
||||
**Prerequisites:**
|
||||
- Module 1 complete: LangSmith deployed and reachable
|
||||
- Module 2 complete: Authentication configured
|
||||
- Module 3 complete: Production operations concepts understood
|
||||
- Participants own day-2 operations
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Module 4 is hands-on: learners will introduce subtle but noticeable failures and debug them using standard tools and the canonical diagnostics bundle. This module builds the muscle memory needed for real incidents.
|
||||
|
||||
**What you'll accomplish:**
|
||||
- Understand common failure modes and their symptoms
|
||||
- Master the "first 10 minutes" incident response checklist
|
||||
- Learn to collect canonical diagnostics bundles
|
||||
- Practice debugging with guided failure labs
|
||||
- Know when and how to escalate to Support
|
||||
|
||||
**What this module avoids:**
|
||||
- Deep dives into specific monitoring tools (assumes basic kubectl/helm)
|
||||
- Performance optimization (covered in Module 3)
|
||||
- Infrastructure provisioning (covered in Module 1)
|
||||
- Authentication configuration (covered in Module 2)
|
||||
|
||||
---
|
||||
|
||||
## Section 1: Incident Reality Check
|
||||
|
||||
### The Mindset
|
||||
|
||||
**Incidents happen.** Even with perfect configuration, production systems fail. The difference between a 30-minute incident and a 4-hour outage is often preparation and process.
|
||||
|
||||
**Key principles:**
|
||||
1. **Collect evidence first.** Don't redeploy, restart, or reconfigure until you understand what's wrong.
|
||||
2. **Time is evidence.** Every minute that passes without collecting diagnostics is lost information.
|
||||
3. **Symptoms are clues.** The same root cause can manifest differently depending on load, timing, and configuration.
|
||||
4. **Support needs context.** A good diagnostics bundle is worth more than a perfect description.
|
||||
|
||||
### What Makes Incidents Hard
|
||||
|
||||
**Pressure:**
|
||||
- Users are impacted
|
||||
- Management is asking for updates
|
||||
- You're on-call and tired
|
||||
- Multiple systems are involved
|
||||
|
||||
**Complexity:**
|
||||
- Distributed systems have many moving parts
|
||||
- Failures cascade (one service fails, others follow)
|
||||
- Symptoms don't always point to root cause
|
||||
- Configuration drift accumulates over time
|
||||
|
||||
**Tooling:**
|
||||
- Too many tools (which one shows the truth?)
|
||||
- Too few tools (missing critical information)
|
||||
- Tools that hide the problem (aggregation, sampling)
|
||||
|
||||
**This module prepares you for all of these.**
|
||||
|
||||
---
|
||||
|
||||
## Section 2: Common Failure Modes
|
||||
|
||||
### Ingestion & Tracing Failures
|
||||
|
||||
**Symptoms:**
|
||||
- Traces appear delayed or missing
|
||||
- Worker pods show errors in logs
|
||||
- ClickHouse insert errors
|
||||
- Queue backlogs
|
||||
|
||||
**Common causes:**
|
||||
- ClickHouse connectivity issues (network, credentials, resource limits)
|
||||
- Blob storage misconfiguration (large payloads fail)
|
||||
- Worker resource exhaustion (CPU/memory limits)
|
||||
- Redis connectivity (job queue backing up)
|
||||
|
||||
**What to check first:**
|
||||
- Worker pod logs
|
||||
- ClickHouse pod status and logs
|
||||
- Redis connectivity and latency
|
||||
- Blob storage configuration
|
||||
|
||||
### UI & API Failures
|
||||
|
||||
**Symptoms:**
|
||||
- UI returns 5xx errors
|
||||
- API endpoints timeout
|
||||
- Login fails or redirects loop
|
||||
- Specific features don't work
|
||||
|
||||
**Common causes:**
|
||||
- Database connectivity (PostgreSQL unreachable)
|
||||
- Authentication misconfiguration (OIDC/SAML)
|
||||
- Ingress/load balancer issues
|
||||
- API pod crashes or resource limits
|
||||
|
||||
**What to check first:**
|
||||
- API pod logs
|
||||
- Database connectivity
|
||||
- Ingress status and configuration
|
||||
- Authentication configuration (Module 2 validation)
|
||||
|
||||
### Authentication Failures
|
||||
|
||||
**Symptoms:**
|
||||
- Users can't log in
|
||||
- Redirect loops
|
||||
- 403 errors after successful login
|
||||
- Session timeouts
|
||||
|
||||
**Common causes:**
|
||||
- IdP connectivity issues
|
||||
- OIDC/SAML configuration drift
|
||||
- Secret rotation without updating LangSmith
|
||||
- Network policies blocking egress
|
||||
|
||||
**What to check first:**
|
||||
- Auth pod logs
|
||||
- IdP connectivity (curl to issuer URL)
|
||||
- OIDC/SAML configuration (Module 2 validation)
|
||||
- Network policies
|
||||
|
||||
---
|
||||
|
||||
## Section 3: First 10 Minutes Checklist
|
||||
|
||||
**The first 10 minutes of an incident are critical.** This is when you collect the most valuable evidence and make decisions that determine how long the incident lasts.
|
||||
|
||||
### What NOT to Do
|
||||
|
||||
**Resist the urge to:**
|
||||
- Run `helm upgrade` or `kubectl rollout restart`
|
||||
- Delete pods "to see if they come back"
|
||||
- Scale resources up/down
|
||||
- Change configuration
|
||||
|
||||
**Why:** Redeploying destroys evidence and may mask the root cause. Collect diagnostics first.
|
||||
|
||||
### The Checklist
|
||||
|
||||
See [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md) for the complete reference.
|
||||
|
||||
**Quick summary:**
|
||||
1. **Minute 0-2:** Triage & scope (what's broken, who's impacted)
|
||||
2. **Minute 2-5:** Quick health check (pods, events, ingress)
|
||||
3. **Minute 5-8:** Collect diagnostics bundle (canonical script + snapshots)
|
||||
4. **Minute 8-10:** Identify likely root cause (symptoms → checks)
|
||||
|
||||
**Key insight:** This checklist is not about fixing the issue—it's about collecting evidence and making informed decisions.
|
||||
|
||||
---
|
||||
|
||||
## Section 4: Standard Diagnostics Collection
|
||||
|
||||
### The Canonical Script
|
||||
|
||||
LangChain provides an official diagnostics script that captures everything Support needs:
|
||||
|
||||
**Location:**
|
||||
```
|
||||
https://github.com/langchain-ai/helm/blob/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
|
||||
```
|
||||
|
||||
**What it captures:**
|
||||
- Pod logs (all containers)
|
||||
- Events (sorted by timestamp)
|
||||
- Resource usage (CPU, memory)
|
||||
- Configuration (deployments, services, ingress)
|
||||
- Storage (PVCs, storage classes)
|
||||
- Network (services, endpoints)
|
||||
|
||||
**How to use it:**
|
||||
```bash
|
||||
curl -O https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
|
||||
chmod +x get_k8s_debugging_info.sh
|
||||
./get_k8s_debugging_info.sh <namespace>
|
||||
```
|
||||
|
||||
**Important:** Always run this script before making changes. The bundle it creates is your evidence.
|
||||
|
||||
### What Good Debugging Looks Like
|
||||
|
||||
**Good debugging:**
|
||||
- Starts with a baseline (what was working before)
|
||||
- Collects evidence systematically (checklist-driven)
|
||||
- Documents hypotheses and tests them
|
||||
- Preserves evidence (saves diagnostics bundles)
|
||||
- Escalates with context (diagnostics + timeline)
|
||||
|
||||
**Bad debugging:**
|
||||
- Changes things without understanding
|
||||
- Doesn't collect evidence
|
||||
- Jumps to conclusions
|
||||
- Destroys evidence (redeploys, deletes)
|
||||
- Escalates without context ("it's broken, fix it")
|
||||
|
||||
**The difference:** Good debugging produces a clear root cause and fix. Bad debugging produces more incidents.
|
||||
|
||||
---
|
||||
|
||||
## Section 5: Working with Support
|
||||
|
||||
### What Speeds Up Support
|
||||
|
||||
**Good escalation includes:**
|
||||
- Diagnostics bundle (canonical script output)
|
||||
- Timeline (when did it start, what changed)
|
||||
- Symptoms (what's broken, who's impacted)
|
||||
- What you've tried (investigation steps, results)
|
||||
- Environment details (versions, configuration)
|
||||
|
||||
**Use the [Support Escalation Template](../shared/support_escalation_template.md).**
|
||||
|
||||
### What Slows Down Support
|
||||
|
||||
**Poor escalation includes:**
|
||||
- No diagnostics bundle ("just look at it")
|
||||
- Vague symptoms ("it's slow")
|
||||
- No timeline ("it broke")
|
||||
- No environment details ("it's on Kubernetes")
|
||||
- Secrets in logs (security risk)
|
||||
|
||||
**Result:** Support has to ask for information you could have provided, delaying resolution.
|
||||
|
||||
### Required Metadata
|
||||
|
||||
**Support will always ask for:**
|
||||
1. Diagnostics bundle (canonical script)
|
||||
2. Helm chart version
|
||||
3. Image tags (if known)
|
||||
4. Recent changes (deployments, config, infrastructure)
|
||||
5. Cloud provider and region
|
||||
6. Kubernetes version
|
||||
7. What you've tried and results
|
||||
|
||||
**Provide this upfront to speed resolution.**
|
||||
|
||||
---
|
||||
|
||||
## Section 6: Preventing Repeat Incidents
|
||||
|
||||
### Post-Incident Review
|
||||
|
||||
**After an incident is resolved:**
|
||||
1. **Document the root cause** (what actually broke)
|
||||
2. **Identify contributing factors** (what made it worse)
|
||||
3. **List what worked** (what helped you debug)
|
||||
4. **List what didn't work** (what slowed you down)
|
||||
5. **Create action items** (what to change to prevent recurrence)
|
||||
|
||||
**Key questions:**
|
||||
- Could we have detected this earlier? (monitoring, alerts)
|
||||
- Could we have prevented this? (configuration, testing)
|
||||
- Could we have fixed it faster? (runbooks, tooling)
|
||||
- What did we learn? (new failure mode, new tool)
|
||||
|
||||
### Common Patterns
|
||||
|
||||
**Configuration drift:**
|
||||
- Secrets rotate, but LangSmith config isn't updated
|
||||
- Infrastructure changes, but Helm values aren't updated
|
||||
- IdP settings change, but OIDC/SAML config isn't updated
|
||||
|
||||
**Prevention:** Automated validation (Module 2, Module 3 notebooks), configuration as code, regular audits.
|
||||
|
||||
**Resource exhaustion:**
|
||||
- ClickHouse runs out of disk
|
||||
- PostgreSQL hits connection limits
|
||||
- Workers hit CPU/memory limits
|
||||
|
||||
**Prevention:** Monitoring (Module 3), autoscaling (Module 3), capacity planning.
|
||||
|
||||
**Network issues:**
|
||||
- Egress blocked by NetworkPolicy
|
||||
- Load balancer misconfiguration
|
||||
- DNS resolution failures
|
||||
|
||||
**Prevention:** Network policy testing, ingress validation (Module 1), DNS checks.
|
||||
|
||||
---
|
||||
|
||||
## Section 7: Hands-on Failure Labs
|
||||
|
||||
**This is where you practice.** Each lab follows the same pattern:
|
||||
|
||||
1. **Baseline snapshot:** Capture what "good" looks like
|
||||
2. **Introduce failure:** Apply a subtle but noticeable fault
|
||||
3. **Observe symptoms:** See how the failure manifests
|
||||
4. **Collect diagnostics:** Run the canonical script and gather evidence
|
||||
5. **Hypothesize root cause:** Based on symptoms, identify likely cause
|
||||
6. **Verify with targeted checks:** Confirm your hypothesis
|
||||
7. **Remediate:** Revert the failure
|
||||
8. **Confirm recovery:** Verify everything is working again
|
||||
9. **Capture lessons learned:** Document what you discovered
|
||||
|
||||
### Lab Structure
|
||||
|
||||
**Each failure lab includes:**
|
||||
- **What this service does for LangSmith:** Context on the service's role
|
||||
- **Expected symptoms when it fails:** What you'll see when it breaks
|
||||
- **Failure injection options:** Two levels (subtle vs. obvious)
|
||||
- **Do the drill:** Step-by-step debugging process
|
||||
- **What Support will ask for:** Service-specific evidence
|
||||
|
||||
### Available Labs
|
||||
|
||||
1. **PostgreSQL Failure Lab** (`10_failure_lab_postgres.ipynb`)
|
||||
- Connection failures, wrong credentials, network isolation
|
||||
- Symptoms: API 5xx, login failures, connection exhaustion
|
||||
|
||||
2. **Redis Failure Lab** (`20_failure_lab_redis.ipynb`)
|
||||
- Connectivity issues, wrong credentials
|
||||
- Symptoms: Intermittent ingestion, latency spikes, worker backlog
|
||||
|
||||
3. **ClickHouse Failure Lab** (`30_failure_lab_clickhouse.ipynb`)
|
||||
- Endpoint misconfiguration, network isolation, resource limits
|
||||
- Symptoms: Traces delayed/missing, insert errors, UI loads but traces don't appear
|
||||
|
||||
4. **Blob Storage Failure Lab** (`40_failure_lab_blob_storage.ipynb`)
|
||||
- Credential misconfiguration, bucket name errors
|
||||
- Symptoms: Large payload traces degrade ClickHouse, warnings in logs
|
||||
|
||||
5. **Full Incident Drill** (`90_full_incident_drill.ipynb`) (optional)
|
||||
- Combined failure + timeline pressure
|
||||
- Practice "first 10 minutes" checklist
|
||||
- Produce incident summary using escalation template
|
||||
|
||||
---
|
||||
|
||||
## Section 8: Workshop Wrap-up
|
||||
|
||||
### What You've Learned
|
||||
|
||||
- How to respond to incidents systematically
|
||||
- How to collect canonical diagnostics bundles
|
||||
- How to debug common failure modes
|
||||
- How to escalate effectively to Support
|
||||
- How to prevent repeat incidents
|
||||
|
||||
### Next Steps
|
||||
|
||||
**Immediate:**
|
||||
- Run through failure labs to build muscle memory
|
||||
- Customize the "first 10 minutes" checklist for your environment
|
||||
- Set up monitoring and alerts (Module 3)
|
||||
|
||||
**Ongoing:**
|
||||
- Practice incident response regularly (drills)
|
||||
- Keep diagnostics script updated
|
||||
- Document your own failure modes and fixes
|
||||
- Share learnings with your team
|
||||
|
||||
### Resources
|
||||
|
||||
- [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)
|
||||
- [Support Escalation Template](../shared/support_escalation_template.md)
|
||||
- [Canonical Diagnostics Script](https://github.com/langchain-ai/helm/blob/main/charts/langsmith/scripts/get_k8s_debugging_info.sh)
|
||||
- Module 1: Deployment & Baseline Validation
|
||||
- Module 2: Identity & Authentication
|
||||
- Module 3: Production Operations & Scaling
|
||||
|
||||
---
|
||||
|
||||
## Artifacts
|
||||
|
||||
**Participants leave with:**
|
||||
- A working incident response process
|
||||
- Experience debugging real failure modes
|
||||
- A diagnostics bundle collection workflow
|
||||
- An escalation template customized for their environment
|
||||
- Confidence to handle incidents independently
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
**Don't:**
|
||||
- Skip the baseline snapshot (you need "before" to compare to "after")
|
||||
- Redeploy before collecting evidence (destroys diagnostics)
|
||||
- Ignore error messages (they're clues)
|
||||
- Escalate without diagnostics bundle (slows Support)
|
||||
- Delete evidence (you'll need it for post-incident review)
|
||||
|
||||
**Do:**
|
||||
- Follow the checklist (it's battle-tested)
|
||||
- Collect diagnostics early (time is evidence)
|
||||
- Document your investigation (helps you and Support)
|
||||
- Test your process (run drills)
|
||||
- Learn from each incident (prevent repeats)
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**"The diagnostics script fails":**
|
||||
- Check kubectl access and namespace
|
||||
- Verify script is up-to-date (check GitHub)
|
||||
- Run with verbose output to see what's failing
|
||||
|
||||
**"I can't reproduce the failure":**
|
||||
- Check that failure injection was applied correctly
|
||||
- Verify symptoms match expected behavior
|
||||
- Try a different failure injection method (Level 2 if Level 1 didn't work)
|
||||
|
||||
**"The remediation doesn't work":**
|
||||
- Verify you reverted the exact change you made
|
||||
- Check for cascading failures (one failure caused another)
|
||||
- Collect post-remediation diagnostics to compare
|
||||
|
||||
**"I don't understand the symptoms":**
|
||||
- Review the service's role in LangSmith (lab introduction)
|
||||
- Check logs for error patterns
|
||||
- Compare to baseline snapshot (what changed?)
|
||||
|
||||
---
|
||||
|
||||
**Remember:** Incident response is a skill. Practice makes perfect. The more you drill, the better you'll be when real incidents happen.
|
||||
|
||||
@@ -0,0 +1,366 @@
|
||||
# Authentication Troubleshooting Playbook
|
||||
|
||||
**Purpose:** Triage tree for common authentication failures
|
||||
**Audience:** Operators troubleshooting SSO issues
|
||||
|
||||
---
|
||||
|
||||
## Triage Tree
|
||||
|
||||
### 1. Login Loop
|
||||
|
||||
**Symptoms:**
|
||||
- User redirected to IdP
|
||||
- User authenticates successfully
|
||||
- Redirected back to LangSmith
|
||||
- Immediately redirected to IdP again (infinite loop)
|
||||
|
||||
**Likely Causes:**
|
||||
1. Redirect URI mismatch (most common)
|
||||
2. Session cookie not being set (TLS/cookie issues)
|
||||
3. Token validation failure
|
||||
4. State parameter mismatch
|
||||
|
||||
**Evidence Gathering:**
|
||||
```bash
|
||||
# Check pod logs for redirect errors
|
||||
kubectl logs <api-pod> -n <namespace> --tail=100 | grep -i "redirect\|callback\|auth"
|
||||
|
||||
# Check ingress configuration
|
||||
kubectl get ingress -n <namespace> -o yaml
|
||||
|
||||
# Test redirect URI exactness
|
||||
curl -I https://<domain>/auth/callback
|
||||
|
||||
# Check browser console for cookie errors
|
||||
# (Manual check in browser developer tools)
|
||||
```
|
||||
|
||||
**Commands:**
|
||||
```bash
|
||||
# Verify redirect URI in Helm values
|
||||
helm get values <release> -n <namespace> | grep -i redirect
|
||||
|
||||
# Check environment variables
|
||||
kubectl exec <pod> -n <namespace> -- env | grep -i redirect
|
||||
|
||||
# Verify IdP whitelist (manual check in IdP admin console)
|
||||
```
|
||||
|
||||
**Fix:**
|
||||
1. Verify redirect URI matches **exactly** (case, trailing slashes, protocol)
|
||||
2. Check IdP whitelist includes exact redirect URI
|
||||
3. Verify TLS certificate is valid (browser must accept cookies)
|
||||
4. Check session cookie settings (SameSite, Secure flags)
|
||||
|
||||
---
|
||||
|
||||
### 2. 403/Unauthorized After Login
|
||||
|
||||
**Symptoms:**
|
||||
- User authenticates successfully at IdP
|
||||
- Redirected back to LangSmith
|
||||
- Receives 403 Forbidden or "Unauthorized" error
|
||||
- Cannot access any resources
|
||||
|
||||
**Likely Causes:**
|
||||
1. Role mapping not configured
|
||||
2. User not in any mapped groups
|
||||
3. Workspace not assigned to user's organization
|
||||
4. Claims/attributes not being sent by IdP
|
||||
|
||||
**Evidence Gathering:**
|
||||
```bash
|
||||
# Check pod logs for authorization errors
|
||||
kubectl logs <api-pod> -n <namespace> --tail=100 | grep -i "403\|unauthorized\|forbidden\|role"
|
||||
|
||||
# Check role mapping configuration
|
||||
helm get values <release> -n <namespace> | grep -i "role\|mapping\|group"
|
||||
|
||||
# Check user's group membership (from IdP)
|
||||
# (Manual check - verify user is in expected groups)
|
||||
```
|
||||
|
||||
**Commands:**
|
||||
```bash
|
||||
# Verify role mapping in Helm values
|
||||
helm get values <release> -n <namespace> | grep -A 10 "roleMapping"
|
||||
|
||||
# Check environment variables for claim mappings
|
||||
kubectl exec <pod> -n <namespace> -- env | grep -i "claim\|attribute\|group"
|
||||
|
||||
# Test with different user (in mapped group)
|
||||
```
|
||||
|
||||
**Fix:**
|
||||
1. Verify user is in a group that's mapped to a role
|
||||
2. Check role mapping configuration in Helm values
|
||||
3. Verify IdP is sending group claims/attributes
|
||||
4. Assign user to appropriate group in IdP
|
||||
5. Verify workspace assignment in LangSmith
|
||||
|
||||
---
|
||||
|
||||
### 3. SAML Assertion Missing Attributes
|
||||
|
||||
**Symptoms:**
|
||||
- User authenticates successfully
|
||||
- Login completes but user has no email/name
|
||||
- Role mapping doesn't work
|
||||
- User cannot access resources
|
||||
|
||||
**Likely Causes:**
|
||||
1. IdP not configured to send required attributes
|
||||
2. Attribute names don't match configuration
|
||||
3. Attribute mapping incorrect in LangSmith
|
||||
|
||||
**Evidence Gathering:**
|
||||
```bash
|
||||
# Check logs for missing attribute errors
|
||||
kubectl logs <api-pod> -n <namespace> --tail=100 | grep -i "attribute\|missing\|email\|name"
|
||||
|
||||
# Check SAML attribute mapping
|
||||
helm get values <release> -n <namespace> | grep -i "saml.*attribute"
|
||||
|
||||
# Verify SAML metadata includes attribute definitions
|
||||
# (Check IdP metadata XML)
|
||||
```
|
||||
|
||||
**Commands:**
|
||||
```bash
|
||||
# Verify attribute mapping configuration
|
||||
kubectl exec <pod> -n <namespace> -- env | grep -i "SAML.*ATTRIBUTE"
|
||||
|
||||
# Check SAML metadata for attribute definitions
|
||||
curl <SAML_METADATA_URL> | grep -i "Attribute"
|
||||
|
||||
# Test with SAML tracer (browser extension) to see actual assertion
|
||||
```
|
||||
|
||||
**Fix:**
|
||||
1. Configure IdP to send required attributes (email, name, groups)
|
||||
2. Verify attribute names match LangSmith configuration exactly
|
||||
3. Update attribute mapping in Helm values if names differ
|
||||
4. Test with SAML tracer to verify attributes in assertion
|
||||
|
||||
---
|
||||
|
||||
### 4. Redirect Mismatch
|
||||
|
||||
**Symptoms:**
|
||||
- Login attempt fails immediately
|
||||
- Error: "redirect_uri_mismatch" or similar
|
||||
- User never reaches IdP login page
|
||||
|
||||
**Likely Causes:**
|
||||
1. Redirect URI in LangSmith doesn't match IdP whitelist
|
||||
2. Trailing slash mismatch
|
||||
3. Protocol mismatch (http vs https)
|
||||
4. Domain mismatch
|
||||
|
||||
**Evidence Gathering:**
|
||||
```bash
|
||||
# Check configured redirect URI
|
||||
helm get values <release> -n <namespace> | grep -i redirect
|
||||
|
||||
# Verify exact redirect URI format
|
||||
kubectl exec <pod> -n <namespace> -- env | grep -i REDIRECT
|
||||
|
||||
# Test redirect URI endpoint
|
||||
curl -I https://<domain>/auth/callback
|
||||
```
|
||||
|
||||
**Commands:**
|
||||
```bash
|
||||
# Compare redirect URIs
|
||||
echo "LangSmith config:"
|
||||
kubectl exec <pod> -n <namespace> -- env | grep OIDC_REDIRECT_URI
|
||||
|
||||
echo "IdP whitelist:"
|
||||
# (Manual check in IdP admin console)
|
||||
|
||||
# Verify exact match (including trailing slashes, case)
|
||||
```
|
||||
|
||||
**Fix:**
|
||||
1. Get exact redirect URI from LangSmith configuration
|
||||
2. Verify it matches IdP whitelist **exactly** (character-by-character)
|
||||
3. Update IdP whitelist if needed
|
||||
4. Restart LangSmith pods after configuration change
|
||||
|
||||
---
|
||||
|
||||
### 5. TLS/Callback Issues
|
||||
|
||||
**Symptoms:**
|
||||
- IdP callback fails with TLS errors
|
||||
- Browser shows "Not Secure" warning
|
||||
- Certificate errors in browser console
|
||||
- Callback never completes
|
||||
|
||||
**Likely Causes:**
|
||||
1. Self-signed certificate (browser rejects)
|
||||
2. Certificate chain incomplete
|
||||
3. Certificate expired
|
||||
4. Certificate doesn't match domain
|
||||
|
||||
**Evidence Gathering:**
|
||||
```bash
|
||||
# Check TLS certificate
|
||||
openssl s_client -connect <domain>:443 -servername <domain> < /dev/null
|
||||
|
||||
# Check certificate expiration
|
||||
echo | openssl s_client -connect <domain>:443 -servername <domain> 2>/dev/null | \
|
||||
openssl x509 -noout -dates
|
||||
|
||||
# Check ingress TLS configuration
|
||||
kubectl get ingress -n <namespace> -o yaml | grep -A 5 tls
|
||||
```
|
||||
|
||||
**Commands:**
|
||||
```bash
|
||||
# Verify certificate validity
|
||||
kubectl get ingress -n <namespace> -o jsonpath='{.items[0].spec.tls[0].secretName}'
|
||||
kubectl get secret <tls-secret> -n <namespace> -o yaml
|
||||
|
||||
# Test certificate from pod
|
||||
kubectl exec <pod> -n <namespace> -- openssl s_client -connect <domain>:443 -servername <domain>
|
||||
```
|
||||
|
||||
**Fix:**
|
||||
1. Use valid TLS certificate from trusted CA (not self-signed)
|
||||
2. Ensure full certificate chain is present
|
||||
3. Renew certificate if expired
|
||||
4. Verify certificate matches domain exactly
|
||||
5. Update ingress TLS secret if needed
|
||||
|
||||
---
|
||||
|
||||
## What Support Will Ask For
|
||||
|
||||
When contacting LangSmith support for authentication issues, provide:
|
||||
|
||||
### Minimal Evidence Bundle
|
||||
|
||||
1. **Configuration Summary (redacted)**
|
||||
- Auth provider type (OIDC/SAML)
|
||||
- Issuer/metadata URL (no secrets)
|
||||
- Domain
|
||||
- Claim/attribute mappings
|
||||
- Role mapping configuration
|
||||
|
||||
2. **Pod Logs**
|
||||
- Last 200 lines from API/server pods
|
||||
- Filtered for auth-related errors
|
||||
- Timestamp of failure
|
||||
|
||||
3. **Recent Events**
|
||||
```bash
|
||||
kubectl get events -n <namespace> --sort-by=.lastTimestamp > events.txt
|
||||
```
|
||||
|
||||
4. **Ingress Configuration**
|
||||
```bash
|
||||
kubectl get ingress -n <namespace> -o yaml > ingress.yaml
|
||||
```
|
||||
|
||||
5. **Helm Values (redacted)**
|
||||
```bash
|
||||
helm get values <release> -n <namespace> > helm-values.txt
|
||||
# Manually redact secrets before sending
|
||||
```
|
||||
|
||||
### Do NOT Include
|
||||
|
||||
- Client secrets
|
||||
- Tokens
|
||||
- Passwords
|
||||
- Private keys
|
||||
- Full certificate chains (public certs OK)
|
||||
|
||||
### Support Bundle Script
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Collect minimal auth troubleshooting bundle
|
||||
|
||||
NAMESPACE="${NAMESPACE:-langsmith}"
|
||||
RELEASE="${HELM_RELEASE:-langsmith}"
|
||||
OUTPUT_DIR="auth-support-$(date +%Y%m%d-%H%M%S)"
|
||||
|
||||
mkdir -p "$OUTPUT_DIR"
|
||||
|
||||
# Pod logs (last 200 lines, auth-related)
|
||||
kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].metadata.name}' | \
|
||||
tr ' ' '\n' | grep -E "(api|server|backend)" | head -3 | while read pod; do
|
||||
kubectl logs "$pod" -n "$NAMESPACE" --tail=200 | \
|
||||
grep -i -E "(auth|oidc|saml|sso|login|redirect)" > "$OUTPUT_DIR/${pod}-auth-logs.txt" || true
|
||||
done
|
||||
|
||||
# Events
|
||||
kubectl get events -n "$NAMESPACE" --sort-by=.lastTimestamp > "$OUTPUT_DIR/events.txt"
|
||||
|
||||
# Ingress
|
||||
kubectl get ingress -n "$NAMESPACE" -o yaml > "$OUTPUT_DIR/ingress.yaml"
|
||||
|
||||
# Helm values (redact secrets manually)
|
||||
helm get values "$RELEASE" -n "$NAMESPACE" > "$OUTPUT_DIR/helm-values.txt"
|
||||
echo "⚠️ REDACT SECRETS FROM helm-values.txt BEFORE SENDING"
|
||||
|
||||
# Configuration summary
|
||||
cat > "$OUTPUT_DIR/config-summary.txt" <<EOF
|
||||
Auth Configuration Summary
|
||||
Generated: $(date -Iseconds)
|
||||
|
||||
Namespace: $NAMESPACE
|
||||
Release: $RELEASE
|
||||
Domain: ${LANGSMITH_DOMAIN:-N/A}
|
||||
Provider: ${AUTH_PROVIDER:-N/A}
|
||||
|
||||
Note: Secrets not included for security.
|
||||
EOF
|
||||
|
||||
echo "Support bundle saved to: $OUTPUT_DIR"
|
||||
echo "⚠️ Review and redact secrets before sending to support"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### OIDC Issues
|
||||
- **Redirect mismatch:** Check exact URI match
|
||||
- **Token validation:** Check issuer URL, clock skew
|
||||
- **Missing claims:** Verify scopes and IdP configuration
|
||||
|
||||
### SAML Issues
|
||||
- **Missing attributes:** Check IdP attribute configuration
|
||||
- **Signature failure:** Verify certificate in metadata
|
||||
- **Entity ID mismatch:** Check entity ID configuration
|
||||
|
||||
### Common Commands
|
||||
```bash
|
||||
# Check auth configuration
|
||||
kubectl exec <pod> -n <namespace> -- env | grep -i -E "(auth|oidc|saml)"
|
||||
|
||||
# Check logs
|
||||
kubectl logs <pod> -n <namespace> --tail=100 | grep -i auth
|
||||
|
||||
# Check Helm values
|
||||
helm get values <release> -n <namespace>
|
||||
|
||||
# Restart pods (after config change)
|
||||
kubectl rollout restart deployment -n <namespace>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
|
||||
If issues persist after following this playbook:
|
||||
|
||||
1. Collect minimal evidence bundle (see above)
|
||||
2. Document exact steps to reproduce
|
||||
3. Note any recent configuration changes
|
||||
4. Contact LangSmith support with evidence bundle
|
||||
|
||||
@@ -0,0 +1,110 @@
|
||||
# Authentication Validation Checklist
|
||||
|
||||
**Purpose:** Operator-friendly checklist for validating SSO configuration
|
||||
**Use:** Complete this checklist after running the validation notebook(s)
|
||||
|
||||
---
|
||||
|
||||
## Preconditions
|
||||
|
||||
- [ ] DNS configured and resolving correctly
|
||||
- [ ] TLS certificate valid and trusted (not self-signed in production)
|
||||
- [ ] Ingress configured and accessible
|
||||
- [ ] LangSmith deployment healthy (all pods running, PVCs bound)
|
||||
|
||||
---
|
||||
|
||||
## Configuration Inputs
|
||||
|
||||
### OIDC Configuration
|
||||
- [ ] `OIDC_ISSUER` set and accessible
|
||||
- [ ] `OIDC_CLIENT_ID` set
|
||||
- [ ] `OIDC_CLIENT_SECRET` set (stored in Kubernetes secret)
|
||||
- [ ] `OIDC_REDIRECT_URI` matches exactly between LangSmith and IdP
|
||||
- [ ] `OIDC_SCOPES` includes `openid` and `email`
|
||||
- [ ] `OIDC_SCOPES` includes `groups` (if using group-based role mapping)
|
||||
|
||||
### SAML Configuration
|
||||
- [ ] `SAML_METADATA_URL` accessible OR `SAML_METADATA_FILE` exists
|
||||
- [ ] SAML metadata XML is valid
|
||||
- [ ] Entity ID matches between LangSmith and IdP
|
||||
- [ ] Signing certificate present in metadata
|
||||
- [ ] SSO endpoints found in metadata
|
||||
|
||||
### Common to Both
|
||||
- [ ] `LANGSMITH_DOMAIN` matches actual domain
|
||||
- [ ] Claim/attribute mappings configured
|
||||
- [ ] Role mapping configured (groups or users)
|
||||
|
||||
---
|
||||
|
||||
## Role Mapping
|
||||
|
||||
- [ ] Group-to-role mapping configured (preferred)
|
||||
- [ ] Admin groups identified and mapped
|
||||
- [ ] Member groups identified and mapped
|
||||
- [ ] Viewer groups identified and mapped (if applicable)
|
||||
- [ ] Minimal admin principle followed (1-2 admins to start)
|
||||
|
||||
---
|
||||
|
||||
## Login Validation
|
||||
|
||||
### Admin User
|
||||
- [ ] Admin user can log in via SSO
|
||||
- [ ] Admin user sees correct role (admin)
|
||||
- [ ] Admin user can access organization settings
|
||||
- [ ] Admin user can manage workspaces
|
||||
- [ ] Admin user can manage users (if applicable)
|
||||
|
||||
### Standard User
|
||||
- [ ] Standard user can log in via SSO
|
||||
- [ ] Standard user sees correct role (member/viewer)
|
||||
- [ ] Standard user can access assigned workspaces
|
||||
- [ ] Standard user cannot access organization settings
|
||||
- [ ] Standard user cannot manage users
|
||||
|
||||
---
|
||||
|
||||
## Session Management
|
||||
|
||||
- [ ] Logout works correctly
|
||||
- [ ] Session invalidation works (logout from IdP invalidates LangSmith session)
|
||||
- [ ] Session timeout configured appropriately
|
||||
- [ ] Multiple browser sessions work independently
|
||||
|
||||
---
|
||||
|
||||
## Audit Evidence
|
||||
|
||||
- [ ] Authentication events logged
|
||||
- [ ] Role assignments logged
|
||||
- [ ] Session creation/destruction logged
|
||||
- [ ] Failed authentication attempts logged
|
||||
- [ ] Logs exportable to SIEM (if required)
|
||||
|
||||
---
|
||||
|
||||
## Documentation
|
||||
|
||||
- [ ] Helm values file saved (with secrets redacted)
|
||||
- [ ] IdP settings documented
|
||||
- [ ] Group-to-role mapping table created
|
||||
- [ ] Admin assignments documented
|
||||
- [ ] Troubleshooting playbook bookmarked
|
||||
|
||||
---
|
||||
|
||||
## Sign-Off
|
||||
|
||||
**Validated by:** _________________
|
||||
**Date:** _________________
|
||||
**Notes:** _________________
|
||||
|
||||
---
|
||||
|
||||
**Next Steps:**
|
||||
- Proceed to Module 3 (if applicable)
|
||||
- Schedule regular access reviews
|
||||
- Document any deviations from standard configuration
|
||||
|
||||
@@ -0,0 +1,163 @@
|
||||
# First 10 Minutes: Incident Response Checklist
|
||||
|
||||
**When:** You detect or are alerted to a LangSmith self-hosted issue.
|
||||
|
||||
**Goal:** Collect evidence, stabilize if possible, and prepare for escalation—without making things worse.
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ Critical: Do NOT Redeploy
|
||||
|
||||
**Resist the urge to:**
|
||||
- Run `helm upgrade` or `kubectl rollout restart`
|
||||
- Delete pods "to see if they come back"
|
||||
- Scale resources up/down
|
||||
- Change configuration
|
||||
|
||||
**Why:** Redeploying destroys evidence and may mask the root cause. Collect diagnostics first.
|
||||
|
||||
---
|
||||
|
||||
## Minute 0-2: Triage & Scope
|
||||
|
||||
- [ ] **Confirm the issue:** What's broken? (UI down, API 5xx, traces missing, auth failing)
|
||||
- [ ] **Check who's impacted:** All users, specific endpoints, specific features?
|
||||
- [ ] **Note the time:** Record detection time and any recent changes (deployments, config changes, infrastructure changes)
|
||||
- [ ] **Check basic connectivity:**
|
||||
```bash
|
||||
kubectl cluster-info
|
||||
kubectl get nodes
|
||||
kubectl get pods -n <namespace>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Minute 2-5: Quick Health Check
|
||||
|
||||
- [ ] **Pod status:**
|
||||
```bash
|
||||
kubectl get pods -n <namespace> -o wide
|
||||
```
|
||||
Look for: CrashLoopBackOff, Pending, Error states
|
||||
|
||||
- [ ] **Recent events:**
|
||||
```bash
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
|
||||
```
|
||||
Look for: Failed scheduling, image pull errors, resource limits
|
||||
|
||||
- [ ] **Ingress/Load Balancer:**
|
||||
```bash
|
||||
kubectl get ingress -n <namespace>
|
||||
```
|
||||
Check if endpoint is reachable (curl or browser)
|
||||
|
||||
- [ ] **Key deployments:**
|
||||
```bash
|
||||
kubectl get deployments -n <namespace>
|
||||
kubectl describe deployment <deployment-name> -n <namespace>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Minute 5-8: Collect Diagnostics Bundle
|
||||
|
||||
- [ ] **Run canonical diagnostics script:**
|
||||
```bash
|
||||
# Download and run the official script
|
||||
curl -O https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
|
||||
chmod +x get_k8s_debugging_info.sh
|
||||
./get_k8s_debugging_info.sh <namespace>
|
||||
```
|
||||
This captures: pod logs, events, resource usage, configuration
|
||||
|
||||
- [ ] **Save timestamped snapshot:**
|
||||
```bash
|
||||
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
|
||||
mkdir -p artifacts/incident-$TIMESTAMP
|
||||
|
||||
kubectl get all -n <namespace> -o yaml > artifacts/incident-$TIMESTAMP/all-resources.yaml
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > artifacts/incident-$TIMESTAMP/events.txt
|
||||
```
|
||||
|
||||
- [ ] **Check logs for obvious errors:**
|
||||
```bash
|
||||
# Check API server logs
|
||||
kubectl logs -n <namespace> -l app=langsmith-api --tail=100
|
||||
|
||||
# Check worker logs
|
||||
kubectl logs -n <namespace> -l app=langsmith-worker --tail=100
|
||||
```
|
||||
Look for: connection errors, timeouts, authentication failures, resource exhaustion
|
||||
|
||||
---
|
||||
|
||||
## Minute 8-10: Identify Likely Root Cause
|
||||
|
||||
Based on symptoms, check the most likely culprits:
|
||||
|
||||
### If UI/API is down:
|
||||
- [ ] Check ingress/load balancer status (via cloud helper or kubectl)
|
||||
- [ ] Check API pod logs for startup errors
|
||||
- [ ] Verify external services (PostgreSQL, Redis) are reachable
|
||||
|
||||
### If traces are missing/delayed:
|
||||
- [ ] Check ClickHouse connectivity and logs
|
||||
- [ ] Check worker pod logs for insert errors
|
||||
- [ ] Verify blob storage configuration (if large payloads)
|
||||
|
||||
### If authentication fails:
|
||||
- [ ] Check OIDC/SAML configuration (Module 2 validation)
|
||||
- [ ] Check IdP connectivity
|
||||
- [ ] Review auth-related pod logs
|
||||
|
||||
### If ingestion is slow:
|
||||
- [ ] Check Redis connectivity and latency
|
||||
- [ ] Check worker pod resource usage
|
||||
- [ ] Look for queue backlogs
|
||||
|
||||
---
|
||||
|
||||
## After 10 Minutes: Decision Point
|
||||
|
||||
**If you've identified and can safely fix the issue:**
|
||||
- Document what you changed
|
||||
- Verify recovery
|
||||
- Collect post-recovery diagnostics
|
||||
|
||||
**If you need help:**
|
||||
- Use the [Support Escalation Template](../shared/support_escalation_template.md)
|
||||
- Include the diagnostics bundle
|
||||
- Note what you've tried and the results
|
||||
|
||||
**If the issue is critical and escalating:**
|
||||
- Continue collecting evidence every 5-10 minutes
|
||||
- Document timeline of symptoms
|
||||
- Prepare escalation with all evidence
|
||||
|
||||
---
|
||||
|
||||
## What NOT to Do
|
||||
|
||||
- ❌ Don't delete namespaces or persistent volumes
|
||||
- ❌ Don't change database passwords or connection strings
|
||||
- ❌ Don't scale resources without understanding the bottleneck
|
||||
- ❌ Don't ignore error messages—they're evidence
|
||||
- ❌ Don't skip the diagnostics bundle—Support will ask for it
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference: Common Failure Patterns
|
||||
|
||||
| Symptom | Likely Cause | First Check |
|
||||
|---------|--------------|-------------|
|
||||
| All pods CrashLoopBackOff | Config error, missing secret | `kubectl describe pod` |
|
||||
| API 5xx errors | Database/Redis connection | Pod logs, service endpoints |
|
||||
| Traces not appearing | ClickHouse connectivity | ClickHouse pod logs |
|
||||
| Slow ingestion | Redis latency, worker backlog | Worker logs, Redis metrics |
|
||||
| Auth redirect loop | OIDC/SAML misconfiguration | Auth pod logs, IdP connectivity |
|
||||
|
||||
---
|
||||
|
||||
**Remember:** The goal is evidence collection and safe triage, not immediate resolution. A good diagnostics bundle is worth more than a hasty fix.
|
||||
|
||||
@@ -0,0 +1,312 @@
|
||||
# Operations Signals and Thresholds
|
||||
|
||||
**Purpose:** Define early warning signals and red flag thresholds for LangSmith operations
|
||||
**Use:** Configure monitoring and alerting based on these thresholds
|
||||
**Frequency:** Review quarterly and adjust based on historical data
|
||||
|
||||
---
|
||||
|
||||
## Signal Categories
|
||||
|
||||
### Critical Signals (Red Flags - Immediate Action)
|
||||
|
||||
**Pod Health:**
|
||||
- Pod in `CrashLoopBackOff` state → **IMMEDIATE**
|
||||
- Pod `Pending` for > 10 minutes → **IMMEDIATE**
|
||||
- Pod restart count > 5 in 1 hour → **IMMEDIATE**
|
||||
- Pod `ImagePullBackOff` → **IMMEDIATE**
|
||||
|
||||
**Resource Saturation:**
|
||||
- Node CPU > 90% for > 5 minutes → **IMMEDIATE**
|
||||
- Node memory > 95% for > 5 minutes → **IMMEDIATE**
|
||||
- Pod CPU > 90% for > 10 minutes → **IMMEDIATE**
|
||||
- Pod memory > 95% for > 10 minutes → **IMMEDIATE**
|
||||
|
||||
**Application Health:**
|
||||
- API server error rate > 5% → **IMMEDIATE**
|
||||
- API server latency p95 > 5 seconds → **IMMEDIATE**
|
||||
- Worker queue depth > 200 → **IMMEDIATE**
|
||||
- Worker processing rate < 10 jobs/minute → **IMMEDIATE**
|
||||
|
||||
**Data Store Health:**
|
||||
- PostgreSQL connection pool exhausted → **IMMEDIATE**
|
||||
- PostgreSQL query timeout > 10 seconds → **IMMEDIATE**
|
||||
- Redis out of memory → **IMMEDIATE**
|
||||
- Redis connection refused → **IMMEDIATE**
|
||||
- ClickHouse query timeout > 10 seconds → **IMMEDIATE**
|
||||
- ClickHouse table size > 1 TB (single table) → **IMMEDIATE**
|
||||
|
||||
### Warning Signals (Yellow Flags - Monitor Closely)
|
||||
|
||||
**Pod Health:**
|
||||
- Pod restart count > 2 in 1 hour → **WARNING**
|
||||
- Pod `Pending` for > 5 minutes → **WARNING**
|
||||
- Pod CPU > 70% for > 10 minutes → **WARNING**
|
||||
- Pod memory > 80% for > 10 minutes → **WARNING**
|
||||
|
||||
**Application Health:**
|
||||
- API server error rate > 1% → **WARNING**
|
||||
- API server latency p95 > 2 seconds → **WARNING**
|
||||
- Worker queue depth > 50 → **WARNING**
|
||||
- Worker processing rate < 50 jobs/minute → **WARNING**
|
||||
|
||||
**Data Store Health:**
|
||||
- PostgreSQL connections > 80% of max → **WARNING**
|
||||
- PostgreSQL query latency p95 > 2 seconds → **WARNING**
|
||||
- Redis memory > 90% → **WARNING**
|
||||
- Redis hit rate < 80% → **WARNING**
|
||||
- ClickHouse query latency p95 > 3 seconds → **WARNING**
|
||||
- ClickHouse disk usage > 80% → **WARNING**
|
||||
|
||||
---
|
||||
|
||||
## Threshold Definitions
|
||||
|
||||
### Pod Restart Count
|
||||
|
||||
**Measurement:** `kubectl get pods -n <namespace> --field-selector=status.phase=Running` → count restarts
|
||||
|
||||
**Calculation:**
|
||||
```bash
|
||||
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'
|
||||
```
|
||||
|
||||
**Thresholds:**
|
||||
- **Critical:** > 5 restarts in 1 hour
|
||||
- **Warning:** > 2 restarts in 1 hour
|
||||
|
||||
**Action:**
|
||||
- Check pod logs: `kubectl logs <pod> -n <namespace> --tail=100`
|
||||
- Check events: `kubectl get events -n <namespace> --sort-by=.lastTimestamp`
|
||||
- Check resource limits: `kubectl describe pod <pod> -n <namespace>`
|
||||
|
||||
### Pending Pods
|
||||
|
||||
**Measurement:** Pods in `Pending` state
|
||||
|
||||
**Calculation:**
|
||||
```bash
|
||||
kubectl get pods -n <namespace> --field-selector=status.phase=Pending
|
||||
```
|
||||
|
||||
**Thresholds:**
|
||||
- **Critical:** Pending for > 10 minutes
|
||||
- **Warning:** Pending for > 5 minutes
|
||||
|
||||
**Action:**
|
||||
- Check events: `kubectl describe pod <pod> -n <namespace>`
|
||||
- Check node capacity: `kubectl top nodes`
|
||||
- Check PVC binding: `kubectl get pvc -n <namespace>`
|
||||
|
||||
### API Server Error Rate
|
||||
|
||||
**Measurement:** HTTP 5xx responses / total requests
|
||||
|
||||
**Calculation:**
|
||||
- Application metrics: `rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])`
|
||||
- Or: Check application logs for error patterns
|
||||
|
||||
**Thresholds:**
|
||||
- **Critical:** > 5% error rate
|
||||
- **Warning:** > 1% error rate
|
||||
|
||||
**Action:**
|
||||
- Check pod logs: `kubectl logs <api-pod> -n <namespace> --tail=100 | grep -i error`
|
||||
- Check downstream services (PostgreSQL, Redis, ClickHouse)
|
||||
- Check resource usage: `kubectl top pod <api-pod> -n <namespace>`
|
||||
|
||||
### Worker Queue Depth
|
||||
|
||||
**Measurement:** Number of jobs in Redis queue
|
||||
|
||||
**Calculation:**
|
||||
```bash
|
||||
# Redis CLI
|
||||
redis-cli LLEN langsmith:jobs:traces
|
||||
```
|
||||
|
||||
**Or via application metrics:**
|
||||
- KEDA metrics: `redis_queue_length`
|
||||
|
||||
**Thresholds:**
|
||||
- **Critical:** > 200 jobs
|
||||
- **Warning:** > 50 jobs
|
||||
|
||||
**Action:**
|
||||
- Scale workers: Check KEDA ScaledObject
|
||||
- Check worker processing rate
|
||||
- Check for stuck jobs
|
||||
|
||||
### PostgreSQL Connection Count
|
||||
|
||||
**Measurement:** Active connections / max connections
|
||||
|
||||
**Calculation:**
|
||||
```sql
|
||||
SELECT count(*) FROM pg_stat_activity;
|
||||
SELECT setting FROM pg_settings WHERE name = 'max_connections';
|
||||
```
|
||||
|
||||
**Or via cloud provider metrics:**
|
||||
- AWS RDS: `DatabaseConnections` metric
|
||||
- Azure Database: `active_connections` metric
|
||||
|
||||
**Thresholds:**
|
||||
- **Critical:** > 90% of max connections
|
||||
- **Warning:** > 80% of max connections
|
||||
|
||||
**Action:**
|
||||
- Check for connection leaks
|
||||
- Review connection pool configuration
|
||||
- Consider increasing max connections (if justified)
|
||||
|
||||
### Redis Memory Usage
|
||||
|
||||
**Measurement:** Used memory / max memory
|
||||
|
||||
**Calculation:**
|
||||
```bash
|
||||
redis-cli INFO memory
|
||||
# used_memory / maxmemory
|
||||
```
|
||||
|
||||
**Or via cloud provider metrics:**
|
||||
- AWS ElastiCache: `DatabaseMemoryUsagePercentage`
|
||||
- Azure Cache: `usedmemorypercentage`
|
||||
|
||||
**Thresholds:**
|
||||
- **Critical:** > 95% memory usage
|
||||
- **Warning:** > 90% memory usage
|
||||
|
||||
**Action:**
|
||||
- Check for memory leaks
|
||||
- Review key expiration policies
|
||||
- Consider scaling up instance size
|
||||
|
||||
### ClickHouse Query Latency
|
||||
|
||||
**Measurement:** p95 query latency
|
||||
|
||||
**Calculation:**
|
||||
- ClickHouse system tables: `SELECT quantile(0.95)(query_duration_ms) FROM system.query_log WHERE event_time > now() - INTERVAL 1 HOUR`
|
||||
|
||||
**Thresholds:**
|
||||
- **Critical:** p95 > 10 seconds
|
||||
- **Warning:** p95 > 3 seconds
|
||||
|
||||
**Action:**
|
||||
- Check table sizes (may need partitioning)
|
||||
- Check for slow queries: `SELECT * FROM system.query_log WHERE query_duration_ms > 5000 ORDER BY query_duration_ms DESC LIMIT 10`
|
||||
- Check disk I/O: `kubectl top pod <clickhouse-pod> -n <namespace>`
|
||||
|
||||
---
|
||||
|
||||
## Log Patterns to Monitor
|
||||
|
||||
### Common Failure Patterns
|
||||
|
||||
**Connection Refused:**
|
||||
```bash
|
||||
kubectl logs <pod> -n <namespace> --tail=100 | grep -i "connection refused"
|
||||
```
|
||||
|
||||
**Timeouts:**
|
||||
```bash
|
||||
kubectl logs <pod> -n <namespace> --tail=100 | grep -i "timeout"
|
||||
```
|
||||
|
||||
**Out of Memory:**
|
||||
```bash
|
||||
kubectl logs <pod> -n <namespace> --tail=100 | grep -i "out of memory\|OOM"
|
||||
```
|
||||
|
||||
**Database Errors:**
|
||||
```bash
|
||||
kubectl logs <pod> -n <namespace> --tail=100 | grep -i "database\|postgres\|redis\|clickhouse" | grep -i "error\|fail"
|
||||
```
|
||||
|
||||
**Authentication Errors:**
|
||||
```bash
|
||||
kubectl logs <pod> -n <namespace> --tail=100 | grep -i "unauthorized\|forbidden\|auth"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Escalation Evidence
|
||||
|
||||
When escalating to support, gather:
|
||||
|
||||
1. **Pod Status:**
|
||||
```bash
|
||||
kubectl get pods -n <namespace> -o wide
|
||||
kubectl describe pod <problem-pod> -n <namespace>
|
||||
```
|
||||
|
||||
2. **Recent Events:**
|
||||
```bash
|
||||
kubectl get events -n <namespace> --sort-by=.lastTimestamp | tail -50
|
||||
```
|
||||
|
||||
3. **Resource Usage:**
|
||||
```bash
|
||||
kubectl top pods -n <namespace>
|
||||
kubectl top nodes
|
||||
```
|
||||
|
||||
4. **Pod Logs:**
|
||||
```bash
|
||||
kubectl logs <pod> -n <namespace> --tail=200
|
||||
```
|
||||
|
||||
5. **Application Metrics:**
|
||||
- Error rates, latencies, queue depths
|
||||
- Database connection counts
|
||||
- Cache hit rates
|
||||
|
||||
6. **Configuration:**
|
||||
- Helm values (redacted)
|
||||
- Environment variables (redacted)
|
||||
- Resource requests/limits
|
||||
|
||||
---
|
||||
|
||||
## Threshold Tuning
|
||||
|
||||
**Initial thresholds:** Use the values above as starting points.
|
||||
|
||||
**Tuning process:**
|
||||
1. Monitor for 1-2 weeks
|
||||
2. Identify false positives (alerts that don't require action)
|
||||
3. Identify missed incidents (issues that should have alerted)
|
||||
4. Adjust thresholds based on historical data
|
||||
5. Document threshold changes and rationale
|
||||
|
||||
**Factors to consider:**
|
||||
- Workload patterns (peak hours, batch jobs)
|
||||
- Growth trajectory (user growth, data growth)
|
||||
- Resource capacity (cluster size, database size)
|
||||
- Business requirements (SLA, RTO, RPO)
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Signal | Critical | Warning | Measurement |
|
||||
|--------|----------|---------|-------------|
|
||||
| Pod restarts | > 5/hour | > 2/hour | `kubectl get pods` |
|
||||
| Pending pods | > 10 min | > 5 min | `kubectl get pods` |
|
||||
| API error rate | > 5% | > 1% | Application metrics |
|
||||
| Worker queue | > 200 | > 50 | Redis queue length |
|
||||
| PostgreSQL connections | > 90% max | > 80% max | Database metrics |
|
||||
| Redis memory | > 95% | > 90% | Redis INFO memory |
|
||||
| ClickHouse latency | > 10s p95 | > 3s p95 | Query log |
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Configure alerts** based on these thresholds
|
||||
2. **Test alerts** to ensure they fire correctly
|
||||
3. **Document runbooks** for each alert type
|
||||
4. **Review quarterly** and adjust based on experience
|
||||
|
||||
@@ -0,0 +1,238 @@
|
||||
# Production Readiness Checklist
|
||||
|
||||
**Purpose:** Validate that LangSmith deployment meets production requirements
|
||||
**Use:** Complete this checklist before declaring production-ready
|
||||
**Frequency:** Review quarterly or after significant changes
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure & Networking
|
||||
|
||||
### Cloud Provider Configuration
|
||||
- [ ] Correct cloud account/subscription (verified)
|
||||
- [ ] Correct region selected (verified)
|
||||
- [ ] Private networking configured (all data stores in private subnets)
|
||||
- [ ] VPC/VNet peering configured (if multi-VPC deployment)
|
||||
- [ ] Security groups/NSGs configured correctly
|
||||
- [ ] IAM roles/Managed Identities configured (no access keys)
|
||||
|
||||
### Kubernetes Cluster
|
||||
- [ ] Cluster version supported (check compatibility matrix)
|
||||
- [ ] Node capacity sufficient (headroom for scaling)
|
||||
- [ ] Cluster autoscaling enabled (if applicable)
|
||||
- [ ] CSI storage drivers installed (EBS CSI for AWS, Azure Disk CSI for Azure)
|
||||
- [ ] Network policies configured (if required)
|
||||
- [ ] Resource quotas set (if multi-tenant)
|
||||
|
||||
---
|
||||
|
||||
## Data Stores
|
||||
|
||||
### PostgreSQL
|
||||
- [ ] Instance size meets baseline (db.r5.xlarge minimum for production)
|
||||
- [ ] Multi-AZ enabled (RDS) or read replicas configured (Azure)
|
||||
- [ ] Storage autoscaling enabled
|
||||
- [ ] Automated backups configured (7-day retention minimum)
|
||||
- [ ] Connection pool configured (100+ connections)
|
||||
- [ ] Private networking (no public access)
|
||||
- [ ] Encryption at rest enabled
|
||||
- [ ] Performance insights/monitoring enabled
|
||||
|
||||
### Redis
|
||||
- [ ] Instance type meets baseline (cache.r6g.xlarge minimum for production)
|
||||
- [ ] Cluster mode enabled (3+ nodes for production)
|
||||
- [ ] AOF persistence enabled (production)
|
||||
- [ ] Memory headroom sufficient (50% free)
|
||||
- [ ] Private networking (no public access)
|
||||
- [ ] Encryption at rest enabled
|
||||
|
||||
### ClickHouse
|
||||
- [ ] Deployment type: Managed (ClickHouse Cloud) OR in-cluster with proper sizing
|
||||
- [ ] In-cluster: 3-node cluster minimum (for HA)
|
||||
- [ ] Resources per node: 8 CPU, 32 GB RAM, 1 TB storage (production)
|
||||
- [ ] Replication factor: 2x (6 total pods for 3-node cluster)
|
||||
- [ ] Storage class: EBS gp3 with 3000 IOPS (AWS) or Premium SSD (Azure)
|
||||
- [ ] Backups configured (if in-cluster)
|
||||
- [ ] Private networking (no public access)
|
||||
|
||||
### Blob Storage (REQUIRED)
|
||||
- [ ] Blob storage provider configured (S3 or Azure Blob Storage)
|
||||
- [ ] NOT using local filesystem or in-cluster storage
|
||||
- [ ] Bucket/container created and accessible
|
||||
- [ ] IAM role/Managed Identity configured (no access keys)
|
||||
- [ ] Versioning enabled
|
||||
- [ ] Encryption at rest enabled
|
||||
- [ ] Lifecycle policies configured (cost optimization)
|
||||
- [ ] Health check passes (see ops sanity checks notebook)
|
||||
|
||||
**Critical:** Blob storage is REQUIRED for production. Without it, ClickHouse will become unusable under load.
|
||||
|
||||
---
|
||||
|
||||
## Application Configuration
|
||||
|
||||
### Helm Configuration
|
||||
- [ ] Helm values file reviewed and documented
|
||||
- [ ] Resource requests/limits set for all containers
|
||||
- [ ] Replica counts set appropriately (min 2 for HA)
|
||||
- [ ] Environment variables documented
|
||||
- [ ] Secrets stored in Kubernetes (not in values file)
|
||||
- [ ] Values file version controlled (Git)
|
||||
|
||||
### High Availability
|
||||
- [ ] API server replicas: 2+ (for HA)
|
||||
- [ ] Worker replicas: 1+ (scaled via KEDA)
|
||||
- [ ] Pod anti-affinity rules configured (spread across nodes/AZs)
|
||||
- [ ] Readiness probes configured correctly
|
||||
- [ ] Liveness probes configured correctly
|
||||
|
||||
### Autoscaling
|
||||
- [ ] HPA configured for API servers (CPU/memory targets)
|
||||
- [ ] HPA min replicas: 2
|
||||
- [ ] HPA max replicas: 10+ (adjust based on capacity)
|
||||
- [ ] KEDA ScaledObject configured for workers (queue depth)
|
||||
- [ ] KEDA min replicas: 1
|
||||
- [ ] KEDA max replicas: 20+ (adjust based on workload)
|
||||
|
||||
---
|
||||
|
||||
## Observability
|
||||
|
||||
### Monitoring
|
||||
- [ ] Kubernetes metrics available (pod CPU/memory)
|
||||
- [ ] Application metrics exposed (request rates, latencies)
|
||||
- [ ] Database metrics available (connection counts, query performance)
|
||||
- [ ] Redis metrics available (memory usage, hit rates)
|
||||
- [ ] ClickHouse metrics available (query latency, table sizes)
|
||||
- [ ] Log aggregation configured (CloudWatch, Azure Monitor, etc.)
|
||||
|
||||
### Alerting
|
||||
- [ ] Critical alerts configured (pod crashes, high error rates)
|
||||
- [ ] Warning alerts configured (resource saturation, queue depth)
|
||||
- [ ] Alert thresholds documented (see ops_signals_and_thresholds.md)
|
||||
- [ ] On-call rotation configured
|
||||
- [ ] Escalation paths defined
|
||||
|
||||
### Dashboards
|
||||
- [ ] Kubernetes dashboard (pod status, resource usage)
|
||||
- [ ] Application dashboard (request rates, error rates)
|
||||
- [ ] Database dashboard (connection counts, query performance)
|
||||
- [ ] Queue depth dashboard (worker queue metrics)
|
||||
|
||||
---
|
||||
|
||||
## Security
|
||||
|
||||
### Authentication & Authorization
|
||||
- [ ] SSO configured (OIDC or SAML)
|
||||
- [ ] Local auth disabled (production)
|
||||
- [ ] Role mapping configured correctly
|
||||
- [ ] Admin access restricted (minimal admins)
|
||||
|
||||
### Network Security
|
||||
- [ ] Ingress TLS configured (valid certificate)
|
||||
- [ ] mTLS enabled (if service mesh used)
|
||||
- [ ] Egress policies configured (if service mesh used)
|
||||
- [ ] Network policies configured (if required)
|
||||
|
||||
### Secrets Management
|
||||
- [ ] Secrets stored in Kubernetes (not in code)
|
||||
- [ ] Secrets rotation process documented
|
||||
- [ ] Access to secrets restricted (RBAC)
|
||||
|
||||
---
|
||||
|
||||
## Backup & Disaster Recovery
|
||||
|
||||
### Backups
|
||||
- [ ] PostgreSQL backups automated (daily, 7-day retention)
|
||||
- [ ] ClickHouse backups configured (if in-cluster)
|
||||
- [ ] Blob storage versioning enabled
|
||||
- [ ] Backup restoration tested (last 6 months)
|
||||
|
||||
### Disaster Recovery
|
||||
- [ ] DR plan documented
|
||||
- [ ] RTO/RPO defined
|
||||
- [ ] Failover procedure tested
|
||||
- [ ] Cross-region replication configured (if required)
|
||||
|
||||
---
|
||||
|
||||
## Operational Readiness
|
||||
|
||||
### Documentation
|
||||
- [ ] Runbooks documented (common operations)
|
||||
- [ ] Incident response procedures documented
|
||||
- [ ] Escalation paths documented
|
||||
- [ ] Service sizing baselines documented
|
||||
|
||||
### Testing
|
||||
- [ ] Load testing performed (validates scaling)
|
||||
- [ ] Failover testing performed (validates HA)
|
||||
- [ ] Backup restoration tested
|
||||
- [ ] Ops sanity checks notebook run (all checks pass)
|
||||
|
||||
### Team Readiness
|
||||
- [ ] On-call rotation established
|
||||
- [ ] Team trained on operations
|
||||
- [ ] Access to cloud console (for managed services)
|
||||
- [ ] Access to monitoring/alerting tools
|
||||
|
||||
---
|
||||
|
||||
## Service Mesh (If Applicable)
|
||||
|
||||
### Istio Configuration
|
||||
- [ ] Istio installed and configured
|
||||
- [ ] Sidecar injection enabled (namespace or per-workload)
|
||||
- [ ] ServiceEntry configured (for external databases)
|
||||
- [ ] DestinationRule configured (traffic policies)
|
||||
- [ ] Egress policies configured (if required)
|
||||
- [ ] mTLS enabled (if required)
|
||||
|
||||
### Operational Considerations
|
||||
- [ ] Log selection documented (app vs proxy logs)
|
||||
- [ ] Health probe timeouts adjusted (account for sidecar)
|
||||
- [ ] Multi-container pod logging understood
|
||||
|
||||
---
|
||||
|
||||
## Sign-Off
|
||||
|
||||
**Validated by:** _________________
|
||||
**Date:** _________________
|
||||
**Next Review Date:** _________________
|
||||
**Notes:** _________________
|
||||
|
||||
---
|
||||
|
||||
## Post-Checklist Actions
|
||||
|
||||
1. **Run ops sanity checks notebook:**
|
||||
- `notebooks/module-3/01_ops_sanity_checks.ipynb`
|
||||
- Address any failures before production
|
||||
|
||||
2. **Document thresholds:**
|
||||
- Update `docs/shared/ops_signals_and_thresholds.md`
|
||||
- Configure alerts based on thresholds
|
||||
|
||||
3. **Schedule quarterly reviews:**
|
||||
- Review checklist quarterly
|
||||
- Update baselines as workload grows
|
||||
- Adjust thresholds based on historical data
|
||||
|
||||
---
|
||||
|
||||
## Common Gaps
|
||||
|
||||
**Most common production readiness gaps:**
|
||||
1. Blob storage not configured (CRITICAL)
|
||||
2. PostgreSQL single-AZ (no HA)
|
||||
3. Redis single node (no cluster mode)
|
||||
4. No autoscaling configured
|
||||
5. No monitoring/alerting
|
||||
6. Backups not tested
|
||||
7. Resource limits not set
|
||||
|
||||
**Address these before declaring production-ready.**
|
||||
|
||||
@@ -0,0 +1,468 @@
|
||||
# Sidecars & Service Mesh (Istio)
|
||||
|
||||
**Purpose:** Guide for enabling and operating Istio sidecars in LangSmith deployments
|
||||
**Audience:** Platform engineers and operators managing service mesh configurations
|
||||
**Prerequisites:** Istio installed in cluster (out of scope for this guide)
|
||||
|
||||
---
|
||||
|
||||
## When Sidecars Are Needed
|
||||
|
||||
### Use Cases
|
||||
|
||||
**Egress Control:**
|
||||
- Restrict outbound traffic to approved destinations only
|
||||
- Prevent pods from accessing unauthorized external services
|
||||
- Enforce network policies at the service mesh level
|
||||
|
||||
**mTLS (Mutual TLS):**
|
||||
- Encrypt traffic between services within the cluster
|
||||
- Provide service-to-service authentication
|
||||
- Meet compliance requirements for encrypted communication
|
||||
|
||||
**Policy Enforcement:**
|
||||
- Rate limiting between services
|
||||
- Circuit breakers for fault tolerance
|
||||
- Traffic splitting for canary deployments
|
||||
|
||||
**Observability:**
|
||||
- Distributed tracing across services
|
||||
- Service-level metrics collection
|
||||
- Request/response logging
|
||||
|
||||
### When NOT Needed
|
||||
|
||||
**Simple deployments:**
|
||||
- Development environments
|
||||
- Proof-of-concept deployments
|
||||
- Single-service deployments
|
||||
|
||||
**No egress requirements:**
|
||||
- All traffic stays within cluster
|
||||
- No external database connections
|
||||
- No outbound API calls
|
||||
|
||||
**Alternative solutions:**
|
||||
- Network policies (Kubernetes native)
|
||||
- Ingress controllers (for north-south traffic)
|
||||
- Application-level rate limiting
|
||||
|
||||
---
|
||||
|
||||
## How to Enable Injection Safely
|
||||
|
||||
### Namespace-Level Injection (Recommended)
|
||||
|
||||
**Best for:** LangSmith namespace (all workloads need sidecars)
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: langsmith
|
||||
labels:
|
||||
istio-injection: enabled
|
||||
istio-discovery: enabled
|
||||
```
|
||||
|
||||
**Apply:**
|
||||
```bash
|
||||
kubectl label namespace langsmith istio-injection=enabled istio-discovery=enabled
|
||||
```
|
||||
|
||||
**Verification:**
|
||||
```bash
|
||||
kubectl get namespace langsmith --show-labels
|
||||
```
|
||||
|
||||
**Behavior:**
|
||||
- All new pods in namespace get sidecars automatically
|
||||
- Existing pods require restart to get sidecars
|
||||
- Pods can opt out with annotation: `sidecar.istio.io/inject: "false"`
|
||||
|
||||
### Per-Workload Annotation (Selective Injection)
|
||||
|
||||
**Best for:** Specific workloads that need sidecars
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: langsmith-api
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
annotations:
|
||||
sidecar.istio.io/inject: "true"
|
||||
spec:
|
||||
containers:
|
||||
- name: api
|
||||
# ... container spec
|
||||
```
|
||||
|
||||
**Apply:**
|
||||
```bash
|
||||
kubectl patch deployment langsmith-api -n langsmith -p '{"spec":{"template":{"metadata":{"annotations":{"sidecar.istio.io/inject":"true"}}}}}'
|
||||
```
|
||||
|
||||
**Behavior:**
|
||||
- Only annotated workloads get sidecars
|
||||
- Works even if namespace injection is disabled
|
||||
- More granular control
|
||||
|
||||
### Revision-Based Injection (Canary/Blue-Green)
|
||||
|
||||
**Best for:** Gradual rollout or canary deployments
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: langsmith
|
||||
labels:
|
||||
istio-injection: enabled
|
||||
istio.io/rev: default # or specific revision
|
||||
```
|
||||
|
||||
**Behavior:**
|
||||
- Allows multiple Istio control planes
|
||||
- Enables gradual migration
|
||||
- Supports canary deployments
|
||||
|
||||
---
|
||||
|
||||
## Operational Implications
|
||||
|
||||
### Logging and kubectl logs Behavior
|
||||
|
||||
**Multi-container pods:** After sidecar injection, pods have multiple containers:
|
||||
- Application container (e.g., `langsmith-api`)
|
||||
- Sidecar container (`istio-proxy`)
|
||||
|
||||
**Default behavior:**
|
||||
```bash
|
||||
# This shows logs from the FIRST container (usually application)
|
||||
kubectl logs <pod> -n <namespace>
|
||||
|
||||
# This may show proxy logs if proxy is first container
|
||||
kubectl logs <pod> -n <namespace> -c istio-proxy
|
||||
|
||||
# Show logs from specific container
|
||||
kubectl logs <pod> -n <namespace> -c <container-name>
|
||||
|
||||
# Show logs from all containers
|
||||
kubectl logs <pod> -n <namespace> --all-containers=true
|
||||
```
|
||||
|
||||
**Common issue:** "If logs appear missing after injection, you're likely looking at the wrong container."
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# List containers in pod
|
||||
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.containers[*].name}'
|
||||
|
||||
# Get logs from application container
|
||||
kubectl logs <pod> -n <namespace> -c langsmith-api
|
||||
|
||||
# Get logs from proxy container
|
||||
kubectl logs <pod> -n <namespace> -c istio-proxy
|
||||
```
|
||||
|
||||
### Health Probes and Timeouts
|
||||
|
||||
**Sidecar adds latency:**
|
||||
- Sidecar intercepts health check requests
|
||||
- Adds ~10-50ms latency per request
|
||||
- May cause probe timeouts if thresholds are too low
|
||||
|
||||
**Adjust probe timeouts:**
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: langsmith-api
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: api
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8080
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5 # Increase if sidecars enabled
|
||||
failureThreshold: 3
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8080
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 30
|
||||
timeoutSeconds: 5 # Increase if sidecars enabled
|
||||
failureThreshold: 3
|
||||
```
|
||||
|
||||
**Verification:**
|
||||
```bash
|
||||
# Check probe success rate
|
||||
kubectl get pods -n <namespace> -o wide
|
||||
# Look for pods in Ready state
|
||||
|
||||
# Check probe failures
|
||||
kubectl describe pod <pod> -n <namespace> | grep -A 5 "Liveness\|Readiness"
|
||||
```
|
||||
|
||||
### Egress to External Databases
|
||||
|
||||
**Problem:** Sidecars block outbound traffic by default.
|
||||
|
||||
**Solution:** Configure `ServiceEntry` for external endpoints.
|
||||
|
||||
**Example ServiceEntry for PostgreSQL:**
|
||||
```yaml
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: ServiceEntry
|
||||
metadata:
|
||||
name: postgres-external
|
||||
namespace: langsmith
|
||||
spec:
|
||||
hosts:
|
||||
- <postgres-hostname>.rds.amazonaws.com
|
||||
ports:
|
||||
- number: 5432
|
||||
name: postgres
|
||||
protocol: TCP
|
||||
location: MESH_EXTERNAL
|
||||
resolution: DNS
|
||||
```
|
||||
|
||||
**Example ServiceEntry for Redis:**
|
||||
```yaml
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: ServiceEntry
|
||||
metadata:
|
||||
name: redis-external
|
||||
namespace: langsmith
|
||||
spec:
|
||||
hosts:
|
||||
- <redis-endpoint>.cache.amazonaws.com
|
||||
ports:
|
||||
- number: 6379
|
||||
name: redis
|
||||
protocol: TCP
|
||||
location: MESH_EXTERNAL
|
||||
resolution: DNS
|
||||
```
|
||||
|
||||
**Apply:**
|
||||
```bash
|
||||
kubectl apply -f serviceentry-postgres.yaml -n langsmith
|
||||
kubectl apply -f serviceentry-redis.yaml -n langsmith
|
||||
```
|
||||
|
||||
**Verification:**
|
||||
```bash
|
||||
# Check ServiceEntry
|
||||
kubectl get serviceentry -n langsmith
|
||||
|
||||
# Test connectivity from pod
|
||||
kubectl exec -it <pod> -n langsmith -c <app-container> -- nc -zv <db-host> <port>
|
||||
```
|
||||
|
||||
### DestinationRule for Traffic Policies
|
||||
|
||||
**Example DestinationRule:**
|
||||
```yaml
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: DestinationRule
|
||||
metadata:
|
||||
name: postgres-dr
|
||||
namespace: langsmith
|
||||
spec:
|
||||
host: <postgres-hostname>.rds.amazonaws.com
|
||||
trafficPolicy:
|
||||
connectionPool:
|
||||
tcp:
|
||||
maxConnections: 100
|
||||
http:
|
||||
http1MaxPendingRequests: 10
|
||||
http2MaxRequests: 100
|
||||
tls:
|
||||
mode: SIMPLE
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sample Labels and Annotations
|
||||
|
||||
### Namespace Labels
|
||||
|
||||
```yaml
|
||||
labels:
|
||||
istio-injection: enabled
|
||||
istio-discovery: enabled
|
||||
```
|
||||
|
||||
### Pod Annotations
|
||||
|
||||
```yaml
|
||||
annotations:
|
||||
sidecar.istio.io/inject: "true" # Enable injection
|
||||
# or
|
||||
sidecar.istio.io/inject: "false" # Disable injection
|
||||
```
|
||||
|
||||
### Complete Example
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: langsmith-api
|
||||
namespace: langsmith
|
||||
spec:
|
||||
replicas: 2
|
||||
template:
|
||||
metadata:
|
||||
annotations:
|
||||
sidecar.istio.io/inject: "true"
|
||||
spec:
|
||||
containers:
|
||||
- name: api
|
||||
image: langsmith/api:latest
|
||||
ports:
|
||||
- containerPort: 8080
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8080
|
||||
timeoutSeconds: 5
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8080
|
||||
timeoutSeconds: 5
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Verification Commands
|
||||
|
||||
### Check Sidecar Injection
|
||||
|
||||
```bash
|
||||
# List pods and containers
|
||||
kubectl get pods -n langsmith -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].name}{"\n"}{end}'
|
||||
|
||||
# Check for istio-proxy container
|
||||
kubectl get pod <pod> -n langsmith -o jsonpath='{.spec.containers[*].name}' | grep istio-proxy
|
||||
|
||||
# Describe pod to see all containers
|
||||
kubectl describe pod <pod> -n langsmith | grep -A 10 "Containers:"
|
||||
```
|
||||
|
||||
### Check ServiceEntry
|
||||
|
||||
```bash
|
||||
# List ServiceEntries
|
||||
kubectl get serviceentry -n langsmith
|
||||
|
||||
# Describe ServiceEntry
|
||||
kubectl describe serviceentry <name> -n langsmith
|
||||
```
|
||||
|
||||
### Check DestinationRule
|
||||
|
||||
```bash
|
||||
# List DestinationRules
|
||||
kubectl get destinationrule -n langsmith
|
||||
|
||||
# Describe DestinationRule
|
||||
kubectl describe destinationrule <name> -n langsmith
|
||||
```
|
||||
|
||||
### Test Connectivity
|
||||
|
||||
```bash
|
||||
# Test from application container
|
||||
kubectl exec -it <pod> -n langsmith -c <app-container> -- curl -v <external-url>
|
||||
|
||||
# Test from proxy container (if needed)
|
||||
kubectl exec -it <pod> -n langsmith -c istio-proxy -- curl -v <external-url>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Logs Appear Missing
|
||||
|
||||
**Symptom:** `kubectl logs <pod>` shows no output or wrong logs.
|
||||
|
||||
**Cause:** Looking at wrong container (proxy instead of app).
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# List containers
|
||||
kubectl get pod <pod> -n langsmith -o jsonpath='{.spec.containers[*].name}'
|
||||
|
||||
# Get logs from correct container
|
||||
kubectl logs <pod> -n langsmith -c <app-container-name>
|
||||
```
|
||||
|
||||
### Health Probes Failing
|
||||
|
||||
**Symptom:** Pods not becoming Ready after sidecar injection.
|
||||
|
||||
**Cause:** Probe timeouts too low for sidecar latency.
|
||||
|
||||
**Solution:** Increase `timeoutSeconds` in probe configuration.
|
||||
|
||||
### External Database Connection Refused
|
||||
|
||||
**Symptom:** Cannot connect to external PostgreSQL/Redis.
|
||||
|
||||
**Cause:** ServiceEntry not configured or incorrect.
|
||||
|
||||
**Solution:**
|
||||
1. Check ServiceEntry exists: `kubectl get serviceentry -n langsmith`
|
||||
2. Verify hostname matches: `kubectl describe serviceentry <name> -n langsmith`
|
||||
3. Check egress policies: `kubectl get authorizationpolicy -n langsmith`
|
||||
|
||||
### High Latency After Injection
|
||||
|
||||
**Symptom:** Request latency increased after sidecar injection.
|
||||
|
||||
**Cause:** Normal sidecar overhead (10-50ms per request).
|
||||
|
||||
**Solution:** This is expected. If latency is excessive (>100ms), check:
|
||||
- Proxy resource limits
|
||||
- Network policies
|
||||
- mTLS overhead
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start with namespace-level injection** for simplicity
|
||||
2. **Adjust health probe timeouts** after injection
|
||||
3. **Configure ServiceEntry** for all external dependencies
|
||||
4. **Monitor proxy resource usage** (CPU/memory)
|
||||
5. **Document container names** for log access
|
||||
6. **Test connectivity** after configuration changes
|
||||
7. **Use per-workload annotation** for selective injection
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [Istio Documentation](https://istio.io/latest/docs/)
|
||||
- [Istio Sidecar Injection](https://istio.io/latest/docs/setup/additional-setup/sidecar-injection/)
|
||||
- [Istio ServiceEntry](https://istio.io/latest/docs/reference/config/networking/service-entry/)
|
||||
- [Istio DestinationRule](https://istio.io/latest/docs/reference/config/networking/destination-rule/)
|
||||
|
||||
@@ -0,0 +1,185 @@
|
||||
# Support Escalation Template
|
||||
|
||||
**Use this template when escalating an incident to LangChain Support.**
|
||||
|
||||
Copy and fill in each section. Include the diagnostics bundle and any relevant evidence.
|
||||
|
||||
---
|
||||
|
||||
## Incident Summary
|
||||
|
||||
**Start Time:** `YYYY-MM-DD HH:MM:SS UTC`
|
||||
**Detection Time:** `YYYY-MM-DD HH:MM:SS UTC`
|
||||
**Current Status:** `[Investigating / Escalating / Resolved]`
|
||||
|
||||
**Brief Description:**
|
||||
```
|
||||
[One-sentence summary of the issue]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
**Who is impacted:**
|
||||
- [ ] All users
|
||||
- [ ] Specific user(s) or workspace(s)
|
||||
- [ ] Specific endpoints or features
|
||||
- [ ] Internal operations only
|
||||
|
||||
**What's broken:**
|
||||
- [ ] UI is unreachable or returns errors
|
||||
- [ ] API endpoints return 5xx errors
|
||||
- [ ] Traces are missing or delayed
|
||||
- [ ] Authentication/authorization failures
|
||||
- [ ] Ingestion is slow or failing
|
||||
- [ ] Other: `[describe]`
|
||||
|
||||
**Error messages observed:**
|
||||
```
|
||||
[Paste relevant error messages, redacting any secrets]
|
||||
```
|
||||
|
||||
**User-facing impact:**
|
||||
```
|
||||
[Describe what users experience]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recent Changes
|
||||
|
||||
**Deployments/Releases:**
|
||||
- [ ] Helm upgrade/chart change: `[version/date]`
|
||||
- [ ] Configuration change: `[what changed]`
|
||||
- [ ] Infrastructure change: `[what changed]`
|
||||
- [ ] No recent changes
|
||||
|
||||
**Timeline:**
|
||||
```
|
||||
[Chronological list of changes leading up to the incident]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Environment Details
|
||||
|
||||
**Cloud Provider:** `[AWS / Azure / GCP / Other]`
|
||||
**Region/Location:** `[region]`
|
||||
**Kubernetes Service:** `[EKS / AKS / GKE / Other]`
|
||||
**Cluster Name:** `[cluster-name]`
|
||||
**Namespace:** `[namespace]`
|
||||
|
||||
**LangSmith Version:**
|
||||
- Helm Chart Version: `[version]`
|
||||
- Image Tags: `[if known]`
|
||||
- Deployment Method: `[Helm / kubectl / Other]`
|
||||
|
||||
**Infrastructure:**
|
||||
- PostgreSQL: `[RDS / Azure Database / In-cluster / Other]`
|
||||
- Redis: `[ElastiCache / Azure Cache / In-cluster / Other]`
|
||||
- ClickHouse: `[Managed / In-cluster]`
|
||||
- Blob Storage: `[S3 / Azure Blob / GCS / Other]`
|
||||
|
||||
---
|
||||
|
||||
## Diagnostics Bundle
|
||||
|
||||
**Bundle Location:** `[path or URL to diagnostics bundle]`
|
||||
|
||||
**Bundle Contents:**
|
||||
- [ ] Canonical diagnostics script output (`get_k8s_debugging_info.sh`)
|
||||
- [ ] `kubectl get all -o yaml` snapshot
|
||||
- [ ] Recent events (`kubectl get events`)
|
||||
- [ ] Pod logs (API, workers, ClickHouse)
|
||||
- [ ] Resource usage snapshot (`kubectl top pods/nodes`)
|
||||
- [ ] Ingress/load balancer configuration
|
||||
- [ ] Helm values (redacted)
|
||||
|
||||
**Bundle Timestamp:** `YYYY-MM-DD HH:MM:SS UTC`
|
||||
|
||||
---
|
||||
|
||||
## What We've Tried
|
||||
|
||||
**Investigation Steps:**
|
||||
1. `[What you checked and what you found]`
|
||||
2. `[Next step and result]`
|
||||
3. `[Continue as needed]`
|
||||
|
||||
**Remediation Attempts:**
|
||||
- [ ] Restarted pods: `[which pods, result]`
|
||||
- [ ] Checked external service connectivity: `[result]`
|
||||
- [ ] Verified configuration: `[result]`
|
||||
- [ ] Other: `[describe]`
|
||||
|
||||
**Current Hypothesis:**
|
||||
```
|
||||
[Your best guess at the root cause, with evidence]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Evidence & Logs
|
||||
|
||||
**Key Log Excerpts (redact secrets):**
|
||||
```
|
||||
[Paste relevant log lines with timestamps]
|
||||
```
|
||||
|
||||
**Error Patterns:**
|
||||
```
|
||||
[Describe patterns you've observed]
|
||||
```
|
||||
|
||||
**Metrics/Signals:**
|
||||
```
|
||||
[Any metrics or signals that indicate the issue]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Questions for Support
|
||||
|
||||
1. `[Your question]`
|
||||
2. `[Another question]`
|
||||
3. `[Continue as needed]`
|
||||
|
||||
---
|
||||
|
||||
## Additional Context
|
||||
|
||||
**Related Issues:**
|
||||
- Previous similar incidents: `[reference]`
|
||||
- Known limitations: `[describe]`
|
||||
- Custom configurations: `[describe, redact secrets]`
|
||||
|
||||
**Priority:**
|
||||
- [ ] Critical (service down, all users impacted)
|
||||
- [ ] High (major feature broken, many users impacted)
|
||||
- [ ] Medium (degraded performance, some users impacted)
|
||||
- [ ] Low (minor issue, workaround available)
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
**What we need from Support:**
|
||||
- [ ] Root cause analysis
|
||||
- [ ] Remediation steps
|
||||
- [ ] Configuration guidance
|
||||
- [ ] Performance optimization
|
||||
- [ ] Other: `[describe]`
|
||||
|
||||
**Our availability:**
|
||||
- Timezone: `[timezone]`
|
||||
- Best time to contact: `[time range]`
|
||||
- Escalation contact: `[name/email]`
|
||||
|
||||
---
|
||||
|
||||
**Template Version:** 1.0
|
||||
**Last Updated:** `[date]`
|
||||
|
||||
**Note:** Always redact secrets, API keys, passwords, and connection strings before sharing. Use `[REDACTED]` or similar markers.
|
||||
|
||||
@@ -35,3 +35,40 @@ VALUES_FILE="./helm/langsmith-values/values.aws-demo.yaml"
|
||||
ARTIFACTS_DIR="./artifacts"
|
||||
LOG_LEVEL="info" # info|debug
|
||||
DRY_RUN="true" # true by default; notebooks should flip this explicitly when applying
|
||||
|
||||
# ===== OIDC SSO Configuration (Module 2) =====
|
||||
# Required: Get these values from your IdP team
|
||||
|
||||
# LangSmith domain (must match your ingress domain)
|
||||
LANGSMITH_DOMAIN="langsmith.example.com"
|
||||
|
||||
# OIDC Configuration (required)
|
||||
OIDC_ISSUER="https://your-org.okta.com/oauth2/default" # IdP issuer URL
|
||||
OIDC_CLIENT_ID="your-client-id" # OAuth2 client ID (public)
|
||||
OIDC_CLIENT_SECRET="your-client-secret" # OAuth2 client secret (store in K8s secret, never commit)
|
||||
OIDC_REDIRECT_URI="https://langsmith.example.com/auth/callback" # Must match EXACTLY in IdP whitelist
|
||||
|
||||
# OIDC Scopes (optional, defaults shown)
|
||||
OIDC_SCOPES="openid,email,profile,groups" # Include 'groups' for group-based role mapping
|
||||
|
||||
# Claim Mappings (optional, defaults shown)
|
||||
OIDC_EMAIL_CLAIM="email" # Claim name for user email (required)
|
||||
OIDC_NAME_CLAIM="name" # Claim name for user display name (optional)
|
||||
OIDC_GROUPS_CLAIM="groups" # Claim name for group membership (optional, for role mapping)
|
||||
|
||||
# ===== SAML SSO Configuration (Module 2 - Alternative) =====
|
||||
# Use SAML if your IdP doesn't support OIDC or enterprise policy requires SAML
|
||||
|
||||
# SAML_METADATA_URL="https://your-idp.com/saml/metadata" # Preferred: metadata URL
|
||||
# SAML_METADATA_FILE="/path/to/metadata.xml" # Alternative: metadata file path
|
||||
# SAML_ENTITY_ID="https://langsmith.example.com" # Optional: entity ID
|
||||
# SAML_EMAIL_ATTRIBUTE="email" # Optional: email attribute name
|
||||
# SAML_NAME_ATTRIBUTE="name" # Optional: name attribute name
|
||||
# SAML_GROUPS_ATTRIBUTE="groups" # Optional: groups attribute name
|
||||
|
||||
# ===== Notes =====
|
||||
# 1. OIDC_CLIENT_SECRET should be stored in Kubernetes secret, not in this file
|
||||
# 2. Redirect URI must match EXACTLY (case, trailing slashes, protocol)
|
||||
# 3. IdP team must whitelist the redirect URI
|
||||
# 4. For production, use HTTPS for all URLs
|
||||
# 5. See docs/modules/module-2.md for complete configuration guide
|
||||
|
||||
@@ -15,7 +15,7 @@ AWS_REGION="us-east-1"
|
||||
AWS_ACCOUNT_ID=""
|
||||
|
||||
# Naming (used by notebooks for display + validation)
|
||||
CLUSTER_NAME="langsmith-workshop"
|
||||
#CLUSTER_NAME=""
|
||||
|
||||
# Local repo paths (absolute is safest)
|
||||
TERRAFORM_REPO_DIR="$HOME/src/langchain-ai/terraform"
|
||||
|
||||
@@ -1,589 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Module 1: Preflight Checks\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"This notebook validates your environment before deploying LangSmith. Most self-hosted failures occur **before** users ever touch the product due to:\n",
|
||||
"\n",
|
||||
"- Mis-sized clusters\n",
|
||||
"- Unsupported ingress setups\n",
|
||||
"- In-cluster databases used past their limits\n",
|
||||
"- Missing storage primitives (blob, PVs)\n",
|
||||
"\n",
|
||||
"This preflight ensures you start from a **supported baseline**.\n",
|
||||
"\n",
|
||||
"## What We'll Check\n",
|
||||
"\n",
|
||||
"1. ✅ Tooling validation (cloud CLI, terraform, kubectl, helm, jq)\n",
|
||||
"2. ✅ Cloud provider credentials & region sanity check\n",
|
||||
"3. ✅ Cluster capacity expectations\n",
|
||||
"4. ✅ Storage prerequisites (CSI drivers, StorageClasses)\n",
|
||||
"5. ✅ Blob storage requirement (cloud object storage)\n",
|
||||
"\n",
|
||||
"**Estimated time:** 20-30 minutes\n",
|
||||
"\n",
|
||||
"**Supported Cloud Providers:** AWS, Azure (GCP coming soon)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Bootstrap environment\n",
|
||||
"import sys\n",
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"# Add notebooks directory to path so we can import shared as a package\n",
|
||||
"# Find the notebooks directory by looking for the shared folder\n",
|
||||
"possible_paths = [\n",
|
||||
" Path.cwd().parent, # If cwd is module-1, go up one level to notebooks\n",
|
||||
" Path.cwd(), # If cwd is already notebooks\n",
|
||||
" Path.cwd() / \"notebooks\", # If cwd is workspace root\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"notebooks_path = None\n",
|
||||
"for path in possible_paths:\n",
|
||||
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" notebooks_path = path\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not notebooks_path:\n",
|
||||
" # Fallback: try workspace root\n",
|
||||
" notebooks_path = Path.cwd() / \"notebooks\"\n",
|
||||
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
|
||||
"\n",
|
||||
"# Add notebooks directory to path so 'shared' can be imported as a package\n",
|
||||
"if str(notebooks_path) not in sys.path:\n",
|
||||
" sys.path.insert(0, str(notebooks_path))\n",
|
||||
"\n",
|
||||
"from shared._bootstrap import bootstrap\n",
|
||||
"\n",
|
||||
"# Run bootstrap: loads env, checks tools, validates AWS, creates artifacts dir\n",
|
||||
"bootstrap_info = bootstrap()\n",
|
||||
"print(f\"\\nBootstrap complete! Artifacts directory: {bootstrap_info['artifacts_dir']}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Cloud Provider Account & Region Validation\n",
|
||||
"\n",
|
||||
"Verify you're using the correct cloud provider account/subscription and region. This is critical for avoiding accidental deployments to production or wrong regions.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import json\n",
|
||||
"from shared._cloud_helpers import (\n",
|
||||
" get_cloud_provider,\n",
|
||||
" get_region,\n",
|
||||
" get_identity,\n",
|
||||
" assert_account,\n",
|
||||
")\n",
|
||||
"from shared._validation import require_env, print_config, ok, warn\n",
|
||||
"\n",
|
||||
"# Get cloud configuration\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"region = get_region()\n",
|
||||
"identity = get_identity()\n",
|
||||
"\n",
|
||||
"provider_display = provider.upper()\n",
|
||||
"print(f\"### Current {provider_display} Session\")\n",
|
||||
"print(f\"Cloud Provider: {provider_display}\")\n",
|
||||
"print(f\"Region: {region}\")\n",
|
||||
"\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" print(f\"Account ID: {identity['Account']}\")\n",
|
||||
" print(f\"User ARN: {identity['Arn']}\")\n",
|
||||
" account_var = \"AWS_ACCOUNT_ID\"\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"\")\n",
|
||||
" subscription_name = identity.get(\"SubscriptionName\", \"\")\n",
|
||||
" print(f\"Subscription ID: {subscription_id}\")\n",
|
||||
" print(f\"Subscription Name: {subscription_name}\")\n",
|
||||
" account_var = \"AZURE_SUBSCRIPTION_ID\"\n",
|
||||
"else:\n",
|
||||
" account_var = None\n",
|
||||
"\n",
|
||||
"# Optional: Validate against expected account/subscription\n",
|
||||
"if account_var:\n",
|
||||
" expected_account = os.environ.get(account_var, \"\").strip()\n",
|
||||
" if expected_account:\n",
|
||||
" assert_account(expected_account)\n",
|
||||
" else:\n",
|
||||
" warn(f\"{account_var} not set in environment - skipping account validation\")\n",
|
||||
" print(f\"💡 Tip: Set {account_var} in your .env file to add a guardrail against wrong account deployments\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Required Environment Variables\n",
|
||||
"\n",
|
||||
"Verify that all required configuration is present. These values will be used throughout the deployment.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Check required environment variables\n",
|
||||
"from shared._cloud_helpers import get_cloud_provider\n",
|
||||
"\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"\n",
|
||||
"# Base required vars (cloud-agnostic)\n",
|
||||
"required_vars = [\n",
|
||||
" \"WORKSHOP_NAME\",\n",
|
||||
" \"NAMESPACE\",\n",
|
||||
" \"CLUSTER_NAME\",\n",
|
||||
" \"TERRAFORM_DIR\",\n",
|
||||
" \"HELM_RELEASE\",\n",
|
||||
" \"HELM_NAMESPACE\",\n",
|
||||
" \"HELM_CHART_REF\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"# Add cloud-specific required vars\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" required_vars.append(\"AWS_REGION\")\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" required_vars.append(\"AZURE_LOCATION\")\n",
|
||||
"\n",
|
||||
"config = require_env(*required_vars)\n",
|
||||
"\n",
|
||||
"# Optional but recommended (cloud-specific)\n",
|
||||
"optional_vars = {}\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" optional_vars = {\n",
|
||||
" \"AWS_PROFILE\": os.environ.get(\"AWS_PROFILE\", \"\"),\n",
|
||||
" \"AWS_ACCOUNT_ID\": os.environ.get(\"AWS_ACCOUNT_ID\", \"\"),\n",
|
||||
" \"VALUES_FILE\": os.environ.get(\"VALUES_FILE\", \"\"),\n",
|
||||
" }\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" optional_vars = {\n",
|
||||
" \"AZURE_SUBSCRIPTION_ID\": os.environ.get(\"AZURE_SUBSCRIPTION_ID\", \"\"),\n",
|
||||
" \"AZURE_RESOURCE_GROUP\": os.environ.get(\"AZURE_RESOURCE_GROUP\", \"\"),\n",
|
||||
" \"VALUES_FILE\": os.environ.get(\"VALUES_FILE\", \"\"),\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
"print(\"\\n### Configuration Summary\")\n",
|
||||
"print(f\"Cloud Provider: {provider.upper()}\")\n",
|
||||
"print_config(config, redact_keys={\"AWS_PROFILE\"})\n",
|
||||
"print(\"\\n### Optional Configuration\")\n",
|
||||
"for k, v in optional_vars.items():\n",
|
||||
" if v:\n",
|
||||
" print(f\"- {k}: {v}\")\n",
|
||||
" else:\n",
|
||||
" print(f\"- {k}: (not set)\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Cluster Capacity Expectations\n",
|
||||
"\n",
|
||||
"LangSmith requires adequate cluster resources. Before deploying, understand what you'll need:\n",
|
||||
"\n",
|
||||
"- **Minimum:** 3 nodes, 4 vCPU, 16GB RAM each (for development/testing)\n",
|
||||
"- **Recommended:** 3 nodes, 8 vCPU, 32GB RAM each (for production workloads)\n",
|
||||
"- **Storage:** EBS CSI driver required for ClickHouse PVCs\n",
|
||||
"\n",
|
||||
"Let's check if a cluster already exists and validate its configuration.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._aws_helpers import eks_cluster_exists\n",
|
||||
"from shared._shell import run\n",
|
||||
"\n",
|
||||
"cluster_name = os.environ[\"CLUSTER_NAME\"]\n",
|
||||
"region = aws_region()\n",
|
||||
"\n",
|
||||
"print(f\"### Checking EKS Cluster: {cluster_name}\")\n",
|
||||
"print(f\"Region: {region}\\n\")\n",
|
||||
"\n",
|
||||
"if eks_cluster_exists(cluster_name):\n",
|
||||
" ok(f\"Cluster '{cluster_name}' exists\")\n",
|
||||
" \n",
|
||||
" # Get cluster details\n",
|
||||
" result = run(\n",
|
||||
" [\"aws\", \"eks\", \"describe-cluster\", \"--name\", cluster_name, \"--region\", region, \"--output\", \"json\"],\n",
|
||||
" check=True,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" cluster_info = json.loads(result.stdout)[\"cluster\"]\n",
|
||||
" \n",
|
||||
" print(f\"\\nCluster Status: {cluster_info['status']}\")\n",
|
||||
" print(f\"Kubernetes Version: {cluster_info['version']}\")\n",
|
||||
" print(f\"Platform Version: {cluster_info.get('platformVersion', 'N/A')}\")\n",
|
||||
" \n",
|
||||
" # Check node groups\n",
|
||||
" print(\"\\n### Node Groups\")\n",
|
||||
" ng_result = run(\n",
|
||||
" [\"aws\", \"eks\", \"list-nodegroups\", \"--cluster-name\", cluster_name, \"--region\", region, \"--output\", \"json\"],\n",
|
||||
" check=True,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" nodegroups = json.loads(ng_result.stdout).get(\"nodegroups\", [])\n",
|
||||
" \n",
|
||||
" if nodegroups:\n",
|
||||
" for ng in nodegroups:\n",
|
||||
" ng_detail = run(\n",
|
||||
" [\"aws\", \"eks\", \"describe-nodegroup\", \"--cluster-name\", cluster_name, \n",
|
||||
" \"--nodegroup-name\", ng, \"--region\", region, \"--output\", \"json\"],\n",
|
||||
" check=True,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" ng_info = json.loads(ng_detail.stdout)[\"nodegroup\"]\n",
|
||||
" scaling = ng_info.get(\"scalingConfig\", {})\n",
|
||||
" print(f\"\\n Node Group: {ng}\")\n",
|
||||
" print(f\" Status: {ng_info['status']}\")\n",
|
||||
" print(f\" Desired: {scaling.get('desiredSize', 'N/A')}\")\n",
|
||||
" print(f\" Min: {scaling.get('minSize', 'N/A')}\")\n",
|
||||
" print(f\" Max: {scaling.get('maxSize', 'N/A')}\")\n",
|
||||
" print(f\" Instance Types: {', '.join(ng_info.get('instanceTypes', []))}\")\n",
|
||||
" else:\n",
|
||||
" warn(\"No node groups found\")\n",
|
||||
" print(\"💡 You'll need to create node groups when deploying with Terraform\")\n",
|
||||
"else:\n",
|
||||
" warn(f\"Cluster '{cluster_name}' does not exist yet\")\n",
|
||||
" print(\"💡 This is expected if you haven't run Terraform yet. Proceed to notebook 02_terraform_apply.ipynb\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Storage Prerequisites\n",
|
||||
"\n",
|
||||
"LangSmith requires persistent storage for ClickHouse. The cloud storage CSI driver must be installed and StorageClasses must be configured.\n",
|
||||
"\n",
|
||||
"**Why this matters:** Without the appropriate CSI driver, ClickHouse PVCs will remain in `Pending` state forever.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Check if kubectl is configured for the cluster\n",
|
||||
"from shared._cloud_helpers import (\n",
|
||||
" get_cloud_provider,\n",
|
||||
" get_region,\n",
|
||||
" configure_kubectl,\n",
|
||||
" get_storage_driver_name,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"cluster_name = os.environ[\"CLUSTER_NAME\"]\n",
|
||||
"region = get_region()\n",
|
||||
"storage_driver = get_storage_driver_name()\n",
|
||||
"\n",
|
||||
"k8s_service = \"EKS\" if provider == \"aws\" else \"AKS\" if provider == \"azure\" else \"Kubernetes\"\n",
|
||||
"print(f\"### Configuring kubectl for {k8s_service} cluster\")\n",
|
||||
"try:\n",
|
||||
" # Configure kubectl (cloud-agnostic)\n",
|
||||
" configure_kubectl(cluster_name, region)\n",
|
||||
" ok(\"kubectl configured for cluster\")\n",
|
||||
" \n",
|
||||
" # Check CSI driver (cloud-specific labels)\n",
|
||||
" print(f\"\\n### Checking {storage_driver} Driver\")\n",
|
||||
" \n",
|
||||
" if provider == \"aws\":\n",
|
||||
" driver_label = \"app=ebs-csi-controller\"\n",
|
||||
" driver_name = \"EBS CSI\"\n",
|
||||
" elif provider == \"azure\":\n",
|
||||
" driver_label = \"app=csi-azuredisk-controller\"\n",
|
||||
" driver_name = \"Azure Disk CSI\"\n",
|
||||
" else:\n",
|
||||
" driver_label = None\n",
|
||||
" driver_name = \"Storage CSI\"\n",
|
||||
" \n",
|
||||
" if driver_label:\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"daemonset\", \"-n\", \"kube-system\", \"-l\", driver_label, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0 and result.stdout.strip():\n",
|
||||
" ds_info = json.loads(result.stdout)\n",
|
||||
" if ds_info.get(\"items\"):\n",
|
||||
" ok(f\"{driver_name} driver is installed\")\n",
|
||||
" print(f\" DaemonSet: {ds_info['items'][0]['metadata']['name']}\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"{driver_name} driver not found\")\n",
|
||||
" print(f\"💡 {driver_name} driver must be installed before deploying LangSmith\")\n",
|
||||
" print(\" The Terraform module should handle this, but verify after deployment\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"{driver_name} driver not found\")\n",
|
||||
" print(f\"💡 {driver_name} driver must be installed before deploying LangSmith\")\n",
|
||||
" \n",
|
||||
" # Check StorageClasses\n",
|
||||
" print(\"\\n### Checking StorageClasses\")\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"storageclass\", \"-o\", \"json\"],\n",
|
||||
" check=True,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" sc_list = json.loads(result.stdout)\n",
|
||||
" \n",
|
||||
" # Find cloud-specific storage classes\n",
|
||||
" if provider == \"aws\":\n",
|
||||
" storage_scs = [sc for sc in sc_list.get(\"items\", []) if \"ebs\" in sc[\"metadata\"][\"name\"].lower() or \n",
|
||||
" sc.get(\"provisioner\", \"\").endswith(\"ebs.csi.aws.com\")]\n",
|
||||
" elif provider == \"azure\":\n",
|
||||
" storage_scs = [sc for sc in sc_list.get(\"items\", []) if \"disk\" in sc[\"metadata\"][\"name\"].lower() or \n",
|
||||
" sc.get(\"provisioner\", \"\").endswith(\"disk.csi.azure.com\")]\n",
|
||||
" else:\n",
|
||||
" storage_scs = []\n",
|
||||
" \n",
|
||||
" if storage_scs:\n",
|
||||
" ok(f\"Found {len(storage_scs)} {storage_driver} StorageClass(es):\")\n",
|
||||
" for sc in storage_scs:\n",
|
||||
" name = sc[\"metadata\"][\"name\"]\n",
|
||||
" default = sc.get(\"metadata\", {}).get(\"annotations\", {}).get(\"storageclass.kubernetes.io/is-default-class\", \"false\")\n",
|
||||
" print(f\" - {name} (default: {default})\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"No {storage_driver} StorageClasses found\")\n",
|
||||
" print(f\"💡 At least one {storage_driver} StorageClass is required for ClickHouse PVCs\")\n",
|
||||
" \n",
|
||||
"except Exception as e:\n",
|
||||
" warn(f\"Could not check storage prerequisites: {e}\")\n",
|
||||
" print(\"💡 This is expected if the cluster doesn't exist yet\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Blob Storage Requirement\n",
|
||||
"\n",
|
||||
"**Critical:** LangSmith requires cloud object storage (S3, Blob Storage, etc.) for blob storage in production. Inline trace payloads will explode ClickHouse if blob storage is not configured.\n",
|
||||
"\n",
|
||||
"Let's verify access to your cloud provider's object storage service and check if a storage account/bucket exists or needs to be created.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._cloud_helpers import (\n",
|
||||
" get_cloud_provider,\n",
|
||||
" get_region,\n",
|
||||
" get_blob_storage_service_name,\n",
|
||||
" verify_blob_storage_access,\n",
|
||||
")\n",
|
||||
"from shared._shell import run\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"region = get_region()\n",
|
||||
"blob_service = get_blob_storage_service_name()\n",
|
||||
"\n",
|
||||
"print(f\"### {blob_service} Access Check\")\n",
|
||||
"print(f\"Cloud Provider: {provider.upper()}\")\n",
|
||||
"print(f\"Region: {region}\\n\")\n",
|
||||
"\n",
|
||||
"# Test blob storage access\n",
|
||||
"try:\n",
|
||||
" if provider == \"aws\":\n",
|
||||
" result = run(\n",
|
||||
" [\"aws\", \"s3\", \"ls\", \"--region\", region],\n",
|
||||
" check=True,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" ok(f\"{blob_service} access verified\")\n",
|
||||
" \n",
|
||||
" # List buckets\n",
|
||||
" buckets_result = run(\n",
|
||||
" [\"aws\", \"s3api\", \"list-buckets\", \"--output\", \"json\"],\n",
|
||||
" check=True,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" buckets = json.loads(buckets_result.stdout).get(\"Buckets\", [])\n",
|
||||
" \n",
|
||||
" print(f\"\\nFound {len(buckets)} S3 bucket(s):\")\n",
|
||||
" for bucket in buckets[:10]: # Show first 10\n",
|
||||
" print(f\" - {bucket['Name']} (created: {bucket['CreationDate']})\")\n",
|
||||
" \n",
|
||||
" if len(buckets) > 10:\n",
|
||||
" print(f\" ... and {len(buckets) - 10} more\")\n",
|
||||
" \n",
|
||||
" elif provider == \"azure\":\n",
|
||||
" result = run(\n",
|
||||
" [\"az\", \"storage\", \"account\", \"list\", \"--output\", \"json\"],\n",
|
||||
" check=True,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" ok(f\"{blob_service} access verified\")\n",
|
||||
" \n",
|
||||
" # List storage accounts\n",
|
||||
" accounts = json.loads(result.stdout)\n",
|
||||
" \n",
|
||||
" print(f\"\\nFound {len(accounts)} Storage Account(s):\")\n",
|
||||
" for account in accounts[:10]: # Show first 10\n",
|
||||
" name = account.get(\"name\", \"N/A\")\n",
|
||||
" location = account.get(\"location\", \"N/A\")\n",
|
||||
" print(f\" - {name} (location: {location})\")\n",
|
||||
" \n",
|
||||
" if len(accounts) > 10:\n",
|
||||
" print(f\" ... and {len(accounts) - 10} more\")\n",
|
||||
" \n",
|
||||
" print(f\"\\n💡 Note: The Terraform module should create a {blob_service} resource for LangSmith blob storage\")\n",
|
||||
" print(\" Verify the resource exists after Terraform deployment\")\n",
|
||||
" \n",
|
||||
"except Exception as e:\n",
|
||||
" warn(f\"{blob_service} access check failed: {e}\")\n",
|
||||
" if provider == \"aws\":\n",
|
||||
" print(\"💡 Ensure your AWS credentials have S3 permissions\")\n",
|
||||
" elif provider == \"azure\":\n",
|
||||
" print(\"💡 Ensure your Azure credentials have Storage Account permissions\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Terraform & Helm Repository Paths\n",
|
||||
"\n",
|
||||
"Verify that the Terraform and Helm repository paths are correctly configured and accessible.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import re\n",
|
||||
"from pathlib import Path\n",
|
||||
"from shared._validation import ok, warn\n",
|
||||
"\n",
|
||||
"def expand_env_vars(path_str: str) -> str:\n",
|
||||
" \"\"\"Expand environment variable references in a path string.\"\"\"\n",
|
||||
" # Expand $VAR and ${VAR} references\n",
|
||||
" def replace_var(match):\n",
|
||||
" var_name = match.group(1) or match.group(2)\n",
|
||||
" return os.environ.get(var_name, match.group(0))\n",
|
||||
" \n",
|
||||
" # Replace $VAR and ${VAR} patterns\n",
|
||||
" path_str = re.sub(r'\\$\\{([^}]+)\\}|\\$([a-zA-Z_][a-zA-Z0-9_]*)', replace_var, path_str)\n",
|
||||
" return path_str\n",
|
||||
"\n",
|
||||
"# Expand environment variables in paths (e.g., $TERRAFORM_REPO_DIR, $HELM_REPO_DIR, $HOME)\n",
|
||||
"terraform_dir_str = expand_env_vars(os.environ[\"TERRAFORM_DIR\"])\n",
|
||||
"terraform_dir = Path(terraform_dir_str).expanduser().resolve()\n",
|
||||
"\n",
|
||||
"helm_chart_ref_str = expand_env_vars(os.environ[\"HELM_CHART_REF\"])\n",
|
||||
"helm_chart_ref = Path(helm_chart_ref_str).expanduser().resolve()\n",
|
||||
"\n",
|
||||
"print(\"### Repository Paths Check\\n\")\n",
|
||||
"\n",
|
||||
"# Check Terraform directory\n",
|
||||
"print(f\"Terraform Directory: {terraform_dir}\")\n",
|
||||
"if terraform_dir.exists():\n",
|
||||
" ok(f\"Terraform directory exists\")\n",
|
||||
" \n",
|
||||
" # Check for main.tf or similar\n",
|
||||
" tf_files = list(terraform_dir.glob(\"*.tf\"))\n",
|
||||
" if tf_files:\n",
|
||||
" print(f\" Found {len(tf_files)} Terraform file(s)\")\n",
|
||||
" else:\n",
|
||||
" warn(\"No .tf files found in Terraform directory\")\n",
|
||||
" print(\"💡 Ensure you're pointing to the correct Terraform module path\")\n",
|
||||
"else:\n",
|
||||
" warn(f\"Terraform directory does not exist: {terraform_dir}\")\n",
|
||||
" print(\"💡 Update TERRAFORM_DIR in your .env file to point to the langchain-ai/terraform repo\")\n",
|
||||
"\n",
|
||||
"# Check Helm chart\n",
|
||||
"print(f\"\\nHelm Chart Reference: {helm_chart_ref}\")\n",
|
||||
"if helm_chart_ref.exists():\n",
|
||||
" ok(f\"Helm chart path exists\")\n",
|
||||
" \n",
|
||||
" # Check for Chart.yaml\n",
|
||||
" chart_yaml = helm_chart_ref / \"Chart.yaml\"\n",
|
||||
" if chart_yaml.exists():\n",
|
||||
" print(f\" Found Chart.yaml\")\n",
|
||||
" else:\n",
|
||||
" warn(\"Chart.yaml not found\")\n",
|
||||
" print(\"💡 Ensure you're pointing to the correct Helm chart path\")\n",
|
||||
"else:\n",
|
||||
" warn(f\"Helm chart path does not exist: {helm_chart_ref}\")\n",
|
||||
" print(\"💡 Update HELM_CHART_REF in your .env file to point to the langchain-ai/helm chart\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Preflight Summary\n",
|
||||
"\n",
|
||||
"Review the checklist below. All items should be ✅ before proceeding to Terraform deployment.\n",
|
||||
"\n",
|
||||
"### ✅ Checklist\n",
|
||||
"\n",
|
||||
"- [ ] All required tools installed (cloud CLI, terraform, kubectl, helm, jq)\n",
|
||||
"- [ ] Cloud provider credentials valid and correct account/subscription/region\n",
|
||||
"- [ ] Required environment variables set\n",
|
||||
"- [ ] Terraform directory path correct\n",
|
||||
"- [ ] Helm chart path correct\n",
|
||||
"- [ ] Blob storage access verified (S3/Blob Storage)\n",
|
||||
"- [ ] (If cluster exists) Storage CSI driver installed\n",
|
||||
"- [ ] (If cluster exists) StorageClasses configured\n",
|
||||
"\n",
|
||||
"### Next Steps\n",
|
||||
"\n",
|
||||
"If all checks pass, proceed to **02_terraform_apply.ipynb** to deploy the infrastructure.\n",
|
||||
"\n",
|
||||
"If any checks failed, review the warnings above and fix the issues before continuing.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.14.2"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
@@ -129,6 +129,77 @@
|
||||
" print(f\"💡 Tip: Set {account_var} in your .env file to add a guardrail against wrong account deployments\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Workshop Identifier Setup\n",
|
||||
"\n",
|
||||
"To ensure unique resource names and enable idempotent deployments, we need a unique identifier for your workshop deployment. This identifier will be used for all Terraform resources.\n",
|
||||
"\n",
|
||||
"**We'll use your email address** (hashed for privacy) to create a deterministic identifier that:\n",
|
||||
"- ✅ Stays the same across notebook runs (idempotent)\n",
|
||||
"- ✅ Is unique per student\n",
|
||||
"- ✅ Works with the date-based prefix for resource naming\n",
|
||||
"\n",
|
||||
"Enter your email address below. It will be hashed and used to generate your unique workshop identifier.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Generate deterministic workshop identifier from email\n",
|
||||
"import hashlib\n",
|
||||
"import json\n",
|
||||
"from datetime import date\n",
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"print(\"### Workshop Identifier Setup\\n\")\n",
|
||||
"print(\"Enter your email address to generate a unique, deterministic identifier for your deployment.\\n\")\n",
|
||||
"print(\"This identifier will be used for all Terraform resources and ensures:\")\n",
|
||||
"print(\" - Same email = same identifier (idempotent)\")\n",
|
||||
"print(\" - Different emails = different identifiers (unique)\")\n",
|
||||
"print(\" - No additional environment variables needed\\n\")\n",
|
||||
"\n",
|
||||
"# Prompt for email (using input() - works in Jupyter)\n",
|
||||
"email = input(\"Enter your email address: \").strip().lower()\n",
|
||||
"\n",
|
||||
"if not email or \"@\" not in email:\n",
|
||||
" raise ValueError(\"Invalid email address. Please enter a valid email.\")\n",
|
||||
"\n",
|
||||
"# Hash email for privacy and determinism\n",
|
||||
"email_hash = hashlib.md5(email.encode()).hexdigest()[:6]\n",
|
||||
"\n",
|
||||
"# Build identifier: -workshop-YYYYMMDD-<hash>\n",
|
||||
"today = date.today()\n",
|
||||
"date_str = today.strftime('%Y%m%d')\n",
|
||||
"workshop_identifier = f\"-workshop-{date_str}-{email_hash}\"\n",
|
||||
"\n",
|
||||
"# Save to artifacts directory for use in Terraform notebook\n",
|
||||
"identifier_file = Path(bootstrap_info['artifacts_dir']) / \"workshop_identifier.json\"\n",
|
||||
"identifier_data = {\n",
|
||||
" \"email_hash\": email_hash,\n",
|
||||
" \"identifier\": workshop_identifier,\n",
|
||||
" \"date\": date_str,\n",
|
||||
" \"created_at\": date.today().isoformat()\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"with open(identifier_file, 'w') as f:\n",
|
||||
" json.dump(identifier_data, f, indent=2)\n",
|
||||
"\n",
|
||||
"print(f\"\\n✅ Workshop identifier generated:\")\n",
|
||||
"print(f\" Identifier: {workshop_identifier}\")\n",
|
||||
"print(f\" Date component: {date_str}\")\n",
|
||||
"print(f\" Hash (from email): {email_hash}\")\n",
|
||||
"print(f\"\\n💡 This identifier will be used for all Terraform resources\")\n",
|
||||
"print(f\" Saved to: {identifier_file}\")\n",
|
||||
"print(f\"\\n⚠️ IMPORTANT: Use the same email address if you re-run this notebook\")\n",
|
||||
"print(f\" to ensure Terraform can manage existing resources.\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
@@ -153,7 +224,6 @@
|
||||
"required_vars = [\n",
|
||||
" \"WORKSHOP_NAME\",\n",
|
||||
" \"NAMESPACE\",\n",
|
||||
" \"CLUSTER_NAME\",\n",
|
||||
" \"TERRAFORM_DIR\",\n",
|
||||
" \"HELM_RELEASE\",\n",
|
||||
" \"HELM_NAMESPACE\",\n",
|
||||
@@ -222,9 +292,27 @@
|
||||
" get_kubernetes_service_name,\n",
|
||||
")\n",
|
||||
"from shared._shell import run\n",
|
||||
"from shared._validation import ok, warn, require_env\n",
|
||||
"from pathlib import Path\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"# Load workshop identifier if it exists (from identifier setup cell)\n",
|
||||
"identifier_file = Path(bootstrap_info['artifacts_dir']) / \"workshop_identifier.json\"\n",
|
||||
"if identifier_file.exists():\n",
|
||||
" with open(identifier_file) as f:\n",
|
||||
" identifier_data = json.load(f)\n",
|
||||
" workshop_identifier = identifier_data[\"identifier\"]\n",
|
||||
" # Compute expected cluster name: langsmith-eks${identifier}\n",
|
||||
" cluster_name = f\"langsmith-eks{workshop_identifier}\"\n",
|
||||
" print(f\"💡 Using cluster name from workshop identifier: {cluster_name}\\n\")\n",
|
||||
"else:\n",
|
||||
" # Fallback to CLUSTER_NAME env var if identifier not set yet\n",
|
||||
" config = require_env(\"CLUSTER_NAME\")\n",
|
||||
" cluster_name = config[\"CLUSTER_NAME\"]\n",
|
||||
" warn(\"Workshop identifier not found - using CLUSTER_NAME from environment\")\n",
|
||||
" print(\"💡 Run the 'Workshop Identifier Setup' cell above to generate a unique identifier\\n\")\n",
|
||||
"\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"cluster_name = os.environ[\"CLUSTER_NAME\"]\n",
|
||||
"region = get_region()\n",
|
||||
"k8s_service = get_kubernetes_service_name()\n",
|
||||
"\n",
|
||||
|
||||
@@ -97,6 +97,7 @@
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import re\n",
|
||||
"import json\n",
|
||||
"from pathlib import Path\n",
|
||||
"from shared._validation import require_env, ok, warn, fail\n",
|
||||
"from shared._shell import run\n",
|
||||
@@ -131,15 +132,41 @@
|
||||
"terraform_dir_str = expand_env_vars(config[\"TERRAFORM_DIR\"])\n",
|
||||
"terraform_dir = Path(terraform_dir_str).expanduser().resolve()\n",
|
||||
"\n",
|
||||
"cluster_name = config[\"CLUSTER_NAME\"]\n",
|
||||
"# Load workshop identifier from preflight notebook\n",
|
||||
"identifier_file = artifacts_dir / \"workshop_identifier.json\"\n",
|
||||
"if not identifier_file.exists():\n",
|
||||
" fail(f\"Workshop identifier not found: {identifier_file}\")\n",
|
||||
" print(\"\\n💡 To fix this:\")\n",
|
||||
" print(\" 1. Run the preflight notebook (01_preflight.ipynb) first\")\n",
|
||||
" print(\" 2. Complete the 'Workshop Identifier Setup' cell\")\n",
|
||||
" print(\" 3. Then return to this notebook\")\n",
|
||||
" raise RuntimeError(f\"Workshop identifier not found. Please run 01_preflight.ipynb first.\")\n",
|
||||
"\n",
|
||||
"with open(identifier_file) as f:\n",
|
||||
" identifier_data = json.load(f)\n",
|
||||
"\n",
|
||||
"workshop_identifier = identifier_data[\"identifier\"]\n",
|
||||
"print(f\"✅ Loaded workshop identifier: {workshop_identifier}\")\n",
|
||||
"\n",
|
||||
"# Compute expected cluster name for validation/display\n",
|
||||
"# Terraform computes: cluster_name = \"langsmith-eks${local.identifier}\"\n",
|
||||
"cluster_name = f\"langsmith-eks{workshop_identifier}\"\n",
|
||||
"\n",
|
||||
"region = config[region_var]\n",
|
||||
"workshop_name = config[\"WORKSHOP_NAME\"]\n",
|
||||
"\n",
|
||||
"print(\"### Terraform Configuration\")\n",
|
||||
"print(\"\\n### Terraform Configuration\")\n",
|
||||
"print(f\"Terraform Directory: {terraform_dir}\")\n",
|
||||
"print(f\"Cluster Name: {cluster_name}\")\n",
|
||||
"print(f\"Workshop Identifier: {workshop_identifier}\")\n",
|
||||
"print(f\"Expected Cluster Name: {cluster_name}\")\n",
|
||||
"print(f\"Region: {region}\")\n",
|
||||
"print(f\"Workshop Name: {workshop_name}\\n\")\n",
|
||||
"print(f\"Workshop Name: {workshop_name}\")\n",
|
||||
"print(f\"\\n💡 Terraform will use this identifier for all resource names:\")\n",
|
||||
"print(f\" Cluster: langsmith-eks{workshop_identifier}\")\n",
|
||||
"print(f\" Redis: langsmith-redis{workshop_identifier}\")\n",
|
||||
"print(f\" S3: langsmith-s3{workshop_identifier}\")\n",
|
||||
"print(f\" Postgres: langsmith-postgres{workshop_identifier}\")\n",
|
||||
"print(f\" VPC: langsmith-vpc{workshop_identifier}\\n\")\n",
|
||||
"\n",
|
||||
"if not terraform_dir.exists():\n",
|
||||
" fail(f\"Terraform directory does not exist: {terraform_dir}\")\n",
|
||||
@@ -371,6 +398,8 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import getpass\n",
|
||||
"\n",
|
||||
"# Create terraform plan\n",
|
||||
"plan_file = artifacts_dir / \"terraform-plan.txt\"\n",
|
||||
"\n",
|
||||
@@ -383,7 +412,22 @@
|
||||
"postgres_username = os.environ.get(\"POSTGRES_USERNAME\", \"\").strip()\n",
|
||||
"postgres_password = os.environ.get(\"POSTGRES_PASSWORD\", \"\").strip()\n",
|
||||
"\n",
|
||||
"if not postgres_username:\n",
|
||||
" print(\"Please provide a PostgreSQL username: \")\n",
|
||||
" postgres_username = input().strip()\n",
|
||||
"\n",
|
||||
"if not postgres_password:\n",
|
||||
" print(\"Please provide a PostgreSQL password: \")\n",
|
||||
" postgres_password = getpass.getpass().strip()\n",
|
||||
"\n",
|
||||
"print(\"### Terraform Variables\\n\")\n",
|
||||
"\n",
|
||||
"# Pass workshop identifier to Terraform\n",
|
||||
"# This is the key variable that controls all resource naming\n",
|
||||
"terraform_vars.extend([\"-var\", f\"identifier={workshop_identifier}\"])\n",
|
||||
"print(f\"✅ IDENTIFIER: {workshop_identifier}\")\n",
|
||||
"print(f\" This will be used for all resource names (cluster, redis, s3, postgres, vpc)\\n\")\n",
|
||||
"\n",
|
||||
"missing_vars = []\n",
|
||||
"\n",
|
||||
"if postgres_username:\n",
|
||||
@@ -439,6 +483,232 @@
|
||||
" print(\"💡 Review the errors above before proceeding\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Pre-Apply Safety Check\n",
|
||||
"\n",
|
||||
"**⚠️ CRITICAL:** Before applying Terraform, verify that resources don't already exist. This prevents accidentally modifying or overwriting existing infrastructure.\n",
|
||||
"\n",
|
||||
"This check will:\n",
|
||||
"- Verify the cluster doesn't already exist (or warn if it does)\n",
|
||||
"- Check for existing RDS/Redis/S3 resources that might conflict\n",
|
||||
"- Require explicit confirmation if resources are found\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Pre-apply safety check: Verify resources don't already exist\n",
|
||||
"from shared._cloud_helpers import (\n",
|
||||
" get_cloud_provider,\n",
|
||||
" get_region,\n",
|
||||
" cluster_exists,\n",
|
||||
" get_kubernetes_service_name,\n",
|
||||
" get_database_service_name,\n",
|
||||
" get_cache_service_name,\n",
|
||||
" get_blob_storage_service_name,\n",
|
||||
")\n",
|
||||
"from shared._validation import ok, warn, fail\n",
|
||||
"from shared._shell import run\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"region = get_region()\n",
|
||||
"k8s_service = get_kubernetes_service_name()\n",
|
||||
"\n",
|
||||
"print(\"### Pre-Apply Resource Existence Check\\n\")\n",
|
||||
"print(\"Checking for existing resources that might conflict...\\n\")\n",
|
||||
"\n",
|
||||
"existing_resources = []\n",
|
||||
"warnings = []\n",
|
||||
"\n",
|
||||
"# Check if cluster already exists\n",
|
||||
"if cluster_exists(cluster_name):\n",
|
||||
" existing_resources.append(f\"{k8s_service} cluster: {cluster_name}\")\n",
|
||||
" warn(f\"⚠️ Cluster '{cluster_name}' already exists!\")\n",
|
||||
" print(f\" If you proceed with Terraform apply, Terraform may attempt to:\")\n",
|
||||
" print(f\" - Import the existing cluster into state, OR\")\n",
|
||||
" print(f\" - Modify the existing cluster configuration\")\n",
|
||||
" print(f\" This could cause unexpected changes to your existing infrastructure.\\n\")\n",
|
||||
" print(f\" 💡 If this is intentional, ensure your Terraform configuration matches the existing cluster\")\n",
|
||||
" print(f\" 💡 If this is NOT intentional, STOP and update CLUSTER_NAME in your .env file\")\n",
|
||||
"else:\n",
|
||||
" ok(f\"Cluster '{cluster_name}' does not exist (safe to create)\")\n",
|
||||
"\n",
|
||||
"# Check for existing RDS instances (AWS) or PostgreSQL servers (Azure)\n",
|
||||
"db_service = get_database_service_name()\n",
|
||||
"print(f\"\\n### Checking for Existing {db_service} Resources\\n\")\n",
|
||||
"\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" # Check for RDS instances that might match our naming pattern\n",
|
||||
" # We'll check for instances in the same region\n",
|
||||
" try:\n",
|
||||
" result = run(\n",
|
||||
" [\"aws\", \"rds\", \"describe-db-instances\", \"--region\", region, \"--output\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" if result.returncode == 0:\n",
|
||||
" rds_instances = json.loads(result.stdout).get(\"DBInstances\", [])\n",
|
||||
" # Check if any instance name might conflict (exact match or similar pattern)\n",
|
||||
" # Terraform typically uses cluster_name or workshop_name in resource names\n",
|
||||
" for instance in rds_instances:\n",
|
||||
" db_id = instance.get(\"DBInstanceIdentifier\", \"\")\n",
|
||||
" # Check if instance name contains cluster_name or workshop_name\n",
|
||||
" if cluster_name.lower() in db_id.lower() or workshop_name.lower() in db_id.lower():\n",
|
||||
" existing_resources.append(f\"RDS instance: {db_id}\")\n",
|
||||
" warnings.append(f\"Found RDS instance '{db_id}' that may conflict with Terraform resources\")\n",
|
||||
" except Exception as e:\n",
|
||||
" warn(f\"Could not check for RDS instances: {e}\")\n",
|
||||
" print(\" 💡 This is OK - proceeding with caution\")\n",
|
||||
"\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" # Check for PostgreSQL servers\n",
|
||||
" try:\n",
|
||||
" result = run(\n",
|
||||
" [\"az\", \"postgres\", \"server\", \"list\", \"--output\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" if result.returncode == 0:\n",
|
||||
" postgres_servers = json.loads(result.stdout)\n",
|
||||
" for server in postgres_servers:\n",
|
||||
" server_name = server.get(\"name\", \"\")\n",
|
||||
" server_location = server.get(\"location\", \"\")\n",
|
||||
" # Check if server is in same location and name might conflict\n",
|
||||
" if server_location.lower() == region.lower():\n",
|
||||
" if cluster_name.lower() in server_name.lower() or workshop_name.lower() in server_name.lower():\n",
|
||||
" existing_resources.append(f\"PostgreSQL server: {server_name}\")\n",
|
||||
" warnings.append(f\"Found PostgreSQL server '{server_name}' that may conflict with Terraform resources\")\n",
|
||||
" except Exception as e:\n",
|
||||
" warn(f\"Could not check for PostgreSQL servers: {e}\")\n",
|
||||
" print(\" 💡 This is OK - proceeding with caution\")\n",
|
||||
"\n",
|
||||
"# Check for existing Redis/ElastiCache clusters\n",
|
||||
"cache_service = get_cache_service_name()\n",
|
||||
"print(f\"\\n### Checking for Existing {cache_service} Resources\\n\")\n",
|
||||
"\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" # Check for ElastiCache clusters\n",
|
||||
" try:\n",
|
||||
" result = run(\n",
|
||||
" [\"aws\", \"elasticache\", \"describe-cache-clusters\", \"--region\", region, \"--output\", \"json\", \"--show-cache-node-info\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" if result.returncode == 0:\n",
|
||||
" cache_clusters = json.loads(result.stdout).get(\"CacheClusters\", [])\n",
|
||||
" for cluster in cache_clusters:\n",
|
||||
" cluster_id = cluster.get(\"CacheClusterId\", \"\")\n",
|
||||
" if cluster_name.lower() in cluster_id.lower() or workshop_name.lower() in cluster_id.lower():\n",
|
||||
" existing_resources.append(f\"ElastiCache cluster: {cluster_id}\")\n",
|
||||
" warnings.append(f\"Found ElastiCache cluster '{cluster_id}' that may conflict with Terraform resources\")\n",
|
||||
" except Exception as e:\n",
|
||||
" warn(f\"Could not check for ElastiCache clusters: {e}\")\n",
|
||||
" print(\" 💡 This is OK - proceeding with caution\")\n",
|
||||
"\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" # Check for Redis caches\n",
|
||||
" try:\n",
|
||||
" result = run(\n",
|
||||
" [\"az\", \"redis\", \"list\", \"--output\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" if result.returncode == 0:\n",
|
||||
" redis_caches = json.loads(result.stdout)\n",
|
||||
" for cache in redis_caches:\n",
|
||||
" cache_name = cache.get(\"name\", \"\")\n",
|
||||
" cache_location = cache.get(\"location\", \"\")\n",
|
||||
" if cache_location.lower() == region.lower():\n",
|
||||
" if cluster_name.lower() in cache_name.lower() or workshop_name.lower() in cache_name.lower():\n",
|
||||
" existing_resources.append(f\"Redis cache: {cache_name}\")\n",
|
||||
" warnings.append(f\"Found Redis cache '{cache_name}' that may conflict with Terraform resources\")\n",
|
||||
" except Exception as e:\n",
|
||||
" warn(f\"Could not check for Redis caches: {e}\")\n",
|
||||
" print(\" 💡 This is OK - proceeding with caution\")\n",
|
||||
"\n",
|
||||
"# Check for existing S3 buckets (AWS) or Storage Accounts (Azure)\n",
|
||||
"blob_service = get_blob_storage_service_name()\n",
|
||||
"print(f\"\\n### Checking for Existing {blob_service} Resources\\n\")\n",
|
||||
"\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" # Check for S3 buckets\n",
|
||||
" try:\n",
|
||||
" result = run(\n",
|
||||
" [\"aws\", \"s3api\", \"list-buckets\", \"--output\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" if result.returncode == 0:\n",
|
||||
" buckets = json.loads(result.stdout).get(\"Buckets\", [])\n",
|
||||
" # Check if any bucket name might conflict\n",
|
||||
" for bucket in buckets:\n",
|
||||
" bucket_name = bucket.get(\"Name\", \"\")\n",
|
||||
" if cluster_name.lower() in bucket_name.lower() or workshop_name.lower() in bucket_name.lower():\n",
|
||||
" existing_resources.append(f\"S3 bucket: {bucket_name}\")\n",
|
||||
" warnings.append(f\"Found S3 bucket '{bucket_name}' that may conflict with Terraform resources\")\n",
|
||||
" except Exception as e:\n",
|
||||
" warn(f\"Could not check for S3 buckets: {e}\")\n",
|
||||
" print(\" 💡 This is OK - proceeding with caution\")\n",
|
||||
"\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" # Check for Storage Accounts\n",
|
||||
" try:\n",
|
||||
" result = run(\n",
|
||||
" [\"az\", \"storage\", \"account\", \"list\", \"--output\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" if result.returncode == 0:\n",
|
||||
" storage_accounts = json.loads(result.stdout)\n",
|
||||
" for account in storage_accounts:\n",
|
||||
" account_name = account.get(\"name\", \"\")\n",
|
||||
" account_location = account.get(\"location\", \"\")\n",
|
||||
" if account_location.lower() == region.lower():\n",
|
||||
" if cluster_name.lower() in account_name.lower() or workshop_name.lower() in account_name.lower():\n",
|
||||
" existing_resources.append(f\"Storage account: {account_name}\")\n",
|
||||
" warnings.append(f\"Found Storage account '{account_name}' that may conflict with Terraform resources\")\n",
|
||||
" except Exception as e:\n",
|
||||
" warn(f\"Could not check for Storage accounts: {e}\")\n",
|
||||
" print(\" 💡 This is OK - proceeding with caution\")\n",
|
||||
"\n",
|
||||
"# Summary and decision\n",
|
||||
"print(\"\\n\" + \"=\" * 60)\n",
|
||||
"print(\"### Safety Check Summary\")\n",
|
||||
"print(\"=\" * 60)\n",
|
||||
"\n",
|
||||
"if existing_resources:\n",
|
||||
" fail(f\"Found {len(existing_resources)} existing resource(s) that may conflict:\")\n",
|
||||
" for resource in existing_resources:\n",
|
||||
" print(f\" ⚠️ {resource}\")\n",
|
||||
" \n",
|
||||
" print(\"\\n\" + \"=\" * 60)\n",
|
||||
" print(\"⚠️ WARNING: Proceeding with Terraform apply may:\")\n",
|
||||
" print(\" - Modify existing infrastructure\")\n",
|
||||
" print(\" - Import existing resources into Terraform state\")\n",
|
||||
" print(\" - Cause unexpected changes or conflicts\")\n",
|
||||
" print(\"=\" * 60)\n",
|
||||
" print(\"\\n💡 Recommendations:\")\n",
|
||||
" print(\" 1. If these resources are from a previous deployment, that's OK\")\n",
|
||||
" print(\" 2. If these resources are UNRELATED to this deployment:\")\n",
|
||||
" print(f\" - Update CLUSTER_NAME or WORKSHOP_NAME in your .env file\")\n",
|
||||
" print(f\" - Use different resource names to avoid conflicts\")\n",
|
||||
" print(\" 3. Review the Terraform plan carefully before applying\")\n",
|
||||
" print(\" 4. Consider using Terraform import if you want to manage existing resources\")\n",
|
||||
" print(\"\\n⚠️ You must explicitly confirm you understand the risks before proceeding.\")\n",
|
||||
" print(\" Review the plan output and ensure you're comfortable with the changes.\")\n",
|
||||
"else:\n",
|
||||
" ok(\"No conflicting resources found\")\n",
|
||||
" print(\"✅ Safe to proceed with Terraform apply\")\n",
|
||||
" print(\"💡 Still review the Terraform plan before applying to ensure it's correct\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
|
||||
@@ -449,60 +449,72 @@
|
||||
"if not license_key:\n",
|
||||
" raise RuntimeError(\"❌ LANGSMITH_LICENSE_KEY is required\")\n",
|
||||
"\n",
|
||||
"# Helper function to create or update secret\n",
|
||||
"def create_or_update_secret(secret_name: str, literals: dict, namespace: str):\n",
|
||||
" \"\"\"Create a secret if it doesn't exist, or update it if it does.\"\"\"\n",
|
||||
" # Check if secret exists\n",
|
||||
" check_result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secret\", secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" # Build kubectl command\n",
|
||||
" cmd = [\"kubectl\", \"create\", \"secret\", \"generic\", secret_name, \"-n\", namespace]\n",
|
||||
" for key, value in literals.items():\n",
|
||||
" cmd.extend([\"--from-literal\", f\"{key}={value}\"])\n",
|
||||
" \n",
|
||||
" if check_result.returncode == 0:\n",
|
||||
" # Secret exists - update it using apply\n",
|
||||
" print(f\" Secret '{secret_name}' exists, updating...\")\n",
|
||||
" # Generate YAML using dry-run, then apply it\n",
|
||||
" create_cmd = cmd + [\"--dry-run=client\", \"-o\", \"yaml\"]\n",
|
||||
" result = run(create_cmd, check=True, stream=False)\n",
|
||||
" \n",
|
||||
" # Apply the YAML\n",
|
||||
" apply_result = run(\n",
|
||||
" [\"kubectl\", \"apply\", \"-f\", \"-\"],\n",
|
||||
" input=result.stdout,\n",
|
||||
" check=True,\n",
|
||||
" stream=True\n",
|
||||
" )\n",
|
||||
" return \"updated\"\n",
|
||||
" else:\n",
|
||||
" # Secret doesn't exist - create it\n",
|
||||
" print(f\" Secret '{secret_name}' does not exist, creating...\")\n",
|
||||
" run(cmd, check=True, stream=True)\n",
|
||||
" return \"created\"\n",
|
||||
"\n",
|
||||
"# Create license key secret\n",
|
||||
"print(\"Creating license key secret...\")\n",
|
||||
"run(\n",
|
||||
" [\n",
|
||||
" \"kubectl\", \"create\", \"secret\", \"generic\", \"langsmith-license\",\n",
|
||||
" f\"--from-literal=license-key={license_key}\",\n",
|
||||
" \"-n\", namespace,\n",
|
||||
" \"--dry-run=client\", \"-o\", \"yaml\"\n",
|
||||
" ],\n",
|
||||
" check=True,\n",
|
||||
" stream=False\n",
|
||||
"action = create_or_update_secret(\n",
|
||||
" \"langsmith-license\",\n",
|
||||
" {\"license-key\": license_key},\n",
|
||||
" namespace\n",
|
||||
")\n",
|
||||
"# Actually create it (remove dry-run)\n",
|
||||
"run(\n",
|
||||
" [\n",
|
||||
" \"kubectl\", \"create\", \"secret\", \"generic\", \"langsmith-license\",\n",
|
||||
" f\"--from-literal=license-key={license_key}\",\n",
|
||||
" \"-n\", namespace\n",
|
||||
" ],\n",
|
||||
" check=False, # May already exist\n",
|
||||
" stream=True\n",
|
||||
")\n",
|
||||
"ok(\"License key secret created/updated\")\n",
|
||||
"ok(f\"License key secret {action}\")\n",
|
||||
"\n",
|
||||
"# Create database secret if credentials provided\n",
|
||||
"if db_user and db_password:\n",
|
||||
" print(\"\\nCreating database secret...\")\n",
|
||||
" run(\n",
|
||||
" [\n",
|
||||
" \"kubectl\", \"create\", \"secret\", \"generic\", \"langsmith-db\",\n",
|
||||
" f\"--from-literal=username={db_user}\",\n",
|
||||
" f\"--from-literal=password={db_password}\",\n",
|
||||
" \"-n\", namespace\n",
|
||||
" ],\n",
|
||||
" check=False, # May already exist\n",
|
||||
" stream=True\n",
|
||||
" action = create_or_update_secret(\n",
|
||||
" \"langsmith-db\",\n",
|
||||
" {\"username\": db_user, \"password\": db_password},\n",
|
||||
" namespace\n",
|
||||
" )\n",
|
||||
" ok(\"Database secret created/updated\")\n",
|
||||
" ok(f\"Database secret {action}\")\n",
|
||||
"else:\n",
|
||||
" print(\"💡 Skipping database secret (using IAM auth or not needed)\")\n",
|
||||
"\n",
|
||||
"# Create Redis secret if password provided\n",
|
||||
"if redis_password:\n",
|
||||
" print(\"\\nCreating Redis secret...\")\n",
|
||||
" run(\n",
|
||||
" [\n",
|
||||
" \"kubectl\", \"create\", \"secret\", \"generic\", \"langsmith-redis\",\n",
|
||||
" f\"--from-literal=password={redis_password}\",\n",
|
||||
" \"-n\", namespace\n",
|
||||
" ],\n",
|
||||
" check=False, # May already exist\n",
|
||||
" stream=True\n",
|
||||
" action = create_or_update_secret(\n",
|
||||
" \"langsmith-redis\",\n",
|
||||
" {\"password\": redis_password},\n",
|
||||
" namespace\n",
|
||||
" )\n",
|
||||
" ok(\"Redis secret created/updated\")\n",
|
||||
" ok(f\"Redis secret {action}\")\n",
|
||||
"else:\n",
|
||||
" print(\"💡 Skipping Redis secret (using IAM auth or not needed)\")\n",
|
||||
"\n",
|
||||
@@ -568,6 +580,79 @@
|
||||
" print(\"💡 Review the errors above before proceeding\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Pre-Install Safety Check\n",
|
||||
"\n",
|
||||
"**⚠️ CRITICAL:** Before installing with Helm, verify that a release doesn't already exist. This prevents accidentally overwriting or conflicting with existing deployments.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Pre-install safety check: Verify Helm release doesn't already exist\n",
|
||||
"from shared._validation import ok, warn, fail\n",
|
||||
"from shared._shell import run\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"print(\"### Pre-Install Helm Release Check\\n\")\n",
|
||||
"print(\"Checking if Helm release already exists...\\n\")\n",
|
||||
"\n",
|
||||
"# Check if Helm release exists\n",
|
||||
"result = run(\n",
|
||||
" [\"helm\", \"list\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False, # Don't fail if namespace doesn't exist yet\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"releases = []\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" try:\n",
|
||||
" releases = json.loads(result.stdout)\n",
|
||||
" except json.JSONDecodeError:\n",
|
||||
" # Empty output or invalid JSON\n",
|
||||
" releases = []\n",
|
||||
"elif \"not found\" in result.stderr.lower() or \"does not exist\" in result.stderr.lower():\n",
|
||||
" # Namespace doesn't exist, which is fine\n",
|
||||
" ok(f\"Namespace '{namespace}' does not exist (will be created)\")\n",
|
||||
" releases = []\n",
|
||||
"else:\n",
|
||||
" # Some other error\n",
|
||||
" warn(f\"Could not check for Helm releases: {result.stderr}\")\n",
|
||||
" print(\"💡 Proceeding with caution\")\n",
|
||||
"\n",
|
||||
"# Check if our release name already exists\n",
|
||||
"langsmith_releases = [r for r in releases if r.get(\"name\") == helm_release]\n",
|
||||
"\n",
|
||||
"if langsmith_releases:\n",
|
||||
" release = langsmith_releases[0]\n",
|
||||
" fail(f\"⚠️ Helm release '{helm_release}' already exists in namespace '{namespace}'!\")\n",
|
||||
" print(f\"\\nRelease Details:\")\n",
|
||||
" print(f\" Name: {release.get('name', 'N/A')}\")\n",
|
||||
" print(f\" Status: {release.get('status', 'N/A')}\")\n",
|
||||
" print(f\" Chart: {release.get('chart', 'N/A')}\")\n",
|
||||
" print(f\" Revision: {release.get('revision', 'N/A')}\")\n",
|
||||
" print(f\" Namespace: {release.get('namespace', 'N/A')}\")\n",
|
||||
" \n",
|
||||
" print(\"\\n\" + \"=\" * 60)\n",
|
||||
" print(\"⚠️ WARNING: Cannot install - release already exists!\")\n",
|
||||
" print(\"=\" * 60)\n",
|
||||
" print(\"\\n💡 Options:\")\n",
|
||||
" print(f\" 1. To upgrade the existing release, use: helm upgrade {helm_release} ...\")\n",
|
||||
" print(f\" 2. To reinstall, first uninstall: helm uninstall {helm_release} -n {namespace}\")\n",
|
||||
" print(f\" 3. To use a different release name, update HELM_RELEASE in your .env file\")\n",
|
||||
" print(\"\\n❌ Do NOT proceed with 'helm install' - it will fail.\")\n",
|
||||
" raise RuntimeError(f\"Helm release '{helm_release}' already exists. Use 'helm upgrade' or uninstall first.\")\n",
|
||||
"else:\n",
|
||||
" ok(f\"Helm release '{helm_release}' does not exist (safe to install)\")\n",
|
||||
" print(\"✅ Safe to proceed with Helm install\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
|
||||
@@ -13,10 +13,13 @@
|
||||
"### What We'll Validate\n",
|
||||
"\n",
|
||||
"1. ✅ Pod readiness (all pods running)\n",
|
||||
"2. ✅ PVC binding (storage provisioned)\n",
|
||||
"3. ✅ Ingress provisioning (ALB created)\n",
|
||||
"4. ✅ Endpoint reachability (services accessible)\n",
|
||||
"5. ✅ Basic UI availability (web interface works)\n",
|
||||
"2. ✅ License key validation (properly configured)\n",
|
||||
"3. ✅ PVC binding (storage provisioned)\n",
|
||||
"4. ✅ External services connectivity (PostgreSQL, Redis, blob storage)\n",
|
||||
"5. ✅ Ingress provisioning (load balancer created)\n",
|
||||
"6. ✅ Endpoint reachability (services accessible)\n",
|
||||
"7. ✅ Basic UI availability (web interface works)\n",
|
||||
"8. ✅ Basic functional test (optional trace submission)\n",
|
||||
"\n",
|
||||
"### Why This Matters\n",
|
||||
"\n",
|
||||
@@ -72,7 +75,7 @@
|
||||
"source": [
|
||||
"## Setting Up Cluster Access\n",
|
||||
"\n",
|
||||
"Ensure kubectl is configured for the EKS cluster.\n"
|
||||
"Ensure kubectl is configured for the Kubernetes cluster.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -82,12 +85,13 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from shared._validation import require_env, ok\n",
|
||||
"from shared._validation import require_env, ok, warn\n",
|
||||
"from shared._cloud_helpers import (\n",
|
||||
" get_cloud_provider,\n",
|
||||
" get_region,\n",
|
||||
" configure_kubectl,\n",
|
||||
")\n",
|
||||
"from shared._shell import run\n",
|
||||
"\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"\n",
|
||||
@@ -131,6 +135,8 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._k8s_helpers import get_pods, wait_for_deployments_ready, require_namespace\n",
|
||||
"from shared._validation import warn\n",
|
||||
"from shared._shell import run\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"# Ensure namespace exists\n",
|
||||
@@ -201,11 +207,279 @@
|
||||
" run([\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by=.lastTimestamp\"], check=False, stream=True)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1.5. License Key Validation\n",
|
||||
"\n",
|
||||
"**Critical:** Verify that the LangSmith license key is properly configured and valid. License issues will prevent the system from functioning correctly.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Check license key secret\n",
|
||||
"print(\"### License Key Validation\\n\")\n",
|
||||
"\n",
|
||||
"# Check if license secret exists\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secret\", \"langsmith-license\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" ok(\"License key secret exists\")\n",
|
||||
" \n",
|
||||
" # Try to check if license key is set (without revealing it)\n",
|
||||
" secret_data = json.loads(result.stdout)\n",
|
||||
" if \"data\" in secret_data and \"license-key\" in secret_data[\"data\"]:\n",
|
||||
" ok(\"License key is present in secret\")\n",
|
||||
" else:\n",
|
||||
" warn(\"License key secret exists but 'license-key' field not found\")\n",
|
||||
" print(\"💡 Secret may use a different key name\")\n",
|
||||
"else:\n",
|
||||
" warn(\"License key secret not found\")\n",
|
||||
" print(\"💡 License secret 'langsmith-license' should exist in namespace\")\n",
|
||||
" print(\" Check that you created the secret during Helm installation\")\n",
|
||||
"\n",
|
||||
"# Check pod logs for license-related errors\n",
|
||||
"print(\"\\n### Checking Pod Logs for License Errors\\n\")\n",
|
||||
"\n",
|
||||
"# Get all pods in namespace\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"jsonpath={.items[*].metadata.name}\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0 and result.stdout.strip():\n",
|
||||
" pod_names = result.stdout.strip().split()\n",
|
||||
" license_errors_found = False\n",
|
||||
" \n",
|
||||
" # Check logs from a few key pods (limit to first 3 to avoid too much output)\n",
|
||||
" key_pods = [p for p in pod_names if any(keyword in p.lower() for keyword in [\"api\", \"server\", \"backend\"])][:3]\n",
|
||||
" if not key_pods:\n",
|
||||
" key_pods = pod_names[:3] # Fallback to first 3 pods\n",
|
||||
" \n",
|
||||
" for pod_name in key_pods:\n",
|
||||
" try:\n",
|
||||
" # Get recent logs (last 50 lines)\n",
|
||||
" log_result = run(\n",
|
||||
" [\"kubectl\", \"logs\", pod_name, \"-n\", namespace, \"--tail=50\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if log_result.returncode == 0:\n",
|
||||
" logs = log_result.stdout.lower()\n",
|
||||
" # Look for common license-related error patterns\n",
|
||||
" license_error_patterns = [\n",
|
||||
" \"license\",\n",
|
||||
" \"unauthorized\",\n",
|
||||
" \"invalid license\",\n",
|
||||
" \"license expired\",\n",
|
||||
" \"license key\",\n",
|
||||
" \"beacon.langchain.com\",\n",
|
||||
" ]\n",
|
||||
" \n",
|
||||
" for pattern in license_error_patterns:\n",
|
||||
" if pattern in logs:\n",
|
||||
" # Check if it's actually an error (not just a log message)\n",
|
||||
" lines = log_result.stdout.split(\"\\n\")\n",
|
||||
" error_lines = [line for line in lines if pattern in line.lower() and any(err in line.lower() for err in [\"error\", \"fail\", \"invalid\", \"unauthorized\"])]\n",
|
||||
" if error_lines:\n",
|
||||
" license_errors_found = True\n",
|
||||
" warn(f\"Potential license issue found in {pod_name} logs\")\n",
|
||||
" print(f\" Pattern: '{pattern}'\")\n",
|
||||
" print(f\" Sample: {error_lines[0][:100]}...\")\n",
|
||||
" break\n",
|
||||
" except Exception as e:\n",
|
||||
" # Skip pods that can't be logged (may not be ready)\n",
|
||||
" pass\n",
|
||||
" \n",
|
||||
" if not license_errors_found:\n",
|
||||
" ok(\"No obvious license-related errors found in pod logs\")\n",
|
||||
" else:\n",
|
||||
" print(\"\\n💡 If license errors are present, verify:\")\n",
|
||||
" print(\" - License key is valid and not expired\")\n",
|
||||
" print(\" - Egress to https://beacon.langchain.com is allowed (if not air-gapped)\")\n",
|
||||
" print(\" - License secret is correctly mounted in pods\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not retrieve pod names to check logs\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2.5. External Services Connectivity\n",
|
||||
"\n",
|
||||
"**Important:** Verify that external services (PostgreSQL, Redis, blob storage) are accessible from the cluster. These are critical dependencies for LangSmith.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._cloud_helpers import (\n",
|
||||
" get_database_service_name,\n",
|
||||
" get_cache_service_name,\n",
|
||||
" get_blob_storage_service_name,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Check external services connectivity\n",
|
||||
"print(\"### External Services Connectivity Check\\n\")\n",
|
||||
"\n",
|
||||
"# Try to load Terraform outputs to get service endpoints\n",
|
||||
"terraform_outputs_file = artifacts_dir / \"terraform-outputs.json\"\n",
|
||||
"terraform_outputs = {}\n",
|
||||
"\n",
|
||||
"if terraform_outputs_file.exists():\n",
|
||||
" try:\n",
|
||||
" with open(terraform_outputs_file) as f:\n",
|
||||
" terraform_outputs_raw = json.load(f)\n",
|
||||
" \n",
|
||||
" # Unwrap Terraform output format\n",
|
||||
" for key, value in terraform_outputs_raw.items():\n",
|
||||
" if isinstance(value, dict) and \"value\" in value:\n",
|
||||
" terraform_outputs[key] = value[\"value\"]\n",
|
||||
" else:\n",
|
||||
" terraform_outputs[key] = value\n",
|
||||
" \n",
|
||||
" print(\"💡 Loaded Terraform outputs for service endpoints\\n\")\n",
|
||||
" except Exception as e:\n",
|
||||
" warn(f\"Could not parse Terraform outputs: {e}\")\n",
|
||||
" print(\"💡 Will attempt basic connectivity checks without endpoint details\")\n",
|
||||
"else:\n",
|
||||
" print(\"💡 Terraform outputs file not found - will check service connectivity from cluster\\n\")\n",
|
||||
"\n",
|
||||
"# Check PostgreSQL connectivity\n",
|
||||
"print(\"### PostgreSQL/Database Connectivity\\n\")\n",
|
||||
"db_service = get_database_service_name()\n",
|
||||
"\n",
|
||||
"# Try to find a pod we can exec into for connectivity tests\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"jsonpath={.items[0].metadata.name}\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0 and result.stdout.strip():\n",
|
||||
" test_pod = result.stdout.strip()\n",
|
||||
" \n",
|
||||
" # Check if we can reach database (basic connectivity test)\n",
|
||||
" # This is a simple test - actual connection requires credentials\n",
|
||||
" db_endpoint = None\n",
|
||||
" if \"rds_endpoint\" in terraform_outputs:\n",
|
||||
" db_endpoint = terraform_outputs[\"rds_endpoint\"]\n",
|
||||
" elif \"postgres_endpoint\" in terraform_outputs:\n",
|
||||
" db_endpoint = terraform_outputs[\"postgres_endpoint\"]\n",
|
||||
" elif \"database_endpoint\" in terraform_outputs:\n",
|
||||
" db_endpoint = terraform_outputs[\"database_endpoint\"]\n",
|
||||
" \n",
|
||||
" if db_endpoint:\n",
|
||||
" # Extract hostname from endpoint (remove port if present)\n",
|
||||
" db_host = db_endpoint.split(\":\")[0] if \":\" in db_endpoint else db_endpoint\n",
|
||||
" print(f\"Testing connectivity to {db_service} at {db_host}...\")\n",
|
||||
" \n",
|
||||
" # Try a simple DNS lookup or ping test\n",
|
||||
" dns_result = run(\n",
|
||||
" [\"kubectl\", \"exec\", \"-n\", namespace, test_pod, \"--\", \"nslookup\", db_host],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if dns_result.returncode == 0:\n",
|
||||
" ok(f\"{db_service} hostname resolves: {db_host}\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"Could not resolve {db_service} hostname\")\n",
|
||||
" print(\"💡 This may be normal if DNS is not fully configured yet\")\n",
|
||||
" else:\n",
|
||||
" print(f\"💡 {db_service} endpoint not found in Terraform outputs\")\n",
|
||||
" print(\" Verify database is accessible from cluster in cloud console\")\n",
|
||||
"else:\n",
|
||||
" print(\"💡 Could not find pod for connectivity testing\")\n",
|
||||
" print(f\" Manually verify {db_service} is accessible from cluster\")\n",
|
||||
"\n",
|
||||
"# Check Redis connectivity\n",
|
||||
"print(\"\\n### Redis/Cache Connectivity\\n\")\n",
|
||||
"cache_service = get_cache_service_name()\n",
|
||||
"\n",
|
||||
"if result.returncode == 0 and result.stdout.strip():\n",
|
||||
" redis_endpoint = None\n",
|
||||
" if \"redis_endpoint\" in terraform_outputs:\n",
|
||||
" redis_endpoint = terraform_outputs[\"redis_endpoint\"]\n",
|
||||
" elif \"cache_endpoint\" in terraform_outputs:\n",
|
||||
" redis_endpoint = terraform_outputs[\"cache_endpoint\"]\n",
|
||||
" elif \"elasticache_endpoint\" in terraform_outputs:\n",
|
||||
" redis_endpoint = terraform_outputs[\"elasticache_endpoint\"]\n",
|
||||
" \n",
|
||||
" if redis_endpoint:\n",
|
||||
" # Extract hostname from endpoint\n",
|
||||
" redis_host = redis_endpoint.split(\":\")[0] if \":\" in redis_endpoint else redis_endpoint\n",
|
||||
" print(f\"Testing connectivity to {cache_service} at {redis_host}...\")\n",
|
||||
" \n",
|
||||
" dns_result = run(\n",
|
||||
" [\"kubectl\", \"exec\", \"-n\", namespace, test_pod, \"--\", \"nslookup\", redis_host],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if dns_result.returncode == 0:\n",
|
||||
" ok(f\"{cache_service} hostname resolves: {redis_host}\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"Could not resolve {cache_service} hostname\")\n",
|
||||
" print(\"💡 This may be normal if DNS is not fully configured yet\")\n",
|
||||
" else:\n",
|
||||
" print(f\"💡 {cache_service} endpoint not found in Terraform outputs\")\n",
|
||||
" print(\" Verify cache is accessible from cluster in cloud console\")\n",
|
||||
"\n",
|
||||
"# Check blob storage (S3/Azure Blob)\n",
|
||||
"print(\"\\n### Blob Storage Connectivity\\n\")\n",
|
||||
"blob_service = get_blob_storage_service_name()\n",
|
||||
"\n",
|
||||
"# Check if blob storage secret exists (indicates it's configured)\n",
|
||||
"blob_secret_result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secret\", \"-n\", namespace, \"-o\", \"jsonpath={.items[*].metadata.name}\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if blob_secret_result.returncode == 0:\n",
|
||||
" secrets = blob_secret_result.stdout.split()\n",
|
||||
" blob_secrets = [s for s in secrets if any(keyword in s.lower() for keyword in [\"s3\", \"storage\", \"blob\", \"aws\"])]\n",
|
||||
" if blob_secrets:\n",
|
||||
" ok(f\"Blob storage secrets found: {', '.join(blob_secrets)}\")\n",
|
||||
" else:\n",
|
||||
" print(\"💡 Blob storage secrets not found (may use IAM roles instead)\")\n",
|
||||
"\n",
|
||||
"# Check for S3 bucket or blob storage account in Terraform outputs\n",
|
||||
"if \"s3_bucket\" in terraform_outputs or \"bucket_name\" in terraform_outputs:\n",
|
||||
" bucket_name = terraform_outputs.get(\"s3_bucket\") or terraform_outputs.get(\"bucket_name\")\n",
|
||||
" ok(f\"Blob storage bucket/container configured: {bucket_name}\")\n",
|
||||
"elif \"storage_account\" in terraform_outputs:\n",
|
||||
" storage_account = terraform_outputs[\"storage_account\"]\n",
|
||||
" ok(f\"Azure storage account configured: {storage_account}\")\n",
|
||||
"else:\n",
|
||||
" print(f\"💡 Verify {blob_service} is configured and accessible\")\n",
|
||||
" print(\" Check Terraform outputs or cloud console for storage resource\")\n",
|
||||
"\n",
|
||||
"print(\"\\n💡 For comprehensive functional testing of external services,\")\n",
|
||||
"print(\" see the validation guide for trace submission and attachment tests\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
@@ -269,9 +543,9 @@
|
||||
"source": [
|
||||
"## 3. Ingress Provisioning Check\n",
|
||||
"\n",
|
||||
"**Critical:** The AWS ALB (Application Load Balancer) must be provisioned. This is how external traffic reaches LangSmith.\n",
|
||||
"**Critical:** The load balancer (ALB for AWS, Application Gateway for Azure) must be provisioned. This is how external traffic reaches LangSmith.\n",
|
||||
"\n",
|
||||
"Common issue: ALB never appears due to wrong ingress assumptions.\n"
|
||||
"Common issue: Load balancer never appears due to wrong ingress assumptions.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -317,16 +591,25 @@
|
||||
" if ingress_hosts:\n",
|
||||
" print(f\" Hosts: {', '.join(ingress_hosts)}\")\n",
|
||||
" \n",
|
||||
" # Check for ALB address\n",
|
||||
" # Check for load balancer address (cloud-agnostic)\n",
|
||||
" if load_balancer.get(\"ingress\"):\n",
|
||||
" alb_addresses = [ing.get(\"hostname\", ing.get(\"ip\", \"\")) for ing in load_balancer[\"ingress\"]]\n",
|
||||
" if alb_addresses:\n",
|
||||
" ok(f\"ALB provisioned: {', '.join(alb_addresses)}\")\n",
|
||||
" print(f\" 💡 Access LangSmith at: https://{alb_addresses[0]}\")\n",
|
||||
" lb_addresses = [ing.get(\"hostname\", ing.get(\"ip\", \"\")) for ing in load_balancer[\"ingress\"]]\n",
|
||||
" if lb_addresses:\n",
|
||||
" # Determine load balancer type based on address format\n",
|
||||
" lb_type = \"Load Balancer\"\n",
|
||||
" if provider == \"aws\":\n",
|
||||
" if \".elb.\" in lb_addresses[0] or \".amazonaws.com\" in lb_addresses[0]:\n",
|
||||
" lb_type = \"ALB (Application Load Balancer)\"\n",
|
||||
" elif provider == \"azure\":\n",
|
||||
" if \".azure.com\" in lb_addresses[0] or \"appgw\" in lb_addresses[0]:\n",
|
||||
" lb_type = \"Application Gateway\"\n",
|
||||
" \n",
|
||||
" ok(f\"{lb_type} provisioned: {', '.join(lb_addresses)}\")\n",
|
||||
" print(f\" 💡 Access LangSmith at: https://{lb_addresses[0]}\")\n",
|
||||
" else:\n",
|
||||
" warn(\"ALB ingress entry exists but no address found\")\n",
|
||||
" warn(\"Load balancer ingress entry exists but no address found\")\n",
|
||||
" else:\n",
|
||||
" warn(\"ALB not yet provisioned (may take a few minutes)\")\n",
|
||||
" warn(\"Load balancer not yet provisioned (may take a few minutes)\")\n",
|
||||
" print(\" 💡 Wait a few minutes and check again\")\n",
|
||||
" else:\n",
|
||||
" warn(\"No ingress resources found\")\n",
|
||||
@@ -335,24 +618,62 @@
|
||||
" warn(\"Could not retrieve ingress resources\")\n",
|
||||
" print(\"💡 Ingress may not exist yet or namespace is incorrect\")\n",
|
||||
"\n",
|
||||
"# Also check for ALB Ingress Controller\n",
|
||||
"print(\"\\n### ALB Ingress Controller\\n\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", \"kube-system\", \"-l\", \"app.kubernetes.io/name=aws-load-balancer-controller\", \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"# Check for ingress controller (cloud-agnostic)\n",
|
||||
"print(\"\\n### Ingress Controller\\n\")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" controller_data = json.loads(result.stdout)\n",
|
||||
" controllers = controller_data.get(\"items\", [])\n",
|
||||
" if controllers:\n",
|
||||
" ok(f\"ALB Ingress Controller found ({len(controllers)} pod(s))\")\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" # Check for ALB Ingress Controller\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", \"kube-system\", \"-l\", \"app.kubernetes.io/name=aws-load-balancer-controller\", \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" controller_data = json.loads(result.stdout)\n",
|
||||
" controllers = controller_data.get(\"items\", [])\n",
|
||||
" if controllers:\n",
|
||||
" ok(f\"ALB Ingress Controller found ({len(controllers)} pod(s))\")\n",
|
||||
" else:\n",
|
||||
" warn(\"ALB Ingress Controller not found\")\n",
|
||||
" print(\"💡 ALB Ingress Controller must be installed for ingress to work\")\n",
|
||||
" else:\n",
|
||||
" warn(\"ALB Ingress Controller not found\")\n",
|
||||
" print(\"💡 ALB Ingress Controller must be installed for ingress to work\")\n",
|
||||
" warn(\"Could not check ALB Ingress Controller status\")\n",
|
||||
"\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" # Check for Azure Application Gateway Ingress Controller\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", \"kube-system\", \"-l\", \"app=ingress-azure\", \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" controller_data = json.loads(result.stdout)\n",
|
||||
" controllers = controller_data.get(\"items\", [])\n",
|
||||
" if controllers:\n",
|
||||
" ok(f\"Azure Application Gateway Ingress Controller found ({len(controllers)} pod(s))\")\n",
|
||||
" else:\n",
|
||||
" # Also check for AGIC (Application Gateway Ingress Controller)\n",
|
||||
" result2 = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", \"kube-system\", \"-l\", \"app=ingress-appgw\", \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" if result2.returncode == 0:\n",
|
||||
" controller_data2 = json.loads(result2.stdout)\n",
|
||||
" controllers2 = controller_data2.get(\"items\", [])\n",
|
||||
" if controllers2:\n",
|
||||
" ok(f\"Application Gateway Ingress Controller found ({len(controllers2)} pod(s))\")\n",
|
||||
" else:\n",
|
||||
" warn(\"Application Gateway Ingress Controller not found\")\n",
|
||||
" print(\"💡 Application Gateway Ingress Controller must be installed for ingress to work\")\n",
|
||||
" else:\n",
|
||||
" warn(\"Could not check Application Gateway Ingress Controller status\")\n",
|
||||
" else:\n",
|
||||
" warn(\"Could not check Application Gateway Ingress Controller status\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not check ALB Ingress Controller status\")\n"
|
||||
" print(\"💡 Verify ingress controller is installed for your cloud provider\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -424,6 +745,129 @@
|
||||
" warn(\"No services found\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 6. Basic Functional Test (Optional)\n",
|
||||
"\n",
|
||||
"**Optional:** Submit a simple test trace to verify the end-to-end pipeline is working. This validates that traces can be ingested, stored, and retrieved.\n",
|
||||
"\n",
|
||||
"> **Note:** For comprehensive functional testing (traces, attachments, feedback, datasets), see the full validation guide.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Optional: Basic functional test\n",
|
||||
"print(\"### Basic Functional Test (Optional)\\n\")\n",
|
||||
"\n",
|
||||
"# Check if we have the necessary information to run a test\n",
|
||||
"ingress_result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"ingress\", \"-n\", namespace, \"-o\", \"jsonpath={.items[0].status.loadBalancer.ingress[0].hostname}\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if ingress_result.returncode == 0 and ingress_result.stdout.strip():\n",
|
||||
" ingress_host = ingress_result.stdout.strip()\n",
|
||||
" langsmith_endpoint = f\"https://{ingress_host}/api\"\n",
|
||||
" \n",
|
||||
" print(f\"LangSmith endpoint: {langsmith_endpoint}\")\n",
|
||||
" print(\"\\n💡 To run a basic functional test:\")\n",
|
||||
" print(\" 1. Generate an API key from the LangSmith UI\")\n",
|
||||
" print(\" 2. Set LANGSMITH_API_KEY environment variable\")\n",
|
||||
" print(\" 3. Run the test script below (or see validation guide for comprehensive tests)\\n\")\n",
|
||||
" \n",
|
||||
" # Check if API key is available\n",
|
||||
" api_key = os.environ.get(\"LANGSMITH_API_KEY\", \"\").strip()\n",
|
||||
" \n",
|
||||
" if api_key:\n",
|
||||
" print(\"✅ LANGSMITH_API_KEY found - attempting basic trace submission...\\n\")\n",
|
||||
" \n",
|
||||
" try:\n",
|
||||
" # Simple test: submit a basic trace\n",
|
||||
" test_code = f'''\n",
|
||||
"import os\n",
|
||||
"import requests\n",
|
||||
"from langsmith import traceable\n",
|
||||
"\n",
|
||||
"# Configure LangSmith\n",
|
||||
"os.environ[\"LANGSMITH_TRACING\"] = \"true\"\n",
|
||||
"os.environ[\"LANGSMITH_API_KEY\"] = \"{api_key}\"\n",
|
||||
"os.environ[\"LANGSMITH_ENDPOINT\"] = \"{langsmith_endpoint}\"\n",
|
||||
"os.environ[\"LANGSMITH_PROJECT\"] = \"validation-test\"\n",
|
||||
"\n",
|
||||
"# Simple traced function\n",
|
||||
"@traceable(name=\"test_basic_function\")\n",
|
||||
"def test_function():\n",
|
||||
" return \"Hello from LangSmith validation test!\"\n",
|
||||
"\n",
|
||||
"# Run test\n",
|
||||
"try:\n",
|
||||
" result = test_function()\n",
|
||||
" print(f\"✅ Test trace submitted successfully: {{result}}\")\n",
|
||||
" print(f\"💡 Check the LangSmith UI at https://{{ingress_host}} to see the trace\")\n",
|
||||
" print(\" Navigate to the 'validation-test' project\")\n",
|
||||
"except Exception as e:\n",
|
||||
" print(f\"⚠️ Error submitting trace: {{e}}\")\n",
|
||||
" print(\"💡 This may be normal if LangSmith is still initializing\")\n",
|
||||
"'''\n",
|
||||
" \n",
|
||||
" # Try to import langsmith to see if it's available\n",
|
||||
" try:\n",
|
||||
" import langsmith\n",
|
||||
" print(\"Running basic trace test...\")\n",
|
||||
" exec(test_code)\n",
|
||||
" ok(\"Basic functional test completed\")\n",
|
||||
" except ImportError:\n",
|
||||
" print(\"⚠️ langsmith package not installed\")\n",
|
||||
" print(\"💡 Install with: pip install langsmith\")\n",
|
||||
" print(\"\\nTest script (save and run separately):\")\n",
|
||||
" print(\"=\" * 60)\n",
|
||||
" print(test_code)\n",
|
||||
" print(\"=\" * 60)\n",
|
||||
" except Exception as e:\n",
|
||||
" warn(f\"Could not run functional test: {e}\")\n",
|
||||
" print(\"💡 This is optional - you can test functionality manually in the UI\")\n",
|
||||
" else:\n",
|
||||
" print(\"💡 To enable automated testing, set LANGSMITH_API_KEY in your environment\")\n",
|
||||
" print(\" Get an API key from: https://{ingress_host}/settings/api-keys\")\n",
|
||||
" print(\"\\nExample test script (run after getting API key):\")\n",
|
||||
" print(\"=\" * 60)\n",
|
||||
" print(f'''\n",
|
||||
"import os\n",
|
||||
"from langsmith import traceable\n",
|
||||
"\n",
|
||||
"os.environ[\"LANGSMITH_TRACING\"] = \"true\"\n",
|
||||
"os.environ[\"LANGSMITH_API_KEY\"] = \"<your-api-key>\"\n",
|
||||
"os.environ[\"LANGSMITH_ENDPOINT\"] = \"{langsmith_endpoint}\"\n",
|
||||
"os.environ[\"LANGSMITH_PROJECT\"] = \"validation-test\"\n",
|
||||
"\n",
|
||||
"@traceable(name=\"test_basic_function\")\n",
|
||||
"def test_function():\n",
|
||||
" return \"Hello from LangSmith!\"\n",
|
||||
"\n",
|
||||
"test_function()\n",
|
||||
"print(\"Check the UI for the trace!\")\n",
|
||||
"''')\n",
|
||||
" print(\"=\" * 60)\n",
|
||||
"else:\n",
|
||||
" print(\"💡 Ingress not available yet - functional test requires accessible endpoint\")\n",
|
||||
" print(\" Complete ingress validation first, then return to this section\")\n",
|
||||
"\n",
|
||||
"print(\"\\n💡 For comprehensive functional testing including:\")\n",
|
||||
"print(\" - Trace submission & ClickHouse analytics\")\n",
|
||||
"print(\" - Attachments & blob storage\")\n",
|
||||
"print(\" - Feedback system\")\n",
|
||||
"print(\" - Dataset management\")\n",
|
||||
"print(\" - Agent deployments\")\n",
|
||||
"print(\" See the full validation guide for detailed test scripts\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
@@ -460,7 +904,14 @@
|
||||
" # Try to access the UI (HTTPS)\n",
|
||||
" ui_url = f\"https://{ingress_host}\"\n",
|
||||
" print(f\"\\nTesting UI availability at: {ui_url}\")\n",
|
||||
" print(\"(This may take a moment if ALB is still provisioning...)\\n\")\n",
|
||||
" \n",
|
||||
" # Cloud-specific messaging\n",
|
||||
" if provider == \"aws\":\n",
|
||||
" print(\"(This may take a moment if ALB is still provisioning...)\\n\")\n",
|
||||
" elif provider == \"azure\":\n",
|
||||
" print(\"(This may take a moment if Application Gateway is still provisioning...)\\n\")\n",
|
||||
" else:\n",
|
||||
" print(\"(This may take a moment if load balancer is still provisioning...)\\n\")\n",
|
||||
" \n",
|
||||
" try:\n",
|
||||
" # Use a short timeout and allow redirects\n",
|
||||
@@ -482,19 +933,36 @@
|
||||
" print(\" Browser may show security warning - this is normal for self-signed certs\")\n",
|
||||
" except requests.exceptions.Timeout:\n",
|
||||
" warn(\"UI request timed out\")\n",
|
||||
" print(\"💡 ALB may still be provisioning, or ingress is not fully configured\")\n",
|
||||
" print(f\" Try again in a few minutes: {ui_url}\")\n",
|
||||
" if provider == \"aws\":\n",
|
||||
" print(\"💡 ALB may still be provisioning, or ingress is not fully configured\")\n",
|
||||
" print(f\" Check AWS console for ALB status, then try: {ui_url}\")\n",
|
||||
" elif provider == \"azure\":\n",
|
||||
" print(\"💡 Application Gateway may still be provisioning, or ingress is not fully configured\")\n",
|
||||
" print(f\" Check Azure portal for Application Gateway status, then try: {ui_url}\")\n",
|
||||
" else:\n",
|
||||
" print(f\" Try again in a few minutes: {ui_url}\")\n",
|
||||
" except requests.exceptions.ConnectionError as e:\n",
|
||||
" warn(f\"Could not connect to UI: {e}\")\n",
|
||||
" print(\"💡 ALB may still be provisioning\")\n",
|
||||
" print(f\" Check AWS console for ALB status, then try: {ui_url}\")\n",
|
||||
" if provider == \"aws\":\n",
|
||||
" print(\"💡 ALB may still be provisioning\")\n",
|
||||
" print(f\" Check AWS console for ALB status, then try: {ui_url}\")\n",
|
||||
" elif provider == \"azure\":\n",
|
||||
" print(\"💡 Application Gateway may still be provisioning\")\n",
|
||||
" print(f\" Check Azure portal for Application Gateway status, then try: {ui_url}\")\n",
|
||||
" else:\n",
|
||||
" print(f\" Try again in a few minutes: {ui_url}\")\n",
|
||||
" except Exception as e:\n",
|
||||
" warn(f\"Error accessing UI: {e}\")\n",
|
||||
" print(f\"💡 Manual check: Open {ui_url} in a browser\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not determine ingress hostname\")\n",
|
||||
" print(\"💡 Ingress may not be provisioned yet\")\n",
|
||||
" print(\" Run the ingress check above and wait for ALB to be created\")\n"
|
||||
" if provider == \"aws\":\n",
|
||||
" print(\" Run the ingress check above and wait for ALB to be created\")\n",
|
||||
" elif provider == \"azure\":\n",
|
||||
" print(\" Run the ingress check above and wait for Application Gateway to be created\")\n",
|
||||
" else:\n",
|
||||
" print(\" Run the ingress check above and wait for load balancer to be created\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -561,10 +1029,13 @@
|
||||
"### ✅ Validation Checklist\n",
|
||||
"\n",
|
||||
"- [ ] All pods are running and ready\n",
|
||||
"- [ ] License key is properly configured (no errors in logs)\n",
|
||||
"- [ ] All PVCs are bound\n",
|
||||
"- [ ] Ingress/ALB is provisioned\n",
|
||||
"- [ ] External services are accessible (PostgreSQL, Redis, blob storage)\n",
|
||||
"- [ ] Ingress/load balancer is provisioned\n",
|
||||
"- [ ] Services are accessible\n",
|
||||
"- [ ] UI is reachable (or ALB is provisioning)\n",
|
||||
"- [ ] UI is reachable (or load balancer is provisioning)\n",
|
||||
"- [ ] Basic functional test passed (optional)\n",
|
||||
"- [ ] Diagnostic artifacts collected\n",
|
||||
"\n",
|
||||
"### 🎯 Next Steps\n",
|
||||
@@ -573,15 +1044,18 @@
|
||||
"- ✅ You have a working baseline deployment\n",
|
||||
"- ✅ You're on a supported path\n",
|
||||
"- ✅ Ready to proceed to Module 2 (SSO/OIDC configuration)\n",
|
||||
"- 💡 For comprehensive functional testing, see the full validation guide\n",
|
||||
"\n",
|
||||
"**If checks fail:**\n",
|
||||
"- Review the warnings above\n",
|
||||
"- Check diagnostic artifacts\n",
|
||||
"- Common issues:\n",
|
||||
" - **PVCs pending:** EBS CSI driver not installed\n",
|
||||
" - **ALB not appearing:** Wrong ingress configuration\n",
|
||||
" - **Pods not ready:** Check events and logs\n",
|
||||
" - **UI not accessible:** Wait for ALB provisioning (can take 5-10 minutes)\n",
|
||||
" - **PVCs pending:** Storage CSI driver not installed\n",
|
||||
"- **Load balancer not appearing:** Wrong ingress configuration\n",
|
||||
"- **Pods not ready:** Check events and logs\n",
|
||||
"- **UI not accessible:** Wait for load balancer provisioning (can take 5-10 minutes)\n",
|
||||
"- **License errors:** Verify license key is valid and secret is correctly mounted\n",
|
||||
"- **External services unreachable:** Check network connectivity and security groups\n",
|
||||
"\n",
|
||||
"### 📋 Baseline Reference\n",
|
||||
"\n",
|
||||
|
||||
@@ -425,7 +425,7 @@
|
||||
"\n",
|
||||
"If you want to start over:\n",
|
||||
"1. Review and update your `.env` file\n",
|
||||
"2. Run `01_aws_preflight.ipynb` again\n",
|
||||
"2. Run `01_preflight.ipynb` again\n",
|
||||
"3. Proceed through the module notebooks\n",
|
||||
"\n",
|
||||
"**Thank you for completing Module 1!**\n"
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,863 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Module 2: SAML SSO Validation (Optional)\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"This notebook validates SAML SSO configuration for your LangSmith deployment. Use this if your IdP only supports SAML or if enterprise policy requires SAML.\n",
|
||||
"\n",
|
||||
"**⚠️ SAFETY NOTICE:** This notebook is **READ-ONLY**. It performs validation checks only and does NOT modify any infrastructure, Helm values, secrets, or deployments. All operations are safe to run against production environments.\n",
|
||||
"\n",
|
||||
"**Prerequisites:**\n",
|
||||
"- Module 1 deployment is healthy and accessible\n",
|
||||
"- DNS configured and resolving correctly\n",
|
||||
"- TLS certificate valid and trusted\n",
|
||||
"- Ingress configured and working\n",
|
||||
"- IdP team has provided SAML metadata or metadata URL\n",
|
||||
"\n",
|
||||
"## What We'll Validate\n",
|
||||
"\n",
|
||||
"1. ✅ Environment configuration (SAML settings, redacted)\n",
|
||||
"2. ✅ Preflight checks (tools, kubectl, namespace, Helm release)\n",
|
||||
"3. ✅ Current auth configuration (without leaking secrets)\n",
|
||||
"4. ✅ Ingress/TLS preconditions (domain, HTTPS)\n",
|
||||
"5. ✅ SAML metadata validation (URL reachability, XML parsing, required attributes)\n",
|
||||
"6. ✅ Deployment verification (pods, logs, endpoints)\n",
|
||||
"7. ✅ Common failure signatures\n",
|
||||
"8. ✅ Support bundle pointers\n",
|
||||
"\n",
|
||||
"**Estimated time:** 30-45 minutes\n",
|
||||
"\n",
|
||||
"**Important:** \n",
|
||||
"- This notebook never prints secrets. All sensitive values are redacted.\n",
|
||||
"- This notebook does NOT modify any resources. It is safe for production use.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Bootstrap environment\n",
|
||||
"import sys\n",
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"# Add notebooks directory to path so we can import shared as a package\n",
|
||||
"possible_paths = [\n",
|
||||
" Path.cwd().parent, # If cwd is module-2, go up one level to notebooks\n",
|
||||
" Path.cwd(), # If cwd is already notebooks\n",
|
||||
" Path.cwd() / \"notebooks\", # If cwd is workspace root\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"notebooks_path = None\n",
|
||||
"for path in possible_paths:\n",
|
||||
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" notebooks_path = path\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not notebooks_path:\n",
|
||||
" notebooks_path = Path.cwd() / \"notebooks\"\n",
|
||||
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
|
||||
"\n",
|
||||
"# Add notebooks directory to path so 'shared' can be imported as a package\n",
|
||||
"if str(notebooks_path) not in sys.path:\n",
|
||||
" sys.path.insert(0, str(notebooks_path))\n",
|
||||
"\n",
|
||||
"from shared._bootstrap import bootstrap\n",
|
||||
"\n",
|
||||
"# Run bootstrap\n",
|
||||
"bootstrap_info = bootstrap()\n",
|
||||
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
|
||||
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1. Configuration\n",
|
||||
"\n",
|
||||
"Load and validate SAML configuration from environment variables. All secrets are redacted in output.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import json\n",
|
||||
"from pathlib import Path\n",
|
||||
"from shared._validation import require_env, print_config, redact, ok, warn\n",
|
||||
"from shared._shell import run\n",
|
||||
"\n",
|
||||
"# Required SAML configuration variables\n",
|
||||
"required_vars = [\n",
|
||||
" \"NAMESPACE\",\n",
|
||||
" \"SAML_METADATA_URL\", # OR SAML_METADATA_FILE (one must be provided)\n",
|
||||
" \"LANGSMITH_DOMAIN\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"# Optional but recommended\n",
|
||||
"optional_vars = [\n",
|
||||
" \"SAML_ENTITY_ID\",\n",
|
||||
" \"SAML_EMAIL_ATTRIBUTE\",\n",
|
||||
" \"SAML_NAME_ATTRIBUTE\",\n",
|
||||
" \"SAML_GROUPS_ATTRIBUTE\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"print(\"### Loading SAML Configuration\\n\")\n",
|
||||
"\n",
|
||||
"# Load required variables\n",
|
||||
"config = {}\n",
|
||||
"missing = []\n",
|
||||
"\n",
|
||||
"for var in required_vars:\n",
|
||||
" value = os.environ.get(var, \"\").strip()\n",
|
||||
" if not value:\n",
|
||||
" missing.append(var)\n",
|
||||
" config[var] = value\n",
|
||||
"\n",
|
||||
"# Check if SAML_METADATA_FILE is provided as alternative\n",
|
||||
"saml_metadata_file = os.environ.get(\"SAML_METADATA_FILE\", \"\").strip()\n",
|
||||
"if not config.get(\"SAML_METADATA_URL\") and not saml_metadata_file:\n",
|
||||
" missing.append(\"SAML_METADATA_URL or SAML_METADATA_FILE\")\n",
|
||||
"\n",
|
||||
"if missing:\n",
|
||||
" raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
|
||||
" f\"💡 Copy env-samples/oidc.env.example to your .env file and fill in SAML values\")\n",
|
||||
"\n",
|
||||
"# Load optional variables\n",
|
||||
"for var in optional_vars:\n",
|
||||
" config[var] = os.environ.get(var, \"\").strip()\n",
|
||||
"\n",
|
||||
"# Set defaults for optional variables\n",
|
||||
"if not config.get(\"SAML_EMAIL_ATTRIBUTE\"):\n",
|
||||
" config[\"SAML_EMAIL_ATTRIBUTE\"] = \"email\"\n",
|
||||
"if not config.get(\"SAML_NAME_ATTRIBUTE\"):\n",
|
||||
" config[\"SAML_NAME_ATTRIBUTE\"] = \"name\"\n",
|
||||
"if not config.get(\"SAML_GROUPS_ATTRIBUTE\"):\n",
|
||||
" config[\"SAML_GROUPS_ATTRIBUTE\"] = \"groups\"\n",
|
||||
"\n",
|
||||
"# Print configuration (redacted)\n",
|
||||
"print_config(config, redact_keys=set())\n",
|
||||
"\n",
|
||||
"ok(\"Configuration loaded\")\n",
|
||||
"\n",
|
||||
"# Validate metadata source\n",
|
||||
"if config.get(\"SAML_METADATA_URL\"):\n",
|
||||
" metadata_url = config[\"SAML_METADATA_URL\"]\n",
|
||||
" if not metadata_url.startswith(\"https://\"):\n",
|
||||
" warn(\"SAML metadata URL should use HTTPS\")\n",
|
||||
" print(f\"\\n💡 Using metadata URL: {metadata_url}\")\n",
|
||||
"elif saml_metadata_file:\n",
|
||||
" metadata_path = Path(saml_metadata_file)\n",
|
||||
" if not metadata_path.exists():\n",
|
||||
" raise RuntimeError(f\"❌ SAML metadata file not found: {saml_metadata_file}\")\n",
|
||||
" print(f\"\\n💡 Using metadata file: {saml_metadata_file}\")\n",
|
||||
"else:\n",
|
||||
" raise RuntimeError(\"❌ Either SAML_METADATA_URL or SAML_METADATA_FILE must be provided\")\n",
|
||||
"\n",
|
||||
"domain = config[\"LANGSMITH_DOMAIN\"]\n",
|
||||
"print(f\"\\n💡 Verify these values match your IdP configuration:\")\n",
|
||||
"print(f\" - Entity ID: {config.get('SAML_ENTITY_ID', 'N/A')}\")\n",
|
||||
"print(f\" - Metadata URL/File: {config.get('SAML_METADATA_URL', saml_metadata_file)}\")\n",
|
||||
"print(f\" - Domain: {domain}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Safety Check: Verify Environment\n",
|
||||
"\n",
|
||||
"Before proceeding with validation, confirm you're working with the correct environment and that auth configuration is appropriate to validate.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Safety check: Verify environment and auth configuration state\n",
|
||||
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
|
||||
"from shared._validation import ok, warn\n",
|
||||
"from shared._shell import run\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"region = get_region()\n",
|
||||
"identity = get_identity()\n",
|
||||
"\n",
|
||||
"print(\"### Environment Safety Check\\n\")\n",
|
||||
"\n",
|
||||
"# Show current environment\n",
|
||||
"provider_display = provider.upper()\n",
|
||||
"print(f\"Cloud Provider: {provider_display}\")\n",
|
||||
"print(f\"Region: {region}\")\n",
|
||||
"\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" print(f\"Account ID: {identity.get('Account', 'N/A')}\")\n",
|
||||
" print(f\"User ARN: {identity.get('Arn', 'N/A')}\")\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" print(f\"Subscription ID: {identity.get('SubscriptionId', identity.get('Account', 'N/A'))}\")\n",
|
||||
" print(f\"Subscription Name: {identity.get('SubscriptionName', 'N/A')}\")\n",
|
||||
"\n",
|
||||
"print(\"\\n\" + \"=\" * 60)\n",
|
||||
"print(\"⚠️ IMPORTANT: This notebook is READ-ONLY\")\n",
|
||||
"print(\"=\" * 60)\n",
|
||||
"print(\"\\nThis notebook will:\")\n",
|
||||
"print(\" ✅ Validate SAML configuration\")\n",
|
||||
"print(\" ✅ Check deployment status\")\n",
|
||||
"print(\" ✅ Inspect current auth settings (secrets redacted)\")\n",
|
||||
"print(\" ✅ Collect support bundles\")\n",
|
||||
"print(\"\\nThis notebook will NOT:\")\n",
|
||||
"print(\" ❌ Modify Helm values or releases\")\n",
|
||||
"print(\" ❌ Create or update secrets\")\n",
|
||||
"print(\" ❌ Restart pods or deployments\")\n",
|
||||
"print(\" ❌ Change any infrastructure\")\n",
|
||||
"print(\"\\n\" + \"=\" * 60)\n",
|
||||
"\n",
|
||||
"# Check if auth is already configured\n",
|
||||
"print(\"\\n### Checking Current Auth Configuration State\\n\")\n",
|
||||
"namespace = config.get(\"NAMESPACE\", \"\")\n",
|
||||
"helm_release = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
|
||||
"\n",
|
||||
"# Check for auth-related secrets\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"auth_configured = False\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" secrets = json.loads(result.stdout)\n",
|
||||
" auth_secrets = [s for s in secrets.get(\"items\", [])\n",
|
||||
" if any(keyword in s.get(\"metadata\", {}).get(\"name\", \"\").lower()\n",
|
||||
" for keyword in [\"auth\", \"saml\", \"sso\"])]\n",
|
||||
" \n",
|
||||
" if auth_secrets:\n",
|
||||
" auth_configured = True\n",
|
||||
" ok(f\"Found {len(auth_secrets)} auth-related secret(s) - auth appears configured\")\n",
|
||||
" print(\" 💡 This validation will check if your SAML configuration matches existing setup\")\n",
|
||||
" else:\n",
|
||||
" warn(\"No auth-related secrets found - auth may not be configured yet\")\n",
|
||||
" print(\" 💡 This validation will verify your SAML configuration is ready to apply\")\n",
|
||||
"\n",
|
||||
"# Check Helm values for auth config\n",
|
||||
"result = run(\n",
|
||||
" [\"helm\", \"get\", \"values\", helm_release, \"-n\", namespace, \"--output\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" try:\n",
|
||||
" values = json.loads(result.stdout)\n",
|
||||
" if \"auth\" in str(values).lower() or \"saml\" in str(values).lower():\n",
|
||||
" if not auth_configured:\n",
|
||||
" auth_configured = True\n",
|
||||
" ok(\"Helm values contain auth configuration\")\n",
|
||||
" else:\n",
|
||||
" warn(\"No auth configuration found in Helm values\")\n",
|
||||
" except json.JSONDecodeError:\n",
|
||||
" pass\n",
|
||||
"\n",
|
||||
"if auth_configured:\n",
|
||||
" print(\"\\n\" + \"=\" * 60)\n",
|
||||
" print(\"⚠️ Auth is already configured in this environment\")\n",
|
||||
" print(\"=\" * 60)\n",
|
||||
" print(\"\\nThis validation will:\")\n",
|
||||
" print(\" - Verify your SAML settings match the existing configuration\")\n",
|
||||
" print(\" - Check if authentication is working correctly\")\n",
|
||||
" print(\" - Identify any configuration mismatches\")\n",
|
||||
" print(\"\\n💡 If you need to CHANGE auth configuration, use Helm upgrade separately\")\n",
|
||||
" print(\" This notebook only validates, it does not modify configuration\")\n",
|
||||
"else:\n",
|
||||
" print(\"\\n\" + \"=\" * 60)\n",
|
||||
" print(\"ℹ️ Auth not yet configured\")\n",
|
||||
" print(\"=\" * 60)\n",
|
||||
" print(\"\\nThis validation will:\")\n",
|
||||
" print(\" - Verify your SAML settings are correct\")\n",
|
||||
" print(\" - Check prerequisites (DNS, TLS, ingress)\")\n",
|
||||
" print(\" - Validate IdP metadata\")\n",
|
||||
" print(\"\\n💡 After validation passes, apply configuration using Helm upgrade\")\n",
|
||||
" print(\" This notebook only validates, it does not apply configuration\")\n",
|
||||
"\n",
|
||||
"ok(\"Environment safety check complete\")\n",
|
||||
"print(\"\\n✅ Safe to proceed with validation\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2. Preflight Checks\n",
|
||||
"\n",
|
||||
"Same as OIDC notebook - verify tools, kubectl context, namespace, and Helm release.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._validation import ok, warn\n",
|
||||
"from shared._k8s_helpers import require_namespace, namespace_exists\n",
|
||||
"from shared._shell import run\n",
|
||||
"from shared._cloud_helpers import get_cloud_provider, get_region\n",
|
||||
"\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"region = get_region()\n",
|
||||
"namespace = config[\"NAMESPACE\"]\n",
|
||||
"\n",
|
||||
"print(\"### Preflight Checks\\n\")\n",
|
||||
"\n",
|
||||
"# Check kubectl is available\n",
|
||||
"print(\"1. Checking kubectl...\")\n",
|
||||
"result = run([\"kubectl\", \"version\", \"--client\", \"--short\"], check=False, stream=False)\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" ok(\"kubectl is available\")\n",
|
||||
" print(f\" {result.stdout.strip()}\")\n",
|
||||
"else:\n",
|
||||
" raise RuntimeError(\"❌ kubectl is not available or not working\")\n",
|
||||
"\n",
|
||||
"# Check kubectl context\n",
|
||||
"print(\"\\n2. Checking kubectl context...\")\n",
|
||||
"result = run([\"kubectl\", \"config\", \"current-context\"], check=False, stream=False)\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" context = result.stdout.strip()\n",
|
||||
" ok(f\"Current context: {context}\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not determine kubectl context\")\n",
|
||||
"\n",
|
||||
"# Check namespace exists\n",
|
||||
"print(f\"\\n3. Checking namespace '{namespace}'...\")\n",
|
||||
"if namespace_exists(namespace):\n",
|
||||
" ok(f\"Namespace '{namespace}' exists\")\n",
|
||||
"else:\n",
|
||||
" raise RuntimeError(f\"❌ Namespace '{namespace}' does not exist. Complete Module 1 first.\")\n",
|
||||
"\n",
|
||||
"# Check Helm release\n",
|
||||
"print(f\"\\n4. Checking Helm release...\")\n",
|
||||
"helm_release = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
|
||||
"result = run(\n",
|
||||
" [\"helm\", \"list\", \"-n\", namespace, \"--output\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" try:\n",
|
||||
" releases = json.loads(result.stdout)\n",
|
||||
" release_names = [r.get(\"name\") for r in releases]\n",
|
||||
" if helm_release in release_names:\n",
|
||||
" ok(f\"Helm release '{helm_release}' exists\")\n",
|
||||
" else:\n",
|
||||
" raise RuntimeError(f\"❌ Helm release '{helm_release}' not found\")\n",
|
||||
" except json.JSONDecodeError:\n",
|
||||
" warn(\"Could not parse Helm release list\")\n",
|
||||
"\n",
|
||||
"ok(\"Preflight checks complete\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3. Inspect Current Auth Configuration\n",
|
||||
"\n",
|
||||
"Examine the current authentication configuration without leaking secrets.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"### Inspecting Current Auth Configuration\\n\")\n",
|
||||
"\n",
|
||||
"# Check for auth-related environment variables in deployments\n",
|
||||
"print(\"1. Checking deployment environment variables...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" deployments = json.loads(result.stdout)\n",
|
||||
" auth_vars_found = False\n",
|
||||
" \n",
|
||||
" for deployment in deployments.get(\"items\", []):\n",
|
||||
" name = deployment.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" containers = deployment.get(\"spec\", {}).get(\"template\", {}).get(\"spec\", {}).get(\"containers\", [])\n",
|
||||
" \n",
|
||||
" for container in containers:\n",
|
||||
" env_vars = container.get(\"env\", [])\n",
|
||||
" auth_env = [e for e in env_vars if any(keyword in e.get(\"name\", \"\").upper() for keyword in [\"AUTH\", \"SAML\", \"SSO\"])]\n",
|
||||
" \n",
|
||||
" if auth_env:\n",
|
||||
" auth_vars_found = True\n",
|
||||
" print(f\"\\n Deployment: {name}\")\n",
|
||||
" for env in auth_env:\n",
|
||||
" env_name = env.get(\"name\", \"\")\n",
|
||||
" if \"SECRET\" in env_name.upper() or \"PASSWORD\" in env_name.upper():\n",
|
||||
" print(f\" - {env_name}: <redacted>\")\n",
|
||||
" elif env.get(\"valueFrom\"):\n",
|
||||
" print(f\" - {env_name}: <from secret/configmap>\")\n",
|
||||
" \n",
|
||||
" if not auth_vars_found:\n",
|
||||
" warn(\"No auth-related environment variables found\")\n",
|
||||
"\n",
|
||||
"# Check for auth-related secrets (names only)\n",
|
||||
"print(\"\\n2. Checking for auth-related secrets...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" secrets = json.loads(result.stdout)\n",
|
||||
" auth_secrets = [s for s in secrets.get(\"items\", [])\n",
|
||||
" if any(keyword in s.get(\"metadata\", {}).get(\"name\", \"\").lower()\n",
|
||||
" for keyword in [\"auth\", \"saml\", \"sso\"])]\n",
|
||||
" \n",
|
||||
" if auth_secrets:\n",
|
||||
" ok(f\"Found {len(auth_secrets)} auth-related secret(s)\")\n",
|
||||
" for secret in auth_secrets:\n",
|
||||
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" print(f\" - {name} (values not displayed)\")\n",
|
||||
"\n",
|
||||
"ok(\"Auth configuration inspection complete (no secrets displayed)\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4. Validate Ingress/TLS Preconditions\n",
|
||||
"\n",
|
||||
"Verify domain resolution, HTTPS accessibility, and TLS certificate validity.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import socket\n",
|
||||
"import ssl\n",
|
||||
"import requests\n",
|
||||
"\n",
|
||||
"domain = config[\"LANGSMITH_DOMAIN\"]\n",
|
||||
"print(f\"### Validating Ingress/TLS for {domain}\\n\")\n",
|
||||
"\n",
|
||||
"# 1. DNS Resolution\n",
|
||||
"print(\"1. Checking DNS resolution...\")\n",
|
||||
"try:\n",
|
||||
" ip_address = socket.gethostbyname(domain)\n",
|
||||
" ok(f\"Domain resolves to: {ip_address}\")\n",
|
||||
"except socket.gaierror as e:\n",
|
||||
" raise RuntimeError(f\"❌ DNS resolution failed for {domain}: {e}\")\n",
|
||||
"\n",
|
||||
"# 2. HTTPS Reachability\n",
|
||||
"print(f\"\\n2. Checking HTTPS reachability...\")\n",
|
||||
"https_url = f\"https://{domain}\"\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" response = requests.get(https_url, timeout=10, verify=True, allow_redirects=True)\n",
|
||||
" ok(f\"HTTPS accessible: {response.status_code}\")\n",
|
||||
"except requests.exceptions.SSLError as e:\n",
|
||||
" warn(f\"SSL verification failed: {e}\")\n",
|
||||
" print(\" 💡 Certificate may be self-signed or invalid\")\n",
|
||||
"except requests.exceptions.RequestException as e:\n",
|
||||
" raise RuntimeError(f\"❌ Could not connect to {domain}: {e}\")\n",
|
||||
"\n",
|
||||
"# 3. TLS Certificate Check\n",
|
||||
"print(f\"\\n3. Checking TLS certificate...\")\n",
|
||||
"try:\n",
|
||||
" context = ssl.create_default_context()\n",
|
||||
" with socket.create_connection((domain, 443), timeout=10) as sock:\n",
|
||||
" with context.wrap_socket(sock, server_hostname=domain) as ssock:\n",
|
||||
" cert = ssock.getpeercert()\n",
|
||||
" subject = dict(x[0] for x in cert['subject'])\n",
|
||||
" \n",
|
||||
" import datetime\n",
|
||||
" not_after = datetime.datetime.strptime(cert['notAfter'], '%b %d %H:%M:%S %Y %Z')\n",
|
||||
" days_until_expiry = (not_after - datetime.datetime.now()).days\n",
|
||||
" \n",
|
||||
" if days_until_expiry > 30:\n",
|
||||
" ok(f\"Certificate valid for {days_until_expiry} more days\")\n",
|
||||
" elif days_until_expiry > 0:\n",
|
||||
" warn(f\"Certificate expires in {days_until_expiry} days\")\n",
|
||||
" else:\n",
|
||||
" raise RuntimeError(f\"❌ Certificate expired\")\n",
|
||||
"except Exception as e:\n",
|
||||
" warn(f\"Could not verify TLS certificate: {e}\")\n",
|
||||
"\n",
|
||||
"ok(\"Ingress/TLS preconditions validated\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 5. SAML Metadata Validation\n",
|
||||
"\n",
|
||||
"Validate SAML metadata URL reachability, XML parsing, and required attributes.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import xml.etree.ElementTree as ET\n",
|
||||
"import requests\n",
|
||||
"\n",
|
||||
"print(\"### Validating SAML Metadata\\n\")\n",
|
||||
"\n",
|
||||
"metadata_url = config.get(\"SAML_METADATA_URL\", \"\")\n",
|
||||
"metadata_file = os.environ.get(\"SAML_METADATA_FILE\", \"\").strip()\n",
|
||||
"\n",
|
||||
"# 1. Fetch or Load Metadata\n",
|
||||
"print(\"1. Loading SAML metadata...\")\n",
|
||||
"metadata_xml = None\n",
|
||||
"\n",
|
||||
"if metadata_url:\n",
|
||||
" print(f\" Fetching from URL: {metadata_url}\")\n",
|
||||
" try:\n",
|
||||
" response = requests.get(metadata_url, timeout=10, verify=True)\n",
|
||||
" if response.status_code == 200:\n",
|
||||
" ok(\"Metadata URL accessible\")\n",
|
||||
" metadata_xml = response.text\n",
|
||||
" else:\n",
|
||||
" raise RuntimeError(f\"❌ Metadata URL returned {response.status_code}\")\n",
|
||||
" except requests.exceptions.RequestException as e:\n",
|
||||
" raise RuntimeError(f\"❌ Could not fetch metadata URL: {e}\")\n",
|
||||
"elif metadata_file:\n",
|
||||
" print(f\" Loading from file: {metadata_file}\")\n",
|
||||
" try:\n",
|
||||
" with open(metadata_file, \"r\") as f:\n",
|
||||
" metadata_xml = f.read()\n",
|
||||
" ok(\"Metadata file loaded\")\n",
|
||||
" except Exception as e:\n",
|
||||
" raise RuntimeError(f\"❌ Could not load metadata file: {e}\")\n",
|
||||
"\n",
|
||||
"if not metadata_xml:\n",
|
||||
" raise RuntimeError(\"❌ No metadata XML available\")\n",
|
||||
"\n",
|
||||
"# 2. Parse XML\n",
|
||||
"print(\"\\n2. Parsing SAML metadata XML...\")\n",
|
||||
"try:\n",
|
||||
" # Register namespaces\n",
|
||||
" namespaces = {\n",
|
||||
" 'md': 'urn:oasis:names:tc:SAML:2.0:metadata',\n",
|
||||
" 'ds': 'http://www.w3.org/2000/09/xmldsig#',\n",
|
||||
" }\n",
|
||||
" \n",
|
||||
" root = ET.fromstring(metadata_xml)\n",
|
||||
" ok(\"Metadata XML is valid\")\n",
|
||||
"except ET.ParseError as e:\n",
|
||||
" raise RuntimeError(f\"❌ Invalid XML: {e}\")\n",
|
||||
"\n",
|
||||
"# 3. Extract Entity Descriptor\n",
|
||||
"print(\"\\n3. Extracting entity information...\")\n",
|
||||
"entity_id = None\n",
|
||||
"try:\n",
|
||||
" entity_descriptor = root.find('.//md:EntityDescriptor', namespaces)\n",
|
||||
" if entity_descriptor is not None:\n",
|
||||
" entity_id = entity_descriptor.get('entityID')\n",
|
||||
" if entity_id:\n",
|
||||
" ok(f\"Entity ID found: {entity_id}\")\n",
|
||||
" if config.get(\"SAML_ENTITY_ID\") and entity_id != config.get(\"SAML_ENTITY_ID\"):\n",
|
||||
" warn(f\"Entity ID mismatch: config={config.get('SAML_ENTITY_ID')}, metadata={entity_id}\")\n",
|
||||
" else:\n",
|
||||
" warn(\"Entity ID not found in metadata\")\n",
|
||||
" else:\n",
|
||||
" warn(\"EntityDescriptor not found in metadata\")\n",
|
||||
"except Exception as e:\n",
|
||||
" warn(f\"Could not extract entity information: {e}\")\n",
|
||||
"\n",
|
||||
"# 4. Extract IDP SSO Descriptor\n",
|
||||
"print(\"\\n4. Extracting IdP SSO descriptor...\")\n",
|
||||
"try:\n",
|
||||
" idp_sso = root.find('.//md:IDPSSODescriptor', namespaces)\n",
|
||||
" if idp_sso is not None:\n",
|
||||
" ok(\"IdP SSO descriptor found\")\n",
|
||||
" \n",
|
||||
" # Extract SSO endpoints\n",
|
||||
" sso_endpoints = idp_sso.findall('.//md:SingleSignOnService', namespaces)\n",
|
||||
" if sso_endpoints:\n",
|
||||
" print(f\" Found {len(sso_endpoints)} SSO endpoint(s):\")\n",
|
||||
" for endpoint in sso_endpoints:\n",
|
||||
" location = endpoint.get('Location', '')\n",
|
||||
" binding = endpoint.get('Binding', '')\n",
|
||||
" print(f\" - {binding}: {location}\")\n",
|
||||
" else:\n",
|
||||
" warn(\"No SSO endpoints found\")\n",
|
||||
" else:\n",
|
||||
" warn(\"IDPSSODescriptor not found - may not be IdP metadata\")\n",
|
||||
"except Exception as e:\n",
|
||||
" warn(f\"Could not extract IdP SSO descriptor: {e}\")\n",
|
||||
"\n",
|
||||
"# 5. Extract Certificates\n",
|
||||
"print(\"\\n5. Checking for signing certificates...\")\n",
|
||||
"try:\n",
|
||||
" certificates = root.findall('.//ds:X509Certificate', namespaces)\n",
|
||||
" if certificates:\n",
|
||||
" ok(f\"Found {len(certificates)} certificate(s)\")\n",
|
||||
" for i, cert in enumerate(certificates):\n",
|
||||
" cert_text = cert.text.strip() if cert.text else \"\"\n",
|
||||
" if cert_text:\n",
|
||||
" print(f\" Certificate {i+1}: {len(cert_text)} characters\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"Certificate {i+1} is empty\")\n",
|
||||
" else:\n",
|
||||
" warn(\"No signing certificates found\")\n",
|
||||
" print(\" 💡 IdP must provide signing certificate for assertion validation\")\n",
|
||||
"except Exception as e:\n",
|
||||
" warn(f\"Could not extract certificates: {e}\")\n",
|
||||
"\n",
|
||||
"# 6. Validate Required Attributes\n",
|
||||
"print(\"\\n6. Validating attribute configuration...\")\n",
|
||||
"print(f\" Expected email attribute: {config['SAML_EMAIL_ATTRIBUTE']}\")\n",
|
||||
"print(f\" Expected name attribute: {config['SAML_NAME_ATTRIBUTE']}\")\n",
|
||||
"print(f\" Expected groups attribute: {config['SAML_GROUPS_ATTRIBUTE']}\")\n",
|
||||
"\n",
|
||||
"ok(\"SAML metadata validation complete\")\n",
|
||||
"print(\"\\n💡 Verify your IdP sends these attributes in SAML assertions:\")\n",
|
||||
"print(f\" - {config['SAML_EMAIL_ATTRIBUTE']} (required)\")\n",
|
||||
"print(f\" - {config['SAML_NAME_ATTRIBUTE']} (optional)\")\n",
|
||||
"print(f\" - {config['SAML_GROUPS_ATTRIBUTE']} (optional, for role mapping)\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._shell import run\n",
|
||||
"\n",
|
||||
"print(\"### Checking for Common SAML Failure Signatures\\n\")\n",
|
||||
"\n",
|
||||
"# Get pod names\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"jsonpath={.items[*].metadata.name}\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0 and result.stdout.strip():\n",
|
||||
" pod_names = result.stdout.strip().split()\n",
|
||||
" api_pods = [p for p in pod_names if any(keyword in p.lower() for keyword in [\"api\", \"server\", \"backend\"])]\n",
|
||||
" \n",
|
||||
" if not api_pods:\n",
|
||||
" api_pods = pod_names[:2]\n",
|
||||
" \n",
|
||||
" failure_patterns = {\n",
|
||||
" \"Missing Attributes\": [\n",
|
||||
" \"missing attribute\",\n",
|
||||
" \"attribute not found\",\n",
|
||||
" \"email attribute\",\n",
|
||||
" \"required attribute\",\n",
|
||||
" ],\n",
|
||||
" \"Signature Validation\": [\n",
|
||||
" \"signature validation failed\",\n",
|
||||
" \"invalid signature\",\n",
|
||||
" \"certificate\",\n",
|
||||
" \"signing key\",\n",
|
||||
" ],\n",
|
||||
" \"Assertion Expired\": [\n",
|
||||
" \"assertion expired\",\n",
|
||||
" \"notonorafter\",\n",
|
||||
" \"clock skew\",\n",
|
||||
" \"timeout\",\n",
|
||||
" ],\n",
|
||||
" \"Entity ID Mismatch\": [\n",
|
||||
" \"entity id\",\n",
|
||||
" \"issuer mismatch\",\n",
|
||||
" \"audience\",\n",
|
||||
" ],\n",
|
||||
" \"Metadata Issues\": [\n",
|
||||
" \"metadata\",\n",
|
||||
" \"xml parse\",\n",
|
||||
" \"invalid metadata\",\n",
|
||||
" ],\n",
|
||||
" }\n",
|
||||
" \n",
|
||||
" found_issues = []\n",
|
||||
" \n",
|
||||
" for pod_name in api_pods[:2]:\n",
|
||||
" try:\n",
|
||||
" log_result = run(\n",
|
||||
" [\"kubectl\", \"logs\", pod_name, \"-n\", namespace, \"--tail=200\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if log_result.returncode == 0:\n",
|
||||
" logs_lower = log_result.stdout.lower()\n",
|
||||
" \n",
|
||||
" for category, patterns in failure_patterns.items():\n",
|
||||
" for pattern in patterns:\n",
|
||||
" if pattern in logs_lower:\n",
|
||||
" # Check if it's actually an error (not just a log message)\n",
|
||||
" lines = log_result.stdout.split(\"\\n\")\n",
|
||||
" error_lines = [line for line in lines \n",
|
||||
" if pattern in line.lower() \n",
|
||||
" and any(err in line.lower() for err in [\"error\", \"fail\", \"invalid\", \"missing\"])]\n",
|
||||
" \n",
|
||||
" if error_lines and category not in found_issues:\n",
|
||||
" found_issues.append(category)\n",
|
||||
" warn(f\"Potential {category} issue found in {pod_name} logs\")\n",
|
||||
" print(f\" Pattern: '{pattern}'\")\n",
|
||||
" # Don't print full log line as it may contain sensitive data\n",
|
||||
" break\n",
|
||||
" except Exception:\n",
|
||||
" pass\n",
|
||||
" \n",
|
||||
" if not found_issues:\n",
|
||||
" ok(\"No common SAML failure signatures found in logs\")\n",
|
||||
" else:\n",
|
||||
" print(f\"\\n💡 Found potential issues: {', '.join(found_issues)}\")\n",
|
||||
" print(\" Review logs manually for details:\")\n",
|
||||
" print(f\" kubectl logs <pod-name> -n {namespace} --tail=100 | grep -i saml\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not retrieve pod names\")\n",
|
||||
"\n",
|
||||
"print(\"\\n💡 Common SAML failure causes:\")\n",
|
||||
"print(\" 1. Missing required attributes in assertion\")\n",
|
||||
"print(\" 2. Certificate mismatch or expired certificate\")\n",
|
||||
"print(\" 3. Clock skew between LangSmith and IdP\")\n",
|
||||
"print(\" 4. Entity ID mismatch\")\n",
|
||||
"print(\" 5. Attribute name mismatch\")\n",
|
||||
"print(\"\\n See docs/shared/auth_troubleshooting.md for detailed troubleshooting\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4. Deployment Verification & Support Bundle\n",
|
||||
"\n",
|
||||
"Same as OIDC notebook - verify pods, check logs, collect support bundle.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._k8s_helpers import get_pods, wait_for_deployments_ready\n",
|
||||
"from datetime import datetime\n",
|
||||
"import requests\n",
|
||||
"\n",
|
||||
"print(\"### Deployment Verification\\n\")\n",
|
||||
"\n",
|
||||
"# 1. Pod Readiness\n",
|
||||
"print(\"1. Checking pod readiness...\")\n",
|
||||
"require_namespace(namespace)\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" wait_for_deployments_ready(namespace, timeout=\"5m\")\n",
|
||||
" ok(\"All deployments ready\")\n",
|
||||
"except Exception as e:\n",
|
||||
" warn(f\"Some deployments may not be ready: {e}\")\n",
|
||||
"\n",
|
||||
"pods_output = get_pods(namespace)\n",
|
||||
"print(\"\\nPod Status:\")\n",
|
||||
"print(pods_output)\n",
|
||||
"\n",
|
||||
"# 2. Test Endpoint Auth Behavior\n",
|
||||
"print(f\"\\n2. Testing endpoint auth behavior...\")\n",
|
||||
"domain = config[\"LANGSMITH_DOMAIN\"]\n",
|
||||
"test_url = f\"https://{domain}/api/v1/me\"\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" response = requests.get(test_url, timeout=10, verify=True, allow_redirects=False)\n",
|
||||
" if response.status_code in [401, 403]:\n",
|
||||
" ok(f\"Endpoint requires authentication ({response.status_code})\")\n",
|
||||
" elif response.status_code in [301, 302, 307, 308]:\n",
|
||||
" redirect_location = response.headers.get(\"Location\", \"\")\n",
|
||||
" if \"login\" in redirect_location.lower() or \"saml\" in redirect_location.lower():\n",
|
||||
" ok(\"Endpoint redirects to authentication\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"Endpoint redirects but not to auth: {redirect_location}\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"Unexpected status code: {response.status_code}\")\n",
|
||||
"except requests.exceptions.RequestException as e:\n",
|
||||
" warn(f\"Could not test endpoint: {e}\")\n",
|
||||
"\n",
|
||||
"# 3. Support Bundle\n",
|
||||
"print(f\"\\n3. Collecting support bundle...\")\n",
|
||||
"timestamp = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
|
||||
"support_dir = artifacts_dir / f\"saml-support-{timestamp}\"\n",
|
||||
"support_dir.mkdir(exist_ok=True)\n",
|
||||
"\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"jsonpath={.items[*].metadata.name}\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0 and result.stdout.strip():\n",
|
||||
" pod_names = result.stdout.strip().split()\n",
|
||||
" api_pods = [p for p in pod_names if any(keyword in p.lower() for keyword in [\"api\", \"server\", \"backend\"])]\n",
|
||||
" \n",
|
||||
" for pod_name in (api_pods[:3] if api_pods else pod_names[:3]):\n",
|
||||
" try:\n",
|
||||
" log_result = run(\n",
|
||||
" [\"kubectl\", \"logs\", pod_name, \"-n\", namespace, \"--tail=200\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" if log_result.returncode == 0:\n",
|
||||
" log_file = support_dir / f\"{pod_name}-logs.txt\"\n",
|
||||
" with open(log_file, \"w\") as f:\n",
|
||||
" f.write(log_result.stdout)\n",
|
||||
" print(f\" ✅ Saved logs for {pod_name}\")\n",
|
||||
" except Exception:\n",
|
||||
" pass\n",
|
||||
"\n",
|
||||
"ok(f\"Support bundle saved to: {support_dir}\")\n",
|
||||
"print(\"\\n💡 Include pod logs and configuration when contacting support\")\n",
|
||||
"print(\" See docs/shared/auth_troubleshooting.md for complete bundle procedure\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -0,0 +1,933 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Module 3: Operations Sanity Checks\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"This notebook performs read-only validation and signal checks for your LangSmith production deployment. It assumes Module 1 and Module 2 are complete.\n",
|
||||
"\n",
|
||||
"**⚠️ SAFETY NOTICE:** This notebook is **READ-ONLY**. It performs validation checks only and does NOT modify any infrastructure, Helm values, secrets, deployments, or resources. All operations are safe to run against production environments.\n",
|
||||
"\n",
|
||||
"**Prerequisites:**\n",
|
||||
"- Module 1 deployment is healthy and accessible\n",
|
||||
"- Module 2 authentication is configured\n",
|
||||
"- kubectl access to the cluster\n",
|
||||
"- Read access to cloud provider APIs (for managed services)\n",
|
||||
"\n",
|
||||
"## What We'll Check\n",
|
||||
"\n",
|
||||
"1. ✅ Configuration (environment variables, redacted)\n",
|
||||
"2. ✅ Preflight (kubectl context, namespace, deployments)\n",
|
||||
"3. ✅ Current state snapshot (pods, services, events)\n",
|
||||
"4. ✅ Early warning signals (restarts, pending pods, resource saturation)\n",
|
||||
"5. ✅ Storage/durability checks (blob storage, backups)\n",
|
||||
"6. ✅ Sidecar checks (Istio, if applicable)\n",
|
||||
"\n",
|
||||
"**Estimated time:** 15-20 minutes\n",
|
||||
"\n",
|
||||
"**Important:** \n",
|
||||
"- This notebook is read-only and safe to run. It does not modify any resources.\n",
|
||||
"- All operations are read-only: `kubectl get`, `kubectl logs`, `kubectl top`, `helm get values`\n",
|
||||
"- Artifacts are saved locally only (no cluster modifications)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Bootstrap environment\n",
|
||||
"import sys\n",
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"# Add notebooks directory to path so we can import shared as a package\n",
|
||||
"possible_paths = [\n",
|
||||
" Path.cwd().parent, # If cwd is module-3, go up one level to notebooks\n",
|
||||
" Path.cwd(), # If cwd is already notebooks\n",
|
||||
" Path.cwd() / \"notebooks\", # If cwd is workspace root\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"notebooks_path = None\n",
|
||||
"for path in possible_paths:\n",
|
||||
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" notebooks_path = path\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not notebooks_path:\n",
|
||||
" notebooks_path = Path.cwd() / \"notebooks\"\n",
|
||||
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
|
||||
"\n",
|
||||
"# Add notebooks directory to path so 'shared' can be imported as a package\n",
|
||||
"if str(notebooks_path) not in sys.path:\n",
|
||||
" sys.path.insert(0, str(notebooks_path))\n",
|
||||
"\n",
|
||||
"from shared._bootstrap import bootstrap\n",
|
||||
"\n",
|
||||
"# Run bootstrap\n",
|
||||
"bootstrap_info = bootstrap()\n",
|
||||
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
|
||||
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Safety Check: Verify Environment\n",
|
||||
"\n",
|
||||
"Before proceeding with validation, confirm you're working with the correct environment. This notebook is read-only and safe for production use.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Safety check: Verify environment and confirm read-only operations\n",
|
||||
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
|
||||
"from shared._validation import ok, warn\n",
|
||||
"\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"region = get_region()\n",
|
||||
"identity = get_identity()\n",
|
||||
"\n",
|
||||
"print(\"### Environment Safety Check\\n\")\n",
|
||||
"\n",
|
||||
"# Show current environment\n",
|
||||
"provider_display = provider.upper()\n",
|
||||
"print(f\"Cloud Provider: {provider_display}\")\n",
|
||||
"print(f\"Region: {region}\")\n",
|
||||
"\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" print(f\"Account ID: {identity.get('Account', 'N/A')}\")\n",
|
||||
" print(f\"User ARN: {identity.get('Arn', 'N/A')}\")\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" print(f\"Subscription ID: {identity.get('SubscriptionId', identity.get('Account', 'N/A'))}\")\n",
|
||||
" print(f\"Subscription Name: {identity.get('SubscriptionName', 'N/A')}\")\n",
|
||||
"\n",
|
||||
"print(\"\\n\" + \"=\" * 60)\n",
|
||||
"print(\"⚠️ IMPORTANT: This notebook is READ-ONLY\")\n",
|
||||
"print(\"=\" * 60)\n",
|
||||
"print(\"\\nThis notebook will:\")\n",
|
||||
"print(\" ✅ Validate production readiness\")\n",
|
||||
"print(\" ✅ Check deployment status and health\")\n",
|
||||
"print(\" ✅ Inspect resource usage and signals\")\n",
|
||||
"print(\" ✅ Verify storage and backup configuration\")\n",
|
||||
"print(\" ✅ Collect state snapshots (saved locally)\")\n",
|
||||
"print(\"\\nThis notebook will NOT:\")\n",
|
||||
"print(\" ❌ Modify Helm values or releases\")\n",
|
||||
"print(\" ❌ Create or update secrets\")\n",
|
||||
"print(\" ❌ Restart pods or deployments\")\n",
|
||||
"print(\" ❌ Change any infrastructure\")\n",
|
||||
"print(\" ❌ Modify any cluster resources\")\n",
|
||||
"print(\"\\nAll operations are read-only:\")\n",
|
||||
"print(\" - kubectl get (read resources)\")\n",
|
||||
"print(\" - kubectl logs (read logs)\")\n",
|
||||
"print(\" - kubectl top (read metrics)\")\n",
|
||||
"print(\" - helm get values (read configuration)\")\n",
|
||||
"print(\" - Write artifacts to local directory only\")\n",
|
||||
"print(\"\\n\" + \"=\" * 60)\n",
|
||||
"\n",
|
||||
"ok(\"Environment safety check complete\")\n",
|
||||
"print(\"\\n✅ Safe to proceed with read-only validation\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1. Configuration\n",
|
||||
"\n",
|
||||
"Load and validate configuration from environment variables. All secrets are redacted in output.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import json\n",
|
||||
"from shared._validation import require_env, print_config, redact, ok, warn\n",
|
||||
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
|
||||
"\n",
|
||||
"# Required configuration variables\n",
|
||||
"required_vars = [\n",
|
||||
" \"NAMESPACE\",\n",
|
||||
" \"CLUSTER_NAME\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"# Optional but recommended\n",
|
||||
"optional_vars = [\n",
|
||||
" \"HELM_RELEASE\",\n",
|
||||
" \"LANGSMITH_DOMAIN\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"print(\"### Loading Configuration\\n\")\n",
|
||||
"\n",
|
||||
"# Load required variables\n",
|
||||
"config = {}\n",
|
||||
"missing = []\n",
|
||||
"\n",
|
||||
"for var in required_vars:\n",
|
||||
" value = os.environ.get(var, \"\").strip()\n",
|
||||
" if not value:\n",
|
||||
" missing.append(var)\n",
|
||||
" config[var] = value\n",
|
||||
"\n",
|
||||
"if missing:\n",
|
||||
" raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
|
||||
" f\"💡 Copy env-samples/workshop.env.example to your .env file and fill in values\")\n",
|
||||
"\n",
|
||||
"# Load optional variables\n",
|
||||
"for var in optional_vars:\n",
|
||||
" config[var] = os.environ.get(var, \"\").strip()\n",
|
||||
"\n",
|
||||
"# Set defaults\n",
|
||||
"if not config.get(\"HELM_RELEASE\"):\n",
|
||||
" config[\"HELM_RELEASE\"] = \"langsmith\"\n",
|
||||
"\n",
|
||||
"# Print configuration (redacted)\n",
|
||||
"print_config(config, redact_keys=set())\n",
|
||||
"\n",
|
||||
"# Show cloud provider info\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"region = get_region()\n",
|
||||
"identity = get_identity()\n",
|
||||
"\n",
|
||||
"provider_display = provider.upper()\n",
|
||||
"print(f\"\\n### Current {provider_display} Session\")\n",
|
||||
"print(f\"Cloud Provider: {provider_display}\")\n",
|
||||
"print(f\"Region: {region}\")\n",
|
||||
"\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" print(f\"Account ID: {identity.get('Account', 'N/A')}\")\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"\")\n",
|
||||
" print(f\"Subscription ID: {subscription_id}\")\n",
|
||||
"\n",
|
||||
"ok(\"Configuration loaded\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2. Preflight Checks\n",
|
||||
"\n",
|
||||
"Verify kubectl context, namespace exists, and deployments are ready.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._validation import ok, warn\n",
|
||||
"from shared._k8s_helpers import require_namespace, namespace_exists, wait_for_deployments_ready\n",
|
||||
"from shared._shell import run\n",
|
||||
"\n",
|
||||
"namespace = config[\"NAMESPACE\"]\n",
|
||||
"helm_release = config[\"HELM_RELEASE\"]\n",
|
||||
"\n",
|
||||
"print(\"### Preflight Checks\\n\")\n",
|
||||
"\n",
|
||||
"# Check kubectl is available\n",
|
||||
"print(\"1. Checking kubectl...\")\n",
|
||||
"result = run([\"kubectl\", \"version\", \"--client\", \"--short\"], check=False, stream=False)\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" ok(\"kubectl is available\")\n",
|
||||
" print(f\" {result.stdout.strip()}\")\n",
|
||||
"else:\n",
|
||||
" raise RuntimeError(\"❌ kubectl is not available or not working\")\n",
|
||||
"\n",
|
||||
"# Check kubectl context\n",
|
||||
"print(\"\\n2. Checking kubectl context...\")\n",
|
||||
"result = run([\"kubectl\", \"config\", \"current-context\"], check=False, stream=False)\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" context = result.stdout.strip()\n",
|
||||
" ok(f\"Current context: {context}\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not determine kubectl context\")\n",
|
||||
" print(\" 💡 Run: kubectl config get-contexts\")\n",
|
||||
"\n",
|
||||
"# Check namespace exists\n",
|
||||
"print(f\"\\n3. Checking namespace '{namespace}'...\")\n",
|
||||
"if namespace_exists(namespace):\n",
|
||||
" ok(f\"Namespace '{namespace}' exists\")\n",
|
||||
"else:\n",
|
||||
" raise RuntimeError(f\"❌ Namespace '{namespace}' does not exist. Complete Module 1 first.\")\n",
|
||||
"\n",
|
||||
"# Check deployments are ready\n",
|
||||
"print(f\"\\n4. Checking deployments...\")\n",
|
||||
"require_namespace(namespace)\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" wait_for_deployments_ready(namespace, timeout=\"2m\")\n",
|
||||
" ok(\"All deployments ready\")\n",
|
||||
"except Exception as e:\n",
|
||||
" warn(f\"Some deployments may not be ready: {e}\")\n",
|
||||
" print(\" 💡 Check pod status manually: kubectl get pods -n {namespace}\")\n",
|
||||
"\n",
|
||||
"ok(\"Preflight checks complete\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3. Snapshot Current State\n",
|
||||
"\n",
|
||||
"Capture current cluster state for baseline reference.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from datetime import datetime\n",
|
||||
"from shared._k8s_helpers import get_pods\n",
|
||||
"from shared._shell import run\n",
|
||||
"\n",
|
||||
"print(\"### Snapshotting Current State\\n\")\n",
|
||||
"\n",
|
||||
"timestamp = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
|
||||
"snapshot_dir = artifacts_dir / f\"ops-snapshot-{timestamp}\"\n",
|
||||
"snapshot_dir.mkdir(exist_ok=True)\n",
|
||||
"\n",
|
||||
"print(f\"Saving snapshot to: {snapshot_dir}\\n\")\n",
|
||||
"\n",
|
||||
"# 1. Get all resources\n",
|
||||
"print(\"1. Capturing all resources...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"all\", \"-n\", namespace, \"-o\", \"wide\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" with open(snapshot_dir / \"all-resources.txt\", \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" ok(\"All resources captured\")\n",
|
||||
" print(result.stdout)\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not capture all resources\")\n",
|
||||
"\n",
|
||||
"# 2. Get events (sorted by timestamp)\n",
|
||||
"print(\"\\n2. Capturing recent events...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by=.lastTimestamp\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" with open(snapshot_dir / \"events.txt\", \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" ok(\"Events captured\")\n",
|
||||
" \n",
|
||||
" # Show recent events\n",
|
||||
" lines = result.stdout.strip().split(\"\\n\")\n",
|
||||
" if len(lines) > 1:\n",
|
||||
" print(f\"\\n Last 10 events:\")\n",
|
||||
" for line in lines[-10:]:\n",
|
||||
" print(f\" {line}\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not capture events\")\n",
|
||||
"\n",
|
||||
"# 3. Get node and pod resource usage (if metrics available)\n",
|
||||
"print(\"\\n3. Checking resource usage...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"top\", \"nodes\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" with open(snapshot_dir / \"node-usage.txt\", \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" ok(\"Node usage captured\")\n",
|
||||
" print(result.stdout)\n",
|
||||
"else:\n",
|
||||
" warn(\"Node metrics not available (metrics-server may not be installed)\")\n",
|
||||
"\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"top\", \"pods\", \"-n\", namespace],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" with open(snapshot_dir / \"pod-usage.txt\", \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" ok(\"Pod usage captured\")\n",
|
||||
" print(result.stdout)\n",
|
||||
"else:\n",
|
||||
" warn(\"Pod metrics not available\")\n",
|
||||
"\n",
|
||||
"# 4. Check for data store services\n",
|
||||
"print(\"\\n4. Checking data store services...\")\n",
|
||||
"data_stores = {\n",
|
||||
" \"postgres\": [\"postgres\", \"postgresql\", \"database\", \"db\"],\n",
|
||||
" \"redis\": [\"redis\", \"cache\"],\n",
|
||||
" \"clickhouse\": [\"clickhouse\", \"ch\"],\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"found_stores = []\n",
|
||||
"for store_type, keywords in data_stores.items():\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"svc\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" services = json.loads(result.stdout)\n",
|
||||
" for svc in services.get(\"items\", []):\n",
|
||||
" name = svc.get(\"metadata\", {}).get(\"name\", \"\").lower()\n",
|
||||
" if any(keyword in name for keyword in keywords):\n",
|
||||
" found_stores.append((store_type, name))\n",
|
||||
" print(f\" ✅ Found {store_type} service: {name}\")\n",
|
||||
"\n",
|
||||
"if found_stores:\n",
|
||||
" ok(f\"Found {len(found_stores)} data store service(s)\")\n",
|
||||
"else:\n",
|
||||
" warn(\"No in-cluster data stores found (may be using managed services)\")\n",
|
||||
"\n",
|
||||
"ok(f\"State snapshot saved to: {snapshot_dir}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._validation import ok, warn\n",
|
||||
"from shared._shell import run\n",
|
||||
"\n",
|
||||
"print(\"### Early Warning Signal Checks\\n\")\n",
|
||||
"\n",
|
||||
"# 1. Check pod restarts\n",
|
||||
"print(\"1. Checking pod restart counts...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"critical_restarts = []\n",
|
||||
"warning_restarts = []\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" for pod in pods.get(\"items\", []):\n",
|
||||
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" status = pod.get(\"status\", {})\n",
|
||||
" phase = status.get(\"phase\", \"\")\n",
|
||||
" \n",
|
||||
" # Check restart count\n",
|
||||
" container_statuses = status.get(\"containerStatuses\", [])\n",
|
||||
" for cs in container_statuses:\n",
|
||||
" restart_count = cs.get(\"restartCount\", 0)\n",
|
||||
" if restart_count > 5:\n",
|
||||
" critical_restarts.append((name, restart_count))\n",
|
||||
" elif restart_count > 2:\n",
|
||||
" warning_restarts.append((name, restart_count))\n",
|
||||
" \n",
|
||||
" # Check pod phase\n",
|
||||
" if phase == \"CrashLoopBackOff\":\n",
|
||||
" critical_restarts.append((name, \"CrashLoopBackOff\"))\n",
|
||||
" elif phase == \"Pending\":\n",
|
||||
" # Check how long it's been pending\n",
|
||||
" conditions = status.get(\"conditions\", [])\n",
|
||||
" for cond in conditions:\n",
|
||||
" if cond.get(\"type\") == \"PodScheduled\" and cond.get(\"status\") != \"True\":\n",
|
||||
" # Pod is pending\n",
|
||||
" warning_restarts.append((name, \"Pending\"))\n",
|
||||
"\n",
|
||||
"if critical_restarts:\n",
|
||||
" warn(f\"❌ Critical: Found {len(critical_restarts)} pod(s) with critical issues\")\n",
|
||||
" for pod_name, issue in critical_restarts:\n",
|
||||
" print(f\" - {pod_name}: {issue}\")\n",
|
||||
" print(\"\\n 💡 Action required: Check pod logs and events\")\n",
|
||||
"elif warning_restarts:\n",
|
||||
" warn(f\"Found {len(warning_restarts)} pod(s) with warnings\")\n",
|
||||
" for pod_name, issue in warning_restarts:\n",
|
||||
" print(f\" - {pod_name}: {issue}\")\n",
|
||||
" print(\"\\n 💡 Monitor these pods closely\")\n",
|
||||
"else:\n",
|
||||
" ok(\"No critical pod restart issues found\")\n",
|
||||
"\n",
|
||||
"# 2. Check for pending pods\n",
|
||||
"print(\"\\n2. Checking for pending pods...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"--field-selector=status.phase=Pending\", \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" pending = pods.get(\"items\", [])\n",
|
||||
" if pending:\n",
|
||||
" warn(f\"Found {len(pending)} pending pod(s)\")\n",
|
||||
" for pod in pending:\n",
|
||||
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" print(f\" - {name}\")\n",
|
||||
" print(\"\\n 💡 Check events: kubectl describe pod <name> -n {namespace}\")\n",
|
||||
" else:\n",
|
||||
" ok(\"No pending pods\")\n",
|
||||
"\n",
|
||||
"# 3. Check resource saturation (if metrics available)\n",
|
||||
"print(\"\\n3. Checking resource saturation...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"top\", \"pods\", \"-n\", namespace],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" lines = result.stdout.strip().split(\"\\n\")[1:] # Skip header\n",
|
||||
" saturated_pods = []\n",
|
||||
" \n",
|
||||
" for line in lines:\n",
|
||||
" parts = line.split()\n",
|
||||
" if len(parts) >= 3:\n",
|
||||
" pod_name = parts[0]\n",
|
||||
" cpu = parts[1]\n",
|
||||
" memory = parts[2]\n",
|
||||
" \n",
|
||||
" # Parse CPU (handle \"m\" suffix for millicores)\n",
|
||||
" try:\n",
|
||||
" if cpu.endswith(\"m\"):\n",
|
||||
" cpu_val = int(cpu[:-1])\n",
|
||||
" else:\n",
|
||||
" cpu_val = int(float(cpu.replace(\"Gi\", \"\").replace(\"Mi\", \"\")))\n",
|
||||
" \n",
|
||||
" # Parse memory (handle \"Mi\" or \"Gi\" suffix)\n",
|
||||
" if \"Gi\" in memory:\n",
|
||||
" mem_val = float(memory.replace(\"Gi\", \"\")) * 1024\n",
|
||||
" elif \"Mi\" in memory:\n",
|
||||
" mem_val = float(memory.replace(\"Mi\", \"\"))\n",
|
||||
" else:\n",
|
||||
" mem_val = 0\n",
|
||||
" \n",
|
||||
" # Check thresholds (simplified - would need requests/limits for accurate %)\n",
|
||||
" # For now, just flag very high absolute values\n",
|
||||
" if cpu_val > 2000: # > 2 cores\n",
|
||||
" saturated_pods.append((pod_name, f\"High CPU: {cpu}\"))\n",
|
||||
" if mem_val > 4096: # > 4 Gi\n",
|
||||
" saturated_pods.append((pod_name, f\"High Memory: {memory}\"))\n",
|
||||
" except (ValueError, IndexError):\n",
|
||||
" pass\n",
|
||||
" \n",
|
||||
" if saturated_pods:\n",
|
||||
" warn(f\"Found {len(saturated_pods)} pod(s) with high resource usage\")\n",
|
||||
" for pod_name, issue in saturated_pods:\n",
|
||||
" print(f\" - {pod_name}: {issue}\")\n",
|
||||
" print(\"\\n 💡 Review resource requests/limits and consider scaling\")\n",
|
||||
" else:\n",
|
||||
" ok(\"No obvious resource saturation detected\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Resource metrics not available (cannot check saturation)\")\n",
|
||||
"\n",
|
||||
"# 4. Check logs for common failure patterns\n",
|
||||
"print(\"\\n4. Checking logs for common failure patterns...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"jsonpath={.items[*].metadata.name}\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"failure_patterns = {\n",
|
||||
" \"connection refused\": [],\n",
|
||||
" \"timeout\": [],\n",
|
||||
" \"out of memory\": [],\n",
|
||||
" \"database error\": [],\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"if result.returncode == 0 and result.stdout.strip():\n",
|
||||
" pod_names = result.stdout.strip().split()\n",
|
||||
" # Check API and worker pods (most likely to have issues)\n",
|
||||
" api_pods = [p for p in pod_names if any(keyword in p.lower() for keyword in [\"api\", \"server\", \"backend\"])]\n",
|
||||
" worker_pods = [p for p in pod_names if any(keyword in p.lower() for keyword in [\"worker\", \"processor\"])]\n",
|
||||
" \n",
|
||||
" pods_to_check = (api_pods[:2] if api_pods else []) + (worker_pods[:2] if worker_pods else [])\n",
|
||||
" \n",
|
||||
" for pod_name in pods_to_check[:4]: # Check up to 4 pods\n",
|
||||
" try:\n",
|
||||
" log_result = run(\n",
|
||||
" [\"kubectl\", \"logs\", pod_name, \"-n\", namespace, \"--tail=50\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if log_result.returncode == 0:\n",
|
||||
" logs_lower = log_result.stdout.lower()\n",
|
||||
" \n",
|
||||
" for pattern, matches in failure_patterns.items():\n",
|
||||
" if pattern in logs_lower:\n",
|
||||
" # Check if it's actually an error (not just a log message)\n",
|
||||
" lines = log_result.stdout.split(\"\\n\")\n",
|
||||
" error_lines = [line for line in lines \n",
|
||||
" if pattern in line.lower() \n",
|
||||
" and any(err in line.lower() for err in [\"error\", \"fail\", \"refused\", \"timeout\"])]\n",
|
||||
" \n",
|
||||
" if error_lines:\n",
|
||||
" matches.append((pod_name, len(error_lines)))\n",
|
||||
" except Exception:\n",
|
||||
" pass\n",
|
||||
" \n",
|
||||
" found_issues = False\n",
|
||||
" for pattern, matches in failure_patterns.items():\n",
|
||||
" if matches:\n",
|
||||
" found_issues = True\n",
|
||||
" warn(f\"Found '{pattern}' pattern in {len(matches)} pod(s)\")\n",
|
||||
" for pod_name, count in matches:\n",
|
||||
" print(f\" - {pod_name}: {count} occurrence(s)\")\n",
|
||||
" \n",
|
||||
" if not found_issues:\n",
|
||||
" ok(\"No common failure patterns found in recent logs\")\n",
|
||||
" else:\n",
|
||||
" print(\"\\n 💡 Review pod logs for details: kubectl logs <pod> -n {namespace} --tail=100\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not retrieve pod names for log checking\")\n",
|
||||
"\n",
|
||||
"ok(\"Early warning signal checks complete\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._validation import ok, warn\n",
|
||||
"from shared._shell import run\n",
|
||||
"\n",
|
||||
"print(\"### Storage / Durability Checks\\n\")\n",
|
||||
"\n",
|
||||
"# 1. Check blob storage configuration\n",
|
||||
"print(\"1. Checking blob storage configuration...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"blob_storage_configured = False\n",
|
||||
"blob_storage_provider = None\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" deployments = json.loads(result.stdout)\n",
|
||||
" for deployment in deployments.get(\"items\", []):\n",
|
||||
" containers = deployment.get(\"spec\", {}).get(\"template\", {}).get(\"spec\", {}).get(\"containers\", [])\n",
|
||||
" \n",
|
||||
" for container in containers:\n",
|
||||
" env_vars = container.get(\"env\", [])\n",
|
||||
" for env in env_vars:\n",
|
||||
" env_name = env.get(\"name\", \"\").upper()\n",
|
||||
" env_value = env.get(\"value\", \"\")\n",
|
||||
" \n",
|
||||
" # Check for blob storage configuration\n",
|
||||
" if \"BLOB\" in env_name or \"S3\" in env_name or \"STORAGE\" in env_name:\n",
|
||||
" if \"PROVIDER\" in env_name:\n",
|
||||
" blob_storage_provider = env_value\n",
|
||||
" blob_storage_configured = True\n",
|
||||
" elif env_value and env_value not in [\"local\", \"filesystem\", \"\"]:\n",
|
||||
" blob_storage_configured = True\n",
|
||||
"\n",
|
||||
"# Also check Helm values if accessible\n",
|
||||
"helm_release = config.get(\"HELM_RELEASE\", \"langsmith\")\n",
|
||||
"result = run(\n",
|
||||
" [\"helm\", \"get\", \"values\", helm_release, \"-n\", namespace, \"--output\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" try:\n",
|
||||
" values = json.loads(result.stdout)\n",
|
||||
" values_str = str(values).lower()\n",
|
||||
" \n",
|
||||
" # Look for blob storage configuration\n",
|
||||
" if \"blob\" in values_str or \"s3\" in values_str:\n",
|
||||
" if \"local\" not in values_str and \"filesystem\" not in values_str:\n",
|
||||
" blob_storage_configured = True\n",
|
||||
" if \"s3\" in values_str:\n",
|
||||
" blob_storage_provider = \"s3\"\n",
|
||||
" elif \"azure\" in values_str:\n",
|
||||
" blob_storage_provider = \"azure\"\n",
|
||||
" except json.JSONDecodeError:\n",
|
||||
" pass\n",
|
||||
"\n",
|
||||
"if blob_storage_configured:\n",
|
||||
" if blob_storage_provider:\n",
|
||||
" ok(f\"Blob storage configured: {blob_storage_provider}\")\n",
|
||||
" else:\n",
|
||||
" ok(\"Blob storage appears configured (provider not detected)\")\n",
|
||||
" print(\" 💡 Verify blob storage is NOT using local filesystem in production\")\n",
|
||||
"else:\n",
|
||||
" warn(\"❌ CRITICAL: Blob storage may not be configured\")\n",
|
||||
" print(\" 💡 Blob storage is REQUIRED for production\")\n",
|
||||
" print(\" 💡 Without it, ClickHouse will become unusable under load\")\n",
|
||||
" print(\" 💡 Configure S3 (AWS) or Azure Blob Storage (Azure)\")\n",
|
||||
" print(\" 💡 Check Helm values: helm get values <release> -n <namespace>\")\n",
|
||||
"\n",
|
||||
"# 2. Check for backup configuration indicators\n",
|
||||
"print(\"\\n2. Checking backup configuration...\")\n",
|
||||
"print(\" Note: Backup configuration verification depends on deployment type\")\n",
|
||||
"\n",
|
||||
"# For managed services, we can't verify from cluster\n",
|
||||
"# For in-cluster services, we can check for backup jobs\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"cronjobs,jobs\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"backup_jobs_found = False\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" resources = json.loads(result.stdout)\n",
|
||||
" for item in resources.get(\"items\", []):\n",
|
||||
" name = item.get(\"metadata\", {}).get(\"name\", \"\").lower()\n",
|
||||
" if \"backup\" in name:\n",
|
||||
" backup_jobs_found = True\n",
|
||||
" print(f\" ✅ Found backup job: {name}\")\n",
|
||||
"\n",
|
||||
"if backup_jobs_found:\n",
|
||||
" ok(\"Backup jobs found in cluster\")\n",
|
||||
"else:\n",
|
||||
" warn(\"No backup jobs found in cluster\")\n",
|
||||
" print(\" 💡 For managed services (RDS, Azure Database), backups are automated\")\n",
|
||||
" print(\" 💡 Verify backups in cloud provider console:\")\n",
|
||||
" if provider == \"aws\":\n",
|
||||
" print(\" - AWS RDS: Check automated backups in RDS console\")\n",
|
||||
" print(\" - AWS ElastiCache: Check snapshot configuration\")\n",
|
||||
" elif provider == \"azure\":\n",
|
||||
" print(\" - Azure Database: Check backup configuration in Azure portal\")\n",
|
||||
" print(\" - Azure Cache: Check backup configuration\")\n",
|
||||
" print(\" 💡 For in-cluster ClickHouse, configure backup CronJob\")\n",
|
||||
"\n",
|
||||
"# 3. Check PVCs (for in-cluster storage)\n",
|
||||
"print(\"\\n3. Checking persistent volume claims...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pvc\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pvcs = json.loads(result.stdout)\n",
|
||||
" pvc_items = pvcs.get(\"items\", [])\n",
|
||||
" \n",
|
||||
" if pvc_items:\n",
|
||||
" ok(f\"Found {len(pvc_items)} PVC(s)\")\n",
|
||||
" for pvc in pvc_items:\n",
|
||||
" name = pvc.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" status = pvc.get(\"status\", {}).get(\"phase\", \"\")\n",
|
||||
" size = pvc.get(\"spec\", {}).get(\"resources\", {}).get(\"requests\", {}).get(\"storage\", \"N/A\")\n",
|
||||
" print(f\" - {name}: {status}, {size}\")\n",
|
||||
" \n",
|
||||
" # Check for unbound PVCs\n",
|
||||
" unbound = [pvc for pvc in pvc_items if pvc.get(\"status\", {}).get(\"phase\") != \"Bound\"]\n",
|
||||
" if unbound:\n",
|
||||
" warn(f\"Found {len(unbound)} unbound PVC(s)\")\n",
|
||||
" print(\" 💡 Check storage class and node capacity\")\n",
|
||||
" else:\n",
|
||||
" print(\" No PVCs found (may be using managed services or ephemeral storage)\")\n",
|
||||
"\n",
|
||||
"ok(\"Storage / durability checks complete\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 6. Sidecar Checks (Istio)\n",
|
||||
"\n",
|
||||
"Detect if Istio sidecars are present and provide guidance on log access.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._validation import ok, warn\n",
|
||||
"from shared._shell import run\n",
|
||||
"\n",
|
||||
"print(\"### Sidecar Checks (Istio)\\n\")\n",
|
||||
"\n",
|
||||
"# Check if Istio is installed (check for istiod)\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"deployment\", \"-A\", \"-o\", \"jsonpath={.items[?(@.metadata.name==\\\"istiod\\\")].metadata.name}\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"istio_installed = False\n",
|
||||
"if result.returncode == 0 and result.stdout.strip():\n",
|
||||
" istio_installed = True\n",
|
||||
" ok(\"Istio appears to be installed\")\n",
|
||||
"else:\n",
|
||||
" print(\" Istio not detected (or not in default namespace)\")\n",
|
||||
" print(\" 💡 Sidecar checks will be skipped\")\n",
|
||||
"\n",
|
||||
"if istio_installed:\n",
|
||||
" # Check for sidecar injection in namespace\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"namespace\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" namespace_injection = False\n",
|
||||
" if result.returncode == 0:\n",
|
||||
" ns = json.loads(result.stdout)\n",
|
||||
" labels = ns.get(\"metadata\", {}).get(\"labels\", {})\n",
|
||||
" if labels.get(\"istio-injection\") == \"enabled\" or labels.get(\"istio-discovery\") == \"enabled\":\n",
|
||||
" namespace_injection = True\n",
|
||||
" ok(\"Namespace-level sidecar injection enabled\")\n",
|
||||
" print(f\" Labels: {labels}\")\n",
|
||||
" else:\n",
|
||||
" print(\" Namespace-level injection not enabled\")\n",
|
||||
" print(\" 💡 Sidecars may be injected per-workload\")\n",
|
||||
" \n",
|
||||
" # Check for sidecars in pods\n",
|
||||
" print(\"\\n2. Checking for sidecars in pods...\")\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" pods_with_sidecars = []\n",
|
||||
" pods_without_sidecars = []\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" for pod in pods.get(\"items\", []):\n",
|
||||
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
|
||||
" container_names = [c.get(\"name\", \"\") for c in containers]\n",
|
||||
" \n",
|
||||
" if \"istio-proxy\" in container_names:\n",
|
||||
" pods_with_sidecars.append((name, container_names))\n",
|
||||
" else:\n",
|
||||
" pods_without_sidecars.append((name, container_names))\n",
|
||||
" \n",
|
||||
" if pods_with_sidecars:\n",
|
||||
" ok(f\"Found {len(pods_with_sidecars)} pod(s) with sidecars\")\n",
|
||||
" print(\"\\n Pods with sidecars:\")\n",
|
||||
" for pod_name, containers in pods_with_sidecars[:5]: # Show first 5\n",
|
||||
" app_containers = [c for c in containers if c != \"istio-proxy\"]\n",
|
||||
" print(f\" - {pod_name}: {', '.join(app_containers)} + istio-proxy\")\n",
|
||||
" \n",
|
||||
" print(\"\\n 💡 Important: When fetching logs, specify container name:\")\n",
|
||||
" print(\" kubectl logs <pod> -n <namespace> -c <container-name>\")\n",
|
||||
" print(\" kubectl logs <pod> -n <namespace> -c istio-proxy # for proxy logs\")\n",
|
||||
" print(\" kubectl logs <pod> -n <namespace> --all-containers=true # for all logs\")\n",
|
||||
" print(\"\\n ⚠️ If logs appear missing, you're likely looking at the wrong container!\")\n",
|
||||
" \n",
|
||||
" if pods_without_sidecars:\n",
|
||||
" warn(f\"Found {len(pods_without_sidecars)} pod(s) without sidecars\")\n",
|
||||
" print(\" 💡 These pods may need sidecar injection or are opted out\")\n",
|
||||
" else:\n",
|
||||
" if namespace_injection:\n",
|
||||
" warn(\"No pods with sidecars found (may need pod restart)\")\n",
|
||||
" print(\" 💡 Existing pods require restart to get sidecars\")\n",
|
||||
" else:\n",
|
||||
" print(\" No sidecars detected (Istio may not be used or injection disabled)\")\n",
|
||||
"\n",
|
||||
"ok(\"Sidecar checks complete\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Summary\n",
|
||||
"\n",
|
||||
"### ✅ Sanity Checks Complete\n",
|
||||
"\n",
|
||||
"This notebook has validated:\n",
|
||||
"- ✅ Configuration loaded\n",
|
||||
"- ✅ Preflight checks passed\n",
|
||||
"- ✅ Current state snapshotted\n",
|
||||
"- ✅ Early warning signals checked\n",
|
||||
"- ✅ Storage/durability verified\n",
|
||||
"- ✅ Sidecar status checked (if applicable)\n",
|
||||
"\n",
|
||||
"### 🎯 Next Steps\n",
|
||||
"\n",
|
||||
"1. **Review production readiness checklist:**\n",
|
||||
" - See `docs/shared/production_readiness_checklist.md`\n",
|
||||
" - Address any gaps identified\n",
|
||||
"\n",
|
||||
"2. **Review signals and thresholds:**\n",
|
||||
" - See `docs/shared/ops_signals_and_thresholds.md`\n",
|
||||
" - Configure alerts based on thresholds\n",
|
||||
"\n",
|
||||
"3. **Review sidecar documentation (if using Istio):**\n",
|
||||
" - See `docs/shared/sidecars_and_service_mesh.md`\n",
|
||||
" - Verify ServiceEntry configuration for external databases\n",
|
||||
"\n",
|
||||
"4. **Document your baselines:**\n",
|
||||
" - Record current resource usage\n",
|
||||
" - Document scaling thresholds\n",
|
||||
" - Update runbooks with findings\n",
|
||||
"\n",
|
||||
"### 📋 Common Issues Found\n",
|
||||
"\n",
|
||||
"If checks failed, common issues include:\n",
|
||||
"- Blob storage not configured (CRITICAL for production)\n",
|
||||
"- Pods restarting (check logs and resource limits)\n",
|
||||
"- Pending pods (check node capacity and PVC binding)\n",
|
||||
"- High resource usage (review requests/limits)\n",
|
||||
"- Missing backups (verify in cloud console)\n",
|
||||
"\n",
|
||||
"### 🔍 Evidence for Support\n",
|
||||
"\n",
|
||||
"When contacting support, include:\n",
|
||||
"- State snapshot from this notebook\n",
|
||||
"- Pod logs (from correct container if sidecars enabled)\n",
|
||||
"- Recent events\n",
|
||||
"- Resource usage metrics\n",
|
||||
"- Configuration summary (redacted)\n",
|
||||
"\n",
|
||||
"See `docs/shared/ops_signals_and_thresholds.md` for escalation evidence requirements.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -0,0 +1,666 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Module 4: Diagnostics Baseline\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"**This notebook teaches \"baseline first\" discipline.** Before introducing failures or debugging issues, you must capture what \"good\" looks like. This baseline becomes your reference point for all troubleshooting.\n",
|
||||
"\n",
|
||||
"**⚠️ SAFETY NOTICE:** This notebook is **READ-ONLY**. However, Module 4 failure labs will modify your environment. Ensure you've completed the safety check in `../shared/00_setup_or_resume_environment.ipynb` before proceeding.\n",
|
||||
"\n",
|
||||
"**What This Notebook Does:**\n",
|
||||
"1. Captures cluster state snapshot (pods, services, deployments)\n",
|
||||
"2. Collects recent events and resource usage\n",
|
||||
"3. Runs the canonical diagnostics script\n",
|
||||
"4. Performs basic health checks\n",
|
||||
"5. Saves everything to a timestamped directory\n",
|
||||
"\n",
|
||||
"**Why This Matters:**\n",
|
||||
"- You need \"before\" to compare to \"after\"\n",
|
||||
"- Support will ask for baseline diagnostics\n",
|
||||
"- Good debugging starts with understanding normal state\n",
|
||||
"- Evidence collection is time-sensitive\n",
|
||||
"\n",
|
||||
"**Estimated time:** 15-20 minutes\n",
|
||||
"\n",
|
||||
"**Important:** \n",
|
||||
"- Run this notebook BEFORE starting any failure labs. It's your evidence baseline.\n",
|
||||
"- This notebook is read-only and safe to run.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Bootstrap environment\n",
|
||||
"import sys\n",
|
||||
"from pathlib import Path\n",
|
||||
"from datetime import datetime\n",
|
||||
"\n",
|
||||
"# Add notebooks directory to path so we can import shared as a package\n",
|
||||
"possible_paths = [\n",
|
||||
" Path.cwd().parent, # If cwd is module-4, go up one level to notebooks\n",
|
||||
" Path.cwd(), # If cwd is already notebooks\n",
|
||||
" Path.cwd() / \"notebooks\", # If cwd is workspace root\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"notebooks_path = None\n",
|
||||
"for path in possible_paths:\n",
|
||||
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" notebooks_path = path\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not notebooks_path:\n",
|
||||
" notebooks_path = Path.cwd() / \"notebooks\"\n",
|
||||
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
|
||||
"\n",
|
||||
"# Add notebooks directory to path so 'shared' can be imported as a package\n",
|
||||
"if str(notebooks_path) not in sys.path:\n",
|
||||
" sys.path.insert(0, str(notebooks_path))\n",
|
||||
"\n",
|
||||
"from shared._bootstrap import bootstrap\n",
|
||||
"\n",
|
||||
"# Run bootstrap\n",
|
||||
"bootstrap_info = bootstrap()\n",
|
||||
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
|
||||
"\n",
|
||||
"# Create timestamped directory for this baseline\n",
|
||||
"timestamp = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
|
||||
"baseline_dir = artifacts_dir / \"module-4\" / f\"baseline-{timestamp}\"\n",
|
||||
"baseline_dir.mkdir(parents=True, exist_ok=True)\n",
|
||||
"\n",
|
||||
"print(f\"\\nBaseline directory: {baseline_dir}\")\n",
|
||||
"print(f\"All diagnostics will be saved here.\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Safety Check: Environment Verification\n",
|
||||
"\n",
|
||||
"Verify you're in a safe environment before collecting baseline. Module 4 failure labs will modify your environment.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Safety check: Verify environment is safe for Module 4\n",
|
||||
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
|
||||
"from shared._validation import ok, warn\n",
|
||||
"import os\n",
|
||||
"\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"region = get_region()\n",
|
||||
"identity = get_identity()\n",
|
||||
"\n",
|
||||
"print(\"### Environment Safety Check\\n\")\n",
|
||||
"\n",
|
||||
"# Show environment details\n",
|
||||
"provider_display = provider.upper()\n",
|
||||
"print(f\"Cloud Provider: {provider_display}\")\n",
|
||||
"print(f\"Region: {region}\")\n",
|
||||
"\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" print(f\"Account ID: {identity.get('Account', 'N/A')}\")\n",
|
||||
" print(f\"User ARN: {identity.get('Arn', 'N/A')}\")\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
|
||||
" print(f\"Subscription ID: {subscription_id}\")\n",
|
||||
"\n",
|
||||
"# Show environment variables\n",
|
||||
"print(f\"\\n### Environment Variables\")\n",
|
||||
"print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
|
||||
"print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
|
||||
"print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
|
||||
"\n",
|
||||
"# Check for Module 4 safety flag\n",
|
||||
"module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
|
||||
"if module4_safe in [\"true\", \"yes\", \"1\"]:\n",
|
||||
" ok(\"MODULE4_SAFE_ENVIRONMENT flag is set\")\n",
|
||||
" print(\" ✅ Environment verified as safe for Module 4 failure labs\")\n",
|
||||
"else:\n",
|
||||
" warn(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
|
||||
" print(\" 💡 This notebook is read-only, but failure labs require this flag\")\n",
|
||||
" print(\" 💡 Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
|
||||
" print(\" 💡 Complete safety check in ../shared/00_setup_or_resume_environment.ipynb first\")\n",
|
||||
"\n",
|
||||
"print(\"\\n⚠️ REMINDER: This notebook is read-only.\")\n",
|
||||
"print(\" Failure labs in Module 4 will modify secrets and cause disruptions.\")\n",
|
||||
"print(\" Only run failure labs in TEST/NON-PRODUCTION environments.\")\n",
|
||||
"\n",
|
||||
"ok(\"Environment check complete\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1. Configuration\n",
|
||||
"\n",
|
||||
"Load and validate configuration from environment variables.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from shared._validation import ok, warn\n",
|
||||
"from shared._cloud_helpers import get_cloud_provider, get_region\n",
|
||||
"\n",
|
||||
"# Required configuration\n",
|
||||
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
|
||||
"\n",
|
||||
"print(\"### Loading Configuration\\n\")\n",
|
||||
"\n",
|
||||
"config = {}\n",
|
||||
"missing = []\n",
|
||||
"\n",
|
||||
"for var in required_vars:\n",
|
||||
" value = os.environ.get(var, \"\").strip()\n",
|
||||
" if not value:\n",
|
||||
" missing.append(var)\n",
|
||||
" config[var] = value\n",
|
||||
"\n",
|
||||
"if missing:\n",
|
||||
" raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
|
||||
" f\"💡 Copy env-samples/workshop.env.example to your .env file and fill in values\")\n",
|
||||
"\n",
|
||||
"# Optional but recommended\n",
|
||||
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
|
||||
"config[\"LANGSMITH_DOMAIN\"] = os.environ.get(\"LANGSMITH_DOMAIN\", \"\")\n",
|
||||
"\n",
|
||||
"namespace = config[\"NAMESPACE\"]\n",
|
||||
"\n",
|
||||
"# Show cloud provider info\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"region = get_region()\n",
|
||||
"\n",
|
||||
"print(f\"Cloud Provider: {provider.upper()}\")\n",
|
||||
"print(f\"Region: {region}\")\n",
|
||||
"print(f\"Namespace: {namespace}\")\n",
|
||||
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
|
||||
"\n",
|
||||
"if config[\"LANGSMITH_DOMAIN\"]:\n",
|
||||
" print(f\"LangSmith Domain: {config['LANGSMITH_DOMAIN']}\")\n",
|
||||
"\n",
|
||||
"ok(\"Configuration loaded\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2. Cluster State Snapshot\n",
|
||||
"\n",
|
||||
"Capture a complete snapshot of all resources in the namespace. This is your \"before\" picture.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._shell import run\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"print(\"### Capturing Cluster State Snapshot\\n\")\n",
|
||||
"\n",
|
||||
"# Get all resources\n",
|
||||
"print(\"1. Collecting all resources...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"all\", \"-n\", namespace, \"-o\", \"wide\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" snapshot_file = baseline_dir / \"all-resources.txt\"\n",
|
||||
" with open(snapshot_file, \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" ok(f\"Saved resource snapshot to {snapshot_file.name}\")\n",
|
||||
" print(f\" Resources captured: {len(result.stdout.splitlines())} lines\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not capture resource snapshot\")\n",
|
||||
"\n",
|
||||
"# Get all resources as YAML (more detailed)\n",
|
||||
"print(\"\\n2. Collecting detailed YAML...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"all\", \"-n\", namespace, \"-o\", \"yaml\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" yaml_file = baseline_dir / \"all-resources.yaml\"\n",
|
||||
" with open(yaml_file, \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" ok(f\"Saved detailed YAML to {yaml_file.name}\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not capture detailed YAML\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3. Key Deployments Description\n",
|
||||
"\n",
|
||||
"Get detailed information about key deployments.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"### Describing Key Deployments\\n\")\n",
|
||||
"\n",
|
||||
"# Get list of deployments\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" deployments = json.loads(result.stdout)\n",
|
||||
" deployment_items = deployments.get(\"items\", [])\n",
|
||||
" \n",
|
||||
" if deployment_items:\n",
|
||||
" ok(f\"Found {len(deployment_items)} deployment(s)\")\n",
|
||||
" \n",
|
||||
" # Describe each deployment\n",
|
||||
" for deployment in deployment_items:\n",
|
||||
" name = deployment.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" print(f\"\\n3. Describing deployment: {name}\")\n",
|
||||
" \n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"describe\", \"deployment\", name, \"-n\", namespace],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" desc_file = baseline_dir / f\"deployment-{name}.txt\"\n",
|
||||
" with open(desc_file, \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" print(f\" ✅ Saved description to {desc_file.name}\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"Could not describe deployment {name}\")\n",
|
||||
" else:\n",
|
||||
" warn(\"No deployments found\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not list deployments\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4. Recent Events\n",
|
||||
"\n",
|
||||
"Capture recent events sorted by timestamp. Events often contain the first clues about what's happening.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"### Collecting Recent Events\\n\")\n",
|
||||
"\n",
|
||||
"# Get events sorted by timestamp\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" events_file = baseline_dir / \"events.txt\"\n",
|
||||
" with open(events_file, \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" ok(f\"Saved events to {events_file.name}\")\n",
|
||||
" \n",
|
||||
" # Count events by type\n",
|
||||
" lines = result.stdout.strip().split(\"\\n\")\n",
|
||||
" if len(lines) > 1: # Header + events\n",
|
||||
" event_count = len(lines) - 1\n",
|
||||
" print(f\" Captured {event_count} event(s)\")\n",
|
||||
" \n",
|
||||
" # Show last few events\n",
|
||||
" if event_count > 0:\n",
|
||||
" print(\"\\n Last 5 events:\")\n",
|
||||
" for line in lines[-5:]:\n",
|
||||
" if line.strip():\n",
|
||||
" print(f\" {line}\")\n",
|
||||
" else:\n",
|
||||
" print(\" No events found (this is normal for a healthy cluster)\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not collect events\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 5. Resource Usage\n",
|
||||
"\n",
|
||||
"Capture resource usage (CPU, memory) if metrics are available.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"### Collecting Resource Usage\\n\")\n",
|
||||
"\n",
|
||||
"# Top pods\n",
|
||||
"print(\"1. Checking pod resource usage...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"top\", \"pods\", \"-n\", namespace],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" top_pods_file = baseline_dir / \"top-pods.txt\"\n",
|
||||
" with open(top_pods_file, \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" ok(f\"Saved pod resource usage to {top_pods_file.name}\")\n",
|
||||
" print(result.stdout)\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not get pod resource usage (metrics server may not be available)\")\n",
|
||||
" print(\" 💡 This is OK - metrics are optional for baseline collection\")\n",
|
||||
"\n",
|
||||
"# Top nodes (if available)\n",
|
||||
"print(\"\\n2. Checking node resource usage...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"top\", \"nodes\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" top_nodes_file = baseline_dir / \"top-nodes.txt\"\n",
|
||||
" with open(top_nodes_file, \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" ok(f\"Saved node resource usage to {top_nodes_file.name}\")\n",
|
||||
" print(result.stdout)\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not get node resource usage (metrics server may not be available)\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 6. Canonical Diagnostics Script\n",
|
||||
"\n",
|
||||
"**This is the most important step.** Run the official LangChain diagnostics script that Support expects.\n",
|
||||
"\n",
|
||||
"The script captures:\n",
|
||||
"- Pod logs (all containers)\n",
|
||||
"- Events (sorted by timestamp)\n",
|
||||
"- Resource usage (CPU, memory)\n",
|
||||
"- Configuration (deployments, services, ingress)\n",
|
||||
"- Storage (PVCs, storage classes)\n",
|
||||
"- Network (services, endpoints)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import urllib.request\n",
|
||||
"import subprocess\n",
|
||||
"\n",
|
||||
"print(\"### Running Canonical Diagnostics Script\\n\")\n",
|
||||
"\n",
|
||||
"# URL to the canonical script\n",
|
||||
"script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
|
||||
"script_path = baseline_dir / \"get_k8s_debugging_info.sh\"\n",
|
||||
"\n",
|
||||
"print(f\"1. Downloading script from: {script_url}\")\n",
|
||||
"try:\n",
|
||||
" urllib.request.urlretrieve(script_url, script_path)\n",
|
||||
" ok(f\"Downloaded script to {script_path.name}\")\n",
|
||||
" \n",
|
||||
" # Make executable\n",
|
||||
" script_path.chmod(0o755)\n",
|
||||
" \n",
|
||||
" # Run the script\n",
|
||||
" print(f\"\\n2. Running diagnostics script for namespace: {namespace}\")\n",
|
||||
" print(\" (This may take a few minutes...)\")\n",
|
||||
" \n",
|
||||
" result = run(\n",
|
||||
" [str(script_path), namespace],\n",
|
||||
" check=False,\n",
|
||||
" stream=True # Stream output so user can see progress\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" ok(\"Diagnostics script completed successfully\")\n",
|
||||
" \n",
|
||||
" # The script creates a tarball - find it\n",
|
||||
" diagnostics_tarball = None\n",
|
||||
" for file in baseline_dir.parent.iterdir():\n",
|
||||
" if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
|
||||
" diagnostics_tarball = file\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" if diagnostics_tarball:\n",
|
||||
" # Move it to our baseline directory\n",
|
||||
" target_path = baseline_dir / diagnostics_tarball.name\n",
|
||||
" diagnostics_tarball.rename(target_path)\n",
|
||||
" ok(f\"Diagnostics bundle saved to: {target_path.name}\")\n",
|
||||
" print(f\" Size: {target_path.stat().st_size / 1024 / 1024:.2f} MB\")\n",
|
||||
" else:\n",
|
||||
" warn(\"Could not find diagnostics tarball (check script output above)\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"Diagnostics script returned non-zero exit code: {result.returncode}\")\n",
|
||||
" print(\" Check the output above for errors\")\n",
|
||||
" print(\" 💡 The script may still have collected useful information\")\n",
|
||||
" \n",
|
||||
"except urllib.request.URLError as e:\n",
|
||||
" warn(f\"Could not download diagnostics script: {e}\")\n",
|
||||
" print(\" 💡 You can download it manually and run it:\")\n",
|
||||
" print(f\" curl -O {script_url}\")\n",
|
||||
" print(f\" chmod +x get_k8s_debugging_info.sh\")\n",
|
||||
" print(f\" ./get_k8s_debugging_info.sh {namespace}\")\n",
|
||||
"except Exception as e:\n",
|
||||
" warn(f\"Error running diagnostics script: {e}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 7. Basic Health Check\n",
|
||||
"\n",
|
||||
"Perform a basic HTTP check to verify the LangSmith endpoint is reachable.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import requests\n",
|
||||
"import urllib3\n",
|
||||
"\n",
|
||||
"# Disable SSL warnings for self-signed certs\n",
|
||||
"urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\n",
|
||||
"\n",
|
||||
"print(\"### Testing Endpoint Reachability\\n\")\n",
|
||||
"\n",
|
||||
"# Determine endpoint URL\n",
|
||||
"if config[\"LANGSMITH_DOMAIN\"]:\n",
|
||||
" test_url = f\"https://{config['LANGSMITH_DOMAIN']}\"\n",
|
||||
"else:\n",
|
||||
" # Try to get from ingress\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"ingress\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" ingresses = json.loads(result.stdout)\n",
|
||||
" for ingress in ingresses.get(\"items\", []):\n",
|
||||
" rules = ingress.get(\"spec\", {}).get(\"rules\", [])\n",
|
||||
" for rule in rules:\n",
|
||||
" host = rule.get(\"host\", \"\")\n",
|
||||
" if host:\n",
|
||||
" test_url = f\"https://{host}\"\n",
|
||||
" break\n",
|
||||
" else:\n",
|
||||
" test_url = None\n",
|
||||
"\n",
|
||||
"if test_url:\n",
|
||||
" print(f\"Testing: {test_url}\")\n",
|
||||
" try:\n",
|
||||
" response = requests.get(test_url, allow_redirects=True, verify=False, timeout=10)\n",
|
||||
" \n",
|
||||
" health_file = baseline_dir / \"endpoint-health.txt\"\n",
|
||||
" with open(health_file, \"w\") as f:\n",
|
||||
" f.write(f\"URL: {test_url}\\n\")\n",
|
||||
" f.write(f\"Status Code: {response.status_code}\\n\")\n",
|
||||
" f.write(f\"Response Headers:\\n{json.dumps(dict(response.headers), indent=2)}\\n\")\n",
|
||||
" \n",
|
||||
" if response.status_code in [200, 302, 401, 403]:\n",
|
||||
" ok(f\"Endpoint is reachable (HTTP {response.status_code})\")\n",
|
||||
" print(f\" Response saved to {health_file.name}\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"Endpoint returned unexpected status: {response.status_code}\")\n",
|
||||
" except requests.exceptions.SSLError:\n",
|
||||
" warn(\"SSL verification failed (may be self-signed certificate)\")\n",
|
||||
" print(\" 💡 This is OK for testing. In production, use proper TLS certificates.\")\n",
|
||||
" except requests.exceptions.RequestException as e:\n",
|
||||
" warn(f\"Could not reach endpoint: {e}\")\n",
|
||||
" print(\" 💡 Endpoint may still be provisioning or DNS not configured\")\n",
|
||||
"else:\n",
|
||||
" warn(\"No endpoint URL available for testing\")\n",
|
||||
" print(\" 💡 Set LANGSMITH_DOMAIN in your .env file to test endpoint reachability\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 8. What Good Looks Like\n",
|
||||
"\n",
|
||||
"Quick validation checks to confirm the baseline is healthy.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._validation import ok, warn\n",
|
||||
"\n",
|
||||
"print(\"### Quick Health Validation\\n\")\n",
|
||||
"\n",
|
||||
"# Check pod status\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"healthy_pods = 0\n",
|
||||
"unhealthy_pods = []\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" for pod in pods.get(\"items\", []):\n",
|
||||
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" phase = pod.get(\"status\", {}).get(\"phase\", \"\")\n",
|
||||
" container_statuses = pod.get(\"status\", {}).get(\"containerStatuses\", [])\n",
|
||||
" \n",
|
||||
" is_ready = True\n",
|
||||
" for cs in container_statuses:\n",
|
||||
" if not cs.get(\"ready\", False):\n",
|
||||
" is_ready = False\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" if phase == \"Running\" and is_ready:\n",
|
||||
" healthy_pods += 1\n",
|
||||
" else:\n",
|
||||
" unhealthy_pods.append((name, phase, is_ready))\n",
|
||||
" \n",
|
||||
" if unhealthy_pods:\n",
|
||||
" warn(f\"Found {len(unhealthy_pods)} pod(s) that are not healthy:\")\n",
|
||||
" for name, phase, ready in unhealthy_pods:\n",
|
||||
" print(f\" - {name}: phase={phase}, ready={ready}\")\n",
|
||||
" else:\n",
|
||||
" ok(f\"All {healthy_pods} pod(s) are healthy and ready\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not check pod status\")\n",
|
||||
"\n",
|
||||
"# Check for CrashLoopBackOff\n",
|
||||
"if unhealthy_pods:\n",
|
||||
" crash_loops = [name for name, phase, _ in unhealthy_pods if phase == \"CrashLoopBackOff\"]\n",
|
||||
" if crash_loops:\n",
|
||||
" warn(f\"Found {len(crash_loops)} pod(s) in CrashLoopBackOff:\")\n",
|
||||
" for name in crash_loops:\n",
|
||||
" print(f\" - {name}\")\n",
|
||||
" print(\" 💡 Check pod logs to understand why they're crashing\")\n",
|
||||
"\n",
|
||||
"# Check for Pending pods\n",
|
||||
"pending = [name for name, phase, _ in unhealthy_pods if phase == \"Pending\"]\n",
|
||||
"if pending:\n",
|
||||
" warn(f\"Found {len(pending)} pod(s) in Pending state:\")\n",
|
||||
" for name in pending:\n",
|
||||
" print(f\" - {name}\")\n",
|
||||
" print(\" 💡 Check events and resource availability\")\n",
|
||||
"\n",
|
||||
"print(\"\\n### Baseline Summary\\n\")\n",
|
||||
"print(f\"✅ Baseline captured at: {timestamp}\")\n",
|
||||
"print(f\"📁 Baseline directory: {baseline_dir}\")\n",
|
||||
"print(f\"📊 Resources captured:\")\n",
|
||||
"print(f\" - Cluster state snapshot\")\n",
|
||||
"print(f\" - Deployment descriptions\")\n",
|
||||
"print(f\" - Recent events\")\n",
|
||||
"print(f\" - Resource usage (if available)\")\n",
|
||||
"print(f\" - Canonical diagnostics bundle\")\n",
|
||||
"print(f\" - Endpoint health check\")\n",
|
||||
"\n",
|
||||
"ok(\"Baseline collection complete!\")\n",
|
||||
"print(\"\\n💡 Use this baseline as your reference point for all failure labs.\")\n",
|
||||
"print(\" Compare future diagnostics to this baseline to identify what changed.\")\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -0,0 +1,944 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Module 4: Failure Lab - PostgreSQL\n",
|
||||
"\n",
|
||||
"## ⚠️ CRITICAL SAFETY WARNING\n",
|
||||
"\n",
|
||||
"**THIS NOTEBOOK WILL MODIFY YOUR ENVIRONMENT:**\n",
|
||||
"- **Modifies Kubernetes secrets** (breaks PostgreSQL password)\n",
|
||||
"- **Causes service disruptions** (API failures, login failures)\n",
|
||||
"- **Requires remediation** to restore functionality\n",
|
||||
"\n",
|
||||
"**REQUIREMENTS:**\n",
|
||||
"- ✅ **TEST/NON-PRODUCTION environment ONLY**\n",
|
||||
"- ✅ **MODULE4_SAFE_ENVIRONMENT=true** must be set in .env\n",
|
||||
"- ✅ **Baseline diagnostics collected** (run `01_diagnostics_baseline.ipynb` first)\n",
|
||||
"- ✅ **Backup/restore plan** available\n",
|
||||
"\n",
|
||||
"**DO NOT RUN THIS LAB AGAINST PRODUCTION SYSTEMS.**\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"**This lab teaches you how to debug PostgreSQL connectivity failures in LangSmith.**\n",
|
||||
"\n",
|
||||
"PostgreSQL is LangSmith's primary metadata store. It holds:\n",
|
||||
"- User accounts and workspaces\n",
|
||||
"- Project definitions\n",
|
||||
"- API keys and permissions\n",
|
||||
"- Trace metadata (not the traces themselves, which go to ClickHouse)\n",
|
||||
"\n",
|
||||
"**When PostgreSQL fails, you'll see:**\n",
|
||||
"- API endpoints return 5xx errors\n",
|
||||
"- Login/authentication may fail\n",
|
||||
"- UI may load but actions fail\n",
|
||||
"- Connection exhaustion patterns in logs\n",
|
||||
"\n",
|
||||
"**Learning Objectives:**\n",
|
||||
"1. Understand how PostgreSQL failures manifest\n",
|
||||
"2. Practice collecting diagnostics for database issues\n",
|
||||
"3. Learn to identify connection vs. credential vs. network issues\n",
|
||||
"4. Practice safe remediation\n",
|
||||
"\n",
|
||||
"**Estimated time:** 30-45 minutes\n",
|
||||
"\n",
|
||||
"**⚠️ Important:** \n",
|
||||
"- Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n",
|
||||
"- Complete safety check in `../shared/00_setup_or_resume_environment.ipynb` first!\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Bootstrap environment\n",
|
||||
"import sys\n",
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"# Add notebooks directory to path\n",
|
||||
"possible_paths = [\n",
|
||||
" Path.cwd().parent,\n",
|
||||
" Path.cwd(),\n",
|
||||
" Path.cwd() / \"notebooks\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"notebooks_path = None\n",
|
||||
"for path in possible_paths:\n",
|
||||
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" notebooks_path = path\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not notebooks_path:\n",
|
||||
" notebooks_path = Path.cwd() / \"notebooks\"\n",
|
||||
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
|
||||
"\n",
|
||||
"if str(notebooks_path) not in sys.path:\n",
|
||||
" sys.path.insert(0, str(notebooks_path))\n",
|
||||
"\n",
|
||||
"from shared._bootstrap import bootstrap\n",
|
||||
"from shared._validation import ok, warn\n",
|
||||
"\n",
|
||||
"# Run bootstrap\n",
|
||||
"bootstrap_info = bootstrap()\n",
|
||||
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
|
||||
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## ⚠️ CRITICAL: Environment Safety Verification\n",
|
||||
"\n",
|
||||
"**Before proceeding, verify you're in a TEST/NON-PRODUCTION environment and understand what will be modified.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# CRITICAL SAFETY CHECK: Verify environment is safe for failure injection\n",
|
||||
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
|
||||
"from shared._validation import ok, warn, fail\n",
|
||||
"import os\n",
|
||||
"\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"region = get_region()\n",
|
||||
"identity = get_identity()\n",
|
||||
"\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"print(\"⚠️ CRITICAL SAFETY CHECK - POSTGRESQL FAILURE LAB\")\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"\n",
|
||||
"# Show environment details prominently\n",
|
||||
"provider_display = provider.upper()\n",
|
||||
"print(f\"\\n### Current Environment Configuration\")\n",
|
||||
"print(f\"Cloud Provider: {provider_display}\")\n",
|
||||
"print(f\"Region: {region}\")\n",
|
||||
"\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" account_id = identity.get('Account', 'N/A')\n",
|
||||
" user_arn = identity.get('Arn', 'N/A')\n",
|
||||
" print(f\"Account ID: {account_id}\")\n",
|
||||
" print(f\"User ARN: {user_arn}\")\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
|
||||
" subscription_name = identity.get(\"SubscriptionName\", \"N/A\")\n",
|
||||
" print(f\"Subscription ID: {subscription_id}\")\n",
|
||||
" print(f\"Subscription Name: {subscription_name}\")\n",
|
||||
"\n",
|
||||
"# Show all relevant environment variables\n",
|
||||
"print(f\"\\n### Environment Variables (VERIFY THESE ARE CORRECT)\")\n",
|
||||
"print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
|
||||
"print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
|
||||
"print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
|
||||
"print(f\"LANGSMITH_DOMAIN: {os.environ.get('LANGSMITH_DOMAIN', 'NOT SET')}\")\n",
|
||||
"\n",
|
||||
"print(\"\\n\" + \"=\" * 70)\n",
|
||||
"print(\"⚠️ WHAT THIS LAB WILL DO:\")\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"print(\"\\nThis failure lab will:\")\n",
|
||||
"print(\" 1. Find the PostgreSQL secret in your namespace\")\n",
|
||||
"print(\" 2. BACKUP the original secret (saved to artifacts)\")\n",
|
||||
"print(\" 3. MODIFY the secret to set an INVALID password\")\n",
|
||||
"print(\" 4. Apply the modified secret (breaks database connectivity)\")\n",
|
||||
"print(\" 5. Cause API failures and login failures\")\n",
|
||||
"print(\" 6. Require remediation to restore (restore original secret)\")\n",
|
||||
"print(\"\\n\" + \"=\" * 70)\n",
|
||||
"\n",
|
||||
"# Check for Module 4 safety flag\n",
|
||||
"module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
|
||||
"if module4_safe not in [\"true\", \"yes\", \"1\"]:\n",
|
||||
" fail(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
|
||||
" print(\"\\n❌ SAFETY CHECK FAILED - Cannot proceed\")\n",
|
||||
" print(\"\\nTo run this failure lab, you MUST:\")\n",
|
||||
" print(\" 1. Verify this is a TEST/NON-PRODUCTION environment\")\n",
|
||||
" print(\" 2. Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
|
||||
" print(\" 3. Complete safety check in ../shared/00_setup_or_resume_environment.ipynb\")\n",
|
||||
" print(\" 4. Re-run this cell to confirm\")\n",
|
||||
" print(\"\\nThis flag is REQUIRED to prevent accidental execution in production.\")\n",
|
||||
" raise RuntimeError(\"MODULE4_SAFE_ENVIRONMENT not set. Required for failure labs.\")\n",
|
||||
"\n",
|
||||
"ok(\"MODULE4_SAFE_ENVIRONMENT flag is set\")\n",
|
||||
"print(\"\\n✅ Safety check passed - environment marked as safe for failure injection\")\n",
|
||||
"print(\"\\n⚠️ REMINDER: This lab will break PostgreSQL connectivity.\")\n",
|
||||
"print(\" Ensure you understand the remediation steps before proceeding.\")\n",
|
||||
"print(\" Original secret will be backed up automatically.\")\n",
|
||||
"\n",
|
||||
"print(\"\\n\" + \"=\" * 70)\n",
|
||||
"print(\"✅ Environment verified - ready for PostgreSQL failure lab\")\n",
|
||||
"print(\"=\" * 70)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1. Configuration & Prerequisites\n",
|
||||
"\n",
|
||||
"Load configuration and verify prerequisites.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from shared._validation import require_env\n",
|
||||
"\n",
|
||||
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
|
||||
"config = require_env(*required_vars)\n",
|
||||
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
|
||||
"\n",
|
||||
"namespace = config[\"NAMESPACE\"]\n",
|
||||
"\n",
|
||||
"print(f\"Namespace: {namespace}\")\n",
|
||||
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
|
||||
"\n",
|
||||
"ok(\"Configuration loaded\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2. What This Service Does for LangSmith\n",
|
||||
"\n",
|
||||
"PostgreSQL is LangSmith's **primary metadata store**. It holds:\n",
|
||||
"\n",
|
||||
"- **User accounts and authentication data**\n",
|
||||
"- **Workspaces and projects** (organizational structure)\n",
|
||||
"- **API keys and permissions** (access control)\n",
|
||||
"- **Trace metadata** (not the trace data itself, which goes to ClickHouse)\n",
|
||||
"- **Evaluation results and feedback**\n",
|
||||
"\n",
|
||||
"**Why it matters:**\n",
|
||||
"- Without PostgreSQL, users can't log in\n",
|
||||
"- API calls fail (no authentication, no project lookups)\n",
|
||||
"- UI loads but can't perform actions\n",
|
||||
"- All LangSmith functionality depends on it\n",
|
||||
"\n",
|
||||
"**How LangSmith connects:**\n",
|
||||
"- Connection string stored in Kubernetes Secrets\n",
|
||||
"- Connection pool managed by application\n",
|
||||
"- Connection limits are critical (PostgreSQL has max connections)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3. Expected Symptoms When PostgreSQL Fails\n",
|
||||
"\n",
|
||||
"**What you'll see:**\n",
|
||||
"\n",
|
||||
"1. **API 5xx errors:**\n",
|
||||
" - `/api/v1/...` endpoints return 500 or 503\n",
|
||||
" - Error messages mention \"database\" or \"connection\"\n",
|
||||
"\n",
|
||||
"2. **Login failures:**\n",
|
||||
" - Users can't authenticate\n",
|
||||
" - OIDC/SAML may work (redirects) but session creation fails\n",
|
||||
"\n",
|
||||
"3. **UI loads but actions fail:**\n",
|
||||
" - Pages render (static content)\n",
|
||||
" - API calls fail (can't load projects, traces, etc.)\n",
|
||||
"\n",
|
||||
"4. **Log patterns:**\n",
|
||||
" - Connection timeout errors\n",
|
||||
" - \"connection refused\" or \"connection reset\"\n",
|
||||
" - \"too many connections\" (if connection pool exhausted)\n",
|
||||
" - \"authentication failed\" (if credentials wrong)\n",
|
||||
"\n",
|
||||
"**Timeline:**\n",
|
||||
"- Symptoms appear within seconds of failure\n",
|
||||
"- API calls start failing immediately\n",
|
||||
"- Existing connections may work briefly, then fail\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4. Failure Injection Options\n",
|
||||
"\n",
|
||||
"**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
|
||||
"\n",
|
||||
"### Level 1: Subtle Failure (Recommended for first run)\n",
|
||||
"\n",
|
||||
"**Option A: Wrong Database Password**\n",
|
||||
"- Modify the PostgreSQL password in the Kubernetes Secret\n",
|
||||
"- Symptoms: Authentication failures, connection refused\n",
|
||||
"\n",
|
||||
"**Option B: Wrong Database Host**\n",
|
||||
"- Point connection string to non-existent host\n",
|
||||
"- Symptoms: Connection timeout, DNS resolution failures\n",
|
||||
"\n",
|
||||
"**Option C: Network Isolation (if NetworkPolicy supported)**\n",
|
||||
"- Apply NetworkPolicy blocking egress to PostgreSQL\n",
|
||||
"- Symptoms: Connection timeout, no route to host\n",
|
||||
"\n",
|
||||
"### Level 2: Obvious Failure\n",
|
||||
"\n",
|
||||
"**Option D: Remove Secret Entirely**\n",
|
||||
"- Delete the PostgreSQL connection secret\n",
|
||||
"- Symptoms: Pods crash on startup, immediate failures\n",
|
||||
"\n",
|
||||
"**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 5. Do the Drill - Step 1: Confirm Baseline\n",
|
||||
"\n",
|
||||
"**Before injecting any failure, verify your baseline is healthy.**\n",
|
||||
"\n",
|
||||
"💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._shell import run\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"print(\"### Quick Baseline Check\\n\")\n",
|
||||
"\n",
|
||||
"# Check pod status\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" healthy = sum(1 for p in pods.get(\"items\", [])\n",
|
||||
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
|
||||
" total = len(pods.get(\"items\", []))\n",
|
||||
" print(f\"Pods: {healthy}/{total} running\")\n",
|
||||
" \n",
|
||||
" if healthy == total and total > 0:\n",
|
||||
" ok(\"Baseline looks healthy\")\n",
|
||||
" else:\n",
|
||||
" warn(\"Some pods are not running - check baseline first\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not check pod status\")\n",
|
||||
"\n",
|
||||
"# Check for PostgreSQL secret\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"postgres_secrets = []\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" secrets = json.loads(result.stdout)\n",
|
||||
" for secret in secrets.get(\"items\", []):\n",
|
||||
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" if \"postgres\" in name.lower() or \"database\" in name.lower() or \"db\" in name.lower():\n",
|
||||
" postgres_secrets.append(name)\n",
|
||||
"\n",
|
||||
"if postgres_secrets:\n",
|
||||
" ok(f\"Found {len(postgres_secrets)} PostgreSQL-related secret(s)\")\n",
|
||||
" for secret_name in postgres_secrets:\n",
|
||||
" print(f\" - {secret_name}\")\n",
|
||||
"else:\n",
|
||||
" warn(\"No PostgreSQL secrets found\")\n",
|
||||
" print(\" 💡 PostgreSQL connection may be configured differently\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 6. Do the Drill - Step 2: Apply Failure Injection\n",
|
||||
"\n",
|
||||
"**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
|
||||
"\n",
|
||||
"Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# FAILURE INJECTION: Wrong Database Password\n",
|
||||
"# This cell modifies the PostgreSQL password secret to an invalid value\n",
|
||||
"\n",
|
||||
"import base64\n",
|
||||
"import yaml\n",
|
||||
"from datetime import datetime\n",
|
||||
"\n",
|
||||
"# Find PostgreSQL secret (look for common names)\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"postgres_secret_name = None\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" secrets = json.loads(result.stdout)\n",
|
||||
" for secret in secrets.get(\"items\", []):\n",
|
||||
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" # Common patterns: postgres, database, db, langsmith-db\n",
|
||||
" if any(keyword in name.lower() for keyword in [\"postgres\", \"database\", \"db\"]):\n",
|
||||
" # Check if it has password-related keys\n",
|
||||
" data = secret.get(\"data\", {})\n",
|
||||
" if any(key in data for key in [\"password\", \"POSTGRES_PASSWORD\", \"DB_PASSWORD\"]):\n",
|
||||
" postgres_secret_name = name\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not postgres_secret_name:\n",
|
||||
" raise RuntimeError(\"❌ Could not find PostgreSQL secret. Check your deployment configuration.\")\n",
|
||||
"\n",
|
||||
"print(f\"Found PostgreSQL secret: {postgres_secret_name}\")\n",
|
||||
"\n",
|
||||
"# Get current secret\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secret\", postgres_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
|
||||
" check=True,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Save original secret for restoration\n",
|
||||
"backup_file = artifacts_dir / \"module-4\" / f\"postgres-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
|
||||
"backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
|
||||
"with open(backup_file, \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
"\n",
|
||||
"ok(f\"Backed up original secret to: {backup_file.name}\")\n",
|
||||
"\n",
|
||||
"# Parse YAML and modify password\n",
|
||||
"secret_data = yaml.safe_load(result.stdout)\n",
|
||||
"if \"data\" not in secret_data:\n",
|
||||
" raise RuntimeError(\"Secret has no data section\")\n",
|
||||
"\n",
|
||||
"# Find password key (could be password, POSTGRES_PASSWORD, DB_PASSWORD, etc.)\n",
|
||||
"password_key = None\n",
|
||||
"for key in [\"password\", \"POSTGRES_PASSWORD\", \"DB_PASSWORD\", \"postgres-password\"]:\n",
|
||||
" if key in secret_data[\"data\"]:\n",
|
||||
" password_key = key\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not password_key:\n",
|
||||
" raise RuntimeError(\"Could not find password key in secret\")\n",
|
||||
"\n",
|
||||
"# Set invalid password\n",
|
||||
"invalid_password = \"INVALID_PASSWORD_12345\"\n",
|
||||
"invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
|
||||
"\n",
|
||||
"# Modify secret\n",
|
||||
"secret_data[\"data\"][password_key] = invalid_password_b64\n",
|
||||
"\n",
|
||||
"# Save modified secret to temp file\n",
|
||||
"temp_secret_file = artifacts_dir / \"module-4\" / \"postgres-secret-modified.yaml\"\n",
|
||||
"with open(temp_secret_file, \"w\") as f:\n",
|
||||
" yaml.dump(secret_data, f)\n",
|
||||
"\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"print(\"⚠️ READY TO APPLY FAILURE INJECTION\")\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"print(f\"\\nThis will modify secret: {postgres_secret_name}\")\n",
|
||||
"print(f\"Modified secret saved to: {temp_secret_file.name}\")\n",
|
||||
"print(f\"Backup saved to: {backup_file.name}\")\n",
|
||||
"print(\"\\n\" + \"=\" * 70)\n",
|
||||
"print(\"⚠️ FINAL WARNING BEFORE FAILURE INJECTION\")\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"print(\"\\nThis will:\")\n",
|
||||
"print(\" ❌ Break PostgreSQL connectivity\")\n",
|
||||
"print(\" ❌ Cause API 5xx errors\")\n",
|
||||
"print(\" ❌ Break login/authentication\")\n",
|
||||
"print(\" ❌ Disrupt LangSmith functionality\")\n",
|
||||
"print(\"\\nTo apply the failure:\")\n",
|
||||
"print(\" 1. Verify MODULE4_SAFE_ENVIRONMENT=true is set\")\n",
|
||||
"print(\" 2. Verify you're in a TEST environment\")\n",
|
||||
"print(\" 3. Uncomment the code in the next cell\")\n",
|
||||
"print(\" 4. Run the next cell to apply\")\n",
|
||||
"print(\"\\nTo restore after the lab:\")\n",
|
||||
"print(f\" - Use the backup file: {backup_file.name}\")\n",
|
||||
"print(\" - See the 'Remediation' section below\")\n",
|
||||
"print(\"\\n\" + \"=\" * 70)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# UNCOMMENT TO APPLY FAILURE INJECTION\n",
|
||||
"# \n",
|
||||
"# result = run(\n",
|
||||
"# [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
|
||||
"# check=True,\n",
|
||||
"# stream=True\n",
|
||||
"# )\n",
|
||||
"# \n",
|
||||
"# ok(\"Failure injection applied - PostgreSQL password is now invalid\")\n",
|
||||
"# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
|
||||
"# print(\" This may take 1-2 minutes. Watch for pod restarts:\")\n",
|
||||
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
|
||||
"# \n",
|
||||
"# # Wait a moment for changes to propagate\n",
|
||||
"# import time\n",
|
||||
"# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
|
||||
"# time.sleep(30)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 8. Do the Drill - Step 3: Observe Symptoms\n",
|
||||
"\n",
|
||||
"**Now that the failure is injected, observe how it manifests.**\n",
|
||||
"\n",
|
||||
"Check:\n",
|
||||
"1. Pod logs for connection errors\n",
|
||||
"2. API endpoint responses\n",
|
||||
"3. UI behavior\n",
|
||||
"4. Events for pod restarts\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from datetime import datetime\n",
|
||||
"\n",
|
||||
"# Create incident directory for diagnostics\n",
|
||||
"incident_dir = artifacts_dir / \"module-4\" / f\"postgres-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
|
||||
"incident_dir.mkdir(parents=True, exist_ok=True)\n",
|
||||
"\n",
|
||||
"print(f\"### Collecting Failure Diagnostics\\n\")\n",
|
||||
"print(f\"Saving to: {incident_dir}\\n\")\n",
|
||||
"\n",
|
||||
"# 1. Check pod status\n",
|
||||
"print(\"1. Checking pod status...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" print(result.stdout)\n",
|
||||
" \n",
|
||||
" # Check for restarts\n",
|
||||
" lines = result.stdout.split(\"\\n\")\n",
|
||||
" restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
|
||||
" if restarts:\n",
|
||||
" print(\"\\n Pod restart counts:\")\n",
|
||||
" for line in restarts[1:]: # Skip header\n",
|
||||
" if line.strip():\n",
|
||||
" parts = line.split()\n",
|
||||
" if len(parts) > 3:\n",
|
||||
" print(f\" {parts[0]}: {parts[3]} restarts\")\n",
|
||||
"\n",
|
||||
"# 2. Check recent events\n",
|
||||
"print(\"\\n2. Checking recent events...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" with open(incident_dir / \"events.txt\", \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" if result.stdout.strip():\n",
|
||||
" print(\" Recent warning/error events:\")\n",
|
||||
" for line in result.stdout.split(\"\\n\")[-5:]:\n",
|
||||
" if line.strip():\n",
|
||||
" print(f\" {line}\")\n",
|
||||
"\n",
|
||||
"# 3. Check API pod logs for database errors\n",
|
||||
"print(\"\\n3. Checking API pod logs for database errors...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-api\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"api_pod = result.stdout.strip().strip(\"'\\\"\")\n",
|
||||
"if api_pod:\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"logs\", \"-n\", namespace, api_pod, \"--tail=50\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" logs_file = incident_dir / f\"api-pod-{api_pod}-logs.txt\"\n",
|
||||
" with open(logs_file, \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" \n",
|
||||
" # Look for database-related errors\n",
|
||||
" error_keywords = [\"database\", \"postgres\", \"connection\", \"timeout\", \"refused\", \"authentication\"]\n",
|
||||
" error_lines = [l for l in result.stdout.split(\"\\n\") \n",
|
||||
" if any(kw in l.lower() for kw in error_keywords)]\n",
|
||||
" \n",
|
||||
" if error_lines:\n",
|
||||
" print(\" Found database-related errors:\")\n",
|
||||
" for line in error_lines[-5:]:\n",
|
||||
" print(f\" {line}\")\n",
|
||||
" else:\n",
|
||||
" print(\" No obvious database errors in recent logs\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not find API pod\")\n",
|
||||
"\n",
|
||||
"ok(f\"Diagnostics saved to: {incident_dir}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
|
||||
"\n",
|
||||
"**This is critical - Support will ask for this bundle.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import urllib.request\n",
|
||||
"\n",
|
||||
"print(\"### Running Canonical Diagnostics Script\\n\")\n",
|
||||
"\n",
|
||||
"script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
|
||||
"script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" urllib.request.urlretrieve(script_url, script_path)\n",
|
||||
" script_path.chmod(0o755)\n",
|
||||
" \n",
|
||||
" print(f\"Running diagnostics script for namespace: {namespace}\")\n",
|
||||
" result = run(\n",
|
||||
" [str(script_path), namespace],\n",
|
||||
" check=False,\n",
|
||||
" stream=True\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" ok(\"Diagnostics script completed\")\n",
|
||||
" \n",
|
||||
" # Find and move tarball\n",
|
||||
" for file in incident_dir.parent.iterdir():\n",
|
||||
" if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
|
||||
" target_path = incident_dir / file.name\n",
|
||||
" file.rename(target_path)\n",
|
||||
" ok(f\"Diagnostics bundle: {target_path.name}\")\n",
|
||||
" break\n",
|
||||
" else:\n",
|
||||
" warn(\"Diagnostics script had errors (check output above)\")\n",
|
||||
" \n",
|
||||
"except Exception as e:\n",
|
||||
" warn(f\"Could not run diagnostics script: {e}\")\n",
|
||||
" print(\" 💡 You can run it manually:\")\n",
|
||||
" print(f\" curl -O {script_url}\")\n",
|
||||
" print(f\" chmod +x get_k8s_debugging_info.sh\")\n",
|
||||
" print(f\" ./get_k8s_debugging_info.sh {namespace}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 10. Do the Drill - Step 5: Guided Triage\n",
|
||||
"\n",
|
||||
"**Where to look first for PostgreSQL issues:**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"### Guided Triage Steps\\n\")\n",
|
||||
"\n",
|
||||
"print(\"1. Check pod logs for connection errors:\")\n",
|
||||
"print(f\" kubectl logs -n {namespace} <pod-name> | grep -i 'database\\\\|postgres\\\\|connection'\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"2. Verify secret exists and has correct keys:\")\n",
|
||||
"print(f\" kubectl get secret {postgres_secret_name} -n {namespace} -o yaml\")\n",
|
||||
"print(\" (Don't print the actual values - they're base64 encoded)\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"3. Check for pod restarts (indicates startup failures):\")\n",
|
||||
"print(f\" kubectl get pods -n {namespace}\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"4. Test database connectivity from a pod (if possible):\")\n",
|
||||
"print(\" kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \\\\\")\n",
|
||||
"print(\" psql -h <db-host> -U <user> -d <database>\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"5. Check events for authentication/connection errors:\")\n",
|
||||
"print(f\" kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"# Check what we can automatically\n",
|
||||
"print(\"\\n### Automatic Checks\\n\")\n",
|
||||
"\n",
|
||||
"# Check secret still exists\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secret\", postgres_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" ok(f\"Secret '{postgres_secret_name}' still exists\")\n",
|
||||
" secret_data = json.loads(result.stdout)\n",
|
||||
" keys = list(secret_data.get(\"data\", {}).keys())\n",
|
||||
" print(f\" Secret keys: {', '.join(keys)}\")\n",
|
||||
"else:\n",
|
||||
" warn(f\"Secret '{postgres_secret_name}' not found!\")\n",
|
||||
"\n",
|
||||
"# Check for pods with database connection env vars\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" db_related_pods = []\n",
|
||||
" for pod in pods.get(\"items\", []):\n",
|
||||
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
|
||||
" for container in containers:\n",
|
||||
" env = container.get(\"env\", [])\n",
|
||||
" db_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
|
||||
" for kw in [\"DB\", \"POSTGRES\", \"DATABASE\"])]\n",
|
||||
" if db_env:\n",
|
||||
" db_related_pods.append(name)\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" if db_related_pods:\n",
|
||||
" print(f\"\\n Pods with database environment variables:\")\n",
|
||||
" for pod_name in set(db_related_pods):\n",
|
||||
" print(f\" - {pod_name}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 11. Do the Drill - Step 6: Remediation\n",
|
||||
"\n",
|
||||
"**Restore the original secret to fix the issue.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# REMEDIATION: Restore original secret\n",
|
||||
"# UNCOMMENT TO RESTORE\n",
|
||||
"\n",
|
||||
"# if backup_file.exists():\n",
|
||||
"# print(f\"Restoring original secret from: {backup_file.name}\")\n",
|
||||
"# result = run(\n",
|
||||
"# [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
|
||||
"# check=True,\n",
|
||||
"# stream=True\n",
|
||||
"# )\n",
|
||||
"# \n",
|
||||
"# ok(\"Original secret restored\")\n",
|
||||
"# print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
|
||||
"# print(\" This may take 1-2 minutes. Monitor pod status:\")\n",
|
||||
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
|
||||
"# \n",
|
||||
"# import time\n",
|
||||
"# print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
|
||||
"# time.sleep(60)\n",
|
||||
"# else:\n",
|
||||
"# warn(f\"Backup file not found: {backup_file}\")\n",
|
||||
"# print(\" 💡 You may need to manually restore the secret\")\n",
|
||||
"\n",
|
||||
"print(\"⚠️ To restore, uncomment the code above and run this cell.\")\n",
|
||||
"print(f\" Backup file: {backup_file.name}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 12. Do the Drill - Step 7: Confirm Recovery\n",
|
||||
"\n",
|
||||
"**Verify that everything is working again.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"### Verifying Recovery\\n\")\n",
|
||||
"\n",
|
||||
"# Check pod status\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" running = sum(1 for p in pods.get(\"items\", [])\n",
|
||||
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
|
||||
" total = len(pods.get(\"items\", []))\n",
|
||||
" \n",
|
||||
" if running == total and total > 0:\n",
|
||||
" ok(f\"All {total} pod(s) are running\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"Only {running}/{total} pod(s) running\")\n",
|
||||
" print(\" 💡 Wait a bit longer for pods to fully recover\")\n",
|
||||
"\n",
|
||||
"# Check for recent errors in logs\n",
|
||||
"print(\"\\nChecking for recent errors...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-api\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"api_pod = result.stdout.strip().strip(\"'\\\"\")\n",
|
||||
"if api_pod:\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"logs\", \"-n\", namespace, api_pod, \"--tail=20\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" error_keywords = [\"error\", \"fail\", \"database\", \"postgres\", \"connection\"]\n",
|
||||
" recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
|
||||
" if any(kw in l.lower() for kw in error_keywords)]\n",
|
||||
" \n",
|
||||
" if recent_errors:\n",
|
||||
" warn(\"Still seeing some errors in logs:\")\n",
|
||||
" for line in recent_errors[-3:]:\n",
|
||||
" print(f\" {line}\")\n",
|
||||
" else:\n",
|
||||
" ok(\"No recent errors in API logs\")\n",
|
||||
"\n",
|
||||
"ok(\"Recovery verification complete\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 13. What Support Will Ask For\n",
|
||||
"\n",
|
||||
"**When escalating a PostgreSQL issue, Support will need:**\n",
|
||||
"\n",
|
||||
"1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
|
||||
"2. **PostgreSQL connection details:**\n",
|
||||
" - Host/endpoint (redacted)\n",
|
||||
" - Database name\n",
|
||||
" - Username (redacted)\n",
|
||||
" - Whether using SSL/TLS\n",
|
||||
"3. **Error messages from logs:**\n",
|
||||
" - Full error text (not just \"connection failed\")\n",
|
||||
" - Timestamps of first occurrence\n",
|
||||
"4. **Recent changes:**\n",
|
||||
" - Secret rotations\n",
|
||||
" - Database migrations\n",
|
||||
" - Network policy changes\n",
|
||||
"5. **Connection pool status:**\n",
|
||||
" - Current connections vs. max connections\n",
|
||||
" - Connection pool exhaustion patterns\n",
|
||||
"6. **Database health (if accessible):**\n",
|
||||
" - PostgreSQL version\n",
|
||||
" - Active connections\n",
|
||||
" - Lock contention\n",
|
||||
"\n",
|
||||
"**Evidence collected in this lab:**\n",
|
||||
"- ✅ Diagnostics bundle\n",
|
||||
"- ✅ Pod logs with database errors\n",
|
||||
"- ✅ Events showing failures\n",
|
||||
"- ✅ Secret configuration (structure, not values)\n",
|
||||
"\n",
|
||||
"**Additional evidence to gather (if escalating):**\n",
|
||||
"- Database endpoint connectivity test\n",
|
||||
"- Connection pool metrics (if available)\n",
|
||||
"- PostgreSQL logs (if accessible via cloud provider)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 14. Lessons Learned\n",
|
||||
"\n",
|
||||
"**Key takeaways from this lab:**\n",
|
||||
"\n",
|
||||
"1. **PostgreSQL failures manifest quickly** - API calls fail within seconds\n",
|
||||
"2. **Logs are your friend** - Connection errors appear in pod logs immediately\n",
|
||||
"3. **Secrets matter** - Wrong credentials cause authentication failures\n",
|
||||
"4. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
|
||||
"5. **Diagnostics bundle is essential** - Support needs it for root cause analysis\n",
|
||||
"\n",
|
||||
"**Common mistakes to avoid:**\n",
|
||||
"- ❌ Changing multiple things at once (hard to identify root cause)\n",
|
||||
"- ❌ Not collecting diagnostics before remediation\n",
|
||||
"- ❌ Ignoring connection pool limits\n",
|
||||
"- ❌ Not testing database connectivity independently\n",
|
||||
"\n",
|
||||
"**Next steps:**\n",
|
||||
"- Practice with other failure injection methods (Level 2)\n",
|
||||
"- Try the Redis, ClickHouse, or Blob Storage failure labs\n",
|
||||
"- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -0,0 +1,941 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Module 4: Failure Lab - Redis\n",
|
||||
"\n",
|
||||
"## ⚠️ CRITICAL SAFETY WARNING\n",
|
||||
"\n",
|
||||
"**THIS NOTEBOOK WILL MODIFY YOUR ENVIRONMENT:**\n",
|
||||
"- **Modifies Kubernetes secrets** (breaks Redis password)\n",
|
||||
"- **Causes service disruptions** (intermittent ingestion, worker backlog)\n",
|
||||
"- **Requires remediation** to restore functionality\n",
|
||||
"\n",
|
||||
"**REQUIREMENTS:**\n",
|
||||
"- ✅ **TEST/NON-PRODUCTION environment ONLY**\n",
|
||||
"- ✅ **MODULE4_SAFE_ENVIRONMENT=true** must be set in .env\n",
|
||||
"- ✅ **Baseline diagnostics collected** (run `01_diagnostics_baseline.ipynb` first)\n",
|
||||
"- ✅ **Backup/restore plan** available\n",
|
||||
"\n",
|
||||
"**DO NOT RUN THIS LAB AGAINST PRODUCTION SYSTEMS.**\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"**This lab teaches you how to debug Redis connectivity failures in LangSmith.**\n",
|
||||
"\n",
|
||||
"Redis is LangSmith's **cache and job queue**. It handles:\n",
|
||||
"- Job queue for asynchronous trace processing\n",
|
||||
"- Caching for frequently accessed data\n",
|
||||
"- Rate limiting and session management\n",
|
||||
"- Worker coordination\n",
|
||||
"\n",
|
||||
"**When Redis fails, you'll see:**\n",
|
||||
"- Intermittent ingestion issues\n",
|
||||
"- Latency spikes and retries\n",
|
||||
"- Worker backlog (jobs piling up)\n",
|
||||
"- Traces may be delayed or missing\n",
|
||||
"\n",
|
||||
"**Learning Objectives:**\n",
|
||||
"1. Understand how Redis failures manifest\n",
|
||||
"2. Practice collecting diagnostics for cache/queue issues\n",
|
||||
"3. Learn to identify connection vs. credential vs. network issues\n",
|
||||
"4. Practice safe remediation\n",
|
||||
"\n",
|
||||
"**Estimated time:** 30-45 minutes\n",
|
||||
"\n",
|
||||
"**⚠️ Important:** \n",
|
||||
"- Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n",
|
||||
"- Complete safety check in `../shared/00_setup_or_resume_environment.ipynb` first!\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Bootstrap environment\n",
|
||||
"import sys\n",
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"# Add notebooks directory to path\n",
|
||||
"possible_paths = [\n",
|
||||
" Path.cwd().parent,\n",
|
||||
" Path.cwd(),\n",
|
||||
" Path.cwd() / \"notebooks\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"notebooks_path = None\n",
|
||||
"for path in possible_paths:\n",
|
||||
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" notebooks_path = path\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not notebooks_path:\n",
|
||||
" notebooks_path = Path.cwd() / \"notebooks\"\n",
|
||||
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
|
||||
"\n",
|
||||
"if str(notebooks_path) not in sys.path:\n",
|
||||
" sys.path.insert(0, str(notebooks_path))\n",
|
||||
"\n",
|
||||
"from shared._bootstrap import bootstrap\n",
|
||||
"from shared._validation import ok, warn\n",
|
||||
"\n",
|
||||
"# Run bootstrap\n",
|
||||
"bootstrap_info = bootstrap()\n",
|
||||
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
|
||||
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## ⚠️ CRITICAL: Environment Safety Verification\n",
|
||||
"\n",
|
||||
"**Before proceeding, verify you're in a TEST/NON-PRODUCTION environment and understand what will be modified.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# CRITICAL SAFETY CHECK: Verify environment is safe for failure injection\n",
|
||||
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
|
||||
"from shared._validation import ok, warn, fail\n",
|
||||
"import os\n",
|
||||
"\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"region = get_region()\n",
|
||||
"identity = get_identity()\n",
|
||||
"\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"print(\"⚠️ CRITICAL SAFETY CHECK - REDIS FAILURE LAB\")\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"\n",
|
||||
"# Show environment details prominently\n",
|
||||
"provider_display = provider.upper()\n",
|
||||
"print(f\"\\n### Current Environment Configuration\")\n",
|
||||
"print(f\"Cloud Provider: {provider_display}\")\n",
|
||||
"print(f\"Region: {region}\")\n",
|
||||
"\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" account_id = identity.get('Account', 'N/A')\n",
|
||||
" user_arn = identity.get('Arn', 'N/A')\n",
|
||||
" print(f\"Account ID: {account_id}\")\n",
|
||||
" print(f\"User ARN: {user_arn}\")\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
|
||||
" subscription_name = identity.get(\"SubscriptionName\", \"N/A\")\n",
|
||||
" print(f\"Subscription ID: {subscription_id}\")\n",
|
||||
" print(f\"Subscription Name: {subscription_name}\")\n",
|
||||
"\n",
|
||||
"# Show all relevant environment variables\n",
|
||||
"print(f\"\\n### Environment Variables (VERIFY THESE ARE CORRECT)\")\n",
|
||||
"print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
|
||||
"print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
|
||||
"print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
|
||||
"print(f\"LANGSMITH_DOMAIN: {os.environ.get('LANGSMITH_DOMAIN', 'NOT SET')}\")\n",
|
||||
"\n",
|
||||
"print(\"\\n\" + \"=\" * 70)\n",
|
||||
"print(\"⚠️ WHAT THIS LAB WILL DO:\")\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"print(\"\\nThis failure lab will:\")\n",
|
||||
"print(\" 1. Find the Redis secret in your namespace\")\n",
|
||||
"print(\" 2. BACKUP the original secret (saved to artifacts)\")\n",
|
||||
"print(\" 3. MODIFY the secret to set an INVALID password\")\n",
|
||||
"print(\" 4. Apply the modified secret (breaks Redis connectivity)\")\n",
|
||||
"print(\" 5. Cause intermittent ingestion issues and worker backlog\")\n",
|
||||
"print(\" 6. Require remediation to restore (restore original secret)\")\n",
|
||||
"print(\"\\n\" + \"=\" * 70)\n",
|
||||
"\n",
|
||||
"# Check for Module 4 safety flag\n",
|
||||
"module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
|
||||
"if module4_safe not in [\"true\", \"yes\", \"1\"]:\n",
|
||||
" fail(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
|
||||
" print(\"\\n❌ SAFETY CHECK FAILED - Cannot proceed\")\n",
|
||||
" print(\"\\nTo run this failure lab, you MUST:\")\n",
|
||||
" print(\" 1. Verify this is a TEST/NON-PRODUCTION environment\")\n",
|
||||
" print(\" 2. Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
|
||||
" print(\" 3. Complete safety check in ../shared/00_setup_or_resume_environment.ipynb\")\n",
|
||||
" print(\" 4. Re-run this cell to confirm\")\n",
|
||||
" print(\"\\nThis flag is REQUIRED to prevent accidental execution in production.\")\n",
|
||||
" raise RuntimeError(\"MODULE4_SAFE_ENVIRONMENT not set. Required for failure labs.\")\n",
|
||||
"\n",
|
||||
"ok(\"MODULE4_SAFE_ENVIRONMENT flag is set\")\n",
|
||||
"print(\"\\n✅ Safety check passed - environment marked as safe for failure injection\")\n",
|
||||
"print(\"\\n⚠️ REMINDER: This lab will break Redis connectivity.\")\n",
|
||||
"print(\" Ensure you understand the remediation steps before proceeding.\")\n",
|
||||
"print(\" Original secret will be backed up automatically.\")\n",
|
||||
"\n",
|
||||
"print(\"\\n\" + \"=\" * 70)\n",
|
||||
"print(\"✅ Environment verified - ready for Redis failure lab\")\n",
|
||||
"print(\"=\" * 70)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1. Configuration & Prerequisites\n",
|
||||
"\n",
|
||||
"Load configuration and verify prerequisites.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from shared._validation import require_env\n",
|
||||
"\n",
|
||||
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
|
||||
"config = require_env(*required_vars)\n",
|
||||
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
|
||||
"\n",
|
||||
"namespace = config[\"NAMESPACE\"]\n",
|
||||
"\n",
|
||||
"print(f\"Namespace: {namespace}\")\n",
|
||||
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
|
||||
"\n",
|
||||
"ok(\"Configuration loaded\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2. What This Service Does for LangSmith\n",
|
||||
"\n",
|
||||
"Redis is LangSmith's **cache and job queue**. It handles:\n",
|
||||
"\n",
|
||||
"- **Job queue for asynchronous processing:**\n",
|
||||
" - Workers pull trace processing jobs from Redis\n",
|
||||
" - Jobs are queued when traces arrive via API\n",
|
||||
" - Queue backlog indicates processing delays\n",
|
||||
"\n",
|
||||
"- **Caching:**\n",
|
||||
" - Frequently accessed data (project metadata, user info)\n",
|
||||
" - Reduces load on PostgreSQL\n",
|
||||
" - Improves response times\n",
|
||||
"\n",
|
||||
"- **Rate limiting and session management:**\n",
|
||||
" - API rate limiting\n",
|
||||
" - Session storage (if configured)\n",
|
||||
"\n",
|
||||
"- **Worker coordination:**\n",
|
||||
" - Distributed locking\n",
|
||||
" - Task distribution\n",
|
||||
"\n",
|
||||
"**Why it matters:**\n",
|
||||
"- Without Redis, workers can't process traces\n",
|
||||
"- Job queue fills up, causing delays\n",
|
||||
"- Cache misses increase load on PostgreSQL\n",
|
||||
"- Ingestion becomes unreliable\n",
|
||||
"\n",
|
||||
"**How LangSmith connects:**\n",
|
||||
"- Connection string stored in Kubernetes Secrets\n",
|
||||
"- Workers connect to Redis to pull jobs\n",
|
||||
"- API servers use Redis for caching\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3. Expected Symptoms When Redis Fails\n",
|
||||
"\n",
|
||||
"**What you'll see:**\n",
|
||||
"\n",
|
||||
"1. **Intermittent ingestion issues:**\n",
|
||||
" - Some traces process, others don't\n",
|
||||
" - Inconsistent behavior (works sometimes, fails other times)\n",
|
||||
" - Retries visible in logs\n",
|
||||
"\n",
|
||||
"2. **Latency spikes:**\n",
|
||||
" - API responses slow down\n",
|
||||
" - Worker processing delays\n",
|
||||
" - Timeout errors\n",
|
||||
"\n",
|
||||
"3. **Worker backlog:**\n",
|
||||
" - Jobs piling up in queue\n",
|
||||
" - Workers unable to pull new jobs\n",
|
||||
" - Queue length increasing\n",
|
||||
"\n",
|
||||
"4. **Log patterns:**\n",
|
||||
" - Connection timeout errors\n",
|
||||
" - \"connection refused\" or \"connection reset\"\n",
|
||||
" - \"NOAUTH Authentication required\" (if password wrong)\n",
|
||||
" - Retry attempts in worker logs\n",
|
||||
" - Cache miss patterns\n",
|
||||
"\n",
|
||||
"**Timeline:**\n",
|
||||
"- Symptoms may be intermittent (connection pool retries)\n",
|
||||
"- Worker backlog builds over time\n",
|
||||
"- Cache misses cause cascading delays\n",
|
||||
"- Full failure if connection pool exhausted\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4. Failure Injection Options\n",
|
||||
"\n",
|
||||
"**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
|
||||
"\n",
|
||||
"### Level 1: Subtle Failure (Recommended for first run)\n",
|
||||
"\n",
|
||||
"**Option A: Wrong Redis Password**\n",
|
||||
"- Modify the Redis password in the Kubernetes Secret\n",
|
||||
"- Symptoms: Authentication failures, connection refused, intermittent failures\n",
|
||||
"\n",
|
||||
"**Option B: Block Egress to Redis Endpoint**\n",
|
||||
"- Apply NetworkPolicy blocking egress to Redis (if NetworkPolicy supported)\n",
|
||||
"- Symptoms: Connection timeout, no route to host, intermittent failures\n",
|
||||
"\n",
|
||||
"### Level 2: Obvious Failure\n",
|
||||
"\n",
|
||||
"**Option C: Wrong Redis Host/Endpoint**\n",
|
||||
"- Point connection string to non-existent host\n",
|
||||
"- Symptoms: Connection timeout, DNS resolution failures, immediate failures\n",
|
||||
"\n",
|
||||
"**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 5. Do the Drill - Step 1: Confirm Baseline\n",
|
||||
"\n",
|
||||
"**Before injecting any failure, verify your baseline is healthy.**\n",
|
||||
"\n",
|
||||
"💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._shell import run\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"print(\"### Quick Baseline Check\\n\")\n",
|
||||
"\n",
|
||||
"# Check pod status\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" healthy = sum(1 for p in pods.get(\"items\", [])\n",
|
||||
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
|
||||
" total = len(pods.get(\"items\", []))\n",
|
||||
" print(f\"Pods: {healthy}/{total} running\")\n",
|
||||
" \n",
|
||||
" if healthy == total and total > 0:\n",
|
||||
" ok(\"Baseline looks healthy\")\n",
|
||||
" else:\n",
|
||||
" warn(\"Some pods are not running - check baseline first\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not check pod status\")\n",
|
||||
"\n",
|
||||
"# Check for Redis secret\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"redis_secrets = []\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" secrets = json.loads(result.stdout)\n",
|
||||
" for secret in secrets.get(\"items\", []):\n",
|
||||
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" if \"redis\" in name.lower() or \"cache\" in name.lower():\n",
|
||||
" redis_secrets.append(name)\n",
|
||||
"\n",
|
||||
"if redis_secrets:\n",
|
||||
" ok(f\"Found {len(redis_secrets)} Redis-related secret(s)\")\n",
|
||||
" for secret_name in redis_secrets:\n",
|
||||
" print(f\" - {secret_name}\")\n",
|
||||
"else:\n",
|
||||
" warn(\"No Redis secrets found\")\n",
|
||||
" print(\" 💡 Redis connection may be configured differently\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 6. Do the Drill - Step 2: Apply Failure Injection\n",
|
||||
"\n",
|
||||
"**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
|
||||
"\n",
|
||||
"Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# FAILURE INJECTION: Wrong Redis Password\n",
|
||||
"# This cell modifies the Redis password secret to an invalid value\n",
|
||||
"\n",
|
||||
"import base64\n",
|
||||
"import yaml\n",
|
||||
"from datetime import datetime\n",
|
||||
"\n",
|
||||
"# Find Redis secret (look for common names)\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"redis_secret_name = None\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" secrets = json.loads(result.stdout)\n",
|
||||
" for secret in secrets.get(\"items\", []):\n",
|
||||
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" # Common patterns: redis, cache\n",
|
||||
" if any(keyword in name.lower() for keyword in [\"redis\", \"cache\"]):\n",
|
||||
" # Check if it has password-related keys\n",
|
||||
" data = secret.get(\"data\", {})\n",
|
||||
" if any(key in data for key in [\"password\", \"REDIS_PASSWORD\", \"CACHE_PASSWORD\"]):\n",
|
||||
" redis_secret_name = name\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not redis_secret_name:\n",
|
||||
" raise RuntimeError(\"❌ Could not find Redis secret. Check your deployment configuration.\")\n",
|
||||
"\n",
|
||||
"print(f\"Found Redis secret: {redis_secret_name}\")\n",
|
||||
"\n",
|
||||
"# Get current secret\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secret\", redis_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
|
||||
" check=True,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Save original secret for restoration\n",
|
||||
"backup_file = artifacts_dir / \"module-4\" / f\"redis-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
|
||||
"backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
|
||||
"with open(backup_file, \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
"\n",
|
||||
"ok(f\"Backed up original secret to: {backup_file.name}\")\n",
|
||||
"\n",
|
||||
"# Parse YAML and modify password\n",
|
||||
"secret_data = yaml.safe_load(result.stdout)\n",
|
||||
"if \"data\" not in secret_data:\n",
|
||||
" raise RuntimeError(\"Secret has no data section\")\n",
|
||||
"\n",
|
||||
"# Find password key (could be password, REDIS_PASSWORD, CACHE_PASSWORD, etc.)\n",
|
||||
"password_key = None\n",
|
||||
"for key in [\"password\", \"REDIS_PASSWORD\", \"CACHE_PASSWORD\", \"redis-password\"]:\n",
|
||||
" if key in secret_data[\"data\"]:\n",
|
||||
" password_key = key\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not password_key:\n",
|
||||
" raise RuntimeError(\"Could not find password key in secret\")\n",
|
||||
"\n",
|
||||
"# Set invalid password\n",
|
||||
"invalid_password = \"INVALID_PASSWORD_12345\"\n",
|
||||
"invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
|
||||
"\n",
|
||||
"# Modify secret\n",
|
||||
"secret_data[\"data\"][password_key] = invalid_password_b64\n",
|
||||
"\n",
|
||||
"# Save modified secret to temp file\n",
|
||||
"temp_secret_file = artifacts_dir / \"module-4\" / \"redis-secret-modified.yaml\"\n",
|
||||
"with open(temp_secret_file, \"w\") as f:\n",
|
||||
" yaml.dump(secret_data, f)\n",
|
||||
"\n",
|
||||
"print(f\"\\n⚠️ READY TO APPLY FAILURE INJECTION\")\n",
|
||||
"print(f\" This will set an invalid password in secret: {redis_secret_name}\")\n",
|
||||
"print(f\" Modified secret saved to: {temp_secret_file.name}\")\n",
|
||||
"print(f\"\\n To apply, uncomment and run the next cell.\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# UNCOMMENT TO APPLY FAILURE INJECTION\n",
|
||||
"# \n",
|
||||
"# result = run(\n",
|
||||
"# [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
|
||||
"# check=True,\n",
|
||||
"# stream=True\n",
|
||||
"# )\n",
|
||||
"# \n",
|
||||
"# ok(\"Failure injection applied - Redis password is now invalid\")\n",
|
||||
"# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
|
||||
"# print(\" This may take 1-2 minutes. Watch for pod restarts:\")\n",
|
||||
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
|
||||
"# \n",
|
||||
"# # Wait a moment for changes to propagate\n",
|
||||
"# import time\n",
|
||||
"# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
|
||||
"# time.sleep(30)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 8. Do the Drill - Step 3: Observe Symptoms\n",
|
||||
"\n",
|
||||
"**Now that the failure is injected, observe how it manifests.**\n",
|
||||
"\n",
|
||||
"Check:\n",
|
||||
"1. Worker pod logs for Redis connection errors\n",
|
||||
"2. Queue backlog (if visible)\n",
|
||||
"3. Worker retry patterns\n",
|
||||
"4. Latency in API responses\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from datetime import datetime\n",
|
||||
"\n",
|
||||
"# Create incident directory for diagnostics\n",
|
||||
"incident_dir = artifacts_dir / \"module-4\" / f\"redis-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
|
||||
"incident_dir.mkdir(parents=True, exist_ok=True)\n",
|
||||
"\n",
|
||||
"print(f\"### Collecting Failure Diagnostics\\n\")\n",
|
||||
"print(f\"Saving to: {incident_dir}\\n\")\n",
|
||||
"\n",
|
||||
"# 1. Check pod status\n",
|
||||
"print(\"1. Checking pod status...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" print(result.stdout)\n",
|
||||
" \n",
|
||||
" # Check for restarts\n",
|
||||
" lines = result.stdout.split(\"\\n\")\n",
|
||||
" restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
|
||||
" if restarts:\n",
|
||||
" print(\"\\n Pod restart counts:\")\n",
|
||||
" for line in restarts[1:]: # Skip header\n",
|
||||
" if line.strip():\n",
|
||||
" parts = line.split()\n",
|
||||
" if len(parts) > 3:\n",
|
||||
" print(f\" {parts[0]}: {parts[3]} restarts\")\n",
|
||||
"\n",
|
||||
"# 2. Check recent events\n",
|
||||
"print(\"\\n2. Checking recent events...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" with open(incident_dir / \"events.txt\", \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" if result.stdout.strip():\n",
|
||||
" print(\" Recent warning/error events:\")\n",
|
||||
" for line in result.stdout.split(\"\\n\")[-5:]:\n",
|
||||
" if line.strip():\n",
|
||||
" print(f\" {line}\")\n",
|
||||
"\n",
|
||||
"# 3. Check worker pod logs for Redis errors\n",
|
||||
"print(\"\\n3. Checking worker pod logs for Redis errors...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
|
||||
"if worker_pod:\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=50\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" logs_file = incident_dir / f\"worker-pod-{worker_pod}-logs.txt\"\n",
|
||||
" with open(logs_file, \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" \n",
|
||||
" # Look for Redis-related errors\n",
|
||||
" error_keywords = [\"redis\", \"cache\", \"connection\", \"timeout\", \"refused\", \"authentication\", \"NOAUTH\", \"retry\"]\n",
|
||||
" error_lines = [l for l in result.stdout.split(\"\\n\") \n",
|
||||
" if any(kw in l.lower() for kw in error_keywords)]\n",
|
||||
" \n",
|
||||
" if error_lines:\n",
|
||||
" print(\" Found Redis-related errors:\")\n",
|
||||
" for line in error_lines[-5:]:\n",
|
||||
" print(f\" {line}\")\n",
|
||||
" else:\n",
|
||||
" print(\" No obvious Redis errors in recent logs\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not find worker pod\")\n",
|
||||
"\n",
|
||||
"ok(f\"Diagnostics saved to: {incident_dir}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
|
||||
"\n",
|
||||
"**This is critical - Support will ask for this bundle.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import urllib.request\n",
|
||||
"\n",
|
||||
"print(\"### Running Canonical Diagnostics Script\\n\")\n",
|
||||
"\n",
|
||||
"script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
|
||||
"script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" urllib.request.urlretrieve(script_url, script_path)\n",
|
||||
" script_path.chmod(0o755)\n",
|
||||
" \n",
|
||||
" print(f\"Running diagnostics script for namespace: {namespace}\")\n",
|
||||
" result = run(\n",
|
||||
" [str(script_path), namespace],\n",
|
||||
" check=False,\n",
|
||||
" stream=True\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" ok(\"Diagnostics script completed\")\n",
|
||||
" \n",
|
||||
" # Find and move tarball\n",
|
||||
" for file in incident_dir.parent.iterdir():\n",
|
||||
" if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
|
||||
" target_path = incident_dir / file.name\n",
|
||||
" file.rename(target_path)\n",
|
||||
" ok(f\"Diagnostics bundle: {target_path.name}\")\n",
|
||||
" break\n",
|
||||
" else:\n",
|
||||
" warn(\"Diagnostics script had errors (check output above)\")\n",
|
||||
" \n",
|
||||
"except Exception as e:\n",
|
||||
" warn(f\"Could not run diagnostics script: {e}\")\n",
|
||||
" print(\" 💡 You can run it manually:\")\n",
|
||||
" print(f\" curl -O {script_url}\")\n",
|
||||
" print(f\" chmod +x get_k8s_debugging_info.sh\")\n",
|
||||
" print(f\" ./get_k8s_debugging_info.sh {namespace}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 10. Do the Drill - Step 5: Guided Triage\n",
|
||||
"\n",
|
||||
"**Where to look first for Redis issues:**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"### Guided Triage Steps\\n\")\n",
|
||||
"\n",
|
||||
"print(\"1. Check worker pod logs for Redis connection errors:\")\n",
|
||||
"print(f\" kubectl logs -n {namespace} <worker-pod-name> | grep -i 'redis\\\\|cache\\\\|connection'\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"2. Verify secret exists and has correct keys:\")\n",
|
||||
"print(f\" kubectl get secret {redis_secret_name} -n {namespace} -o yaml\")\n",
|
||||
"print(\" (Don't print the actual values - they're base64 encoded)\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"3. Check for worker pod restarts (indicates connection failures):\")\n",
|
||||
"print(f\" kubectl get pods -n {namespace} -l app=langsmith-worker\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"4. Test Redis connectivity from a pod (if possible):\")\n",
|
||||
"print(\" kubectl run -it --rm debug --image=redis:7 --restart=Never -- \\\\\")\n",
|
||||
"print(\" redis-cli -h <redis-host> -p <port> -a <password> ping\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"5. Check events for connection/authentication errors:\")\n",
|
||||
"print(f\" kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"# Check what we can automatically\n",
|
||||
"print(\"\\n### Automatic Checks\\n\")\n",
|
||||
"\n",
|
||||
"# Check secret still exists\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secret\", redis_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" ok(f\"Secret '{redis_secret_name}' still exists\")\n",
|
||||
" secret_data = json.loads(result.stdout)\n",
|
||||
" keys = list(secret_data.get(\"data\", {}).keys())\n",
|
||||
" print(f\" Secret keys: {', '.join(keys)}\")\n",
|
||||
"else:\n",
|
||||
" warn(f\"Secret '{redis_secret_name}' not found!\")\n",
|
||||
"\n",
|
||||
"# Check for pods with Redis connection env vars\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" redis_related_pods = []\n",
|
||||
" for pod in pods.get(\"items\", []):\n",
|
||||
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
|
||||
" for container in containers:\n",
|
||||
" env = container.get(\"env\", [])\n",
|
||||
" redis_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
|
||||
" for kw in [\"REDIS\", \"CACHE\"])]\n",
|
||||
" if redis_env:\n",
|
||||
" redis_related_pods.append(name)\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" if redis_related_pods:\n",
|
||||
" print(f\"\\n Pods with Redis environment variables:\")\n",
|
||||
" for pod_name in set(redis_related_pods):\n",
|
||||
" print(f\" - {pod_name}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 11. Do the Drill - Step 6: Remediation\n",
|
||||
"\n",
|
||||
"**Restore the original secret to fix the issue.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# REMEDIATION: Restore original secret\n",
|
||||
"# UNCOMMENT TO RESTORE\n",
|
||||
"\n",
|
||||
"# if backup_file.exists():\n",
|
||||
"# print(f\"Restoring original secret from: {backup_file.name}\")\n",
|
||||
"# result = run(\n",
|
||||
"# [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
|
||||
"# check=True,\n",
|
||||
"# stream=True\n",
|
||||
"# )\n",
|
||||
"# \n",
|
||||
"# ok(\"Original secret restored\")\n",
|
||||
"# print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
|
||||
"# print(\" This may take 1-2 minutes. Monitor pod status:\")\n",
|
||||
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
|
||||
"# \n",
|
||||
"# import time\n",
|
||||
"# print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
|
||||
"# time.sleep(60)\n",
|
||||
"# else:\n",
|
||||
"# warn(f\"Backup file not found: {backup_file}\")\n",
|
||||
"# print(\" 💡 You may need to manually restore the secret\")\n",
|
||||
"\n",
|
||||
"print(\"⚠️ To restore, uncomment the code above and run this cell.\")\n",
|
||||
"print(f\" Backup file: {backup_file.name}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 12. Do the Drill - Step 7: Confirm Recovery\n",
|
||||
"\n",
|
||||
"**Verify that everything is working again.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"### Verifying Recovery\\n\")\n",
|
||||
"\n",
|
||||
"# Check pod status\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" running = sum(1 for p in pods.get(\"items\", [])\n",
|
||||
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
|
||||
" total = len(pods.get(\"items\", []))\n",
|
||||
" \n",
|
||||
" if running == total and total > 0:\n",
|
||||
" ok(f\"All {total} pod(s) are running\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"Only {running}/{total} pod(s) running\")\n",
|
||||
" print(\" 💡 Wait a bit longer for pods to fully recover\")\n",
|
||||
"\n",
|
||||
"# Check for recent errors in worker logs\n",
|
||||
"print(\"\\nChecking for recent errors in worker logs...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
|
||||
"if worker_pod:\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=20\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" error_keywords = [\"error\", \"fail\", \"redis\", \"cache\", \"connection\"]\n",
|
||||
" recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
|
||||
" if any(kw in l.lower() for kw in error_keywords)]\n",
|
||||
" \n",
|
||||
" if recent_errors:\n",
|
||||
" warn(\"Still seeing some errors in logs:\")\n",
|
||||
" for line in recent_errors[-3:]:\n",
|
||||
" print(f\" {line}\")\n",
|
||||
" else:\n",
|
||||
" ok(\"No recent errors in worker logs\")\n",
|
||||
"\n",
|
||||
"ok(\"Recovery verification complete\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 13. What Support Will Ask For\n",
|
||||
"\n",
|
||||
"**When escalating a Redis issue, Support will need:**\n",
|
||||
"\n",
|
||||
"1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
|
||||
"2. **Redis connection details:**\n",
|
||||
" - Host/endpoint (redacted)\n",
|
||||
" - Port\n",
|
||||
" - Password (redacted)\n",
|
||||
" - Whether using SSL/TLS\n",
|
||||
"3. **Error messages from logs:**\n",
|
||||
" - Full error text (not just \"connection failed\")\n",
|
||||
" - Timestamps of first occurrence\n",
|
||||
" - Retry patterns\n",
|
||||
"4. **Recent changes:**\n",
|
||||
" - Secret rotations\n",
|
||||
" - Network policy changes\n",
|
||||
" - Redis configuration changes\n",
|
||||
"5. **Queue status (if accessible):**\n",
|
||||
" - Queue length\n",
|
||||
" - Worker processing rate\n",
|
||||
" - Backlog growth rate\n",
|
||||
"6. **Redis health (if accessible):**\n",
|
||||
" - Redis version\n",
|
||||
" - Memory usage\n",
|
||||
" - Connection count\n",
|
||||
" - Slow queries\n",
|
||||
"\n",
|
||||
"**Evidence collected in this lab:**\n",
|
||||
"- ✅ Diagnostics bundle\n",
|
||||
"- ✅ Worker pod logs with Redis errors\n",
|
||||
"- ✅ Events showing failures\n",
|
||||
"- ✅ Secret configuration (structure, not values)\n",
|
||||
"\n",
|
||||
"**Additional evidence to gather (if escalating):**\n",
|
||||
"- Redis endpoint connectivity test\n",
|
||||
"- Queue metrics (if available)\n",
|
||||
"- Redis logs (if accessible via cloud provider)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 14. Lessons Learned\n",
|
||||
"\n",
|
||||
"**Key takeaways from this lab:**\n",
|
||||
"\n",
|
||||
"1. **Redis failures can be intermittent** - Connection pool retries may mask issues\n",
|
||||
"2. **Worker logs are critical** - Redis errors appear in worker pod logs\n",
|
||||
"3. **Queue backlog is a symptom** - Jobs pile up when workers can't connect\n",
|
||||
"4. **Secrets matter** - Wrong credentials cause authentication failures\n",
|
||||
"5. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
|
||||
"\n",
|
||||
"**Common mistakes to avoid:**\n",
|
||||
"- ❌ Ignoring intermittent failures (they indicate connection issues)\n",
|
||||
"- ❌ Not checking worker logs (API logs may not show Redis errors)\n",
|
||||
"- ❌ Not monitoring queue length (backlog indicates processing delays)\n",
|
||||
"- ❌ Not testing Redis connectivity independently\n",
|
||||
"\n",
|
||||
"**Next steps:**\n",
|
||||
"- Practice with other failure injection methods (Level 2)\n",
|
||||
"- Try the ClickHouse or Blob Storage failure labs\n",
|
||||
"- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -0,0 +1,942 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Module 4: Failure Lab - ClickHouse\n",
|
||||
"\n",
|
||||
"## ⚠️ CRITICAL SAFETY WARNING\n",
|
||||
"\n",
|
||||
"**THIS NOTEBOOK WILL MODIFY YOUR ENVIRONMENT:**\n",
|
||||
"- **Modifies Kubernetes secrets** (breaks ClickHouse password)\n",
|
||||
"- **Causes service disruptions** (traces delayed/missing, insert errors)\n",
|
||||
"- **Requires remediation** to restore functionality\n",
|
||||
"\n",
|
||||
"**REQUIREMENTS:**\n",
|
||||
"- ✅ **TEST/NON-PRODUCTION environment ONLY**\n",
|
||||
"- ✅ **MODULE4_SAFE_ENVIRONMENT=true** must be set in .env\n",
|
||||
"- ✅ **Baseline diagnostics collected** (run `01_diagnostics_baseline.ipynb` first)\n",
|
||||
"- ✅ **Backup/restore plan** available\n",
|
||||
"\n",
|
||||
"**DO NOT RUN THIS LAB AGAINST PRODUCTION SYSTEMS.**\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"**This lab teaches you how to debug ClickHouse connectivity failures in LangSmith.**\n",
|
||||
"\n",
|
||||
"ClickHouse is LangSmith's **trace storage**. It handles:\n",
|
||||
"- Storing trace data (spans, events, metadata)\n",
|
||||
"- Time-series queries for trace search and filtering\n",
|
||||
"- High-volume writes from workers\n",
|
||||
"- Efficient querying for UI display\n",
|
||||
"\n",
|
||||
"**When ClickHouse fails, you'll see:**\n",
|
||||
"- Traces delayed or missing\n",
|
||||
"- Insert errors and merge/backlog hints\n",
|
||||
"- UI loads but traces don't appear\n",
|
||||
"- Query timeouts\n",
|
||||
"\n",
|
||||
"**Learning Objectives:**\n",
|
||||
"1. Understand how ClickHouse failures manifest\n",
|
||||
"2. Practice collecting diagnostics for trace storage issues\n",
|
||||
"3. Learn to identify connection vs. credential vs. network issues\n",
|
||||
"4. Practice safe remediation\n",
|
||||
"\n",
|
||||
"**Estimated time:** 30-45 minutes\n",
|
||||
"\n",
|
||||
"**⚠️ Important:** \n",
|
||||
"- Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n",
|
||||
"- Complete safety check in `../shared/00_setup_or_resume_environment.ipynb` first!\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Bootstrap environment\n",
|
||||
"import sys\n",
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"# Add notebooks directory to path\n",
|
||||
"possible_paths = [\n",
|
||||
" Path.cwd().parent,\n",
|
||||
" Path.cwd(),\n",
|
||||
" Path.cwd() / \"notebooks\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"notebooks_path = None\n",
|
||||
"for path in possible_paths:\n",
|
||||
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" notebooks_path = path\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not notebooks_path:\n",
|
||||
" notebooks_path = Path.cwd() / \"notebooks\"\n",
|
||||
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
|
||||
"\n",
|
||||
"if str(notebooks_path) not in sys.path:\n",
|
||||
" sys.path.insert(0, str(notebooks_path))\n",
|
||||
"\n",
|
||||
"from shared._bootstrap import bootstrap\n",
|
||||
"from shared._validation import ok, warn\n",
|
||||
"\n",
|
||||
"# Run bootstrap\n",
|
||||
"bootstrap_info = bootstrap()\n",
|
||||
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
|
||||
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## ⚠️ CRITICAL: Environment Safety Verification\n",
|
||||
"\n",
|
||||
"**Before proceeding, verify you're in a TEST/NON-PRODUCTION environment and understand what will be modified.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# CRITICAL SAFETY CHECK: Verify environment is safe for failure injection\n",
|
||||
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
|
||||
"from shared._validation import ok, warn, fail\n",
|
||||
"import os\n",
|
||||
"\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"region = get_region()\n",
|
||||
"identity = get_identity()\n",
|
||||
"\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"print(\"⚠️ CRITICAL SAFETY CHECK - CLICKHOUSE FAILURE LAB\")\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"\n",
|
||||
"# Show environment details prominently\n",
|
||||
"provider_display = provider.upper()\n",
|
||||
"print(f\"\\n### Current Environment Configuration\")\n",
|
||||
"print(f\"Cloud Provider: {provider_display}\")\n",
|
||||
"print(f\"Region: {region}\")\n",
|
||||
"\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" account_id = identity.get('Account', 'N/A')\n",
|
||||
" user_arn = identity.get('Arn', 'N/A')\n",
|
||||
" print(f\"Account ID: {account_id}\")\n",
|
||||
" print(f\"User ARN: {user_arn}\")\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
|
||||
" subscription_name = identity.get(\"SubscriptionName\", \"N/A\")\n",
|
||||
" print(f\"Subscription ID: {subscription_id}\")\n",
|
||||
" print(f\"Subscription Name: {subscription_name}\")\n",
|
||||
"\n",
|
||||
"# Show all relevant environment variables\n",
|
||||
"print(f\"\\n### Environment Variables (VERIFY THESE ARE CORRECT)\")\n",
|
||||
"print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
|
||||
"print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
|
||||
"print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
|
||||
"print(f\"LANGSMITH_DOMAIN: {os.environ.get('LANGSMITH_DOMAIN', 'NOT SET')}\")\n",
|
||||
"\n",
|
||||
"print(\"\\n\" + \"=\" * 70)\n",
|
||||
"print(\"⚠️ WHAT THIS LAB WILL DO:\")\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"print(\"\\nThis failure lab will:\")\n",
|
||||
"print(\" 1. Find the ClickHouse secret in your namespace\")\n",
|
||||
"print(\" 2. BACKUP the original secret (saved to artifacts)\")\n",
|
||||
"print(\" 3. MODIFY the secret to set an INVALID password\")\n",
|
||||
"print(\" 4. Apply the modified secret (breaks ClickHouse connectivity)\")\n",
|
||||
"print(\" 5. Cause trace ingestion failures and query timeouts\")\n",
|
||||
"print(\" 6. Require remediation to restore (restore original secret)\")\n",
|
||||
"print(\"\\n\" + \"=\" * 70)\n",
|
||||
"\n",
|
||||
"# Check for Module 4 safety flag\n",
|
||||
"module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
|
||||
"if module4_safe not in [\"true\", \"yes\", \"1\"]:\n",
|
||||
" fail(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
|
||||
" print(\"\\n❌ SAFETY CHECK FAILED - Cannot proceed\")\n",
|
||||
" print(\"\\nTo run this failure lab, you MUST:\")\n",
|
||||
" print(\" 1. Verify this is a TEST/NON-PRODUCTION environment\")\n",
|
||||
" print(\" 2. Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
|
||||
" print(\" 3. Complete safety check in ../shared/00_setup_or_resume_environment.ipynb\")\n",
|
||||
" print(\" 4. Re-run this cell to confirm\")\n",
|
||||
" print(\"\\nThis flag is REQUIRED to prevent accidental execution in production.\")\n",
|
||||
" raise RuntimeError(\"MODULE4_SAFE_ENVIRONMENT not set. Required for failure labs.\")\n",
|
||||
"\n",
|
||||
"ok(\"MODULE4_SAFE_ENVIRONMENT flag is set\")\n",
|
||||
"print(\"\\n✅ Safety check passed - environment marked as safe for failure injection\")\n",
|
||||
"print(\"\\n⚠️ REMINDER: This lab will break ClickHouse connectivity.\")\n",
|
||||
"print(\" Ensure you understand the remediation steps before proceeding.\")\n",
|
||||
"print(\" Original secret will be backed up automatically.\")\n",
|
||||
"\n",
|
||||
"print(\"\\n\" + \"=\" * 70)\n",
|
||||
"print(\"✅ Environment verified - ready for ClickHouse failure lab\")\n",
|
||||
"print(\"=\" * 70)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1. Configuration & Prerequisites\n",
|
||||
"\n",
|
||||
"Load configuration and verify prerequisites.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from shared._validation import require_env\n",
|
||||
"\n",
|
||||
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
|
||||
"config = require_env(*required_vars)\n",
|
||||
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
|
||||
"\n",
|
||||
"namespace = config[\"NAMESPACE\"]\n",
|
||||
"\n",
|
||||
"print(f\"Namespace: {namespace}\")\n",
|
||||
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
|
||||
"\n",
|
||||
"ok(\"Configuration loaded\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2. What This Service Does for LangSmith\n",
|
||||
"\n",
|
||||
"ClickHouse is LangSmith's **clickhouse and job queue**. It handles:\n",
|
||||
"\n",
|
||||
"- **Job queue for asynchronous processing:**\n",
|
||||
" - Workers pull trace processing jobs from ClickHouse\n",
|
||||
" - Jobs are queued when traces arrive via API\n",
|
||||
" - Queue backlog indicates processing delays\n",
|
||||
"\n",
|
||||
"- **Caching:**\n",
|
||||
" - Frequently accessed data (project metadata, user info)\n",
|
||||
" - Reduces load on PostgreSQL\n",
|
||||
" - Improves response times\n",
|
||||
"\n",
|
||||
"- **Rate limiting and session management:**\n",
|
||||
" - API rate limiting\n",
|
||||
" - Session storage (if configured)\n",
|
||||
"\n",
|
||||
"- **Worker coordination:**\n",
|
||||
" - Distributed locking\n",
|
||||
" - Task distribution\n",
|
||||
"\n",
|
||||
"**Why it matters:**\n",
|
||||
"- Without ClickHouse, workers can't process traces\n",
|
||||
"- Job queue fills up, causing delays\n",
|
||||
"- Cache misses increase load on PostgreSQL\n",
|
||||
"- Ingestion becomes unreliable\n",
|
||||
"\n",
|
||||
"**How LangSmith connects:**\n",
|
||||
"- Connection string stored in Kubernetes Secrets\n",
|
||||
"- Workers connect to ClickHouse to pull jobs\n",
|
||||
"- API servers use ClickHouse for caching\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3. Expected Symptoms When ClickHouse Fails\n",
|
||||
"\n",
|
||||
"**What you'll see:**\n",
|
||||
"\n",
|
||||
"1. **Traces delayed or missing:**\n",
|
||||
" - Some traces process, others don't\n",
|
||||
" - Inconsistent behavior (works sometimes, fails other times)\n",
|
||||
" - Retries visible in logs\n",
|
||||
"\n",
|
||||
"2. **Latency spikes:**\n",
|
||||
" - API responses slow down\n",
|
||||
" - Worker processing delays\n",
|
||||
" - Timeout errors\n",
|
||||
"\n",
|
||||
"3. **Worker backlog:**\n",
|
||||
" - Jobs piling up in queue\n",
|
||||
" - Workers unable to pull new jobs\n",
|
||||
" - Queue length increasing\n",
|
||||
"\n",
|
||||
"4. **Log patterns:**\n",
|
||||
" - Connection timeout errors\n",
|
||||
" - \"connection refused\" or \"connection reset\"\n",
|
||||
" - \"NOAUTH Authentication required\" (if password wrong)\n",
|
||||
" - Retry attempts in worker logs\n",
|
||||
" - Cache miss patterns\n",
|
||||
"\n",
|
||||
"**Timeline:**\n",
|
||||
"- Symptoms may be intermittent (connection pool retries)\n",
|
||||
"- Worker backlog builds over time\n",
|
||||
"- Cache misses cause cascading delays\n",
|
||||
"- Full failure if connection pool exhausted\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4. Failure Injection Options\n",
|
||||
"\n",
|
||||
"**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
|
||||
"\n",
|
||||
"### Level 1: Subtle Failure (Recommended for first run)\n",
|
||||
"\n",
|
||||
"**Option A: Wrong ClickHouse Password**\n",
|
||||
"- Modify the ClickHouse password in the Kubernetes Secret\n",
|
||||
"- Symptoms: Authentication failures, connection refused, intermittent failures\n",
|
||||
"\n",
|
||||
"**Option B: Block Egress to ClickHouse Endpoint**\n",
|
||||
"- Apply NetworkPolicy blocking egress to ClickHouse (if NetworkPolicy supported)\n",
|
||||
"- Symptoms: Connection timeout, no route to host, intermittent failures\n",
|
||||
"\n",
|
||||
"### Level 2: Obvious Failure\n",
|
||||
"\n",
|
||||
"**Option C: Wrong ClickHouse Host/Endpoint**\n",
|
||||
"- Point connection string to non-existent host\n",
|
||||
"- Symptoms: Connection timeout, DNS resolution failures, immediate failures\n",
|
||||
"\n",
|
||||
"**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 5. Do the Drill - Step 1: Confirm Baseline\n",
|
||||
"\n",
|
||||
"**Before injecting any failure, verify your baseline is healthy.**\n",
|
||||
"\n",
|
||||
"💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._shell import run\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"print(\"### Quick Baseline Check\\n\")\n",
|
||||
"\n",
|
||||
"# Check pod status\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" healthy = sum(1 for p in pods.get(\"items\", [])\n",
|
||||
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
|
||||
" total = len(pods.get(\"items\", []))\n",
|
||||
" print(f\"Pods: {healthy}/{total} running\")\n",
|
||||
" \n",
|
||||
" if healthy == total and total > 0:\n",
|
||||
" ok(\"Baseline looks healthy\")\n",
|
||||
" else:\n",
|
||||
" warn(\"Some pods are not running - check baseline first\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not check pod status\")\n",
|
||||
"\n",
|
||||
"# Check for ClickHouse secret\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"clickhouse_secrets = []\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" secrets = json.loads(result.stdout)\n",
|
||||
" for secret in secrets.get(\"items\", []):\n",
|
||||
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" if \"clickhouse\" in name.lower() or \"clickhouse\" in name.lower():\n",
|
||||
" clickhouse_secrets.append(name)\n",
|
||||
"\n",
|
||||
"if clickhouse_secrets:\n",
|
||||
" ok(f\"Found {len(clickhouse_secrets)} ClickHouse-related secret(s)\")\n",
|
||||
" for secret_name in clickhouse_secrets:\n",
|
||||
" print(f\" - {secret_name}\")\n",
|
||||
"else:\n",
|
||||
" warn(\"No ClickHouse secrets found\")\n",
|
||||
" print(\" 💡 ClickHouse connection may be configured differently\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 6. Do the Drill - Step 2: Apply Failure Injection\n",
|
||||
"\n",
|
||||
"**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
|
||||
"\n",
|
||||
"Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# FAILURE INJECTION: Wrong ClickHouse Password\n",
|
||||
"# This cell modifies the ClickHouse password secret to an invalid value\n",
|
||||
"\n",
|
||||
"import base64\n",
|
||||
"import yaml\n",
|
||||
"from datetime import datetime\n",
|
||||
"\n",
|
||||
"# Find ClickHouse secret (look for common names)\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"clickhouse_secret_name = None\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" secrets = json.loads(result.stdout)\n",
|
||||
" for secret in secrets.get(\"items\", []):\n",
|
||||
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" # Common patterns: clickhouse, clickhouse\n",
|
||||
" if any(keyword in name.lower() for keyword in [\"clickhouse\", \"clickhouse\"]):\n",
|
||||
" # Check if it has password-related keys\n",
|
||||
" data = secret.get(\"data\", {})\n",
|
||||
" if any(key in data for key in [\"password\", \"REDIS_PASSWORD\", \"CLICKHOUSE_PASSWORD\"]):\n",
|
||||
" clickhouse_secret_name = name\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not clickhouse_secret_name:\n",
|
||||
" raise RuntimeError(\"❌ Could not find ClickHouse secret. Check your deployment configuration.\")\n",
|
||||
"\n",
|
||||
"print(f\"Found ClickHouse secret: {clickhouse_secret_name}\")\n",
|
||||
"\n",
|
||||
"# Get current secret\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secret\", clickhouse_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
|
||||
" check=True,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Save original secret for restoration\n",
|
||||
"backup_file = artifacts_dir / \"module-4\" / f\"clickhouse-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
|
||||
"backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
|
||||
"with open(backup_file, \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
"\n",
|
||||
"ok(f\"Backed up original secret to: {backup_file.name}\")\n",
|
||||
"\n",
|
||||
"# Parse YAML and modify password\n",
|
||||
"secret_data = yaml.safe_load(result.stdout)\n",
|
||||
"if \"data\" not in secret_data:\n",
|
||||
" raise RuntimeError(\"Secret has no data section\")\n",
|
||||
"\n",
|
||||
"# Find password key (could be password, REDIS_PASSWORD, CLICKHOUSE_PASSWORD, etc.)\n",
|
||||
"password_key = None\n",
|
||||
"for key in [\"password\", \"REDIS_PASSWORD\", \"CLICKHOUSE_PASSWORD\", \"clickhouse-password\"]:\n",
|
||||
" if key in secret_data[\"data\"]:\n",
|
||||
" password_key = key\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not password_key:\n",
|
||||
" raise RuntimeError(\"Could not find password key in secret\")\n",
|
||||
"\n",
|
||||
"# Set invalid password\n",
|
||||
"invalid_password = \"INVALID_PASSWORD_12345\"\n",
|
||||
"invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
|
||||
"\n",
|
||||
"# Modify secret\n",
|
||||
"secret_data[\"data\"][password_key] = invalid_password_b64\n",
|
||||
"\n",
|
||||
"# Save modified secret to temp file\n",
|
||||
"temp_secret_file = artifacts_dir / \"module-4\" / \"clickhouse-secret-modified.yaml\"\n",
|
||||
"with open(temp_secret_file, \"w\") as f:\n",
|
||||
" yaml.dump(secret_data, f)\n",
|
||||
"\n",
|
||||
"print(f\"\\n⚠️ READY TO APPLY FAILURE INJECTION\")\n",
|
||||
"print(f\" This will set an invalid password in secret: {clickhouse_secret_name}\")\n",
|
||||
"print(f\" Modified secret saved to: {temp_secret_file.name}\")\n",
|
||||
"print(f\"\\n To apply, uncomment and run the next cell.\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# UNCOMMENT TO APPLY FAILURE INJECTION\n",
|
||||
"# \n",
|
||||
"# result = run(\n",
|
||||
"# [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
|
||||
"# check=True,\n",
|
||||
"# stream=True\n",
|
||||
"# )\n",
|
||||
"# \n",
|
||||
"# ok(\"Failure injection applied - ClickHouse password is now invalid\")\n",
|
||||
"# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
|
||||
"# print(\" This may take 1-2 minutes. Watch for pod restarts:\")\n",
|
||||
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
|
||||
"# \n",
|
||||
"# # Wait a moment for changes to propagate\n",
|
||||
"# import time\n",
|
||||
"# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
|
||||
"# time.sleep(30)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 8. Do the Drill - Step 3: Observe Symptoms\n",
|
||||
"\n",
|
||||
"**Now that the failure is injected, observe how it manifests.**\n",
|
||||
"\n",
|
||||
"Check:\n",
|
||||
"1. Worker pod logs for ClickHouse connection errors\n",
|
||||
"2. Queue backlog (if visible)\n",
|
||||
"3. Worker retry patterns\n",
|
||||
"4. Latency in API responses\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from datetime import datetime\n",
|
||||
"\n",
|
||||
"# Create incident directory for diagnostics\n",
|
||||
"incident_dir = artifacts_dir / \"module-4\" / f\"clickhouse-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
|
||||
"incident_dir.mkdir(parents=True, exist_ok=True)\n",
|
||||
"\n",
|
||||
"print(f\"### Collecting Failure Diagnostics\\n\")\n",
|
||||
"print(f\"Saving to: {incident_dir}\\n\")\n",
|
||||
"\n",
|
||||
"# 1. Check pod status\n",
|
||||
"print(\"1. Checking pod status...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" print(result.stdout)\n",
|
||||
" \n",
|
||||
" # Check for restarts\n",
|
||||
" lines = result.stdout.split(\"\\n\")\n",
|
||||
" restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
|
||||
" if restarts:\n",
|
||||
" print(\"\\n Pod restart counts:\")\n",
|
||||
" for line in restarts[1:]: # Skip header\n",
|
||||
" if line.strip():\n",
|
||||
" parts = line.split()\n",
|
||||
" if len(parts) > 3:\n",
|
||||
" print(f\" {parts[0]}: {parts[3]} restarts\")\n",
|
||||
"\n",
|
||||
"# 2. Check recent events\n",
|
||||
"print(\"\\n2. Checking recent events...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" with open(incident_dir / \"events.txt\", \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" if result.stdout.strip():\n",
|
||||
" print(\" Recent warning/error events:\")\n",
|
||||
" for line in result.stdout.split(\"\\n\")[-5:]:\n",
|
||||
" if line.strip():\n",
|
||||
" print(f\" {line}\")\n",
|
||||
"\n",
|
||||
"# 3. Check worker pod logs for ClickHouse errors\n",
|
||||
"print(\"\\n3. Checking worker pod logs for ClickHouse errors...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
|
||||
"if worker_pod:\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=50\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" logs_file = incident_dir / f\"worker-pod-{worker_pod}-logs.txt\"\n",
|
||||
" with open(logs_file, \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" \n",
|
||||
" # Look for ClickHouse-related errors\n",
|
||||
" error_keywords = [\"clickhouse\", \"clickhouse\", \"connection\", \"timeout\", \"refused\", \"authentication\", \"NOAUTH\", \"retry\"]\n",
|
||||
" error_lines = [l for l in result.stdout.split(\"\\n\") \n",
|
||||
" if any(kw in l.lower() for kw in error_keywords)]\n",
|
||||
" \n",
|
||||
" if error_lines:\n",
|
||||
" print(\" Found ClickHouse-related errors:\")\n",
|
||||
" for line in error_lines[-5:]:\n",
|
||||
" print(f\" {line}\")\n",
|
||||
" else:\n",
|
||||
" print(\" No obvious ClickHouse errors in recent logs\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not find worker pod\")\n",
|
||||
"\n",
|
||||
"ok(f\"Diagnostics saved to: {incident_dir}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
|
||||
"\n",
|
||||
"**This is critical - Support will ask for this bundle.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import urllib.request\n",
|
||||
"\n",
|
||||
"print(\"### Running Canonical Diagnostics Script\\n\")\n",
|
||||
"\n",
|
||||
"script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
|
||||
"script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" urllib.request.urlretrieve(script_url, script_path)\n",
|
||||
" script_path.chmod(0o755)\n",
|
||||
" \n",
|
||||
" print(f\"Running diagnostics script for namespace: {namespace}\")\n",
|
||||
" result = run(\n",
|
||||
" [str(script_path), namespace],\n",
|
||||
" check=False,\n",
|
||||
" stream=True\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" ok(\"Diagnostics script completed\")\n",
|
||||
" \n",
|
||||
" # Find and move tarball\n",
|
||||
" for file in incident_dir.parent.iterdir():\n",
|
||||
" if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
|
||||
" target_path = incident_dir / file.name\n",
|
||||
" file.rename(target_path)\n",
|
||||
" ok(f\"Diagnostics bundle: {target_path.name}\")\n",
|
||||
" break\n",
|
||||
" else:\n",
|
||||
" warn(\"Diagnostics script had errors (check output above)\")\n",
|
||||
" \n",
|
||||
"except Exception as e:\n",
|
||||
" warn(f\"Could not run diagnostics script: {e}\")\n",
|
||||
" print(\" 💡 You can run it manually:\")\n",
|
||||
" print(f\" curl -O {script_url}\")\n",
|
||||
" print(f\" chmod +x get_k8s_debugging_info.sh\")\n",
|
||||
" print(f\" ./get_k8s_debugging_info.sh {namespace}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 10. Do the Drill - Step 5: Guided Triage\n",
|
||||
"\n",
|
||||
"**Where to look first for ClickHouse issues:**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"### Guided Triage Steps\\n\")\n",
|
||||
"\n",
|
||||
"print(\"1. Check worker pod logs for ClickHouse connection errors:\")\n",
|
||||
"print(f\" kubectl logs -n {namespace} <worker-pod-name> | grep -i 'clickhouse\\\\|clickhouse\\\\|connection'\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"2. Verify secret exists and has correct keys:\")\n",
|
||||
"print(f\" kubectl get secret {clickhouse_secret_name} -n {namespace} -o yaml\")\n",
|
||||
"print(\" (Don't print the actual values - they're base64 encoded)\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"3. Check for worker pod restarts (indicates connection failures):\")\n",
|
||||
"print(f\" kubectl get pods -n {namespace} -l app=langsmith-worker\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"4. Test ClickHouse connectivity from a pod (if possible):\")\n",
|
||||
"print(\" kubectl run -it --rm debug --image=clickhouse:7 --restart=Never -- \\\\\")\n",
|
||||
"print(\" clickhouse-cli -h <clickhouse-host> -p <port> -a <password> ping\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"5. Check events for connection/authentication errors:\")\n",
|
||||
"print(f\" kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"# Check what we can automatically\n",
|
||||
"print(\"\\n### Automatic Checks\\n\")\n",
|
||||
"\n",
|
||||
"# Check secret still exists\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secret\", clickhouse_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" ok(f\"Secret '{clickhouse_secret_name}' still exists\")\n",
|
||||
" secret_data = json.loads(result.stdout)\n",
|
||||
" keys = list(secret_data.get(\"data\", {}).keys())\n",
|
||||
" print(f\" Secret keys: {', '.join(keys)}\")\n",
|
||||
"else:\n",
|
||||
" warn(f\"Secret '{clickhouse_secret_name}' not found!\")\n",
|
||||
"\n",
|
||||
"# Check for pods with ClickHouse connection env vars\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" clickhouse_related_pods = []\n",
|
||||
" for pod in pods.get(\"items\", []):\n",
|
||||
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
|
||||
" for container in containers:\n",
|
||||
" env = container.get(\"env\", [])\n",
|
||||
" clickhouse_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
|
||||
" for kw in [\"REDIS\", \"CLICKHOUSE\"])]\n",
|
||||
" if clickhouse_env:\n",
|
||||
" clickhouse_related_pods.append(name)\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" if clickhouse_related_pods:\n",
|
||||
" print(f\"\\n Pods with ClickHouse environment variables:\")\n",
|
||||
" for pod_name in set(clickhouse_related_pods):\n",
|
||||
" print(f\" - {pod_name}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 11. Do the Drill - Step 6: Remediation\n",
|
||||
"\n",
|
||||
"**Restore the original secret to fix the issue.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# REMEDIATION: Restore original secret\n",
|
||||
"# UNCOMMENT TO RESTORE\n",
|
||||
"\n",
|
||||
"# if backup_file.exists():\n",
|
||||
"# print(f\"Restoring original secret from: {backup_file.name}\")\n",
|
||||
"# result = run(\n",
|
||||
"# [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
|
||||
"# check=True,\n",
|
||||
"# stream=True\n",
|
||||
"# )\n",
|
||||
"# \n",
|
||||
"# ok(\"Original secret restored\")\n",
|
||||
"# print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
|
||||
"# print(\" This may take 1-2 minutes. Monitor pod status:\")\n",
|
||||
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
|
||||
"# \n",
|
||||
"# import time\n",
|
||||
"# print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
|
||||
"# time.sleep(60)\n",
|
||||
"# else:\n",
|
||||
"# warn(f\"Backup file not found: {backup_file}\")\n",
|
||||
"# print(\" 💡 You may need to manually restore the secret\")\n",
|
||||
"\n",
|
||||
"print(\"⚠️ To restore, uncomment the code above and run this cell.\")\n",
|
||||
"if 'backup_file' in locals():\n",
|
||||
" print(f\" Backup file: {backup_file.name}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 12. Do the Drill - Step 7: Confirm Recovery\n",
|
||||
"\n",
|
||||
"**Verify that everything is working again.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"### Verifying Recovery\\n\")\n",
|
||||
"\n",
|
||||
"# Check pod status\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" running = sum(1 for p in pods.get(\"items\", [])\n",
|
||||
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
|
||||
" total = len(pods.get(\"items\", []))\n",
|
||||
" \n",
|
||||
" if running == total and total > 0:\n",
|
||||
" ok(f\"All {total} pod(s) are running\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"Only {running}/{total} pod(s) running\")\n",
|
||||
" print(\" 💡 Wait a bit longer for pods to fully recover\")\n",
|
||||
"\n",
|
||||
"# Check for recent errors in worker logs\n",
|
||||
"print(\"\\nChecking for recent errors in worker logs...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
|
||||
"if worker_pod:\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=20\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" error_keywords = [\"error\", \"fail\", \"clickhouse\", \"clickhouse\", \"connection\"]\n",
|
||||
" recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
|
||||
" if any(kw in l.lower() for kw in error_keywords)]\n",
|
||||
" \n",
|
||||
" if recent_errors:\n",
|
||||
" warn(\"Still seeing some errors in logs:\")\n",
|
||||
" for line in recent_errors[-3:]:\n",
|
||||
" print(f\" {line}\")\n",
|
||||
" else:\n",
|
||||
" ok(\"No recent errors in worker logs\")\n",
|
||||
"\n",
|
||||
"ok(\"Recovery verification complete\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 13. What Support Will Ask For\n",
|
||||
"\n",
|
||||
"**When escalating a ClickHouse issue, Support will need:**\n",
|
||||
"\n",
|
||||
"1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
|
||||
"2. **ClickHouse connection details:**\n",
|
||||
" - Host/endpoint (redacted)\n",
|
||||
" - Port\n",
|
||||
" - Password (redacted)\n",
|
||||
" - Whether using SSL/TLS\n",
|
||||
"3. **Error messages from logs:**\n",
|
||||
" - Full error text (not just \"connection failed\")\n",
|
||||
" - Timestamps of first occurrence\n",
|
||||
" - Retry patterns\n",
|
||||
"4. **Recent changes:**\n",
|
||||
" - Secret rotations\n",
|
||||
" - Network policy changes\n",
|
||||
" - ClickHouse configuration changes\n",
|
||||
"5. **Queue status (if accessible):**\n",
|
||||
" - Queue length\n",
|
||||
" - Worker processing rate\n",
|
||||
" - Backlog growth rate\n",
|
||||
"6. **ClickHouse health (if accessible):**\n",
|
||||
" - ClickHouse version\n",
|
||||
" - Memory usage\n",
|
||||
" - Connection count\n",
|
||||
" - Slow queries\n",
|
||||
"\n",
|
||||
"**Evidence collected in this lab:**\n",
|
||||
"- ✅ Diagnostics bundle\n",
|
||||
"- ✅ Worker pod logs with ClickHouse errors\n",
|
||||
"- ✅ Events showing failures\n",
|
||||
"- ✅ Secret configuration (structure, not values)\n",
|
||||
"\n",
|
||||
"**Additional evidence to gather (if escalating):**\n",
|
||||
"- ClickHouse endpoint connectivity test\n",
|
||||
"- Queue metrics (if available)\n",
|
||||
"- ClickHouse logs (if accessible via cloud provider)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 14. Lessons Learned\n",
|
||||
"\n",
|
||||
"**Key takeaways from this lab:**\n",
|
||||
"\n",
|
||||
"1. **ClickHouse failures can be intermittent** - Connection pool retries may mask issues\n",
|
||||
"2. **Worker logs are critical** - ClickHouse errors appear in worker pod logs\n",
|
||||
"3. **Queue backlog is a symptom** - Jobs pile up when workers can't connect\n",
|
||||
"4. **Secrets matter** - Wrong credentials cause authentication failures\n",
|
||||
"5. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
|
||||
"\n",
|
||||
"**Common mistakes to avoid:**\n",
|
||||
"- ❌ Ignoring intermittent failures (they indicate connection issues)\n",
|
||||
"- ❌ Not checking worker logs (API logs may not show ClickHouse errors)\n",
|
||||
"- ❌ Not monitoring queue length (backlog indicates processing delays)\n",
|
||||
"- ❌ Not testing ClickHouse connectivity independently\n",
|
||||
"\n",
|
||||
"**Next steps:**\n",
|
||||
"- Practice with other failure injection methods (Level 2)\n",
|
||||
"- Try the ClickHouse or Blob Storage failure labs\n",
|
||||
"- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -0,0 +1,946 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Module 4: Failure Lab - Blob Storage\n",
|
||||
"\n",
|
||||
"## ⚠️ CRITICAL SAFETY WARNING\n",
|
||||
"\n",
|
||||
"**THIS NOTEBOOK WILL MODIFY YOUR ENVIRONMENT:**\n",
|
||||
"- **Modifies Kubernetes secrets** (breaks blob storage credentials)\n",
|
||||
"- **Causes service disruptions** (large payload failures, ClickHouse degradation)\n",
|
||||
"- **Requires remediation** to restore functionality\n",
|
||||
"\n",
|
||||
"**REQUIREMENTS:**\n",
|
||||
"- ✅ **TEST/NON-PRODUCTION environment ONLY**\n",
|
||||
"- ✅ **MODULE4_SAFE_ENVIRONMENT=true** must be set in .env\n",
|
||||
"- ✅ **Baseline diagnostics collected** (run `01_diagnostics_baseline.ipynb` first)\n",
|
||||
"- ✅ **Backup/restore plan** available\n",
|
||||
"\n",
|
||||
"**DO NOT RUN THIS LAB AGAINST PRODUCTION SYSTEMS.**\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"**This lab teaches you how to debug Blob Storage configuration failures in LangSmith.**\n",
|
||||
"\n",
|
||||
"Blob Storage is LangSmith's **large payload storage**. It handles:\n",
|
||||
"- Storing large trace payloads and artifacts\n",
|
||||
"- Offloading large data from ClickHouse\n",
|
||||
"- Providing durable storage for trace data\n",
|
||||
"\n",
|
||||
"**When Blob Storage fails, you'll see:**\n",
|
||||
"- Large payload traces degrade ClickHouse performance\n",
|
||||
"- Warnings/errors in logs about artifact storage\n",
|
||||
"- Increased ClickHouse pressure and latency under load\n",
|
||||
"- Traces with large payloads fail to store properly\n",
|
||||
"\n",
|
||||
"**Learning Objectives:**\n",
|
||||
"1. Understand how Blob Storage failures manifest\n",
|
||||
"2. Practice collecting diagnostics for blob storage issues\n",
|
||||
"3. Learn to identify configuration vs. credential vs. network issues\n",
|
||||
"4. Practice safe remediation\n",
|
||||
"\n",
|
||||
"**Estimated time:** 30-45 minutes\n",
|
||||
"\n",
|
||||
"**⚠️ Important:** \n",
|
||||
"- Run `01_diagnostics_baseline.ipynb` BEFORE starting this lab!\n",
|
||||
"- Complete safety check in `../shared/00_setup_or_resume_environment.ipynb` first!\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Bootstrap environment\n",
|
||||
"import sys\n",
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"# Add notebooks directory to path\n",
|
||||
"possible_paths = [\n",
|
||||
" Path.cwd().parent,\n",
|
||||
" Path.cwd(),\n",
|
||||
" Path.cwd() / \"notebooks\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"notebooks_path = None\n",
|
||||
"for path in possible_paths:\n",
|
||||
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" notebooks_path = path\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not notebooks_path:\n",
|
||||
" notebooks_path = Path.cwd() / \"notebooks\"\n",
|
||||
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
|
||||
"\n",
|
||||
"if str(notebooks_path) not in sys.path:\n",
|
||||
" sys.path.insert(0, str(notebooks_path))\n",
|
||||
"\n",
|
||||
"from shared._bootstrap import bootstrap\n",
|
||||
"from shared._validation import ok, warn\n",
|
||||
"\n",
|
||||
"# Run bootstrap\n",
|
||||
"bootstrap_info = bootstrap()\n",
|
||||
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
|
||||
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## ⚠️ CRITICAL: Environment Safety Verification\n",
|
||||
"\n",
|
||||
"**Before proceeding, verify you're in a TEST/NON-PRODUCTION environment and understand what will be modified.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# CRITICAL SAFETY CHECK: Verify environment is safe for failure injection\n",
|
||||
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
|
||||
"from shared._validation import ok, warn, fail\n",
|
||||
"import os\n",
|
||||
"\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"region = get_region()\n",
|
||||
"identity = get_identity()\n",
|
||||
"\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"print(\"⚠️ CRITICAL SAFETY CHECK - BLOB STORAGE FAILURE LAB\")\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"\n",
|
||||
"# Show environment details prominently\n",
|
||||
"provider_display = provider.upper()\n",
|
||||
"print(f\"\\n### Current Environment Configuration\")\n",
|
||||
"print(f\"Cloud Provider: {provider_display}\")\n",
|
||||
"print(f\"Region: {region}\")\n",
|
||||
"\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" account_id = identity.get('Account', 'N/A')\n",
|
||||
" user_arn = identity.get('Arn', 'N/A')\n",
|
||||
" print(f\"Account ID: {account_id}\")\n",
|
||||
" print(f\"User ARN: {user_arn}\")\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
|
||||
" subscription_name = identity.get(\"SubscriptionName\", \"N/A\")\n",
|
||||
" print(f\"Subscription ID: {subscription_id}\")\n",
|
||||
" print(f\"Subscription Name: {subscription_name}\")\n",
|
||||
"\n",
|
||||
"# Show all relevant environment variables\n",
|
||||
"print(f\"\\n### Environment Variables (VERIFY THESE ARE CORRECT)\")\n",
|
||||
"print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
|
||||
"print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
|
||||
"print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
|
||||
"print(f\"LANGSMITH_DOMAIN: {os.environ.get('LANGSMITH_DOMAIN', 'NOT SET')}\")\n",
|
||||
"\n",
|
||||
"print(\"\\n\" + \"=\" * 70)\n",
|
||||
"print(\"⚠️ WHAT THIS LAB WILL DO:\")\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"print(\"\\nThis failure lab will:\")\n",
|
||||
"print(\" 1. Find the Blob Storage secret in your namespace\")\n",
|
||||
"print(\" 2. BACKUP the original secret (saved to artifacts)\")\n",
|
||||
"print(\" 3. MODIFY the secret to set INVALID credentials\")\n",
|
||||
"print(\" 4. Apply the modified secret (breaks blob storage connectivity)\")\n",
|
||||
"print(\" 5. Cause large payload failures and ClickHouse degradation\")\n",
|
||||
"print(\" 6. Require remediation to restore (restore original secret)\")\n",
|
||||
"print(\"\\n\" + \"=\" * 70)\n",
|
||||
"\n",
|
||||
"# Check for Module 4 safety flag\n",
|
||||
"module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
|
||||
"if module4_safe not in [\"true\", \"yes\", \"1\"]:\n",
|
||||
" fail(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
|
||||
" print(\"\\n❌ SAFETY CHECK FAILED - Cannot proceed\")\n",
|
||||
" print(\"\\nTo run this failure lab, you MUST:\")\n",
|
||||
" print(\" 1. Verify this is a TEST/NON-PRODUCTION environment\")\n",
|
||||
" print(\" 2. Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
|
||||
" print(\" 3. Complete safety check in ../shared/00_setup_or_resume_environment.ipynb\")\n",
|
||||
" print(\" 4. Re-run this cell to confirm\")\n",
|
||||
" print(\"\\nThis flag is REQUIRED to prevent accidental execution in production.\")\n",
|
||||
" raise RuntimeError(\"MODULE4_SAFE_ENVIRONMENT not set. Required for failure labs.\")\n",
|
||||
"\n",
|
||||
"ok(\"MODULE4_SAFE_ENVIRONMENT flag is set\")\n",
|
||||
"print(\"\\n✅ Safety check passed - environment marked as safe for failure injection\")\n",
|
||||
"print(\"\\n⚠️ REMINDER: This lab will break blob storage connectivity.\")\n",
|
||||
"print(\" Ensure you understand the remediation steps before proceeding.\")\n",
|
||||
"print(\" Original secret will be backed up automatically.\")\n",
|
||||
"\n",
|
||||
"print(\"\\n\" + \"=\" * 70)\n",
|
||||
"print(\"✅ Environment verified - ready for Blob Storage failure lab\")\n",
|
||||
"print(\"=\" * 70)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1. Configuration & Prerequisites\n",
|
||||
"\n",
|
||||
"Load configuration and verify prerequisites.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from shared._validation import require_env\n",
|
||||
"\n",
|
||||
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
|
||||
"config = require_env(*required_vars)\n",
|
||||
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
|
||||
"\n",
|
||||
"namespace = config[\"NAMESPACE\"]\n",
|
||||
"\n",
|
||||
"print(f\"Namespace: {namespace}\")\n",
|
||||
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
|
||||
"\n",
|
||||
"ok(\"Configuration loaded\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2. What This Service Does for LangSmith\n",
|
||||
"\n",
|
||||
"Blob Storage is LangSmith's **blob and job queue**. It handles:\n",
|
||||
"\n",
|
||||
"- **Job queue for asynchronous processing:**\n",
|
||||
" - Workers pull trace processing jobs from Blob Storage\n",
|
||||
" - Jobs are queued when traces arrive via API\n",
|
||||
" - Queue backlog indicates processing delays\n",
|
||||
"\n",
|
||||
"- **Caching:**\n",
|
||||
" - Frequently accessed data (project metadata, user info)\n",
|
||||
" - Reduces load on PostgreSQL\n",
|
||||
" - Improves response times\n",
|
||||
"\n",
|
||||
"- **Rate limiting and session management:**\n",
|
||||
" - API rate limiting\n",
|
||||
" - Session storage (if configured)\n",
|
||||
"\n",
|
||||
"- **Worker coordination:**\n",
|
||||
" - Distributed locking\n",
|
||||
" - Task distribution\n",
|
||||
"\n",
|
||||
"**Why it matters:**\n",
|
||||
"- Without Blob Storage, workers can't process traces\n",
|
||||
"- Job queue fills up, causing delays\n",
|
||||
"- Cache misses increase load on PostgreSQL\n",
|
||||
"- Ingestion becomes unreliable\n",
|
||||
"\n",
|
||||
"**How LangSmith connects:**\n",
|
||||
"- Connection string stored in Kubernetes Secrets\n",
|
||||
"- Workers connect to Blob Storage to pull jobs\n",
|
||||
"- API servers use Blob Storage for caching\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3. Expected Symptoms When Blob Storage Fails\n",
|
||||
"\n",
|
||||
"**What you'll see:**\n",
|
||||
"\n",
|
||||
"1. **Large payload traces degrade ClickHouse:**\n",
|
||||
" - ClickHouse performance degrades under load\n",
|
||||
" - Insert operations slow down\n",
|
||||
" - Query performance suffers\n",
|
||||
" - Storage pressure increases\n",
|
||||
"\n",
|
||||
"2. **Warnings/errors in logs about artifact storage:**\n",
|
||||
" - Worker logs show artifact upload failures\n",
|
||||
" - Bucket access errors\n",
|
||||
" - Credential errors\n",
|
||||
" - \"No such bucket\" or \"Access Denied\" errors\n",
|
||||
"\n",
|
||||
"3. **Increased ClickHouse pressure:**\n",
|
||||
" - ClickHouse latency increases\n",
|
||||
" - Merge operations backlog\n",
|
||||
" - Storage usage spikes\n",
|
||||
" - Query timeouts\n",
|
||||
"\n",
|
||||
"4. **Log patterns:**\n",
|
||||
" - Artifact storage errors in worker logs\n",
|
||||
" - S3/blob storage connection errors\n",
|
||||
" - Bucket access denied errors\n",
|
||||
" - Credential errors\n",
|
||||
" - Configuration errors\n",
|
||||
"\n",
|
||||
"**Timeline:**\n",
|
||||
"- Symptoms appear gradually (under load)\n",
|
||||
"- ClickHouse performance degrades over time\n",
|
||||
"- Large traces fail or are rejected\n",
|
||||
"- Full failure if blob storage completely unavailable\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4. Failure Injection Options\n",
|
||||
"\n",
|
||||
"**Choose ONE level to practice with. Level 1 is subtle, Level 2 is more obvious.**\n",
|
||||
"\n",
|
||||
"### Level 1: Subtle Failure (Recommended for first run)\n",
|
||||
"\n",
|
||||
"**Option A: Wrong Blob Storage Password**\n",
|
||||
"- Modify the Blob Storage password in the Kubernetes Secret\n",
|
||||
"- Symptoms: Authentication failures, connection refused, intermittent failures\n",
|
||||
"\n",
|
||||
"**Option B: Block Egress to Blob Storage Endpoint**\n",
|
||||
"- Apply NetworkPolicy blocking egress to Blob Storage (if NetworkPolicy supported)\n",
|
||||
"- Symptoms: Connection timeout, no route to host, intermittent failures\n",
|
||||
"\n",
|
||||
"### Level 2: Obvious Failure\n",
|
||||
"\n",
|
||||
"**Option C: Wrong Blob Storage Host/Endpoint**\n",
|
||||
"- Point connection string to non-existent host\n",
|
||||
"- Symptoms: Connection timeout, DNS resolution failures, immediate failures\n",
|
||||
"\n",
|
||||
"**⚠️ Safety:** All injections are reversible. We'll save the original secret before modifying it.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 5. Do the Drill - Step 1: Confirm Baseline\n",
|
||||
"\n",
|
||||
"**Before injecting any failure, verify your baseline is healthy.**\n",
|
||||
"\n",
|
||||
"💡 **If you haven't run `01_diagnostics_baseline.ipynb` yet, do that first!**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._shell import run\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"print(\"### Quick Baseline Check\\n\")\n",
|
||||
"\n",
|
||||
"# Check pod status\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" healthy = sum(1 for p in pods.get(\"items\", [])\n",
|
||||
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
|
||||
" total = len(pods.get(\"items\", []))\n",
|
||||
" print(f\"Pods: {healthy}/{total} running\")\n",
|
||||
" \n",
|
||||
" if healthy == total and total > 0:\n",
|
||||
" ok(\"Baseline looks healthy\")\n",
|
||||
" else:\n",
|
||||
" warn(\"Some pods are not running - check baseline first\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not check pod status\")\n",
|
||||
"\n",
|
||||
"# Check for Blob Storage secret\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"blob_secrets = []\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" secrets = json.loads(result.stdout)\n",
|
||||
" for secret in secrets.get(\"items\", []):\n",
|
||||
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" if \"blob\" in name.lower() or \"blob\" in name.lower():\n",
|
||||
" blob_secrets.append(name)\n",
|
||||
"\n",
|
||||
"if blob_secrets:\n",
|
||||
" ok(f\"Found {len(blob_secrets)} Blob Storage-related secret(s)\")\n",
|
||||
" for secret_name in blob_secrets:\n",
|
||||
" print(f\" - {secret_name}\")\n",
|
||||
"else:\n",
|
||||
" warn(\"No Blob Storage secrets found\")\n",
|
||||
" print(\" 💡 Blob Storage connection may be configured differently\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 6. Do the Drill - Step 2: Apply Failure Injection\n",
|
||||
"\n",
|
||||
"**⚠️ WARNING: This will modify your LangSmith deployment. Make sure you're in a test environment!**\n",
|
||||
"\n",
|
||||
"Choose your failure injection method below. We'll use **Option A (Wrong Password)** as the default example.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# FAILURE INJECTION: Wrong Blob Storage Password\n",
|
||||
"# This cell modifies the Blob Storage password secret to an invalid value\n",
|
||||
"\n",
|
||||
"import base64\n",
|
||||
"import yaml\n",
|
||||
"from datetime import datetime\n",
|
||||
"\n",
|
||||
"# Find Blob Storage secret (look for common names)\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secrets\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"blob_secret_name = None\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" secrets = json.loads(result.stdout)\n",
|
||||
" for secret in secrets.get(\"items\", []):\n",
|
||||
" name = secret.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" # Common patterns: blob, blob\n",
|
||||
" if any(keyword in name.lower() for keyword in [\"blob\", \"blob\"]):\n",
|
||||
" # Check if it has password-related keys\n",
|
||||
" data = secret.get(\"data\", {})\n",
|
||||
" if any(key in data for key in [\"password\", \"REDIS_PASSWORD\", \"BLOB_PASSWORD\"]):\n",
|
||||
" blob_secret_name = name\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not blob_secret_name:\n",
|
||||
" raise RuntimeError(\"❌ Could not find Blob Storage secret. Check your deployment configuration.\")\n",
|
||||
"\n",
|
||||
"print(f\"Found Blob Storage secret: {blob_secret_name}\")\n",
|
||||
"\n",
|
||||
"# Get current secret\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secret\", blob_secret_name, \"-n\", namespace, \"-o\", \"yaml\"],\n",
|
||||
" check=True,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Save original secret for restoration\n",
|
||||
"backup_file = artifacts_dir / \"module-4\" / f\"blob-secret-backup-{datetime.now().strftime('%Y%m%d-%H%M%S')}.yaml\"\n",
|
||||
"backup_file.parent.mkdir(parents=True, exist_ok=True)\n",
|
||||
"with open(backup_file, \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
"\n",
|
||||
"ok(f\"Backed up original secret to: {backup_file.name}\")\n",
|
||||
"\n",
|
||||
"# Parse YAML and modify password\n",
|
||||
"secret_data = yaml.safe_load(result.stdout)\n",
|
||||
"if \"data\" not in secret_data:\n",
|
||||
" raise RuntimeError(\"Secret has no data section\")\n",
|
||||
"\n",
|
||||
"# Find password key (could be password, REDIS_PASSWORD, BLOB_PASSWORD, etc.)\n",
|
||||
"password_key = None\n",
|
||||
"for key in [\"password\", \"REDIS_PASSWORD\", \"BLOB_PASSWORD\", \"blob-password\"]:\n",
|
||||
" if key in secret_data[\"data\"]:\n",
|
||||
" password_key = key\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not password_key:\n",
|
||||
" raise RuntimeError(\"Could not find password key in secret\")\n",
|
||||
"\n",
|
||||
"# Set invalid password\n",
|
||||
"invalid_password = \"INVALID_PASSWORD_12345\"\n",
|
||||
"invalid_password_b64 = base64.b64encode(invalid_password.encode()).decode()\n",
|
||||
"\n",
|
||||
"# Modify secret\n",
|
||||
"secret_data[\"data\"][password_key] = invalid_password_b64\n",
|
||||
"\n",
|
||||
"# Save modified secret to temp file\n",
|
||||
"temp_secret_file = artifacts_dir / \"module-4\" / \"blob-secret-modified.yaml\"\n",
|
||||
"with open(temp_secret_file, \"w\") as f:\n",
|
||||
" yaml.dump(secret_data, f)\n",
|
||||
"\n",
|
||||
"print(f\"\\n⚠️ READY TO APPLY FAILURE INJECTION\")\n",
|
||||
"print(f\" This will set an invalid password in secret: {blob_secret_name}\")\n",
|
||||
"print(f\" Modified secret saved to: {temp_secret_file.name}\")\n",
|
||||
"print(f\"\\n To apply, uncomment and run the next cell.\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# UNCOMMENT TO APPLY FAILURE INJECTION\n",
|
||||
"# \n",
|
||||
"# result = run(\n",
|
||||
"# [\"kubectl\", \"apply\", \"-f\", str(temp_secret_file)],\n",
|
||||
"# check=True,\n",
|
||||
"# stream=True\n",
|
||||
"# )\n",
|
||||
"# \n",
|
||||
"# ok(\"Failure injection applied - Blob Storage password is now invalid\")\n",
|
||||
"# print(\"\\n💡 Pods will need to restart to pick up the new secret.\")\n",
|
||||
"# print(\" This may take 1-2 minutes. Watch for pod restarts:\")\n",
|
||||
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
|
||||
"# \n",
|
||||
"# # Wait a moment for changes to propagate\n",
|
||||
"# import time\n",
|
||||
"# print(\"\\nWaiting 30 seconds for changes to propagate...\")\n",
|
||||
"# time.sleep(30)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 8. Do the Drill - Step 3: Observe Symptoms\n",
|
||||
"\n",
|
||||
"**Now that the failure is injected, observe how it manifests.**\n",
|
||||
"\n",
|
||||
"Check:\n",
|
||||
"1. Worker pod logs for Blob Storage connection errors\n",
|
||||
"2. Queue backlog (if visible)\n",
|
||||
"3. Worker retry patterns\n",
|
||||
"4. Latency in API responses\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from datetime import datetime\n",
|
||||
"\n",
|
||||
"# Create incident directory for diagnostics\n",
|
||||
"incident_dir = artifacts_dir / \"module-4\" / f\"blob-failure-{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n",
|
||||
"incident_dir.mkdir(parents=True, exist_ok=True)\n",
|
||||
"\n",
|
||||
"print(f\"### Collecting Failure Diagnostics\\n\")\n",
|
||||
"print(f\"Saving to: {incident_dir}\\n\")\n",
|
||||
"\n",
|
||||
"# 1. Check pod status\n",
|
||||
"print(\"1. Checking pod status...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"wide\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" with open(incident_dir / \"pods-status.txt\", \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" print(result.stdout)\n",
|
||||
" \n",
|
||||
" # Check for restarts\n",
|
||||
" lines = result.stdout.split(\"\\n\")\n",
|
||||
" restarts = [l for l in lines if \"RESTARTS\" in l or (l and not l.startswith(\"NAME\"))]\n",
|
||||
" if restarts:\n",
|
||||
" print(\"\\n Pod restart counts:\")\n",
|
||||
" for line in restarts[1:]: # Skip header\n",
|
||||
" if line.strip():\n",
|
||||
" parts = line.split()\n",
|
||||
" if len(parts) > 3:\n",
|
||||
" print(f\" {parts[0]}: {parts[3]} restarts\")\n",
|
||||
"\n",
|
||||
"# 2. Check recent events\n",
|
||||
"print(\"\\n2. Checking recent events...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"events\", \"-n\", namespace, \"--sort-by='.lastTimestamp'\", \"--field-selector=type!=Normal\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" with open(incident_dir / \"events.txt\", \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" if result.stdout.strip():\n",
|
||||
" print(\" Recent warning/error events:\")\n",
|
||||
" for line in result.stdout.split(\"\\n\")[-5:]:\n",
|
||||
" if line.strip():\n",
|
||||
" print(f\" {line}\")\n",
|
||||
"\n",
|
||||
"# 3. Check worker pod logs for Blob Storage errors\n",
|
||||
"print(\"\\n3. Checking worker pod logs for Blob Storage errors...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
|
||||
"if worker_pod:\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=50\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" logs_file = incident_dir / f\"worker-pod-{worker_pod}-logs.txt\"\n",
|
||||
" with open(logs_file, \"w\") as f:\n",
|
||||
" f.write(result.stdout)\n",
|
||||
" \n",
|
||||
" # Look for Blob Storage-related errors\n",
|
||||
" error_keywords = [\"blob\", \"blob\", \"connection\", \"timeout\", \"refused\", \"authentication\", \"NOAUTH\", \"retry\"]\n",
|
||||
" error_lines = [l for l in result.stdout.split(\"\\n\") \n",
|
||||
" if any(kw in l.lower() for kw in error_keywords)]\n",
|
||||
" \n",
|
||||
" if error_lines:\n",
|
||||
" print(\" Found Blob Storage-related errors:\")\n",
|
||||
" for line in error_lines[-5:]:\n",
|
||||
" print(f\" {line}\")\n",
|
||||
" else:\n",
|
||||
" print(\" No obvious Blob Storage errors in recent logs\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not find worker pod\")\n",
|
||||
"\n",
|
||||
"ok(f\"Diagnostics saved to: {incident_dir}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 9. Do the Drill - Step 4: Run Canonical Diagnostics Script\n",
|
||||
"\n",
|
||||
"**This is critical - Support will ask for this bundle.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import urllib.request\n",
|
||||
"\n",
|
||||
"print(\"### Running Canonical Diagnostics Script\\n\")\n",
|
||||
"\n",
|
||||
"script_url = \"https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh\"\n",
|
||||
"script_path = incident_dir / \"get_k8s_debugging_info.sh\"\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" urllib.request.urlretrieve(script_url, script_path)\n",
|
||||
" script_path.chmod(0o755)\n",
|
||||
" \n",
|
||||
" print(f\"Running diagnostics script for namespace: {namespace}\")\n",
|
||||
" result = run(\n",
|
||||
" [str(script_path), namespace],\n",
|
||||
" check=False,\n",
|
||||
" stream=True\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" ok(\"Diagnostics script completed\")\n",
|
||||
" \n",
|
||||
" # Find and move tarball\n",
|
||||
" for file in incident_dir.parent.iterdir():\n",
|
||||
" if file.name.startswith(\"langsmith-debug-\") and file.suffix == \".tar.gz\":\n",
|
||||
" target_path = incident_dir / file.name\n",
|
||||
" file.rename(target_path)\n",
|
||||
" ok(f\"Diagnostics bundle: {target_path.name}\")\n",
|
||||
" break\n",
|
||||
" else:\n",
|
||||
" warn(\"Diagnostics script had errors (check output above)\")\n",
|
||||
" \n",
|
||||
"except Exception as e:\n",
|
||||
" warn(f\"Could not run diagnostics script: {e}\")\n",
|
||||
" print(\" 💡 You can run it manually:\")\n",
|
||||
" print(f\" curl -O {script_url}\")\n",
|
||||
" print(f\" chmod +x get_k8s_debugging_info.sh\")\n",
|
||||
" print(f\" ./get_k8s_debugging_info.sh {namespace}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 10. Do the Drill - Step 5: Guided Triage\n",
|
||||
"\n",
|
||||
"**Where to look first for Blob Storage issues:**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"### Guided Triage Steps\\n\")\n",
|
||||
"\n",
|
||||
"print(\"1. Check worker pod logs for Blob Storage connection errors:\")\n",
|
||||
"print(f\" kubectl logs -n {namespace} <worker-pod-name> | grep -i 'blob\\\\|blob\\\\|connection'\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"2. Verify secret exists and has correct keys:\")\n",
|
||||
"print(f\" kubectl get secret {blob_secret_name} -n {namespace} -o yaml\")\n",
|
||||
"print(\" (Don't print the actual values - they're base64 encoded)\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"3. Check for worker pod restarts (indicates connection failures):\")\n",
|
||||
"print(f\" kubectl get pods -n {namespace} -l app=langsmith-worker\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"4. Test Blob Storage connectivity from a pod (if possible):\")\n",
|
||||
"print(\" kubectl run -it --rm debug --image=blob:7 --restart=Never -- \\\\\")\n",
|
||||
"print(\" blob-cli -h <blob-host> -p <port> -a <password> ping\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"print(\"5. Check events for connection/authentication errors:\")\n",
|
||||
"print(f\" kubectl get events -n {namespace} --sort-by='.lastTimestamp' | grep -i 'error\\\\|fail'\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"# Check what we can automatically\n",
|
||||
"print(\"\\n### Automatic Checks\\n\")\n",
|
||||
"\n",
|
||||
"# Check secret still exists\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"secret\", blob_secret_name, \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" ok(f\"Secret '{blob_secret_name}' still exists\")\n",
|
||||
" secret_data = json.loads(result.stdout)\n",
|
||||
" keys = list(secret_data.get(\"data\", {}).keys())\n",
|
||||
" print(f\" Secret keys: {', '.join(keys)}\")\n",
|
||||
"else:\n",
|
||||
" warn(f\"Secret '{blob_secret_name}' not found!\")\n",
|
||||
"\n",
|
||||
"# Check for pods with Blob Storage connection env vars\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" blob_related_pods = []\n",
|
||||
" for pod in pods.get(\"items\", []):\n",
|
||||
" name = pod.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" containers = pod.get(\"spec\", {}).get(\"containers\", [])\n",
|
||||
" for container in containers:\n",
|
||||
" env = container.get(\"env\", [])\n",
|
||||
" blob_env = [e for e in env if any(kw in e.get(\"name\", \"\").upper() \n",
|
||||
" for kw in [\"REDIS\", \"BLOB\"])]\n",
|
||||
" if blob_env:\n",
|
||||
" blob_related_pods.append(name)\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" if blob_related_pods:\n",
|
||||
" print(f\"\\n Pods with Blob Storage environment variables:\")\n",
|
||||
" for pod_name in set(blob_related_pods):\n",
|
||||
" print(f\" - {pod_name}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 11. Do the Drill - Step 6: Remediation\n",
|
||||
"\n",
|
||||
"**Restore the original secret to fix the issue.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# REMEDIATION: Restore original secret\n",
|
||||
"# UNCOMMENT TO RESTORE\n",
|
||||
"\n",
|
||||
"# if backup_file.exists():\n",
|
||||
"# print(f\"Restoring original secret from: {backup_file.name}\")\n",
|
||||
"# result = run(\n",
|
||||
"# [\"kubectl\", \"apply\", \"-f\", str(backup_file)],\n",
|
||||
"# check=True,\n",
|
||||
"# stream=True\n",
|
||||
"# )\n",
|
||||
"# \n",
|
||||
"# ok(\"Original secret restored\")\n",
|
||||
"# print(\"\\n💡 Pods will restart to pick up the correct secret.\")\n",
|
||||
"# print(\" This may take 1-2 minutes. Monitor pod status:\")\n",
|
||||
"# print(f\" kubectl get pods -n {namespace} -w\")\n",
|
||||
"# \n",
|
||||
"# import time\n",
|
||||
"# print(\"\\nWaiting 60 seconds for pods to restart...\")\n",
|
||||
"# time.sleep(60)\n",
|
||||
"# else:\n",
|
||||
"# warn(f\"Backup file not found: {backup_file}\")\n",
|
||||
"# print(\" 💡 You may need to manually restore the secret\")\n",
|
||||
"\n",
|
||||
"print(\"⚠️ To restore, uncomment the code above and run this cell.\")\n",
|
||||
"if 'backup_file' in locals() and backup_file:\n",
|
||||
" print(f\" Backup file: {backup_file.name}\")\n",
|
||||
"else:\n",
|
||||
" print(\" 💡 If you modified Helm values, restore them manually\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 12. Do the Drill - Step 7: Confirm Recovery\n",
|
||||
"\n",
|
||||
"**Verify that everything is working again.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"### Verifying Recovery\\n\")\n",
|
||||
"\n",
|
||||
"# Check pod status\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" pods = json.loads(result.stdout)\n",
|
||||
" running = sum(1 for p in pods.get(\"items\", [])\n",
|
||||
" if p.get(\"status\", {}).get(\"phase\") == \"Running\")\n",
|
||||
" total = len(pods.get(\"items\", []))\n",
|
||||
" \n",
|
||||
" if running == total and total > 0:\n",
|
||||
" ok(f\"All {total} pod(s) are running\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"Only {running}/{total} pod(s) running\")\n",
|
||||
" print(\" 💡 Wait a bit longer for pods to fully recover\")\n",
|
||||
"\n",
|
||||
"# Check for recent errors in worker logs\n",
|
||||
"print(\"\\nChecking for recent errors in worker logs...\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"pods\", \"-n\", namespace, \"-l\", \"app=langsmith-worker\", \"-o\", \"jsonpath='{.items[0].metadata.name}'\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"worker_pod = result.stdout.strip().strip(\"'\\\"\")\n",
|
||||
"if worker_pod:\n",
|
||||
" result = run(\n",
|
||||
" [\"kubectl\", \"logs\", \"-n\", namespace, worker_pod, \"--tail=20\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if result.returncode == 0:\n",
|
||||
" error_keywords = [\"error\", \"fail\", \"blob\", \"blob\", \"connection\"]\n",
|
||||
" recent_errors = [l for l in result.stdout.split(\"\\n\") \n",
|
||||
" if any(kw in l.lower() for kw in error_keywords)]\n",
|
||||
" \n",
|
||||
" if recent_errors:\n",
|
||||
" warn(\"Still seeing some errors in logs:\")\n",
|
||||
" for line in recent_errors[-3:]:\n",
|
||||
" print(f\" {line}\")\n",
|
||||
" else:\n",
|
||||
" ok(\"No recent errors in worker logs\")\n",
|
||||
"\n",
|
||||
"ok(\"Recovery verification complete\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 13. What Support Will Ask For\n",
|
||||
"\n",
|
||||
"**When escalating a Blob Storage issue, Support will need:**\n",
|
||||
"\n",
|
||||
"1. **Diagnostics bundle** (canonical script output) ✅ Collected above\n",
|
||||
"2. **Blob Storage connection details:**\n",
|
||||
" - Host/endpoint (redacted)\n",
|
||||
" - Port\n",
|
||||
" - Password (redacted)\n",
|
||||
" - Whether using SSL/TLS\n",
|
||||
"3. **Error messages from logs:**\n",
|
||||
" - Full error text (not just \"connection failed\")\n",
|
||||
" - Timestamps of first occurrence\n",
|
||||
" - Retry patterns\n",
|
||||
"4. **Recent changes:**\n",
|
||||
" - Secret rotations\n",
|
||||
" - Network policy changes\n",
|
||||
" - Blob Storage configuration changes\n",
|
||||
"5. **Queue status (if accessible):**\n",
|
||||
" - Queue length\n",
|
||||
" - Worker processing rate\n",
|
||||
" - Backlog growth rate\n",
|
||||
"6. **Blob Storage health (if accessible):**\n",
|
||||
" - Blob Storage version\n",
|
||||
" - Memory usage\n",
|
||||
" - Connection count\n",
|
||||
" - Slow queries\n",
|
||||
"\n",
|
||||
"**Evidence collected in this lab:**\n",
|
||||
"- ✅ Diagnostics bundle\n",
|
||||
"- ✅ Worker pod logs with Blob Storage errors\n",
|
||||
"- ✅ Events showing failures\n",
|
||||
"- ✅ Secret configuration (structure, not values)\n",
|
||||
"\n",
|
||||
"**Additional evidence to gather (if escalating):**\n",
|
||||
"- Blob Storage endpoint connectivity test\n",
|
||||
"- Queue metrics (if available)\n",
|
||||
"- Blob Storage logs (if accessible via cloud provider)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 14. Lessons Learned\n",
|
||||
"\n",
|
||||
"**Key takeaways from this lab:**\n",
|
||||
"\n",
|
||||
"1. **Blob Storage failures can be intermittent** - Connection pool retries may mask issues\n",
|
||||
"2. **Worker logs are critical** - Blob Storage errors appear in worker pod logs\n",
|
||||
"3. **Queue backlog is a symptom** - Jobs pile up when workers can't connect\n",
|
||||
"4. **Secrets matter** - Wrong credentials cause authentication failures\n",
|
||||
"5. **Baseline is critical** - You need \"before\" to compare to \"after\"\n",
|
||||
"\n",
|
||||
"**Common mistakes to avoid:**\n",
|
||||
"- ❌ Ignoring intermittent failures (they indicate connection issues)\n",
|
||||
"- ❌ Not checking worker logs (API logs may not show Blob Storage errors)\n",
|
||||
"- ❌ Not monitoring queue length (backlog indicates processing delays)\n",
|
||||
"- ❌ Not testing Blob Storage connectivity independently\n",
|
||||
"\n",
|
||||
"**Next steps:**\n",
|
||||
"- Practice with other failure injection methods (Level 2)\n",
|
||||
"- Try the ClickHouse or Blob Storage failure labs\n",
|
||||
"- Review the [First 10 Minutes Checklist](../shared/incident_first_10_minutes.md)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -0,0 +1,37 @@
|
||||
# Module 4: Troubleshooting & Incident Response
|
||||
|
||||
This directory contains notebooks for Module 4 of the LangSmith Self-Hosted Operator workshop.
|
||||
|
||||
## Notebooks
|
||||
|
||||
### Setup & Baseline
|
||||
- **`../shared/00_setup_or_resume_environment.ipynb`** - Validates environment is ready (shared across modules 2, 3, 4)
|
||||
- **`01_diagnostics_baseline.ipynb`** - Captures baseline diagnostics (run this first!)
|
||||
|
||||
### Failure Labs
|
||||
- **`10_failure_lab_postgres.ipynb`** - PostgreSQL connectivity failure debugging
|
||||
- **`20_failure_lab_redis.ipynb`** - Redis connectivity failure debugging
|
||||
- **`30_failure_lab_clickhouse.ipynb`** - ClickHouse connectivity failure debugging
|
||||
- **`40_failure_lab_blob_storage.ipynb`** - Blob storage configuration failure debugging
|
||||
|
||||
### Advanced
|
||||
- **`90_full_incident_drill.ipynb`** - Complete incident simulation (optional)
|
||||
|
||||
## Workflow
|
||||
|
||||
1. Run `../shared/00_setup_or_resume_environment.ipynb` to verify your environment
|
||||
2. Run `01_diagnostics_baseline.ipynb` to capture baseline
|
||||
3. Run failure labs in order (10, 20, 30, 40) or pick specific ones
|
||||
4. Optionally run `90_full_incident_drill.ipynb` for complete practice
|
||||
|
||||
## Important Notes
|
||||
|
||||
- **Always run baseline first** - You need "before" to compare to "after"
|
||||
- **Failure injections are reversible** - All labs include remediation steps
|
||||
- **Don't skip diagnostics collection** - Support will ask for the canonical bundle
|
||||
- **Practice in test environments only** - These labs modify your deployment
|
||||
|
||||
## Documentation
|
||||
|
||||
See `docs/modules/module-4.md` for complete module documentation.
|
||||
|
||||
@@ -0,0 +1,614 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Setup or Resume Environment\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"This notebook helps you prepare for workshop modules 2 through 4 (Identity & Auth, Production Operations, or Troubleshooting). It validates that your LangSmith environment is running and accessible, or guides you to deploy it using Module 1.\n",
|
||||
"\n",
|
||||
"### About This Notebook\n",
|
||||
"This notebook is **READ-ONLY** and safe to run. It performs validation checks only and does not modify any resources. \n",
|
||||
"\n",
|
||||
"### Module-Specific Notes\n",
|
||||
"- **Modules 2 and 3** are read-only validation notebooks, perfect for understanding your current configuration\n",
|
||||
"- **Module 4** includes hands-on failure labs that intentionally modify secrets to teach troubleshooting—these require a test environment\n",
|
||||
"- Module-specific guidance is provided below to help you understand what to expect\n",
|
||||
"\n",
|
||||
"### Prerequisites\n",
|
||||
"- Module 1 notebooks available (for deployment if needed)\n",
|
||||
"- `kubectl` configured (if environment exists)\n",
|
||||
"\n",
|
||||
"### What This Notebook Does\n",
|
||||
"1. Checks if LangSmith is already deployed\n",
|
||||
"2. If not, provides links to Module 1 deployment notebooks\n",
|
||||
"3. If yes, validates the environment is healthy and reachable\n",
|
||||
"4. **Verifies you're in the correct environment** (shows account/region)\n",
|
||||
"5. Shows module-specific safety warnings\n",
|
||||
"\n",
|
||||
"**Estimated time:** 10-15 minutes\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Bootstrap environment\n",
|
||||
"import sys\n",
|
||||
"import os\n",
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"# Add notebooks directory to path so we can import shared as a package\n",
|
||||
"possible_paths = [\n",
|
||||
" Path.cwd().parent, # If cwd is a module directory, go up one level to notebooks\n",
|
||||
" Path.cwd(), # If cwd is already notebooks\n",
|
||||
" Path.cwd() / \"notebooks\", # If cwd is workspace root\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"notebooks_path = None\n",
|
||||
"for path in possible_paths:\n",
|
||||
" if path and (path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" notebooks_path = path\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not notebooks_path:\n",
|
||||
" notebooks_path = Path.cwd() / \"notebooks\"\n",
|
||||
" if not (notebooks_path / \"shared\" / \"_bootstrap.py\").exists():\n",
|
||||
" raise RuntimeError(f\"Could not find notebooks/shared directory. Current dir: {Path.cwd()}\")\n",
|
||||
"\n",
|
||||
"# Add notebooks directory to path so 'shared' can be imported as a package\n",
|
||||
"if str(notebooks_path) not in sys.path:\n",
|
||||
" sys.path.insert(0, str(notebooks_path))\n",
|
||||
"\n",
|
||||
"from shared._bootstrap import bootstrap\n",
|
||||
"\n",
|
||||
"# Run bootstrap\n",
|
||||
"bootstrap_info = bootstrap()\n",
|
||||
"artifacts_dir = Path(bootstrap_info['artifacts_dir'])\n",
|
||||
"print(f\"\\nArtifacts directory: {artifacts_dir}\")\n",
|
||||
"\n",
|
||||
"# Detect which module is using this notebook\n",
|
||||
"# Check current working directory or environment variable\n",
|
||||
"current_module = None\n",
|
||||
"cwd_str = str(Path.cwd())\n",
|
||||
"if \"module-2\" in cwd_str:\n",
|
||||
" current_module = \"2\"\n",
|
||||
"elif \"module-3\" in cwd_str:\n",
|
||||
" current_module = \"3\"\n",
|
||||
"elif \"module-4\" in cwd_str:\n",
|
||||
" current_module = \"4\"\n",
|
||||
"else:\n",
|
||||
" # Try environment variable\n",
|
||||
" current_module = os.environ.get(\"CURRENT_MODULE\", \"\")\n",
|
||||
" if not current_module:\n",
|
||||
" # Default: assume generic use\n",
|
||||
" current_module = None\n",
|
||||
"\n",
|
||||
"print(f\"\\nDetected module context: {current_module if current_module else 'Generic'}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Environment Safety Verification\n",
|
||||
"\n",
|
||||
"**Before proceeding, verify you're working with the correct environment.**\n",
|
||||
"\n",
|
||||
"**Module-specific notes:**\n",
|
||||
"- **Module 2 (Identity & Auth):** Read-only validation - safe for production\n",
|
||||
"- **Module 3 (Production Operations):** Read-only validation - safe for production\n",
|
||||
"- **Module 4 (Troubleshooting):** Includes failure labs that modify secrets - **TEST environment ONLY**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Environment Safety Check: Verify environment and show module-specific warnings\n",
|
||||
"from shared._cloud_helpers import get_cloud_provider, get_region, get_identity\n",
|
||||
"from shared._validation import ok, warn, fail\n",
|
||||
"\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"region = get_region()\n",
|
||||
"identity = get_identity()\n",
|
||||
"\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"print(\"⚠️ ENVIRONMENT SAFETY CHECK\")\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"\n",
|
||||
"# Show environment details prominently\n",
|
||||
"provider_display = provider.upper()\n",
|
||||
"print(f\"\\n### Current Environment Configuration\")\n",
|
||||
"print(f\"Cloud Provider: {provider_display}\")\n",
|
||||
"print(f\"Region: {region}\")\n",
|
||||
"\n",
|
||||
"if provider == \"aws\":\n",
|
||||
" account_id = identity.get('Account', 'N/A')\n",
|
||||
" user_arn = identity.get('Arn', 'N/A')\n",
|
||||
" print(f\"Account ID: {account_id}\")\n",
|
||||
" print(f\"User ARN: {user_arn}\")\n",
|
||||
"elif provider == \"azure\":\n",
|
||||
" subscription_id = identity.get(\"SubscriptionId\") or identity.get(\"Account\", \"N/A\")\n",
|
||||
" subscription_name = identity.get(\"SubscriptionName\", \"N/A\")\n",
|
||||
" print(f\"Subscription ID: {subscription_id}\")\n",
|
||||
" print(f\"Subscription Name: {subscription_name}\")\n",
|
||||
"\n",
|
||||
"# Show all relevant environment variables\n",
|
||||
"print(f\"\\n### Environment Variables (for verification)\")\n",
|
||||
"print(f\"NAMESPACE: {os.environ.get('NAMESPACE', 'NOT SET')}\")\n",
|
||||
"print(f\"CLUSTER_NAME: {os.environ.get('CLUSTER_NAME', 'NOT SET')}\")\n",
|
||||
"print(f\"HELM_RELEASE: {os.environ.get('HELM_RELEASE', 'langsmith')}\")\n",
|
||||
"print(f\"LANGSMITH_DOMAIN: {os.environ.get('LANGSMITH_DOMAIN', 'NOT SET')}\")\n",
|
||||
"\n",
|
||||
"# Module-specific safety checks\n",
|
||||
"if current_module == \"4\":\n",
|
||||
" # Module 4: Failure labs require TEST environment\n",
|
||||
" print(\"\\n\" + \"=\" * 70)\n",
|
||||
" print(\"⚠️ CRITICAL: Module 4 Failure Labs Will Modify Your Environment\")\n",
|
||||
" print(\"=\" * 70)\n",
|
||||
" print(\"\\nThe failure labs in Module 4 will:\")\n",
|
||||
" print(\" ❌ Modify Kubernetes secrets (break passwords/credentials)\")\n",
|
||||
" print(\" ❌ Cause service disruptions (API failures, login failures)\")\n",
|
||||
" print(\" ❌ Require remediation to restore functionality\")\n",
|
||||
" print(\"\\nThis is INTENTIONAL for learning troubleshooting, but:\")\n",
|
||||
" print(\" ⚠️ ONLY run in TEST/NON-PRODUCTION environments\")\n",
|
||||
" print(\" ⚠️ DO NOT run against production systems\")\n",
|
||||
" print(\" ⚠️ Ensure you can restore the environment after labs\")\n",
|
||||
" print(\"\\n\" + \"=\" * 70)\n",
|
||||
" \n",
|
||||
" # Require explicit confirmation for Module 4\n",
|
||||
" print(\"\\n### Environment Verification Required for Module 4\")\n",
|
||||
" print(\"\\nPlease confirm:\")\n",
|
||||
" print(\" 1. ✅ This is a TEST/NON-PRODUCTION environment\")\n",
|
||||
" print(\" 2. ✅ You understand failure labs will modify secrets\")\n",
|
||||
" print(\" 3. ✅ You have a way to restore the environment (backup/teardown)\")\n",
|
||||
" print(\" 4. ✅ You will NOT run these labs against production\")\n",
|
||||
" \n",
|
||||
" # Check if user has explicitly acknowledged\n",
|
||||
" module4_safe = os.environ.get(\"MODULE4_SAFE_ENVIRONMENT\", \"\").lower()\n",
|
||||
" if module4_safe in [\"true\", \"yes\", \"1\"]:\n",
|
||||
" ok(\"MODULE4_SAFE_ENVIRONMENT flag is set - proceeding\")\n",
|
||||
" print(\"\\n✅ Safety check passed - environment marked as safe for Module 4\")\n",
|
||||
" else:\n",
|
||||
" fail(\"MODULE4_SAFE_ENVIRONMENT flag is NOT set\")\n",
|
||||
" print(\"\\n❌ SAFETY CHECK FAILED\")\n",
|
||||
" print(\"\\nTo proceed with Module 4 failure labs, you MUST:\")\n",
|
||||
" print(\" 1. Verify this is a TEST/NON-PRODUCTION environment\")\n",
|
||||
" print(\" 2. Set MODULE4_SAFE_ENVIRONMENT=true in your .env file\")\n",
|
||||
" print(\" 3. Re-run this cell to confirm\")\n",
|
||||
" print(\"\\nThis flag is REQUIRED to prevent accidental execution in production.\")\n",
|
||||
" raise RuntimeError(\"MODULE4_SAFE_ENVIRONMENT not set. This is required for Module 4 failure labs.\")\n",
|
||||
" \n",
|
||||
" print(\"\\n\" + \"=\" * 70)\n",
|
||||
" print(\"✅ Environment verified as safe for Module 4 failure labs\")\n",
|
||||
" print(\"=\" * 70)\n",
|
||||
"elif current_module in [\"2\", \"3\"]:\n",
|
||||
" # Modules 2 and 3: Read-only, safe for production\n",
|
||||
" print(\"\\n\" + \"=\" * 70)\n",
|
||||
" print(f\"✅ Module {current_module} is READ-ONLY\")\n",
|
||||
" print(\"=\" * 70)\n",
|
||||
" if current_module == \"2\":\n",
|
||||
" print(\"\\nModule 2 (Identity & Auth) notebooks:\")\n",
|
||||
" print(\" ✅ Perform read-only validation checks\")\n",
|
||||
" print(\" ✅ Do NOT modify any infrastructure or secrets\")\n",
|
||||
" print(\" ✅ Safe to run against production environments\")\n",
|
||||
" elif current_module == \"3\":\n",
|
||||
" print(\"\\nModule 3 (Production Operations) notebooks:\")\n",
|
||||
" print(\" ✅ Perform read-only validation and signal checks\")\n",
|
||||
" print(\" ✅ Do NOT modify any infrastructure or resources\")\n",
|
||||
" print(\" ✅ Safe to run against production environments\")\n",
|
||||
" print(\"\\n\" + \"=\" * 70)\n",
|
||||
" ok(\"Environment check complete - safe to proceed with read-only validation\")\n",
|
||||
"else:\n",
|
||||
" # Generic use - show general warning\n",
|
||||
" print(\"\\n\" + \"=\" * 70)\n",
|
||||
" print(\"⚠️ MODULE CONTEXT NOT DETECTED\")\n",
|
||||
" print(\"=\" * 70)\n",
|
||||
" print(\"\\nThis notebook is used by multiple modules:\")\n",
|
||||
" print(\" - Module 2: Read-only validation (safe for production)\")\n",
|
||||
" print(\" - Module 3: Read-only validation (safe for production)\")\n",
|
||||
" print(\" - Module 4: Failure labs (TEST environment ONLY)\")\n",
|
||||
" print(\"\\n💡 If using Module 4, ensure MODULE4_SAFE_ENVIRONMENT=true is set\")\n",
|
||||
" print(\"\\n\" + \"=\" * 70)\n",
|
||||
" ok(\"Environment check complete\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1. Configuration\n",
|
||||
"\n",
|
||||
"Load and validate configuration from environment variables.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from shared._validation import require_env, ok, warn\n",
|
||||
"from shared._cloud_helpers import get_cloud_provider, get_region\n",
|
||||
"\n",
|
||||
"# Required configuration\n",
|
||||
"required_vars = [\"NAMESPACE\", \"CLUSTER_NAME\"]\n",
|
||||
"\n",
|
||||
"print(\"### Loading Configuration\\n\")\n",
|
||||
"\n",
|
||||
"config = {}\n",
|
||||
"missing = []\n",
|
||||
"\n",
|
||||
"for var in required_vars:\n",
|
||||
" value = os.environ.get(var, \"\").strip()\n",
|
||||
" if not value:\n",
|
||||
" missing.append(var)\n",
|
||||
" config[var] = value\n",
|
||||
"\n",
|
||||
"if missing:\n",
|
||||
" raise RuntimeError(f\"❌ Missing required environment variables: {', '.join(missing)}\\n\"\n",
|
||||
" f\"💡 Copy env-samples/workshop.env.example to your .env file and fill in values\")\n",
|
||||
"\n",
|
||||
"# Optional but recommended\n",
|
||||
"config[\"HELM_RELEASE\"] = os.environ.get(\"HELM_RELEASE\", \"langsmith\")\n",
|
||||
"config[\"LANGSMITH_DOMAIN\"] = os.environ.get(\"LANGSMITH_DOMAIN\", \"\")\n",
|
||||
"\n",
|
||||
"# Show cloud provider info\n",
|
||||
"provider = get_cloud_provider()\n",
|
||||
"region = get_region()\n",
|
||||
"\n",
|
||||
"print(f\"Cloud Provider: {provider.upper()}\")\n",
|
||||
"print(f\"Region: {region}\")\n",
|
||||
"print(f\"Namespace: {config['NAMESPACE']}\")\n",
|
||||
"print(f\"Cluster: {config['CLUSTER_NAME']}\")\n",
|
||||
"print(f\"Helm Release: {config['HELM_RELEASE']}\")\n",
|
||||
"\n",
|
||||
"if config[\"LANGSMITH_DOMAIN\"]:\n",
|
||||
" print(f\"LangSmith Domain: {config['LANGSMITH_DOMAIN']}\")\n",
|
||||
"\n",
|
||||
"ok(\"Configuration loaded\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2. Check if Environment Exists\n",
|
||||
"\n",
|
||||
"We'll check if LangSmith is already deployed. If not, we'll provide instructions to deploy using Module 1.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from shared._shell import run\n",
|
||||
"from shared._cloud_helpers import cluster_exists, configure_kubectl, get_kubernetes_service_name\n",
|
||||
"\n",
|
||||
"namespace = config[\"NAMESPACE\"]\n",
|
||||
"cluster_name = config[\"CLUSTER_NAME\"]\n",
|
||||
"k8s_service = get_kubernetes_service_name()\n",
|
||||
"\n",
|
||||
"print(f\"### Checking {k8s_service} Cluster\\n\")\n",
|
||||
"\n",
|
||||
"# Check if cluster exists\n",
|
||||
"if cluster_exists(cluster_name):\n",
|
||||
" ok(f\"Cluster '{cluster_name}' exists\")\n",
|
||||
" \n",
|
||||
" # Configure kubectl\n",
|
||||
" print(f\"\\n### Configuring kubectl\\n\")\n",
|
||||
" try:\n",
|
||||
" configure_kubectl(cluster_name, region)\n",
|
||||
" ok(\"kubectl configured\")\n",
|
||||
" except Exception as e:\n",
|
||||
" warn(f\"Could not configure kubectl: {e}\")\n",
|
||||
" print(\"💡 Make sure you have proper cloud provider credentials\")\n",
|
||||
" raise\n",
|
||||
"else:\n",
|
||||
" warn(f\"Cluster '{cluster_name}' not found\")\n",
|
||||
" print(\"\\n💡 You need to deploy LangSmith first using Module 1 notebooks.\")\n",
|
||||
" print(\" See the 'Deploy Environment' section below.\")\n",
|
||||
" raise RuntimeError(\"Cluster not found. Deploy using Module 1 first.\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3. Verify Namespace and Helm Release\n",
|
||||
"\n",
|
||||
"Check that the LangSmith namespace exists and Helm release is installed.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"\n",
|
||||
"helm_release = config[\"HELM_RELEASE\"]\n",
|
||||
"\n",
|
||||
"print(\"### Checking Namespace\\n\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"namespace\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" ok(f\"Namespace '{namespace}' exists\")\n",
|
||||
"else:\n",
|
||||
" warn(f\"Namespace '{namespace}' not found\")\n",
|
||||
" print(\"\\n💡 You need to deploy LangSmith first using Module 1 notebooks.\")\n",
|
||||
" print(\" See the 'Deploy Environment' section below.\")\n",
|
||||
" raise RuntimeError(\"Namespace not found. Deploy using Module 1 first.\")\n",
|
||||
"\n",
|
||||
"print(\"\\n### Checking Helm Release\\n\")\n",
|
||||
"result = run(\n",
|
||||
" [\"helm\", \"list\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" releases = json.loads(result.stdout)\n",
|
||||
" release_found = any(r.get(\"name\") == helm_release for r in releases)\n",
|
||||
" \n",
|
||||
" if release_found:\n",
|
||||
" ok(f\"Helm release '{helm_release}' found\")\n",
|
||||
" # Get release info\n",
|
||||
" result = run(\n",
|
||||
" [\"helm\", \"status\", helm_release, \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
" )\n",
|
||||
" if result.returncode == 0:\n",
|
||||
" release_info = json.loads(result.stdout)\n",
|
||||
" print(f\" Status: {release_info.get('info', {}).get('status', 'unknown')}\")\n",
|
||||
" print(f\" Chart: {release_info.get('chart', {}).get('metadata', {}).get('name', 'unknown')}\")\n",
|
||||
" print(f\" Version: {release_info.get('chart', {}).get('metadata', {}).get('version', 'unknown')}\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"Helm release '{helm_release}' not found in namespace '{namespace}'\")\n",
|
||||
" print(\"\\n💡 You need to deploy LangSmith first using Module 1 notebooks.\")\n",
|
||||
" print(\" See the 'Deploy Environment' section below.\")\n",
|
||||
" raise RuntimeError(\"Helm release not found. Deploy using Module 1 first.\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not list Helm releases\")\n",
|
||||
" print(\"💡 Make sure Helm is installed and kubectl is configured correctly\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4. Verify Ingress Endpoint\n",
|
||||
"\n",
|
||||
"Check that the LangSmith ingress is configured and reachable.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import requests\n",
|
||||
"from urllib.parse import urlparse\n",
|
||||
"\n",
|
||||
"print(\"### Checking Ingress\\n\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"ingress\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"ingress_found = False\n",
|
||||
"ingress_host = None\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" ingresses = json.loads(result.stdout)\n",
|
||||
" for ingress in ingresses.get(\"items\", []):\n",
|
||||
" rules = ingress.get(\"spec\", {}).get(\"rules\", [])\n",
|
||||
" for rule in rules:\n",
|
||||
" host = rule.get(\"host\", \"\")\n",
|
||||
" if host:\n",
|
||||
" ingress_found = True\n",
|
||||
" ingress_host = host\n",
|
||||
" print(f\" Found ingress with host: {host}\")\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
"if not ingress_found:\n",
|
||||
" warn(\"No ingress found\")\n",
|
||||
" print(\"💡 Ingress may still be provisioning. Check Module 1 validation notebook.\")\n",
|
||||
"else:\n",
|
||||
" ok(f\"Ingress configured with host: {ingress_host}\")\n",
|
||||
" \n",
|
||||
" # Try to reach the endpoint\n",
|
||||
" if config[\"LANGSMITH_DOMAIN\"]:\n",
|
||||
" test_url = f\"https://{config['LANGSMITH_DOMAIN']}\"\n",
|
||||
" elif ingress_host:\n",
|
||||
" test_url = f\"https://{ingress_host}\"\n",
|
||||
" else:\n",
|
||||
" test_url = None\n",
|
||||
" \n",
|
||||
" if test_url:\n",
|
||||
" print(f\"\\n### Testing Endpoint Reachability\\n\")\n",
|
||||
" print(f\"Testing: {test_url}\")\n",
|
||||
" try:\n",
|
||||
" # Allow redirects, don't verify SSL (may be self-signed)\n",
|
||||
" response = requests.get(test_url, allow_redirects=True, verify=False, timeout=10)\n",
|
||||
" if response.status_code in [200, 302, 401, 403]:\n",
|
||||
" ok(f\"Endpoint is reachable (HTTP {response.status_code})\")\n",
|
||||
" else:\n",
|
||||
" warn(f\"Endpoint returned unexpected status: {response.status_code}\")\n",
|
||||
" except requests.exceptions.SSLError:\n",
|
||||
" # SSL error is OK if using self-signed certs\n",
|
||||
" warn(\"SSL verification failed (may be self-signed certificate)\")\n",
|
||||
" print(\"💡 This is OK for testing. In production, use proper TLS certificates.\")\n",
|
||||
" except requests.exceptions.RequestException as e:\n",
|
||||
" warn(f\"Could not reach endpoint: {e}\")\n",
|
||||
" print(\"💡 Ingress may still be provisioning. Wait a few minutes and try again.\")\n",
|
||||
" else:\n",
|
||||
" warn(\"No domain configured for testing\")\n",
|
||||
" print(\"💡 Set LANGSMITH_DOMAIN in your .env file to test endpoint reachability\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 5. Quick Health Check\n",
|
||||
"\n",
|
||||
"Verify that key deployments are running.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"### Checking Key Deployments\\n\")\n",
|
||||
"result = run(\n",
|
||||
" [\"kubectl\", \"get\", \"deployments\", \"-n\", namespace, \"-o\", \"json\"],\n",
|
||||
" check=False,\n",
|
||||
" stream=False\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if result.returncode == 0:\n",
|
||||
" deployments = json.loads(result.stdout)\n",
|
||||
" deployment_items = deployments.get(\"items\", [])\n",
|
||||
" \n",
|
||||
" if deployment_items:\n",
|
||||
" ok(f\"Found {len(deployment_items)} deployment(s)\")\n",
|
||||
" print(\"\\nDeployment Status:\")\n",
|
||||
" for deployment in deployment_items:\n",
|
||||
" name = deployment.get(\"metadata\", {}).get(\"name\", \"\")\n",
|
||||
" spec_replicas = deployment.get(\"spec\", {}).get(\"replicas\", 0)\n",
|
||||
" status_replicas = deployment.get(\"status\", {}).get(\"replicas\", 0)\n",
|
||||
" ready_replicas = deployment.get(\"status\", {}).get(\"readyReplicas\", 0)\n",
|
||||
" available_replicas = deployment.get(\"status\", {}).get(\"availableReplicas\", 0)\n",
|
||||
" \n",
|
||||
" status_icon = \"✅\" if ready_replicas == spec_replicas and available_replicas == spec_replicas else \"⚠️\"\n",
|
||||
" print(f\" {status_icon} {name}: {ready_replicas}/{spec_replicas} ready, {available_replicas}/{spec_replicas} available\")\n",
|
||||
" else:\n",
|
||||
" warn(\"No deployments found\")\n",
|
||||
" print(\"💡 LangSmith may not be fully deployed. Check Module 1 validation notebook.\")\n",
|
||||
"else:\n",
|
||||
" warn(\"Could not list deployments\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## ✅ Environment Ready\n",
|
||||
"\n",
|
||||
"Your LangSmith environment is running and accessible. You're ready to proceed with your module.\n",
|
||||
"\n",
|
||||
"**Next Steps (Module-Specific):**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Show module-specific next steps\n",
|
||||
"print(\"### Module-Specific Next Steps\\n\")\n",
|
||||
"\n",
|
||||
"if current_module == \"2\":\n",
|
||||
" print(\"**Module 2: Identity & Auth**\\n\")\n",
|
||||
" print(\"1. Run `01_sso_oidc_validation.ipynb` to validate OIDC SSO configuration\")\n",
|
||||
" print(\"2. (Optional) Run `02_sso_saml_validation.ipynb` if using SAML\")\n",
|
||||
" print(\"\\n💡 These notebooks are read-only and safe for production use.\")\n",
|
||||
"elif current_module == \"3\":\n",
|
||||
" print(\"**Module 3: Production Operations & Scaling**\\n\")\n",
|
||||
" print(\"1. Run `01_ops_sanity_checks.ipynb` to validate production readiness\")\n",
|
||||
" print(\"2. Review production readiness checklist: `docs/shared/production_readiness_checklist.md`\")\n",
|
||||
" print(\"3. Review signals and thresholds: `docs/shared/ops_signals_and_thresholds.md`\")\n",
|
||||
" print(\"\\n💡 This notebook is read-only and safe for production use.\")\n",
|
||||
"elif current_module == \"4\":\n",
|
||||
" print(\"**Module 4: Troubleshooting & Incident Response**\\n\")\n",
|
||||
" print(\"1. Run `01_diagnostics_baseline.ipynb` to capture a baseline snapshot\")\n",
|
||||
" print(\"2. Proceed with failure labs (10, 20, 30, 40)\")\n",
|
||||
" print(\"3. Optionally run `90_full_incident_drill.ipynb` for a complete incident simulation\")\n",
|
||||
" print(\"\\n⚠️ REMINDER: Module 4 failure labs modify secrets and cause disruptions.\")\n",
|
||||
" print(\" Only run in TEST/NON-PRODUCTION environments.\")\n",
|
||||
"else:\n",
|
||||
" print(\"**Generic Use**\\n\")\n",
|
||||
" print(\"This notebook can be used by:\")\n",
|
||||
" print(\" - Module 2: Identity & Auth validation (read-only)\")\n",
|
||||
" print(\" - Module 3: Production Operations checks (read-only)\")\n",
|
||||
" print(\" - Module 4: Troubleshooting failure labs (modifies environment)\")\n",
|
||||
" print(\"\\n💡 Navigate to the appropriate module directory and run this notebook from there.\")\n",
|
||||
"\n",
|
||||
"print(\"\\n\" + \"=\" * 70)\n",
|
||||
"print(\"📝 Important Reminder\")\n",
|
||||
"print(\"=\" * 70)\n",
|
||||
"print(\"\\n**When finished with workshop modules, run Module 1's `99_teardown.ipynb`\")\n",
|
||||
"print(\"to delete the environment and avoid ongoing cloud costs.**\")\n",
|
||||
"print(\"\\nThe teardown notebook will:\")\n",
|
||||
"print(\" - Remove Helm release\")\n",
|
||||
"print(\" - Destroy Terraform-managed infrastructure (Kubernetes cluster, database, cache, blob storage, etc.)\")\n",
|
||||
"print(\" - Clean up any remaining resources\")\n",
|
||||
"print(\"\\n**Location:** `../module-1/99_teardown.ipynb`\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 🚀 Deploy Environment (If Not Already Deployed)\n",
|
||||
"\n",
|
||||
"If your environment is not running, follow these steps to deploy LangSmith using Module 1:\n",
|
||||
"\n",
|
||||
"### Step 1: Preflight Checks\n",
|
||||
"Run `../module-1/01_preflight.ipynb` to validate your environment.\n",
|
||||
"\n",
|
||||
"### Step 2: Provision Infrastructure\n",
|
||||
"Run `../module-1/02_terraform_apply.ipynb` to deploy cloud infrastructure (Kubernetes cluster, database, cache, blob storage).\n",
|
||||
"\n",
|
||||
"### Step 3: Install LangSmith\n",
|
||||
"Run `../module-1/03_helm_install_langsmith.ipynb` to install LangSmith using Helm.\n",
|
||||
"\n",
|
||||
"### Step 4: Validate Deployment\n",
|
||||
"Run `../module-1/04_validate_ingress_and_ui.ipynb` to verify everything is working.\n",
|
||||
"\n",
|
||||
"### Step 5: Return Here\n",
|
||||
"Once deployment is complete, return to this notebook and re-run the cells above to verify your environment is ready.\n",
|
||||
"\n",
|
||||
"---\n",
|
||||
"\n",
|
||||
"**Note:** If you encounter errors during deployment, refer to Module 1 documentation and troubleshooting guides.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": ".venv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python",
|
||||
"version": "3.9.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,5 +1,6 @@
|
||||
from __future__ import annotations
|
||||
import os
|
||||
from datetime import date
|
||||
|
||||
def ok(msg: str) -> None:
|
||||
print(f"✅ {msg}")
|
||||
@@ -18,6 +19,9 @@ def require_env(*keys: str) -> dict:
|
||||
if not v:
|
||||
missing.append(k)
|
||||
cfg[k] = v
|
||||
if k == 'CLUSTER_NAME':
|
||||
# Add a hardcoded prefix to the cluster name
|
||||
cfg[k] = f"langsmith-workshop-{date.today().strftime('%Y%m%d')}-{v}"
|
||||
if missing:
|
||||
fail(f"Missing required environment variables: {', '.join(missing)}")
|
||||
return cfg
|
||||
|
||||
@@ -0,0 +1,11 @@
|
||||
# Test artifacts
|
||||
artifacts/
|
||||
*.pyc
|
||||
__pycache__/
|
||||
.pytest_cache/
|
||||
.coverage
|
||||
htmlcov/
|
||||
|
||||
# Notebook execution outputs
|
||||
*.ipynb_checkpoints
|
||||
|
||||
+127
@@ -0,0 +1,127 @@
|
||||
# Tests for LangSmith Self-Hosted Workshops
|
||||
|
||||
This directory contains tests for validating notebook execution and syntax.
|
||||
|
||||
## Test Structure
|
||||
|
||||
- `conftest.py`: Pytest configuration and fixtures
|
||||
- `test_notebook_execution.py`: Notebook execution tests
|
||||
- `requirements.txt`: Test dependencies
|
||||
- `artifacts/`: Directory for test artifacts (created automatically)
|
||||
|
||||
## Running Tests Locally
|
||||
|
||||
### Prerequisites
|
||||
|
||||
```bash
|
||||
# Install test dependencies
|
||||
pip install -r tests/requirements.txt
|
||||
|
||||
# Install system dependencies (if needed)
|
||||
# macOS: brew install jq
|
||||
# Ubuntu: sudo apt-get install jq
|
||||
```
|
||||
|
||||
### Run All Tests
|
||||
|
||||
```bash
|
||||
# Run syntax tests only (fast, no infrastructure required)
|
||||
CI_SKIP_EXECUTION=true pytest tests/ -v
|
||||
|
||||
# Run full execution tests (requires infrastructure)
|
||||
pytest tests/ -v
|
||||
```
|
||||
|
||||
### Run Specific Test Suites
|
||||
|
||||
```bash
|
||||
# Test Module 1 notebooks
|
||||
pytest tests/test_notebook_execution.py::TestModule1Notebooks -v
|
||||
|
||||
# Test Module 2 notebooks
|
||||
pytest tests/test_notebook_execution.py::TestModule2Notebooks -v
|
||||
```
|
||||
|
||||
### Run Individual Notebook Tests
|
||||
|
||||
```bash
|
||||
# Test specific notebook syntax
|
||||
pytest tests/test_notebook_execution.py::TestModule1Notebooks::test_module1_notebook_syntax -v
|
||||
```
|
||||
|
||||
## CI/CD Integration
|
||||
|
||||
Tests run automatically on:
|
||||
- Pull requests to `main`/`master`
|
||||
- Pushes to `main`/`master`
|
||||
- Manual workflow dispatch
|
||||
|
||||
### GitHub Actions Workflow
|
||||
|
||||
The `.github/workflows/test-notebooks.yml` workflow:
|
||||
|
||||
1. **Test Notebook Syntax**: Validates JSON structure and code cells
|
||||
2. **Test Module 1 Preflight**: Validates preflight notebook structure
|
||||
3. **Test Module 2 Syntax**: Validates auth validation notebooks
|
||||
4. **Lint Python Code**: Runs flake8 and black checks
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The workflow uses test environment variables. For full execution tests, set:
|
||||
|
||||
```yaml
|
||||
# In GitHub Actions secrets/variables
|
||||
AWS_ACCESS_KEY_ID
|
||||
AWS_SECRET_ACCESS_KEY
|
||||
AWS_REGION
|
||||
CLUSTER_NAME
|
||||
NAMESPACE
|
||||
# ... etc
|
||||
```
|
||||
|
||||
## Test Strategy
|
||||
|
||||
### Syntax Tests (Always Run)
|
||||
|
||||
- Validate notebook JSON structure
|
||||
- Check for code cells
|
||||
- Verify imports can be resolved
|
||||
- No infrastructure required
|
||||
|
||||
### Execution Tests (Conditional)
|
||||
|
||||
- Full notebook execution
|
||||
- Requires actual infrastructure (cluster, IdP, etc.)
|
||||
- Skipped in CI by default (`CI_SKIP_EXECUTION=true`)
|
||||
- Can be enabled for integration testing environments
|
||||
|
||||
## Adding New Tests
|
||||
|
||||
1. Add notebook to appropriate test class in `test_notebook_execution.py`
|
||||
2. Update `pytest.parametrize` decorator with notebook name
|
||||
3. Add any required environment variables to `conftest.py`
|
||||
4. Update GitHub Actions workflow if needed
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Import Errors
|
||||
|
||||
If tests fail with import errors:
|
||||
- Ensure `notebooks/shared/` is in Python path
|
||||
- Check that `conftest.py` is setting up paths correctly
|
||||
- Verify all required packages are in `requirements.txt`
|
||||
|
||||
### Timeout Errors
|
||||
|
||||
If notebook execution times out:
|
||||
- Increase timeout in `execute_notebook()` function
|
||||
- Check for infinite loops or long-running operations
|
||||
- Consider mocking external API calls
|
||||
|
||||
### Environment Variable Issues
|
||||
|
||||
If tests fail due to missing env vars:
|
||||
- Check `conftest.py` for default values
|
||||
- Verify GitHub Actions workflow sets required variables
|
||||
- Add variables to test fixtures if needed
|
||||
|
||||
@@ -0,0 +1,2 @@
|
||||
# Tests for LangSmith Self-Hosted Workshops notebooks
|
||||
|
||||
@@ -0,0 +1,28 @@
|
||||
"""
|
||||
Pytest configuration and fixtures for notebook testing.
|
||||
"""
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add notebooks directory to path
|
||||
repo_root = Path(__file__).parent.parent
|
||||
notebooks_dir = repo_root / "notebooks"
|
||||
if str(notebooks_dir) not in sys.path:
|
||||
sys.path.insert(0, str(notebooks_dir))
|
||||
|
||||
# Set test environment variables
|
||||
os.environ.setdefault("NAMESPACE", "langsmith-test")
|
||||
os.environ.setdefault("CLUSTER_NAME", "test-cluster")
|
||||
os.environ.setdefault("HELM_RELEASE", "langsmith")
|
||||
os.environ.setdefault("ARTIFACTS_DIR", str(repo_root / "tests" / "artifacts"))
|
||||
|
||||
# Cloud provider defaults (can be overridden by GitHub Actions)
|
||||
os.environ.setdefault("CLOUD_PROVIDER", "aws")
|
||||
os.environ.setdefault("AWS_REGION", "us-west-2")
|
||||
os.environ.setdefault("AZURE_LOCATION", "eastus")
|
||||
|
||||
# Create artifacts directory
|
||||
artifacts_dir = Path(os.environ["ARTIFACTS_DIR"])
|
||||
artifacts_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
@@ -0,0 +1,11 @@
|
||||
# Test dependencies for notebook execution
|
||||
pytest>=7.0.0
|
||||
jupyter>=1.0.0
|
||||
nbconvert>=6.0.0
|
||||
ipykernel>=6.0.0
|
||||
|
||||
# Notebook dependencies (should match what notebooks need)
|
||||
python-dotenv>=1.0.0
|
||||
pyyaml>=6.0
|
||||
requests>=2.28.0
|
||||
|
||||
@@ -0,0 +1,283 @@
|
||||
"""
|
||||
Test notebook execution using nbconvert.
|
||||
|
||||
This module executes notebooks and validates they complete without errors.
|
||||
"""
|
||||
import json
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
from pathlib import Path
|
||||
import pytest
|
||||
|
||||
# Repository root
|
||||
REPO_ROOT = Path(__file__).parent.parent
|
||||
NOTEBOOKS_DIR = REPO_ROOT / "notebooks"
|
||||
|
||||
|
||||
def execute_notebook(notebook_path: Path, timeout: int = 600) -> tuple[bool, str]:
|
||||
"""
|
||||
Execute a Jupyter notebook using nbconvert.
|
||||
|
||||
Args:
|
||||
notebook_path: Path to the notebook file
|
||||
timeout: Maximum execution time in seconds
|
||||
|
||||
Returns:
|
||||
Tuple of (success: bool, output: str)
|
||||
"""
|
||||
try:
|
||||
# Use nbconvert to execute the notebook
|
||||
result = subprocess.run(
|
||||
[
|
||||
sys.executable,
|
||||
"-m",
|
||||
"jupyter",
|
||||
"nbconvert",
|
||||
"--to",
|
||||
"notebook",
|
||||
"--execute",
|
||||
"--inplace",
|
||||
"--ExecutePreprocessor.timeout=600",
|
||||
"--ExecutePreprocessor.kernel_name=python3",
|
||||
str(notebook_path),
|
||||
],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=timeout,
|
||||
cwd=str(notebook_path.parent),
|
||||
)
|
||||
|
||||
if result.returncode == 0:
|
||||
return True, result.stdout
|
||||
else:
|
||||
error_msg = f"STDOUT:\n{result.stdout}\n\nSTDERR:\n{result.stderr}"
|
||||
return False, error_msg
|
||||
|
||||
except subprocess.TimeoutExpired:
|
||||
return False, f"Notebook execution timed out after {timeout} seconds"
|
||||
except Exception as e:
|
||||
return False, f"Error executing notebook: {str(e)}"
|
||||
|
||||
|
||||
def get_notebook_cells(notebook_path: Path) -> list:
|
||||
"""Get all code cells from a notebook."""
|
||||
with open(notebook_path, "r") as f:
|
||||
nb = json.load(f)
|
||||
return [cell for cell in nb.get("cells", []) if cell.get("cell_type") == "code"]
|
||||
|
||||
|
||||
class TestNotebookExecution:
|
||||
"""Base class for notebook execution tests."""
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def setup_test_env(self, monkeypatch):
|
||||
"""Set up test environment variables."""
|
||||
# Set minimal required env vars for testing
|
||||
test_env = {
|
||||
"NAMESPACE": "langsmith-test",
|
||||
"CLUSTER_NAME": "test-cluster",
|
||||
"HELM_RELEASE": "langsmith",
|
||||
"ARTIFACTS_DIR": str(REPO_ROOT / "tests" / "artifacts"),
|
||||
"CLOUD_PROVIDER": os.environ.get("CLOUD_PROVIDER", "aws"),
|
||||
"AWS_REGION": os.environ.get("AWS_REGION", "us-west-2"),
|
||||
"AZURE_LOCATION": os.environ.get("AZURE_LOCATION", "eastus"),
|
||||
# Mock values for testing (will fail actual operations but allow syntax checks)
|
||||
"LANGSMITH_DOMAIN": "test.langsmith.example.com",
|
||||
"OIDC_ISSUER": "https://test-idp.example.com/oauth2/default",
|
||||
"OIDC_CLIENT_ID": "test-client-id",
|
||||
"OIDC_CLIENT_SECRET": "test-client-secret",
|
||||
"OIDC_REDIRECT_URI": "https://test.langsmith.example.com/auth/callback",
|
||||
}
|
||||
|
||||
for key, value in test_env.items():
|
||||
monkeypatch.setenv(key, value)
|
||||
|
||||
def _validate_notebook_syntax(self, notebook_path: Path):
|
||||
"""Helper method to validate notebook has valid JSON structure and code cells."""
|
||||
assert notebook_path.exists(), f"Notebook not found: {notebook_path}"
|
||||
|
||||
with open(notebook_path, "r") as f:
|
||||
nb = json.load(f)
|
||||
|
||||
assert "cells" in nb, "Notebook missing cells"
|
||||
assert len(nb["cells"]) > 0, "Notebook has no cells"
|
||||
|
||||
code_cells = [c for c in nb["cells"] if c.get("cell_type") == "code"]
|
||||
assert len(code_cells) > 0, "Notebook has no code cells"
|
||||
|
||||
|
||||
# Module 1 tests
|
||||
class TestModule1Notebooks(TestNotebookExecution):
|
||||
"""Test Module 1 notebooks."""
|
||||
|
||||
@pytest.mark.parametrize("notebook", [
|
||||
"01_preflight.ipynb",
|
||||
"99_teardown.ipynb", # Always test syntax, even if execution is skipped
|
||||
# Note: Skip terraform/helm/validation notebooks in CI as they require actual infrastructure
|
||||
# "02_terraform_apply.ipynb",
|
||||
# "03_helm_install_langsmith.ipynb",
|
||||
# "04_validate_ingress_and_ui.ipynb",
|
||||
])
|
||||
def test_module1_notebook_syntax(self, notebook):
|
||||
"""Test Module 1 notebook syntax."""
|
||||
notebook_path = NOTEBOOKS_DIR / "module-1" / notebook
|
||||
self._validate_notebook_syntax(notebook_path)
|
||||
|
||||
@pytest.mark.skipif(
|
||||
os.environ.get("CI_SKIP_EXECUTION") == "true",
|
||||
reason="Skipping execution in CI (requires infrastructure)"
|
||||
)
|
||||
@pytest.mark.parametrize("notebook", [
|
||||
"01_preflight.ipynb",
|
||||
])
|
||||
def test_module1_notebook_execution(self, notebook):
|
||||
"""Test Module 1 notebook execution (only if infrastructure available)."""
|
||||
notebook_path = NOTEBOOKS_DIR / "module-1" / notebook
|
||||
success, output = execute_notebook(notebook_path, timeout=300)
|
||||
assert success, f"Notebook execution failed:\n{output}"
|
||||
|
||||
@pytest.mark.skipif(
|
||||
os.environ.get("CI_SKIP_EXECUTION") == "true",
|
||||
reason="Skipping execution in CI (requires infrastructure)"
|
||||
)
|
||||
def test_module1_teardown_execution(self):
|
||||
"""
|
||||
Test Module 1 teardown notebook execution.
|
||||
|
||||
This test runs when CI_SKIP_EXECUTION is not true, ensuring that
|
||||
resources created during execution tests are properly cleaned up.
|
||||
|
||||
IMPORTANT: This test should run AFTER other execution tests to ensure
|
||||
proper cleanup. It will destroy all infrastructure created during testing.
|
||||
|
||||
Note: The teardown notebook has commented-out code sections that must be
|
||||
uncommented to actually destroy resources. This test validates the notebook
|
||||
structure and execution flow, but actual resource destruction requires
|
||||
manual uncommenting in the notebook itself.
|
||||
"""
|
||||
notebook_path = NOTEBOOKS_DIR / "module-1" / "99_teardown.ipynb"
|
||||
# Teardown may take longer, especially for Terraform destroy
|
||||
# Using 30 minutes timeout to allow for full infrastructure teardown
|
||||
success, output = execute_notebook(notebook_path, timeout=1800) # 30 minutes
|
||||
assert success, f"Teardown notebook execution failed:\n{output}"
|
||||
|
||||
|
||||
# Module 2 tests
|
||||
class TestModule2Notebooks(TestNotebookExecution):
|
||||
"""Test Module 2 notebooks."""
|
||||
|
||||
@pytest.mark.parametrize("notebook", [
|
||||
"01_sso_oidc_validation.ipynb",
|
||||
"02_sso_saml_validation.ipynb",
|
||||
])
|
||||
def test_module2_notebook_syntax(self, notebook):
|
||||
"""Test Module 2 notebook syntax."""
|
||||
notebook_path = NOTEBOOKS_DIR / "module-2" / notebook
|
||||
self._validate_notebook_syntax(notebook_path)
|
||||
|
||||
@pytest.mark.skipif(
|
||||
os.environ.get("CI_SKIP_EXECUTION") == "true",
|
||||
reason="Skipping execution in CI (requires infrastructure)"
|
||||
)
|
||||
@pytest.mark.parametrize("notebook", [
|
||||
"01_sso_oidc_validation.ipynb",
|
||||
"02_sso_saml_validation.ipynb",
|
||||
])
|
||||
def test_module2_notebook_execution(self, notebook):
|
||||
"""Test Module 2 notebook execution (only if infrastructure available)."""
|
||||
notebook_path = NOTEBOOKS_DIR / "module-2" / notebook
|
||||
success, output = execute_notebook(notebook_path, timeout=300)
|
||||
assert success, f"Notebook execution failed:\n{output}"
|
||||
|
||||
|
||||
# Module 3 tests
|
||||
class TestModule3Notebooks(TestNotebookExecution):
|
||||
"""Test Module 3 notebooks."""
|
||||
|
||||
@pytest.mark.parametrize("notebook", [
|
||||
"01_ops_sanity_checks.ipynb",
|
||||
])
|
||||
def test_module3_notebook_syntax(self, notebook):
|
||||
"""Test Module 3 notebook syntax."""
|
||||
notebook_path = NOTEBOOKS_DIR / "module-3" / notebook
|
||||
self._validate_notebook_syntax(notebook_path)
|
||||
|
||||
@pytest.mark.skipif(
|
||||
os.environ.get("CI_SKIP_EXECUTION") == "true",
|
||||
reason="Skipping execution in CI (requires infrastructure)"
|
||||
)
|
||||
@pytest.mark.parametrize("notebook", [
|
||||
"01_ops_sanity_checks.ipynb",
|
||||
])
|
||||
def test_module3_notebook_execution(self, notebook):
|
||||
"""Test Module 3 notebook execution (only if infrastructure available)."""
|
||||
notebook_path = NOTEBOOKS_DIR / "module-3" / notebook
|
||||
# Ops sanity checks may take longer due to resource usage checks
|
||||
success, output = execute_notebook(notebook_path, timeout=600)
|
||||
assert success, f"Notebook execution failed:\n{output}"
|
||||
|
||||
|
||||
# Module 4 tests
|
||||
class TestModule4Notebooks(TestNotebookExecution):
|
||||
"""Test Module 4 notebooks."""
|
||||
|
||||
@pytest.mark.parametrize("notebook", [
|
||||
"00_setup_or_resume_environment.ipynb",
|
||||
"01_diagnostics_baseline.ipynb",
|
||||
"10_failure_lab_postgres.ipynb",
|
||||
"20_failure_lab_redis.ipynb",
|
||||
"30_failure_lab_clickhouse.ipynb",
|
||||
"40_failure_lab_blob_storage.ipynb",
|
||||
])
|
||||
def test_module4_notebook_syntax(self, notebook):
|
||||
"""Test Module 4 notebook syntax."""
|
||||
notebook_path = NOTEBOOKS_DIR / "module-4" / notebook
|
||||
self._validate_notebook_syntax(notebook_path)
|
||||
|
||||
@pytest.mark.skipif(
|
||||
os.environ.get("CI_SKIP_EXECUTION") == "true",
|
||||
reason="Skipping execution in CI (requires infrastructure)"
|
||||
)
|
||||
@pytest.mark.parametrize("notebook", [
|
||||
"00_setup_or_resume_environment.ipynb",
|
||||
"01_diagnostics_baseline.ipynb",
|
||||
])
|
||||
def test_module4_notebook_execution(self, notebook):
|
||||
"""
|
||||
Test Module 4 notebook execution (only if infrastructure available).
|
||||
|
||||
Tests setup and baseline notebooks which are read-only validation.
|
||||
Failure labs are syntax-tested only to avoid modifying production environments.
|
||||
"""
|
||||
notebook_path = NOTEBOOKS_DIR / "module-4" / notebook
|
||||
# Setup and baseline checks may take longer due to diagnostics collection
|
||||
success, output = execute_notebook(notebook_path, timeout=600)
|
||||
assert success, f"Notebook execution failed:\n{output}"
|
||||
|
||||
@pytest.mark.skipif(
|
||||
os.environ.get("CI_SKIP_EXECUTION") == "true",
|
||||
reason="Skipping execution in CI (requires infrastructure and failure injection)"
|
||||
)
|
||||
@pytest.mark.parametrize("notebook", [
|
||||
"10_failure_lab_postgres.ipynb",
|
||||
"20_failure_lab_redis.ipynb",
|
||||
"30_failure_lab_clickhouse.ipynb",
|
||||
"40_failure_lab_blob_storage.ipynb",
|
||||
])
|
||||
def test_module4_failure_lab_execution(self, notebook):
|
||||
"""
|
||||
Test Module 4 failure lab notebook execution (only if infrastructure available).
|
||||
|
||||
WARNING: These notebooks inject failures by modifying secrets and configurations.
|
||||
They should only be run in test environments, not production.
|
||||
|
||||
These tests validate that failure injection and remediation workflows function
|
||||
correctly. The notebooks include safety mechanisms (commented-out injection code)
|
||||
but should still be used with caution.
|
||||
"""
|
||||
notebook_path = NOTEBOOKS_DIR / "module-4" / notebook
|
||||
# Failure labs may take longer due to failure injection, observation, and remediation
|
||||
success, output = execute_notebook(notebook_path, timeout=900) # 15 minutes
|
||||
assert success, f"Notebook execution failed:\n{output}"
|
||||
|
||||
Reference in New Issue
Block a user