Implement deterministic resource naming using email-based identifiers
and configure automatic output cell stripping for notebooks.
Changes:
- Add interactive workshop identifier setup in 01_preflight.ipynb
- Prompts student for email address
- Hashes email (MD5, 6 chars) for privacy and determinism
- Creates identifier: -workshop-YYYYMMDD-<hash>
- Saves to artifacts/workshop_identifier.json
- Update 02_terraform_apply.ipynb to use identifier
- Loads identifier from artifacts file
- Passes identifier to Terraform via -var identifier=...
- Removes incorrect cluster_name variable (not a Terraform input)
- Computes expected cluster name for validation
- Update 01_preflight.ipynb cluster check
- Uses identifier to compute cluster name
- Falls back to CLUSTER_NAME env var if identifier not set
- Configure nbstripout for automatic output cell stripping
- Add .gitattributes with *.ipynb filter=nbstripout
- Ensures output cells are never committed
Benefits:
- No additional environment variables required
- Deterministic identifiers (same email = same identifier)
- Idempotent deployments (safe to re-run)
- Unique per student
- Automatic output cell management
The environment setup notebook is now shared across modules 2, 3, and 4,
reducing duplication and providing a consistent entry point for all
workshop modules.
Changes:
- Moved `00_setup_or_resume_environment.ipynb` from module-4/ to shared/
- Added automatic module detection based on working directory
- Implemented module-specific safety checks:
- Modules 2 & 3: Read-only validation (safe for production)
- Module 4: Failure labs require MODULE4_SAFE_ENVIRONMENT flag
- Updated all Module 4 notebook references to point to shared location
- Revised tone to be more supportive and teacher-like (less directive)
- Added module-aware "Next Steps" section with appropriate guidance
The notebook now adapts its behavior and messaging based on which module
is using it, providing appropriate safety warnings and next steps for
each context.
Add mandatory MODULE4_SAFE_ENVIRONMENT flag requirement and comprehensive
environment verification to all Module 4 failure lab notebooks. Failure
labs now prominently display cloud account/region details, require explicit
safety flag, and keep failure injection code commented out by default.
This prevents accidental execution in production environments while still
allowing hands-on troubleshooting practice in test environments.
This commit adds comprehensive test coverage for Module 3 (Production
Operations & Scaling) and Module 4 (Troubleshooting & Incident Response)
notebooks, following the same pattern established for Modules 1 and 2.
Test Infrastructure:
- tests/test_notebook_execution.py:
- Added TestModule3Notebooks class with syntax and execution tests
for 01_ops_sanity_checks.ipynb
- Added TestModule4Notebooks class with:
- Syntax tests for all 6 Module 4 notebooks (00, 01, 10, 20, 30, 40)
- Execution tests for setup/baseline notebooks (00, 01) - read-only
- Execution tests for failure labs (10, 20, 30, 40) - with warnings
about failure injection and secret modification
CI/CD Integration:
- .github/workflows/test-notebooks.yml:
- Added Module 3 and Module 4 syntax tests to CI pipeline
- All syntax tests now run automatically on PRs and pushes
Test Features:
- Respects CI_SKIP_EXECUTION environment variable (same as Module 1)
- Uses environment variables for configuration (cloud provider, region, etc.)
- Appropriate timeouts: 600s for ops checks, 900s for failure labs
- Safety warnings for failure lab execution tests
- Syntax tests always run (including in CI)
- Execution tests skip in CI when CI_SKIP_EXECUTION=true
Module 4 Test Structure:
- Syntax tests: All 6 notebooks validated for structure
- Setup/Baseline execution: Read-only validation notebooks (00, 01)
- Failure lab execution: Separate test method with warnings about
secret modification and failure injection (10, 20, 30, 40)
This ensures all workshop notebooks are validated for syntax correctness
and can be execution-tested when infrastructure is available, maintaining
consistency with existing test patterns.
This commit adds Module 4 of the LangSmith Self-Hosted Operator workshop,
focused on teaching operators how to diagnose issues under pressure, collect
evidence, and resolve incidents efficiently.
Documentation:
- docs/modules/module-4.md: Complete module documentation covering incident
reality, common failure modes, diagnostics collection, debugging methodology,
and working with Support
- docs/shared/incident_first_10_minutes.md: Quick reference checklist for
critical initial incident response steps
- docs/shared/support_escalation_template.md: Copy-paste template for
escalating issues to LangChain Support with all necessary information
Notebooks:
- notebooks/module-4/00_setup_or_resume_environment.ipynb: Environment
validation and setup for Module 4 labs
- notebooks/module-4/01_diagnostics_baseline.ipynb: Teaches "baseline first"
discipline with comprehensive cluster state capture and canonical diagnostics
script execution
- notebooks/module-4/10_failure_lab_postgres.ipynb: Hands-on PostgreSQL
connectivity failure lab with failure injection, diagnostics, and remediation
- notebooks/module-4/20_failure_lab_redis.ipynb: Hands-on Redis cache/queue
failure lab focusing on intermittent ingestion and worker backlog issues
- notebooks/module-4/30_failure_lab_clickhouse.ipynb: Hands-on ClickHouse trace
storage failure lab covering missing traces and insert errors
- notebooks/module-4/40_failure_lab_blob_storage.ipynb: Hands-on blob storage
configuration failure lab demonstrating ClickHouse pressure from large payloads
- notebooks/module-4/README.md: Module overview and notebook descriptions
Key Features:
- All failure labs follow consistent structure: baseline → inject → observe →
collect → triage → remediate → recover
- Cloud-agnostic implementation using shared cloud helpers
- Safe-by-default failure injections with backup/restore mechanisms
- Integration with canonical LangChain diagnostics script
- Secrets-safe: no credentials printed, all values redacted
- Comprehensive Support escalation guidance for each service
Each failure lab includes:
- Service role and importance explanation
- Expected symptoms documentation
- Multiple failure injection options (subtle vs. obvious)
- Guided triage steps with automatic checks
- Support escalation requirements
- Lessons learned and common mistakes
This module completes the core operator workshop curriculum, enabling
operators to confidently troubleshoot and respond to incidents in production
LangSmith deployments.
Add complete Module 3 implementation for production operations, scaling
strategies, and day-2 operations. This module enables operators to run
LangSmith reliably under real production load and respond effectively
when things go wrong.
Components added:
- docs/modules/module-3.md
* Production mental model (distributed system, scaling domains)
* Scaling model (what scales vs what doesn't, failure patterns)
* Service sizing baselines (PostgreSQL, Redis, ClickHouse)
* Blob storage requirements (required for production)
* Autoscaling strategy (HPA and KEDA)
* Observability and early warning signals
* Backups, DR, and failure domains
* Sidecars and service mesh (Istio) section
- docs/shared/production_readiness_checklist.md
* Comprehensive production readiness validation checklist
* Infrastructure, data stores, application configuration
* Observability, security, backup/DR sections
* Service mesh configuration (if applicable)
- docs/shared/ops_signals_and_thresholds.md
* Critical and warning signal definitions
* Threshold definitions with measurement methods
* Log patterns to monitor
* Escalation evidence requirements
* Quick reference table
- docs/shared/sidecars_and_service_mesh.md
* Istio sidecar injection guidance
* Namespace and per-workload injection
* Operational implications (logging, health probes, egress)
* Sample labels and annotations
* Troubleshooting guide
- notebooks/module-3/01_ops_sanity_checks.ipynb
* Read-only validation notebook (15 cells)
* Current state snapshot
* Early warning signal checks
* Storage/durability validation
* Sidecar detection and guidance
Key features:
- Cloud-agnostic (uses shared/_cloud_helpers.py)
- Read-only validation (safe to run)
- Deterministic checks with specific thresholds
- Operator-focused (real commands and evidence)
- Opinionated baselines and recommendations
- Time-bounded (~2 hours executable)
All notebooks follow existing patterns and integrate with shared helper
modules for consistency.
Add comprehensive Module 1 documentation following the same structure
as Modules 2 and 3. This provides a complete reference guide for the
deployment and baseline validation module.
- docs/modules/module-1.md
* Complete module documentation (606 lines)
* Architecture baseline (opinionated, cloud-agnostic)
* Workshop flow covering all 5 notebooks
* Common pitfalls and solutions
* Service sizing baselines
* Blob storage requirements
* Terraform and Helm best practices
* Validation checklist and troubleshooting
The documentation covers:
- Environment readiness & preflight
- Terraform infrastructure provisioning
- Helm application installation
- Validation & go/no-go checklist
- Teardown & cleanup procedures
This completes the module documentation suite, providing consistent
reference material for all three modules.
Include 99_teardown.ipynb in test suite to ensure resources are
cleaned up after execution tests. Runs when CI_SKIP_EXECUTION is
not true, with 30min timeout for Terraform destroy operations.
Add comprehensive testing infrastructure to validate notebook syntax and
execution using pytest and GitHub Actions. This enables automated
validation of notebooks on every PR to ensure they remain structurally
valid and executable.
Components added:
- tests/ directory
* conftest.py: Pytest configuration with test environment setup
* test_notebook_execution.py: Notebook syntax and execution tests
* requirements.txt: Test dependencies (pytest, jupyter, nbconvert)
* README.md: Test documentation and usage guide
* .gitignore: Test artifacts exclusion
- .github/workflows/ directory
* test-notebooks.yml: Main GitHub Actions workflow
- Syntax validation for all notebooks
- Module 1 preflight testing
- Module 2 syntax validation
- Python code linting
* README.md: Workflow documentation
Test strategy:
- Syntax tests: Always run (validate JSON structure, code cells)
- Execution tests: Optional (requires infrastructure, skipped in CI)
- Modular design: Separate test classes per module
- Parametrized tests: Easy to add new notebooks
GitHub Actions workflow:
- Triggers on PRs, pushes to main, and manual dispatch
- Multiple parallel jobs for faster feedback
- Environment variables configured for test execution
- Artifact uploads for debugging
Fixes:
- Renamed base class helper method to prevent pytest discovery
- Fixed fixture dependency issues in test structure
Usage:
- Local: `CI_SKIP_EXECUTION=true pytest tests/ -v`
- CI: Automatically runs on PRs to validate notebook syntax
- Extensible: Easy to add tests for new modules
This ensures notebooks remain valid and executable as the codebase evolves.
- Delete AWS-specific preflight notebook (replaced by cloud-agnostic version)
- Update README and teardown notebook to reference `01_preflight.ipynb`
The cloud-agnostic version supports both AWS and Azure deployments.
- Clarified that the workshop is designed to work with Jupyter notebooks
- Provided options for instructions on running a local Jupyter server or using Google Colab
Implement unified cloud abstraction architecture to support AWS and Azure
(and enable future GCP support) across all Module 1 notebooks.
**New Files:**
- notebooks/shared/_cloud_helpers.py: Cloud-agnostic abstraction layer
that routes to provider-specific implementations with auto-detection
- notebooks/shared/_azure_helpers.py: Azure-specific helper functions
(AKS, Azure Storage, identity, etc.)
- notebooks/module-1/01_preflight.ipynb: Cloud-agnostic preflight notebook
(replaces AWS-specific version)
**Updated Files:**
- notebooks/shared/_bootstrap.py: Cloud-aware tool checks and identity
validation (supports AWS and Azure)
- notebooks/shared/_aws_helpers.py: Added configure_kubectl_eks() and
verify_s3_access() for consistency with Azure helpers
- notebooks/module-1/01_aws_preflight.ipynb → 01_preflight.ipynb:
Renamed and refactored to be cloud-agnostic
- notebooks/module-1/02_terraform_apply.ipynb: Uses cloud helpers for
cluster verification and kubectl configuration
- notebooks/module-1/03_helm_install_langsmith.ipynb: Cloud-aware region
variable handling and kubectl configuration
- notebooks/module-1/04_validate_ingress_and_ui.ipynb: Uses cloud helpers
for kubectl configuration
- notebooks/module-1/99_teardown.ipynb: Cloud-agnostic teardown with
dynamic service name resolution
**Key Features:**
- Auto-detection of cloud provider from environment variables
- Explicit control via CLOUD_PROVIDER environment variable
- Backward compatible (defaults to AWS if not specified)
- Single codebase for all cloud providers
- Extensible architecture for adding GCP or other providers
**Cloud Provider Support:**
- AWS: Full support (existing functionality preserved)
- Azure: Full support (new)
- GCP: Architecture ready (helpers can be added)
This unified approach reduces maintenance burden and enables feature
parity across cloud providers while maintaining a single source of truth
for notebook logic.