18 Commits

Author SHA1 Message Date
Cory Waddingham 800ce7eaa6 Merge pull request #1 from langchain-ai/cwaddingham/refinements
Refinements to the initial creation
2026-01-06 16:46:31 -08:00
Cory Waddingham 62b8abef6c feat: Add email-based workshop identifier and configure nbstripout
Implement deterministic resource naming using email-based identifiers
and configure automatic output cell stripping for notebooks.

Changes:
- Add interactive workshop identifier setup in 01_preflight.ipynb
  - Prompts student for email address
  - Hashes email (MD5, 6 chars) for privacy and determinism
  - Creates identifier: -workshop-YYYYMMDD-<hash>
  - Saves to artifacts/workshop_identifier.json

- Update 02_terraform_apply.ipynb to use identifier
  - Loads identifier from artifacts file
  - Passes identifier to Terraform via -var identifier=...
  - Removes incorrect cluster_name variable (not a Terraform input)
  - Computes expected cluster name for validation

- Update 01_preflight.ipynb cluster check
  - Uses identifier to compute cluster name
  - Falls back to CLUSTER_NAME env var if identifier not set

- Configure nbstripout for automatic output cell stripping
  - Add .gitattributes with *.ipynb filter=nbstripout
  - Ensures output cells are never committed

Benefits:
- No additional environment variables required
- Deterministic identifiers (same email = same identifier)
- Idempotent deployments (safe to re-run)
- Unique per student
- Automatic output cell management
2026-01-06 09:27:55 -08:00
Cory Waddingham 34fa0e6874 Move setup notebook to shared and make it generic for modules 2, 3, and 4
The environment setup notebook is now shared across modules 2, 3, and 4,
reducing duplication and providing a consistent entry point for all
workshop modules.

Changes:
- Moved `00_setup_or_resume_environment.ipynb` from module-4/ to shared/
- Added automatic module detection based on working directory
- Implemented module-specific safety checks:
  - Modules 2 & 3: Read-only validation (safe for production)
  - Module 4: Failure labs require MODULE4_SAFE_ENVIRONMENT flag
- Updated all Module 4 notebook references to point to shared location
- Revised tone to be more supportive and teacher-like (less directive)
- Added module-aware "Next Steps" section with appropriate guidance

The notebook now adapts its behavior and messaging based on which module
is using it, providing appropriate safety warnings and next steps for
each context.
2026-01-02 16:43:59 -08:00
Cory Waddingham 951b61c37c Add safety checks to Module 4 failure labs
Add mandatory MODULE4_SAFE_ENVIRONMENT flag requirement and comprehensive
environment verification to all Module 4 failure lab notebooks. Failure
labs now prominently display cloud account/region details, require explicit
safety flag, and keep failure injection code commented out by default.

This prevents accidental execution in production environments while still
allowing hands-on troubleshooting practice in test environments.
2026-01-02 11:27:28 -08:00
Cory Waddingham b0b424eabd Add tests for Module 3 and Module 4 notebooks
This commit adds comprehensive test coverage for Module 3 (Production
Operations & Scaling) and Module 4 (Troubleshooting & Incident Response)
notebooks, following the same pattern established for Modules 1 and 2.

Test Infrastructure:
- tests/test_notebook_execution.py:
  - Added TestModule3Notebooks class with syntax and execution tests
    for 01_ops_sanity_checks.ipynb
  - Added TestModule4Notebooks class with:
    - Syntax tests for all 6 Module 4 notebooks (00, 01, 10, 20, 30, 40)
    - Execution tests for setup/baseline notebooks (00, 01) - read-only
    - Execution tests for failure labs (10, 20, 30, 40) - with warnings
      about failure injection and secret modification

CI/CD Integration:
- .github/workflows/test-notebooks.yml:
  - Added Module 3 and Module 4 syntax tests to CI pipeline
  - All syntax tests now run automatically on PRs and pushes

Test Features:
- Respects CI_SKIP_EXECUTION environment variable (same as Module 1)
- Uses environment variables for configuration (cloud provider, region, etc.)
- Appropriate timeouts: 600s for ops checks, 900s for failure labs
- Safety warnings for failure lab execution tests
- Syntax tests always run (including in CI)
- Execution tests skip in CI when CI_SKIP_EXECUTION=true

Module 4 Test Structure:
- Syntax tests: All 6 notebooks validated for structure
- Setup/Baseline execution: Read-only validation notebooks (00, 01)
- Failure lab execution: Separate test method with warnings about
  secret modification and failure injection (10, 20, 30, 40)

This ensures all workshop notebooks are validated for syntax correctness
and can be execution-tested when infrastructure is available, maintaining
consistency with existing test patterns.
2026-01-02 10:55:02 -08:00
Cory Waddingham f9c22ad3ea Add Module 4: Troubleshooting & Incident Response
This commit adds Module 4 of the LangSmith Self-Hosted Operator workshop,
focused on teaching operators how to diagnose issues under pressure, collect
evidence, and resolve incidents efficiently.

Documentation:
- docs/modules/module-4.md: Complete module documentation covering incident
  reality, common failure modes, diagnostics collection, debugging methodology,
  and working with Support
- docs/shared/incident_first_10_minutes.md: Quick reference checklist for
  critical initial incident response steps
- docs/shared/support_escalation_template.md: Copy-paste template for
  escalating issues to LangChain Support with all necessary information

Notebooks:
- notebooks/module-4/00_setup_or_resume_environment.ipynb: Environment
  validation and setup for Module 4 labs
- notebooks/module-4/01_diagnostics_baseline.ipynb: Teaches "baseline first"
  discipline with comprehensive cluster state capture and canonical diagnostics
  script execution
- notebooks/module-4/10_failure_lab_postgres.ipynb: Hands-on PostgreSQL
  connectivity failure lab with failure injection, diagnostics, and remediation
- notebooks/module-4/20_failure_lab_redis.ipynb: Hands-on Redis cache/queue
  failure lab focusing on intermittent ingestion and worker backlog issues
- notebooks/module-4/30_failure_lab_clickhouse.ipynb: Hands-on ClickHouse trace
  storage failure lab covering missing traces and insert errors
- notebooks/module-4/40_failure_lab_blob_storage.ipynb: Hands-on blob storage
  configuration failure lab demonstrating ClickHouse pressure from large payloads
- notebooks/module-4/README.md: Module overview and notebook descriptions

Key Features:
- All failure labs follow consistent structure: baseline → inject → observe →
  collect → triage → remediate → recover
- Cloud-agnostic implementation using shared cloud helpers
- Safe-by-default failure injections with backup/restore mechanisms
- Integration with canonical LangChain diagnostics script
- Secrets-safe: no credentials printed, all values redacted
- Comprehensive Support escalation guidance for each service

Each failure lab includes:
- Service role and importance explanation
- Expected symptoms documentation
- Multiple failure injection options (subtle vs. obvious)
- Guided triage steps with automatic checks
- Support escalation requirements
- Lessons learned and common mistakes

This module completes the core operator workshop curriculum, enabling
operators to confidently troubleshoot and respond to incidents in production
LangSmith deployments.
2026-01-02 10:47:33 -08:00
Cory Waddingham af1e7a840c feat: Add Module 3 - Production Operations & Scaling
Add complete Module 3 implementation for production operations, scaling
strategies, and day-2 operations. This module enables operators to run
LangSmith reliably under real production load and respond effectively
when things go wrong.

Components added:

- docs/modules/module-3.md
  * Production mental model (distributed system, scaling domains)
  * Scaling model (what scales vs what doesn't, failure patterns)
  * Service sizing baselines (PostgreSQL, Redis, ClickHouse)
  * Blob storage requirements (required for production)
  * Autoscaling strategy (HPA and KEDA)
  * Observability and early warning signals
  * Backups, DR, and failure domains
  * Sidecars and service mesh (Istio) section

- docs/shared/production_readiness_checklist.md
  * Comprehensive production readiness validation checklist
  * Infrastructure, data stores, application configuration
  * Observability, security, backup/DR sections
  * Service mesh configuration (if applicable)

- docs/shared/ops_signals_and_thresholds.md
  * Critical and warning signal definitions
  * Threshold definitions with measurement methods
  * Log patterns to monitor
  * Escalation evidence requirements
  * Quick reference table

- docs/shared/sidecars_and_service_mesh.md
  * Istio sidecar injection guidance
  * Namespace and per-workload injection
  * Operational implications (logging, health probes, egress)
  * Sample labels and annotations
  * Troubleshooting guide

- notebooks/module-3/01_ops_sanity_checks.ipynb
  * Read-only validation notebook (15 cells)
  * Current state snapshot
  * Early warning signal checks
  * Storage/durability validation
  * Sidecar detection and guidance

Key features:
- Cloud-agnostic (uses shared/_cloud_helpers.py)
- Read-only validation (safe to run)
- Deterministic checks with specific thresholds
- Operator-focused (real commands and evidence)
- Opinionated baselines and recommendations
- Time-bounded (~2 hours executable)

All notebooks follow existing patterns and integrate with shared helper
modules for consistency.
2026-01-02 09:44:15 -08:00
Cory Waddingham 163b810bb3 docs: Add Module 1 documentation
Add comprehensive Module 1 documentation following the same structure
as Modules 2 and 3. This provides a complete reference guide for the
deployment and baseline validation module.

- docs/modules/module-1.md
  * Complete module documentation (606 lines)
  * Architecture baseline (opinionated, cloud-agnostic)
  * Workshop flow covering all 5 notebooks
  * Common pitfalls and solutions
  * Service sizing baselines
  * Blob storage requirements
  * Terraform and Helm best practices
  * Validation checklist and troubleshooting

The documentation covers:
- Environment readiness & preflight
- Terraform infrastructure provisioning
- Helm application installation
- Validation & go/no-go checklist
- Teardown & cleanup procedures

This completes the module documentation suite, providing consistent
reference material for all three modules.
2026-01-02 09:43:58 -08:00
Cory Waddingham a6ee27a060 test: Add teardown notebook execution test
Include 99_teardown.ipynb in test suite to ensure resources are
cleaned up after execution tests. Runs when CI_SKIP_EXECUTION is
not true, with 30min timeout for Terraform destroy operations.
2026-01-02 09:24:42 -08:00
Cory Waddingham ea6131b394 feat: Add CI/CD testing infrastructure for notebooks
Add comprehensive testing infrastructure to validate notebook syntax and
execution using pytest and GitHub Actions. This enables automated
validation of notebooks on every PR to ensure they remain structurally
valid and executable.

Components added:

- tests/ directory
  * conftest.py: Pytest configuration with test environment setup
  * test_notebook_execution.py: Notebook syntax and execution tests
  * requirements.txt: Test dependencies (pytest, jupyter, nbconvert)
  * README.md: Test documentation and usage guide
  * .gitignore: Test artifacts exclusion

- .github/workflows/ directory
  * test-notebooks.yml: Main GitHub Actions workflow
    - Syntax validation for all notebooks
    - Module 1 preflight testing
    - Module 2 syntax validation
    - Python code linting
  * README.md: Workflow documentation

Test strategy:
- Syntax tests: Always run (validate JSON structure, code cells)
- Execution tests: Optional (requires infrastructure, skipped in CI)
- Modular design: Separate test classes per module
- Parametrized tests: Easy to add new notebooks

GitHub Actions workflow:
- Triggers on PRs, pushes to main, and manual dispatch
- Multiple parallel jobs for faster feedback
- Environment variables configured for test execution
- Artifact uploads for debugging

Fixes:
- Renamed base class helper method to prevent pytest discovery
- Fixed fixture dependency issues in test structure

Usage:
- Local: `CI_SKIP_EXECUTION=true pytest tests/ -v`
- CI: Automatically runs on PRs to validate notebook syntax
- Extensible: Easy to add tests for new modules

This ensures notebooks remain valid and executable as the codebase evolves.
2026-01-02 09:19:36 -08:00
Cory Waddingham bc26fb1f93 feat: Add Module 2 - Identity & Authentication (SSO validation)
Add complete Module 2 implementation for validating OIDC and SAML SSO
configurations in LangSmith self-hosted deployments. This module assumes
Module 1 is complete and provides comprehensive validation, troubleshooting,
and documentation for authentication setup.

Components added:

- docs/modules/module-2.md
  * Complete module documentation with auth flow diagrams
  * OIDC (preferred) and SAML (fallback) configuration guides
  * Role mapping, security callouts, and common pitfalls
  * Workshop flow covering auth model, configuration, and validation

- notebooks/module-2/01_sso_oidc_validation.ipynb
  * Primary OIDC validation notebook (19 cells)
  * Environment-driven configuration with secret redaction
  * Validates issuer URL, redirect URI exactness, claims mapping
  * Deployment verification, failure drills (opt-in), support bundle collection

- notebooks/module-2/02_sso_saml_validation.ipynb
  * Optional SAML validation notebook (17 cells)
  * Metadata URL/file validation and XML parsing
  * Entity ID, SSO endpoints, certificate extraction
  * Common failure signature detection

- docs/shared/auth_validation_checklist.md
  * Operator-friendly validation checklist
  * Preconditions, configuration inputs, role mapping
  * Login validation for admin and standard users
  * Session management and audit evidence

- docs/shared/auth_troubleshooting.md
  * Comprehensive troubleshooting playbook
  * Triage tree for common failures (login loop, 403, missing attributes, etc.)
  * Evidence gathering commands and support bundle script
  * Quick reference for OIDC/SAML issues

- env-samples/oidc.env.example
  * Extended with OIDC and SAML configuration variables
  * Required and optional variables with documentation
  * Comments and guidance for IdP team coordination

Key features:
- Cloud-agnostic (uses shared/_cloud_helpers.py)
- Secrets-safe (all sensitive values redacted)
- Operator-focused (deterministic validation, no IdP tutorials)
- Time-bounded (~2 hours executable)
- Opinionated (OIDC preferred, SAML fallback, local auth discouraged)

All notebooks follow existing patterns from Module 1 and integrate with
shared helper modules for consistency.
2026-01-02 09:12:35 -08:00
Cory Waddingham 2681a531a8 Remove deprecated 01_aws_preflight.ipynb and update references
- Delete AWS-specific preflight notebook (replaced by cloud-agnostic version)
- Update README and teardown notebook to reference `01_preflight.ipynb`

The cloud-agnostic version supports both AWS and Azure deployments.
2025-12-30 09:25:22 -08:00
Cory Waddingham ecd436bcb2 Enhance validation notebook: add license checks, external services, and cloud-agnostic support
- Add license key validation (secret check + log scanning)
- Add external services connectivity tests (PostgreSQL, Redis, blob storage)
- Make ingress/UI checks cloud-agnostic (AWS ALB + Azure Application Gateway)
- Add optional basic functional test for trace submission
- Update checklist with new validation items

Aligns with comprehensive validation guide while maintaining baseline focus.
Supports both AWS and Azure deployments.
2025-12-29 13:39:36 -08:00
Cory Waddingham 4a4c11aaf9 docs: Provide suggestions on running Jupyter notebooks
- Clarified that the workshop is designed to work with Jupyter notebooks
- Provided options for instructions on running a local Jupyter server or using Google Colab
2025-12-29 13:38:02 -08:00
Cory Waddingham dd93069314 docs: Clarify minimal values file guidance in Helm install notebook
Rephrase the important note about starting with minimal values files
to be more direct and actionable.
2025-12-29 13:22:45 -08:00
Cory Waddingham 354e12953d feat: Add cloud abstraction layer for multi-cloud support
Implement unified cloud abstraction architecture to support AWS and Azure
(and enable future GCP support) across all Module 1 notebooks.

**New Files:**
- notebooks/shared/_cloud_helpers.py: Cloud-agnostic abstraction layer
  that routes to provider-specific implementations with auto-detection
- notebooks/shared/_azure_helpers.py: Azure-specific helper functions
  (AKS, Azure Storage, identity, etc.)
- notebooks/module-1/01_preflight.ipynb: Cloud-agnostic preflight notebook
  (replaces AWS-specific version)

**Updated Files:**
- notebooks/shared/_bootstrap.py: Cloud-aware tool checks and identity
  validation (supports AWS and Azure)
- notebooks/shared/_aws_helpers.py: Added configure_kubectl_eks() and
  verify_s3_access() for consistency with Azure helpers
- notebooks/module-1/01_aws_preflight.ipynb → 01_preflight.ipynb:
  Renamed and refactored to be cloud-agnostic
- notebooks/module-1/02_terraform_apply.ipynb: Uses cloud helpers for
  cluster verification and kubectl configuration
- notebooks/module-1/03_helm_install_langsmith.ipynb: Cloud-aware region
  variable handling and kubectl configuration
- notebooks/module-1/04_validate_ingress_and_ui.ipynb: Uses cloud helpers
  for kubectl configuration
- notebooks/module-1/99_teardown.ipynb: Cloud-agnostic teardown with
  dynamic service name resolution

**Key Features:**
- Auto-detection of cloud provider from environment variables
- Explicit control via CLOUD_PROVIDER environment variable
- Backward compatible (defaults to AWS if not specified)
- Single codebase for all cloud providers
- Extensible architecture for adding GCP or other providers

**Cloud Provider Support:**
- AWS: Full support (existing functionality preserved)
- Azure: Full support (new)
- GCP: Architecture ready (helpers can be added)

This unified approach reduces maintenance burden and enables feature
parity across cloud providers while maintaining a single source of truth
for notebook logic.
2025-12-29 13:19:36 -08:00
Cory Waddingham e546d9da92 Added README.md file to give overview of the workshop and its contents. 2025-12-26 14:37:37 -08:00
Cory Waddingham 3a190f1c19 feat: Add Module 1 notebooks and shared infrastructure for LangSmith self-hosted workshops
This commit introduces the foundational infrastructure for running LangSmith
self-hosted deployment workshops using Jupyter notebooks.

- Add `notebooks/shared/_bootstrap.py`: Centralized bootstrap logic that:
  - Loads environment variables from `.env` or `workshop.env` files
  - Validates required tools (aws, terraform, helm, kubectl, jq)
  - Prints AWS identity and region information
  - Creates artifacts directory for notebook outputs
  - Automatically installs required Python packages (python-dotenv, pyyaml, requests)

- Add `notebooks/shared/_shell.py`: Shell command execution utilities with:
  - Homebrew path resolution for macOS (fixes PATH issues for subprocess calls)
  - AWS_PROFILE handling
  - Streaming and non-streaming command execution

- Add `notebooks/shared/_validation.py`: Validation helpers for environment
  variables and configuration

- Add `notebooks/shared/_aws_helpers.py`: AWS-specific helper functions
- Add `notebooks/shared/_k8s_helpers.py`: Kubernetes helper functions

Create complete set of Module 1 notebooks following the workshop curriculum:

- `01_aws_preflight.ipynb`: Pre-deployment environment validation
  - Tool validation
  - AWS credentials and region checks
  - Cluster capacity expectations
  - Storage prerequisites (EBS CSI, StorageClasses)
  - S3 blob storage verification
  - Terraform and Helm repository path validation

- `02_terraform_apply.ipynb`: Infrastructure provisioning
  - Terraform module discovery and validation
  - Version pinning verification
  - Remote state configuration
  - Terraform initialization
  - Plan creation with environment variable support
  - Infrastructure application (commented by default)
  - Output capture for Helm deployment

- `03_helm_install_langsmith.ipynb`: LangSmith installation
  - Helm chart discovery and validation
  - Chart version pinning
  - Terraform outputs loading
  - Values file management
  - Kubernetes secrets creation
  - Template rendering before install
  - Helm installation (commented by default)

- `04_validate_ingress_and_ui.ipynb`: Deployment validation
  - Pod readiness checks
  - PVC binding verification
  - Ingress provisioning
  - Endpoint reachability
  - UI availability
  - Diagnostic artifact collection

- `99_teardown.ipynb`: Cleanup procedures
  - Helm uninstall
  - Kubernetes resource cleanup
  - Terraform destroy
  - Verification steps

- Add `.gitignore`: Comprehensive ignore patterns for Python, Jupyter,
  environment files, artifacts, and infrastructure tool outputs

- Add `env-samples/workshop.env.example`: Template environment file with:
  - Workshop configuration variables
  - AWS settings
  - Terraform and Helm repository paths
  - PostgreSQL credentials (POSTGRES_USERNAME, POSTGRES_PASSWORD)
  - Helm configuration

- Add additional example env files for AWS, OIDC, and Module 3

- Environment variable expansion: Supports `$VAR` and `${VAR}` syntax in paths
  (e.g., `$TERRAFORM_REPO_DIR/aws/langsmith`)

- Robust path resolution: Handles different Jupyter working directories and
  automatically finds the notebooks/shared directory

- Error handling: Clear error messages with actionable instructions when
  required tools, directories, or environment variables are missing

- Terraform variable passing: Automatically reads POSTGRES_USERNAME and
  POSTGRES_PASSWORD from environment and passes them to Terraform commands

- Clone instructions: Helpful guidance when Terraform or Helm repositories
  are not found

- Artifact management: Centralized artifacts directory for saving outputs,
  plans, and diagnostic information

All notebooks follow best practices:
- Use official repositories (no forking)
- Pin versions for reproducibility
- Plan before applying
- Render templates before installing
- Validate before proceeding

This establishes a solid foundation for the workshop series, ensuring
participants start from a supported baseline configuration.
2025-12-26 14:31:37 -08:00