langsmith-self-hosted-workshops

langchain-ai/langsmith-self-hosted-workshops

mirror of https://github.com/langchain-ai/langsmith-self-hosted-workshops.git synced 2026-07-01 20:44:14 -04:00

Author	SHA1	Message	Date
Cory Waddingham	800ce7eaa6	Merge pull request #1 from langchain-ai/cwaddingham/refinements Refinements to the initial creation	2026-01-06 16:46:31 -08:00
Cory Waddingham	62b8abef6c	feat: Add email-based workshop identifier and configure nbstripout Implement deterministic resource naming using email-based identifiers and configure automatic output cell stripping for notebooks. Changes: - Add interactive workshop identifier setup in 01_preflight.ipynb - Prompts student for email address - Hashes email (MD5, 6 chars) for privacy and determinism - Creates identifier: -workshop-YYYYMMDD-<hash> - Saves to artifacts/workshop_identifier.json - Update 02_terraform_apply.ipynb to use identifier - Loads identifier from artifacts file - Passes identifier to Terraform via -var identifier=... - Removes incorrect cluster_name variable (not a Terraform input) - Computes expected cluster name for validation - Update 01_preflight.ipynb cluster check - Uses identifier to compute cluster name - Falls back to CLUSTER_NAME env var if identifier not set - Configure nbstripout for automatic output cell stripping - Add .gitattributes with *.ipynb filter=nbstripout - Ensures output cells are never committed Benefits: - No additional environment variables required - Deterministic identifiers (same email = same identifier) - Idempotent deployments (safe to re-run) - Unique per student - Automatic output cell management	2026-01-06 09:27:55 -08:00
Cory Waddingham	34fa0e6874	Move setup notebook to shared and make it generic for modules 2, 3, and 4 The environment setup notebook is now shared across modules 2, 3, and 4, reducing duplication and providing a consistent entry point for all workshop modules. Changes: - Moved `00_setup_or_resume_environment.ipynb` from module-4/ to shared/ - Added automatic module detection based on working directory - Implemented module-specific safety checks: - Modules 2 & 3: Read-only validation (safe for production) - Module 4: Failure labs require MODULE4_SAFE_ENVIRONMENT flag - Updated all Module 4 notebook references to point to shared location - Revised tone to be more supportive and teacher-like (less directive) - Added module-aware "Next Steps" section with appropriate guidance The notebook now adapts its behavior and messaging based on which module is using it, providing appropriate safety warnings and next steps for each context.	2026-01-02 16:43:59 -08:00
Cory Waddingham	951b61c37c	Add safety checks to Module 4 failure labs Add mandatory MODULE4_SAFE_ENVIRONMENT flag requirement and comprehensive environment verification to all Module 4 failure lab notebooks. Failure labs now prominently display cloud account/region details, require explicit safety flag, and keep failure injection code commented out by default. This prevents accidental execution in production environments while still allowing hands-on troubleshooting practice in test environments.	2026-01-02 11:27:28 -08:00
Cory Waddingham	b0b424eabd	Add tests for Module 3 and Module 4 notebooks This commit adds comprehensive test coverage for Module 3 (Production Operations & Scaling) and Module 4 (Troubleshooting & Incident Response) notebooks, following the same pattern established for Modules 1 and 2. Test Infrastructure: - tests/test_notebook_execution.py: - Added TestModule3Notebooks class with syntax and execution tests for 01_ops_sanity_checks.ipynb - Added TestModule4Notebooks class with: - Syntax tests for all 6 Module 4 notebooks (00, 01, 10, 20, 30, 40) - Execution tests for setup/baseline notebooks (00, 01) - read-only - Execution tests for failure labs (10, 20, 30, 40) - with warnings about failure injection and secret modification CI/CD Integration: - .github/workflows/test-notebooks.yml: - Added Module 3 and Module 4 syntax tests to CI pipeline - All syntax tests now run automatically on PRs and pushes Test Features: - Respects CI_SKIP_EXECUTION environment variable (same as Module 1) - Uses environment variables for configuration (cloud provider, region, etc.) - Appropriate timeouts: 600s for ops checks, 900s for failure labs - Safety warnings for failure lab execution tests - Syntax tests always run (including in CI) - Execution tests skip in CI when CI_SKIP_EXECUTION=true Module 4 Test Structure: - Syntax tests: All 6 notebooks validated for structure - Setup/Baseline execution: Read-only validation notebooks (00, 01) - Failure lab execution: Separate test method with warnings about secret modification and failure injection (10, 20, 30, 40) This ensures all workshop notebooks are validated for syntax correctness and can be execution-tested when infrastructure is available, maintaining consistency with existing test patterns.	2026-01-02 10:55:02 -08:00
Cory Waddingham	f9c22ad3ea	Add Module 4: Troubleshooting & Incident Response This commit adds Module 4 of the LangSmith Self-Hosted Operator workshop, focused on teaching operators how to diagnose issues under pressure, collect evidence, and resolve incidents efficiently. Documentation: - docs/modules/module-4.md: Complete module documentation covering incident reality, common failure modes, diagnostics collection, debugging methodology, and working with Support - docs/shared/incident_first_10_minutes.md: Quick reference checklist for critical initial incident response steps - docs/shared/support_escalation_template.md: Copy-paste template for escalating issues to LangChain Support with all necessary information Notebooks: - notebooks/module-4/00_setup_or_resume_environment.ipynb: Environment validation and setup for Module 4 labs - notebooks/module-4/01_diagnostics_baseline.ipynb: Teaches "baseline first" discipline with comprehensive cluster state capture and canonical diagnostics script execution - notebooks/module-4/10_failure_lab_postgres.ipynb: Hands-on PostgreSQL connectivity failure lab with failure injection, diagnostics, and remediation - notebooks/module-4/20_failure_lab_redis.ipynb: Hands-on Redis cache/queue failure lab focusing on intermittent ingestion and worker backlog issues - notebooks/module-4/30_failure_lab_clickhouse.ipynb: Hands-on ClickHouse trace storage failure lab covering missing traces and insert errors - notebooks/module-4/40_failure_lab_blob_storage.ipynb: Hands-on blob storage configuration failure lab demonstrating ClickHouse pressure from large payloads - notebooks/module-4/README.md: Module overview and notebook descriptions Key Features: - All failure labs follow consistent structure: baseline → inject → observe → collect → triage → remediate → recover - Cloud-agnostic implementation using shared cloud helpers - Safe-by-default failure injections with backup/restore mechanisms - Integration with canonical LangChain diagnostics script - Secrets-safe: no credentials printed, all values redacted - Comprehensive Support escalation guidance for each service Each failure lab includes: - Service role and importance explanation - Expected symptoms documentation - Multiple failure injection options (subtle vs. obvious) - Guided triage steps with automatic checks - Support escalation requirements - Lessons learned and common mistakes This module completes the core operator workshop curriculum, enabling operators to confidently troubleshoot and respond to incidents in production LangSmith deployments.	2026-01-02 10:47:33 -08:00
Cory Waddingham	af1e7a840c	feat: Add Module 3 - Production Operations & Scaling Add complete Module 3 implementation for production operations, scaling strategies, and day-2 operations. This module enables operators to run LangSmith reliably under real production load and respond effectively when things go wrong. Components added: - docs/modules/module-3.md * Production mental model (distributed system, scaling domains) * Scaling model (what scales vs what doesn't, failure patterns) * Service sizing baselines (PostgreSQL, Redis, ClickHouse) * Blob storage requirements (required for production) * Autoscaling strategy (HPA and KEDA) * Observability and early warning signals * Backups, DR, and failure domains * Sidecars and service mesh (Istio) section - docs/shared/production_readiness_checklist.md * Comprehensive production readiness validation checklist * Infrastructure, data stores, application configuration * Observability, security, backup/DR sections * Service mesh configuration (if applicable) - docs/shared/ops_signals_and_thresholds.md * Critical and warning signal definitions * Threshold definitions with measurement methods * Log patterns to monitor * Escalation evidence requirements * Quick reference table - docs/shared/sidecars_and_service_mesh.md * Istio sidecar injection guidance * Namespace and per-workload injection * Operational implications (logging, health probes, egress) * Sample labels and annotations * Troubleshooting guide - notebooks/module-3/01_ops_sanity_checks.ipynb * Read-only validation notebook (15 cells) * Current state snapshot * Early warning signal checks * Storage/durability validation * Sidecar detection and guidance Key features: - Cloud-agnostic (uses shared/_cloud_helpers.py) - Read-only validation (safe to run) - Deterministic checks with specific thresholds - Operator-focused (real commands and evidence) - Opinionated baselines and recommendations - Time-bounded (~2 hours executable) All notebooks follow existing patterns and integrate with shared helper modules for consistency.	2026-01-02 09:44:15 -08:00
Cory Waddingham	163b810bb3	docs: Add Module 1 documentation Add comprehensive Module 1 documentation following the same structure as Modules 2 and 3. This provides a complete reference guide for the deployment and baseline validation module. - docs/modules/module-1.md * Complete module documentation (606 lines) * Architecture baseline (opinionated, cloud-agnostic) * Workshop flow covering all 5 notebooks * Common pitfalls and solutions * Service sizing baselines * Blob storage requirements * Terraform and Helm best practices * Validation checklist and troubleshooting The documentation covers: - Environment readiness & preflight - Terraform infrastructure provisioning - Helm application installation - Validation & go/no-go checklist - Teardown & cleanup procedures This completes the module documentation suite, providing consistent reference material for all three modules.	2026-01-02 09:43:58 -08:00
Cory Waddingham	a6ee27a060	test: Add teardown notebook execution test Include 99_teardown.ipynb in test suite to ensure resources are cleaned up after execution tests. Runs when CI_SKIP_EXECUTION is not true, with 30min timeout for Terraform destroy operations.	2026-01-02 09:24:42 -08:00
Cory Waddingham	ea6131b394	feat: Add CI/CD testing infrastructure for notebooks Add comprehensive testing infrastructure to validate notebook syntax and execution using pytest and GitHub Actions. This enables automated validation of notebooks on every PR to ensure they remain structurally valid and executable. Components added: - tests/ directory * conftest.py: Pytest configuration with test environment setup * test_notebook_execution.py: Notebook syntax and execution tests * requirements.txt: Test dependencies (pytest, jupyter, nbconvert) * README.md: Test documentation and usage guide * .gitignore: Test artifacts exclusion - .github/workflows/ directory * test-notebooks.yml: Main GitHub Actions workflow - Syntax validation for all notebooks - Module 1 preflight testing - Module 2 syntax validation - Python code linting * README.md: Workflow documentation Test strategy: - Syntax tests: Always run (validate JSON structure, code cells) - Execution tests: Optional (requires infrastructure, skipped in CI) - Modular design: Separate test classes per module - Parametrized tests: Easy to add new notebooks GitHub Actions workflow: - Triggers on PRs, pushes to main, and manual dispatch - Multiple parallel jobs for faster feedback - Environment variables configured for test execution - Artifact uploads for debugging Fixes: - Renamed base class helper method to prevent pytest discovery - Fixed fixture dependency issues in test structure Usage: - Local: `CI_SKIP_EXECUTION=true pytest tests/ -v` - CI: Automatically runs on PRs to validate notebook syntax - Extensible: Easy to add tests for new modules This ensures notebooks remain valid and executable as the codebase evolves.	2026-01-02 09:19:36 -08:00
Cory Waddingham	bc26fb1f93	feat: Add Module 2 - Identity & Authentication (SSO validation) Add complete Module 2 implementation for validating OIDC and SAML SSO configurations in LangSmith self-hosted deployments. This module assumes Module 1 is complete and provides comprehensive validation, troubleshooting, and documentation for authentication setup. Components added: - docs/modules/module-2.md * Complete module documentation with auth flow diagrams * OIDC (preferred) and SAML (fallback) configuration guides * Role mapping, security callouts, and common pitfalls * Workshop flow covering auth model, configuration, and validation - notebooks/module-2/01_sso_oidc_validation.ipynb * Primary OIDC validation notebook (19 cells) * Environment-driven configuration with secret redaction * Validates issuer URL, redirect URI exactness, claims mapping * Deployment verification, failure drills (opt-in), support bundle collection - notebooks/module-2/02_sso_saml_validation.ipynb * Optional SAML validation notebook (17 cells) * Metadata URL/file validation and XML parsing * Entity ID, SSO endpoints, certificate extraction * Common failure signature detection - docs/shared/auth_validation_checklist.md * Operator-friendly validation checklist * Preconditions, configuration inputs, role mapping * Login validation for admin and standard users * Session management and audit evidence - docs/shared/auth_troubleshooting.md * Comprehensive troubleshooting playbook * Triage tree for common failures (login loop, 403, missing attributes, etc.) * Evidence gathering commands and support bundle script * Quick reference for OIDC/SAML issues - env-samples/oidc.env.example * Extended with OIDC and SAML configuration variables * Required and optional variables with documentation * Comments and guidance for IdP team coordination Key features: - Cloud-agnostic (uses shared/_cloud_helpers.py) - Secrets-safe (all sensitive values redacted) - Operator-focused (deterministic validation, no IdP tutorials) - Time-bounded (~2 hours executable) - Opinionated (OIDC preferred, SAML fallback, local auth discouraged) All notebooks follow existing patterns from Module 1 and integrate with shared helper modules for consistency.	2026-01-02 09:12:35 -08:00
Cory Waddingham	2681a531a8	Remove deprecated 01_aws_preflight.ipynb and update references - Delete AWS-specific preflight notebook (replaced by cloud-agnostic version) - Update README and teardown notebook to reference `01_preflight.ipynb` The cloud-agnostic version supports both AWS and Azure deployments.	2025-12-30 09:25:22 -08:00
Cory Waddingham	ecd436bcb2	Enhance validation notebook: add license checks, external services, and cloud-agnostic support - Add license key validation (secret check + log scanning) - Add external services connectivity tests (PostgreSQL, Redis, blob storage) - Make ingress/UI checks cloud-agnostic (AWS ALB + Azure Application Gateway) - Add optional basic functional test for trace submission - Update checklist with new validation items Aligns with comprehensive validation guide while maintaining baseline focus. Supports both AWS and Azure deployments.	2025-12-29 13:39:36 -08:00
Cory Waddingham	4a4c11aaf9	docs: Provide suggestions on running Jupyter notebooks - Clarified that the workshop is designed to work with Jupyter notebooks - Provided options for instructions on running a local Jupyter server or using Google Colab	2025-12-29 13:38:02 -08:00
Cory Waddingham	dd93069314	docs: Clarify minimal values file guidance in Helm install notebook Rephrase the important note about starting with minimal values files to be more direct and actionable.	2025-12-29 13:22:45 -08:00
Cory Waddingham	354e12953d	feat: Add cloud abstraction layer for multi-cloud support Implement unified cloud abstraction architecture to support AWS and Azure (and enable future GCP support) across all Module 1 notebooks. New Files: - notebooks/shared/_cloud_helpers.py: Cloud-agnostic abstraction layer that routes to provider-specific implementations with auto-detection - notebooks/shared/_azure_helpers.py: Azure-specific helper functions (AKS, Azure Storage, identity, etc.) - notebooks/module-1/01_preflight.ipynb: Cloud-agnostic preflight notebook (replaces AWS-specific version) Updated Files: - notebooks/shared/_bootstrap.py: Cloud-aware tool checks and identity validation (supports AWS and Azure) - notebooks/shared/_aws_helpers.py: Added configure_kubectl_eks() and verify_s3_access() for consistency with Azure helpers - notebooks/module-1/01_aws_preflight.ipynb → 01_preflight.ipynb: Renamed and refactored to be cloud-agnostic - notebooks/module-1/02_terraform_apply.ipynb: Uses cloud helpers for cluster verification and kubectl configuration - notebooks/module-1/03_helm_install_langsmith.ipynb: Cloud-aware region variable handling and kubectl configuration - notebooks/module-1/04_validate_ingress_and_ui.ipynb: Uses cloud helpers for kubectl configuration - notebooks/module-1/99_teardown.ipynb: Cloud-agnostic teardown with dynamic service name resolution Key Features: - Auto-detection of cloud provider from environment variables - Explicit control via CLOUD_PROVIDER environment variable - Backward compatible (defaults to AWS if not specified) - Single codebase for all cloud providers - Extensible architecture for adding GCP or other providers Cloud Provider Support: - AWS: Full support (existing functionality preserved) - Azure: Full support (new) - GCP: Architecture ready (helpers can be added) This unified approach reduces maintenance burden and enables feature parity across cloud providers while maintaining a single source of truth for notebook logic.	2025-12-29 13:19:36 -08:00
Cory Waddingham	e546d9da92	Added README.md file to give overview of the workshop and its contents.	2025-12-26 14:37:37 -08:00
Cory Waddingham	3a190f1c19	feat: Add Module 1 notebooks and shared infrastructure for LangSmith self-hosted workshops This commit introduces the foundational infrastructure for running LangSmith self-hosted deployment workshops using Jupyter notebooks. - Add `notebooks/shared/_bootstrap.py`: Centralized bootstrap logic that: - Loads environment variables from `.env` or `workshop.env` files - Validates required tools (aws, terraform, helm, kubectl, jq) - Prints AWS identity and region information - Creates artifacts directory for notebook outputs - Automatically installs required Python packages (python-dotenv, pyyaml, requests) - Add `notebooks/shared/_shell.py`: Shell command execution utilities with: - Homebrew path resolution for macOS (fixes PATH issues for subprocess calls) - AWS_PROFILE handling - Streaming and non-streaming command execution - Add `notebooks/shared/_validation.py`: Validation helpers for environment variables and configuration - Add `notebooks/shared/_aws_helpers.py`: AWS-specific helper functions - Add `notebooks/shared/_k8s_helpers.py`: Kubernetes helper functions Create complete set of Module 1 notebooks following the workshop curriculum: - `01_aws_preflight.ipynb`: Pre-deployment environment validation - Tool validation - AWS credentials and region checks - Cluster capacity expectations - Storage prerequisites (EBS CSI, StorageClasses) - S3 blob storage verification - Terraform and Helm repository path validation - `02_terraform_apply.ipynb`: Infrastructure provisioning - Terraform module discovery and validation - Version pinning verification - Remote state configuration - Terraform initialization - Plan creation with environment variable support - Infrastructure application (commented by default) - Output capture for Helm deployment - `03_helm_install_langsmith.ipynb`: LangSmith installation - Helm chart discovery and validation - Chart version pinning - Terraform outputs loading - Values file management - Kubernetes secrets creation - Template rendering before install - Helm installation (commented by default) - `04_validate_ingress_and_ui.ipynb`: Deployment validation - Pod readiness checks - PVC binding verification - Ingress provisioning - Endpoint reachability - UI availability - Diagnostic artifact collection - `99_teardown.ipynb`: Cleanup procedures - Helm uninstall - Kubernetes resource cleanup - Terraform destroy - Verification steps - Add `.gitignore`: Comprehensive ignore patterns for Python, Jupyter, environment files, artifacts, and infrastructure tool outputs - Add `env-samples/workshop.env.example`: Template environment file with: - Workshop configuration variables - AWS settings - Terraform and Helm repository paths - PostgreSQL credentials (POSTGRES_USERNAME, POSTGRES_PASSWORD) - Helm configuration - Add additional example env files for AWS, OIDC, and Module 3 - Environment variable expansion: Supports `$VAR` and `${VAR}` syntax in paths (e.g., `$TERRAFORM_REPO_DIR/aws/langsmith`) - Robust path resolution: Handles different Jupyter working directories and automatically finds the notebooks/shared directory - Error handling: Clear error messages with actionable instructions when required tools, directories, or environment variables are missing - Terraform variable passing: Automatically reads POSTGRES_USERNAME and POSTGRES_PASSWORD from environment and passes them to Terraform commands - Clone instructions: Helpful guidance when Terraform or Helm repositories are not found - Artifact management: Centralized artifacts directory for saving outputs, plans, and diagnostic information All notebooks follow best practices: - Use official repositories (no forking) - Pin versions for reproducibility - Plan before applying - Render templates before installing - Validate before proceeding This establishes a solid foundation for the workshop series, ensuring participants start from a supported baseline configuration.	2025-12-26 14:31:37 -08:00

18 Commits