Update documentation to reflect infrastructure team decisions with clear,
opinionated language and explicit production vs non-production boundaries.
Changes:
- Blob storage (S3) is now REQUIRED for production (not recommended)
- EBS CSI Driver requirement added for ClickHouse persistence on EKS
- HPA is required; KEDA positioned as optional P1/advanced autoscaling
- Added rationale sections explaining why these requirements exist
README.md:
- Updated architecture diagram to show S3 as required for production
- Added storage requirements (EBS CSI Driver) in Compute section
- Added autoscaling requirements (HPA required, KEDA optional)
- Changed blob storage section to 'Required for Production'
- Added Section 7.5: Production Requirements Summary with rationale
PROD_CHECKLIST.md:
- Changed Section 4 header from 'STRONGLY RECOMMENDED' to 'REQUIRED FOR PRODUCTION'
- Added Non-Production Guidance subsection for dev/eval scenarios
- Added Kubernetes Storage / EBS CSI section under ClickHouse
- Added new Section 5: Autoscaling (HPA REQUIRED; KEDA OPTIONAL)
- Updated final sign-off to reference production requirements
All changes use direct, operational language with minimal duplication.
Add a comprehensive table of contents to the README that organizes
content by deployment phase and provides clear guidance on:
- What to do at each stage
- When to do it (timing in the deployment process)
- Where to find detailed information (section links and related docs)
The TOC is organized into four sections:
- Getting Started: Initial review and pre-deployment steps
- Architecture Reference: Design and planning decisions
- Operations & Troubleshooting: Post-deployment guidance
- Reference Information: Context and exclusions
This improves discoverability and helps users navigate the documentation
more effectively by clearly indicating the sequence of activities and
where to find specific information.
Improve phrasing, organization, and clarity across PREFLIGHT.md,
README.md, and WALKTHROUGH.md to make the deployment process
easier to follow and understand.
Changes include:
- Clarify ClickHouse deployment options with production vs dev
distinctions
- Add more context and explanations throughout WALKTHROUGH.md
- Improve section organization and formatting
- Add troubleshooting guidance and prerequisites
- Enhance step-by-step instructions with better structure
- Specify "AWS reference architecture" in README for clarity
These changes maintain the technical accuracy while making the
documentation more accessible and easier to follow for engineers
deploying LangSmith Self-Hosted on AWS.
Update reference architecture documentation to align with current LangSmith
scaling guidance and recent production incidents. Establish PROD_CHECKLIST.md
as the authoritative source for production requirements.
Major changes:
Production Requirements (README.md, PROD_CHECKLIST.md):
- ClickHouse: Require 3 replicas minimum for production (baseline), with ≤5
guardrail. Single-node explicitly not supported for production.
- Blob Storage: Elevate from optional to strong production recommendation
with explicit workload triggers (10+ tenants, query concurrency >100,
latency >2s, ingestion delay >60s).
- Add read vs write path mental model section explaining Backend → Redis →
Queue → ClickHouse (write) vs Backend → ClickHouse (read).
- Clarify that query concurrency and disk I/O are leading indicators, not
CPU/memory.
- Add operational guidance: CLICKHOUSE_ASYNC_INSERT_WAIT_PCT_FLOAT=0 usage
and failure mode awareness ("traces created but not visible").
Documentation Structure (PROD_CHECKLIST.md):
- Reorganize sections 1-3 into logical component order: Redis, PostgreSQL,
ClickHouse, Blob Storage (matching system data flow).
- Add brief context descriptions for each component.
- Improve tone to be guidance-oriented rather than directive.
Cross-References:
- Add references from README.md, PREFLIGHT.md, WALKTHROUGH.md, and
TROUBLESHOOTING.md to PROD_CHECKLIST.md for production requirements.
- Add reference from PROD_CHECKLIST.md to README.md for read/write path
mental model explanation.
- Eliminate duplicate advice by establishing clear source of truth.
Fixes:
- Resolve contradiction: Align single-node ClickHouse wording ("not supported"
vs "not recommended") across all files.
- Ensure consistent production requirements across all documentation.
This establishes a clear documentation hierarchy where PROD_CHECKLIST.md serves
as the authoritative source for production requirements, while other documents
reference it rather than duplicating information.
Update TROUBLESHOOTING.md to reference the official LangSmith
Kubernetes debugging script from the Helm repository instead of
maintaining a local copy.
Changes:
- Update TROUBLESHOOTING.md to use get_k8s_debugging_info.sh from
langchain-ai/helm repository
- Remove scripts/capture-diagnostics.sh (no longer needed)
- Add download instructions and script details
This ensures consistency across all LangSmith deployments and reduces
maintenance burden by having a single source of truth for diagnostic
capture functionality.