11 Commits

Author SHA1 Message Date
Cory Waddingham b0bbaf28da Merge pull request #5 from langchain-ai/incorporate-infra-feedback
docs: incorporate infra team feedback
2025-12-19 10:53:07 -08:00
Cory Waddingham e086485ba6 docs: incorporate infra team feedback - make production requirements explicit
Update documentation to reflect infrastructure team decisions with clear,
opinionated language and explicit production vs non-production boundaries.

Changes:
- Blob storage (S3) is now REQUIRED for production (not recommended)
- EBS CSI Driver requirement added for ClickHouse persistence on EKS
- HPA is required; KEDA positioned as optional P1/advanced autoscaling
- Added rationale sections explaining why these requirements exist

README.md:
- Updated architecture diagram to show S3 as required for production
- Added storage requirements (EBS CSI Driver) in Compute section
- Added autoscaling requirements (HPA required, KEDA optional)
- Changed blob storage section to 'Required for Production'
- Added Section 7.5: Production Requirements Summary with rationale

PROD_CHECKLIST.md:
- Changed Section 4 header from 'STRONGLY RECOMMENDED' to 'REQUIRED FOR PRODUCTION'
- Added Non-Production Guidance subsection for dev/eval scenarios
- Added Kubernetes Storage / EBS CSI section under ClickHouse
- Added new Section 5: Autoscaling (HPA REQUIRED; KEDA OPTIONAL)
- Updated final sign-off to reference production requirements

All changes use direct, operational language with minimal duplication.
2025-12-19 10:51:50 -08:00
Cory Waddingham 53b6db5e68 Merge pull request #4 from langchain-ai/adjust-phrasing
Adjust phrasing
2025-12-19 09:47:00 -08:00
Cory Waddingham d7e1fad992 docs: add table of contents to README with navigation guide
Add a comprehensive table of contents to the README that organizes
content by deployment phase and provides clear guidance on:
- What to do at each stage
- When to do it (timing in the deployment process)
- Where to find detailed information (section links and related docs)

The TOC is organized into four sections:
- Getting Started: Initial review and pre-deployment steps
- Architecture Reference: Design and planning decisions
- Operations & Troubleshooting: Post-deployment guidance
- Reference Information: Context and exclusions

This improves discoverability and helps users navigate the documentation
more effectively by clearly indicating the sequence of activities and
where to find specific information.
2025-12-19 09:45:52 -08:00
Cory Waddingham 8dd8b66977 docs: improve clarity and structure across documentation
Improve phrasing, organization, and clarity across PREFLIGHT.md,
README.md, and WALKTHROUGH.md to make the deployment process
easier to follow and understand.

Changes include:
- Clarify ClickHouse deployment options with production vs dev
  distinctions
- Add more context and explanations throughout WALKTHROUGH.md
- Improve section organization and formatting
- Add troubleshooting guidance and prerequisites
- Enhance step-by-step instructions with better structure
- Specify "AWS reference architecture" in README for clarity

These changes maintain the technical accuracy while making the
documentation more accessible and easier to follow for engineers
deploying LangSmith Self-Hosted on AWS.
2025-12-18 13:54:43 -08:00
Cory Waddingham 2856cc07d7 Merge pull request #2 from langchain-ai/update-troubleshooting
Update troubleshooting
2025-12-16 11:23:48 -08:00
Cory Waddingham a1a46c0b7d docs: align production requirements and establish clear documentation hierarchy
Update reference architecture documentation to align with current LangSmith
scaling guidance and recent production incidents. Establish PROD_CHECKLIST.md
as the authoritative source for production requirements.

Major changes:

Production Requirements (README.md, PROD_CHECKLIST.md):
- ClickHouse: Require 3 replicas minimum for production (baseline), with ≤5
  guardrail. Single-node explicitly not supported for production.
- Blob Storage: Elevate from optional to strong production recommendation
  with explicit workload triggers (10+ tenants, query concurrency >100,
  latency >2s, ingestion delay >60s).
- Add read vs write path mental model section explaining Backend → Redis →
  Queue → ClickHouse (write) vs Backend → ClickHouse (read).
- Clarify that query concurrency and disk I/O are leading indicators, not
  CPU/memory.
- Add operational guidance: CLICKHOUSE_ASYNC_INSERT_WAIT_PCT_FLOAT=0 usage
  and failure mode awareness ("traces created but not visible").

Documentation Structure (PROD_CHECKLIST.md):
- Reorganize sections 1-3 into logical component order: Redis, PostgreSQL,
  ClickHouse, Blob Storage (matching system data flow).
- Add brief context descriptions for each component.
- Improve tone to be guidance-oriented rather than directive.

Cross-References:
- Add references from README.md, PREFLIGHT.md, WALKTHROUGH.md, and
  TROUBLESHOOTING.md to PROD_CHECKLIST.md for production requirements.
- Add reference from PROD_CHECKLIST.md to README.md for read/write path
  mental model explanation.
- Eliminate duplicate advice by establishing clear source of truth.

Fixes:
- Resolve contradiction: Align single-node ClickHouse wording ("not supported"
  vs "not recommended") across all files.
- Ensure consistent production requirements across all documentation.

This establishes a clear documentation hierarchy where PROD_CHECKLIST.md serves
as the authoritative source for production requirements, while other documents
reference it rather than duplicating information.
2025-12-16 11:15:01 -08:00
Cory Waddingham d0cc9316d7 Updated README.md to include specific guidance around scaling ClickHouse.
- Included minimum read replica count for a production deployment.
- Added advice regarding blob storage for traces.
2025-12-15 16:59:08 -08:00
Cory Waddingham 292b395f6a Merge pull request #1 from langchain-ai/update-troubleshooting
Replace local diagnostics script with Helm repo script
2025-12-15 13:44:10 -08:00
Cory Waddingham 334e6da773 Replace local diagnostics script with Helm repo script
Update TROUBLESHOOTING.md to reference the official LangSmith
Kubernetes debugging script from the Helm repository instead of
maintaining a local copy.

Changes:
- Update TROUBLESHOOTING.md to use get_k8s_debugging_info.sh from
  langchain-ai/helm repository
- Remove scripts/capture-diagnostics.sh (no longer needed)
- Add download instructions and script details

This ensures consistency across all LangSmith deployments and reduces
maintenance burden by having a single source of truth for diagnostic
capture functionality.
2025-12-15 13:43:39 -08:00
Cory Waddingham ca682685a7 Initial commit: LangSmith Self-Hosted AWS Reference Architecture (P0)
This commit establishes the P0 reference architecture documentation and
supporting tooling for deploying LangSmith Self-Hosted on AWS.

Documentation:
- README.md: Reference architecture overview with embedded request flow diagram
- PREFLIGHT.md: Comprehensive preflight checklist with automated script integration
- WALKTHROUGH.md: Step-by-step deployment walkthrough
- INGRESS.md: ALB-only ingress configuration guide
- TROUBLESHOOTING.md: Evidence-based troubleshooting guide with diagnostic automation

Tooling:
- scripts/preflight.sh: Automated AWS permission and prerequisite validation
- scripts/capture-diagnostics.sh: Automated diagnostic information capture for troubleshooting

All documentation follows a professional, educational, and encouraging tone
designed to support platform/infrastructure/MLOps engineers through the
deployment process.

Key features:
- Opinionated, supportable deployment path
- AWS + EKS + Terraform + Helm stack
- Automated validation and diagnostic tools
- Clear separation of P0 (baseline) vs P1+ (advanced) features
2025-12-15 12:10:50 -08:00