Cory Waddingham a1a46c0b7d docs: align production requirements and establish clear documentation hierarchy
Update reference architecture documentation to align with current LangSmith
scaling guidance and recent production incidents. Establish PROD_CHECKLIST.md
as the authoritative source for production requirements.

Major changes:

Production Requirements (README.md, PROD_CHECKLIST.md):
- ClickHouse: Require 3 replicas minimum for production (baseline), with ≤5
  guardrail. Single-node explicitly not supported for production.
- Blob Storage: Elevate from optional to strong production recommendation
  with explicit workload triggers (10+ tenants, query concurrency >100,
  latency >2s, ingestion delay >60s).
- Add read vs write path mental model section explaining Backend → Redis →
  Queue → ClickHouse (write) vs Backend → ClickHouse (read).
- Clarify that query concurrency and disk I/O are leading indicators, not
  CPU/memory.
- Add operational guidance: CLICKHOUSE_ASYNC_INSERT_WAIT_PCT_FLOAT=0 usage
  and failure mode awareness ("traces created but not visible").

Documentation Structure (PROD_CHECKLIST.md):
- Reorganize sections 1-3 into logical component order: Redis, PostgreSQL,
  ClickHouse, Blob Storage (matching system data flow).
- Add brief context descriptions for each component.
- Improve tone to be guidance-oriented rather than directive.

Cross-References:
- Add references from README.md, PREFLIGHT.md, WALKTHROUGH.md, and
  TROUBLESHOOTING.md to PROD_CHECKLIST.md for production requirements.
- Add reference from PROD_CHECKLIST.md to README.md for read/write path
  mental model explanation.
- Eliminate duplicate advice by establishing clear source of truth.

Fixes:
- Resolve contradiction: Align single-node ClickHouse wording ("not supported"
  vs "not recommended") across all files.
- Ensure consistent production requirements across all documentation.

This establishes a clear documentation hierarchy where PROD_CHECKLIST.md serves
as the authoritative source for production requirements, while other documents
reference it rather than duplicating information.
2025-12-16 11:15:01 -08:00

LangSmith Self-Hosted on AWS — Reference Architecture (P0)

Status: P0 Enablement Baseline
Audience: Platform / Infra / MLOps Engineers
Goal: Provide a single, opinionated, supportable path to deploying and operating LangSmith Self-Hosted (SH) on AWS with minimal support intervention.

This document defines the reference architecture LangChain Enablement stands behind.
Alternative approaches may work, but are out of scope for P0 enablement and future certification.


1. What This Architecture Is (and Is Not)

This is:

  • A production-capable baseline deployment
  • Opinionated by design
  • Built on AWS + EKS + Terraform + Helm
  • Designed to surface real operator responsibilities early
  • The foundation for future labs and certification

This is not:

  • A performance benchmark
  • A multi-region or HA architecture
  • A guide for custom service meshes or bespoke gateways
  • A promise of security guarantees

2. Deployment Mode

P0 Default: Full Self-Hosted

  • Control plane and data plane both run in the customer AWS account
  • Customer is responsible for:
    • Network exposure
    • Authentication
    • Data persistence
    • Upgrades and backups

Hybrid (SaaS control plane + SH data plane) is valid but out of scope for P0 enablement.


3. High-Level Architecture

Request flow (top to bottom):

Request Flow Diagram

Users / CI / SDKs
→ Route53
→ Application Load Balancer (ALB) + WAF
→ Kubernetes Ingress (EKS)
→ LangSmith application services

Persistent dependencies:

  • PostgreSQL — metadata (projects, orgs, users)
  • Redis — cache and job queues
  • ClickHouse — traces and analytics
  • S3 — large artifacts and payload storage

Flow Summary

  • Traffic enters via Route53 → ALB (with optional WAF).
  • ALB forwards to Kubernetes ingress inside EKS.
  • LangSmith application services run in EKS.
  • Persistent state is handled by:
    • PostgreSQL (metadata)
    • Redis (cache / queues)
    • ClickHouse (traces & analytics)
    • S3 (large artifacts and payloads)

This diagram represents the minimum supported topology for the P0 reference architecture.


4. Network & Ingress

VPC

  • Single VPC
  • Public subnets: ALB only
  • Private subnets:
    • EKS worker nodes
    • Data services (RDS, Redis, ClickHouse if in-cluster)

Ingress

  • Application Load Balancer (ALB)
  • AWS WAF strongly recommended
  • TLS termination at ALB (end-to-end TLS recommended)
  • Optionally:
    • Internal ALB + VPN / PrivateLink for non-public access

Egress

  • Outbound HTTPS access to required LangChain endpoints (if applicable)
  • Restrict egress access per organizational policy requirements

5. Compute: Kubernetes (EKS)

Cluster

  • Amazon EKS
  • Managed node groups
  • Cluster Autoscaler enabled
  • Metrics Server enabled

Baseline Capacity

  • Minimum cluster capacity:
    • 16 vCPU / 64 GB RAM available
  • This includes LangSmith services + system overhead

For detailed production capacity requirements, see PROD_CHECKLIST.md.


6. Data Stores

LangSmith SH relies on three core data stores.

PostgreSQL (Metadata)

  • AWS RDS PostgreSQL or Aurora PostgreSQL
  • PostgreSQL 14+
  • Single AZ for P0 (HA is P1)
  • Automated backups enabled

Redis (Cache / Queues)

  • AWS ElastiCache (Redis OSS)
  • Single node acceptable for P0
  • Persistence optional but recommended

ClickHouse (Traces & Analytics)

ClickHouse is memory-, I/O-, and concurrency-intensive. Proper sizing and topology are mandatory for production stability.

For detailed production requirements, see PROD_CHECKLIST.md.

Production Requirements (P0 Baseline)

Topology

  • Production requires a replicated ClickHouse cluster
  • Baseline: 3 ClickHouse replicas (minimum for production)
  • Single-node ClickHouse is not supported for production workloads
  • Read and write concurrency must be able to scale independently
  • Guardrail: Clusters typically should remain ≤5 replicas

Compute

  • 8 vCPU
  • 32 GB RAM

Storage

  • SSD-backed persistent storage
  • ~7000 IOPS
  • ~1000 MiB/s throughput

⚠️ Query concurrency and disk I/O are leading indicators, not CPU/memory. Monitor these metrics to identify bottlenecks before they impact system health.


Suitable for Dev-Only

  • 4 vCPU / 16 GB RAM
  • Single ClickHouse node
  • Non-production proof-of-concept only

Blob Storage (Strong Production Recommendation)

Blob storage is strongly recommended for production and should be enabled once deployments exceed ~10 active tenants OR hit any of the workload-based triggers below.

For complete blob storage requirements and workload triggers, see PROD_CHECKLIST.md.

Workload Triggers

Enable blob storage if any of the following are observed or expected:

  • Peak concurrent ClickHouse queries consistently > 100 (or spikes > 200)
  • P95 query latency > 2 seconds for trace or run retrieval queries
  • P95 ingestion delay (received_at → inserted_at) > 60 seconds
  • Any tenant producing large or verbose traces (e.g., large tool outputs, attachments, or deeply nested spans)

⚠️ Without blob storage, large payloads increase part counts, merge pressure, and read amplification, leading to concurrency collapse and delayed trace visibility.

⚠️ Blob storage lifecycle policies must align with ClickHouse TTL settings to prevent data inconsistencies and ensure proper cleanup.


Scaling Guidance (P1)

ClickHouse Scaling

  • Keep existing CPU/RAM sizing (8 vCPU / 32 GB RAM baseline, scale to 16 vCPU / 64 GB RAM as needed)
  • Query concurrency and disk I/O are leading indicators, not CPU/memory
  • Scale ClickHouse to 16 vCPU / 64 GB RAM and/or additional replicas when:
    • Trace ingestion volume grows
    • Concurrent query count increases
    • Query latency trends upward
    • Insert lag begins to drift

⚠️ Scaling ClickHouse without blob storage has diminishing returns at higher write and concurrency levels.

Redis Sizing

  • For high-write workloads, ensure Redis has sufficient memory and network bandwidth
  • Monitor Redis memory usage, connection counts, and queue depths
  • External Redis (AWS ElastiCache) is required for production workloads with significant write volume
  • Single-node Redis is acceptable for P0 baseline, but consider replication for production workloads

6.5. Read vs Write Path Mental Model

Understanding the separation between read and write paths is critical for effective scaling and troubleshooting.

Write Path

Backend → Redis → Queue → ClickHouse

  • Traces are received by the backend service
  • Data flows through Redis (caching/queuing)
  • Queue workers process and insert into ClickHouse
  • This path handles ingestion and write concurrency

Read Path

Backend → ClickHouse

  • User queries and trace retrieval go directly from backend to ClickHouse
  • This path handles query concurrency and read performance

⚠️ Scaling the wrong layer can worsen outages. For example, adding queue workers without scaling ClickHouse will increase write pressure on an already saturated database, making the problem worse. Always identify whether the bottleneck is in the write path (queue/workers) or read path (ClickHouse query capacity) before scaling.


7. Object Storage

  • Store large trace artifacts and payloads
  • Reduces DB size and blast radius
  • Improves security posture for sensitive inputs/outputs

Access Pattern

  • Use IAM Roles for Service Accounts (IRSA) where possible
  • No static credentials in Helm values

8. Secrets & Identity

Secrets

  • AWS Secrets Manager (preferred)
  • Inject into Kubernetes via:
    • External Secrets
    • CSI driver
    • Secure environment injection

Identity & Auth

  • LangSmith authentication must be configured explicitly
  • Supported patterns include:
    • Token-based authentication
    • OIDC / SSO (at least one concrete example recommended for enablement)

For P0 enablement, select one authentication pattern to focus on. Additional patterns may be explored in future enablement tracks.


8.5. Operational Guidance

For detailed operational guidance including ingestion configuration and failure modes, see PROD_CHECKLIST.md and PROD_CHECKLIST.md.

Ingestion Configuration

CLICKHOUSE_ASYNC_INSERT_WAIT_PCT_FLOAT=0 can be used as an optional ingest lever to reduce write latency. However, this setting does not fix underlying ClickHouse saturation. If ClickHouse is saturated, this may mask symptoms temporarily but will not resolve root causes.

Failure Modes

Common issues often manifest as "traces created but not visible" due to ingestion backpressure. This occurs when:

  • ClickHouse write capacity is exceeded
  • Queue workers cannot keep up with ingestion volume
  • ClickHouse merge operations are backlogged
  • Disk I/O is saturated

When investigating trace visibility issues, check:

  1. ClickHouse query concurrency and disk I/O metrics
  2. Queue depth and worker processing rates
  3. ClickHouse merge operations and part counts
  4. Ingestion delay metrics (received_at → inserted_at)

9. Observability (Platform-Level)

Minimum required:

  • Application logs accessible via CloudWatch
  • Kubernetes events visible
  • Health endpoints monitored

Optional (P1):

  • Prometheus / OpenTelemetry exporters
  • Alerting on:
    • Pod restarts
    • DB connectivity
    • Ingestion failures

10. Security Baseline (Non-Negotiable)

This reference architecture requires essential security controls as a baseline.

MUST

  • TLS enabled
  • No plaintext secrets
  • Least-privilege IAM
  • Network isolation (private subnets for data services)
  • WAF or equivalent rate limiting at ingress

SHOULD

  • Private access only (VPN / PrivateLink)
  • Auth required for all UI and API access
  • Regular patching and upgrades

Explicit Disclaimer

This reference architecture does not guarantee security.
Customers are responsible for reviewing and approving deployments with their security teams.


11. What This Architecture Explicitly Excludes

These are out of scope for P0 enablement:

  • Multi-region active/active
  • Custom gateways or service meshes
  • HA ClickHouse clusters
  • Custom scaling policies beyond autoscaler defaults
  • Performance benchmarking beyond sanity checks

These may appear in P1/P2 enablement or certification tracks.


12. Why This Exists

This reference architecture exists to:

  • Reduce installation failures and complexity
  • Provide support teams with a shared baseline
  • Create a clear, well-documented enablement path
  • Serve as the foundation for:
    • Hands-on labs
    • Operator certification
    • Support playbooks

If you encounter challenges during implementation, these often indicate areas where additional attention or configuration is needed, rather than system defects.


13. Next Artifacts (Planned)

  • Preflight checklist
  • Deployment walkthrough
  • Known sharp edges
  • Failure-mode diagnostics
  • Operator mental model

These resources build on top of this foundation, providing additional guidance and support as you progress.

S
Description
No description provided
Readme Apache-2.0 259 KiB
Languages
Shell 100%