docs: align production requirements and establish clear documentation hierarchy

Update reference architecture documentation to align with current LangSmith scaling guidance and recent production incidents. Establish PROD_CHECKLIST.md as the authoritative source for production requirements. Major changes: Production Requirements (README.md, PROD_CHECKLIST.md): - ClickHouse: Require 3 replicas minimum for production (baseline), with ≤5 guardrail. Single-node explicitly not supported for production. - Blob Storage: Elevate from optional to strong production recommendation with explicit workload triggers (10+ tenants, query concurrency >100, latency >2s, ingestion delay >60s). - Add read vs write path mental model section explaining Backend → Redis → Queue → ClickHouse (write) vs Backend → ClickHouse (read). - Clarify that query concurrency and disk I/O are leading indicators, not CPU/memory. - Add operational guidance: CLICKHOUSE_ASYNC_INSERT_WAIT_PCT_FLOAT=0 usage and failure mode awareness ("traces created but not visible"). Documentation Structure (PROD_CHECKLIST.md): - Reorganize sections 1-3 into logical component order: Redis, PostgreSQL, ClickHouse, Blob Storage (matching system data flow). - Add brief context descriptions for each component. - Improve tone to be guidance-oriented rather than directive. Cross-References: - Add references from README.md, PREFLIGHT.md, WALKTHROUGH.md, and TROUBLESHOOTING.md to PROD_CHECKLIST.md for production requirements. - Add reference from PROD_CHECKLIST.md to README.md for read/write path mental model explanation. - Eliminate duplicate advice by establishing clear source of truth. Fixes: - Resolve contradiction: Align single-node ClickHouse wording ("not supported" vs "not recommended") across all files. - Ensure consistent production requirements across all documentation. This establishes a clear documentation hierarchy where PROD_CHECKLIST.md serves as the authoritative source for production requirements, while other documents reference it rather than duplicating information.
2026-07-01 20:04:39 -04:00 · 2025-12-16 11:15:01 -08:00
parent d0cc9316d7
commit a1a46c0b7d
5 changed files with 258 additions and 18 deletions
@@ -158,6 +158,8 @@ The script provides clear success/failure indicators for each permission check,
  - ClickHouse
  - System overhead

+> **For detailed production capacity and resource requirements, see [`PROD_CHECKLIST.md`](./PROD_CHECKLIST.md).**
+
 ### Required Add-ons
 - [ ] Metrics Server enabled
 - [ ] Cluster Autoscaler enabled
@@ -188,10 +190,14 @@ The script provides clear success/failure indicators for each permission check,
  - [ ] PersistentVolume provisioner available
 - [ ] You understand ClickHouse is **not stateless**

+> **For production ClickHouse topology requirements (3 replicas minimum), see [`PROD_CHECKLIST.md`](./PROD_CHECKLIST.md#3-clickhouse-traces--analytics-required).**
+
 ---

 ## 6. Object Storage (Strongly Recommended)

+> **For blob storage requirements and workload triggers, see [`PROD_CHECKLIST.md`](./PROD_CHECKLIST.md#4-blob-storage-strongly-recommended).**
+
 ### S3
 - [ ] S3 bucket planned for LangSmith artifacts
 - [ ] Bucket region matches deployment region
@@ -0,0 +1,168 @@
+# Self-Hosted LangSmith — Production Readiness Checklist
+
+This checklist is intended for **production self-hosted deployments** of LangSmith.  
+If any item below is unmet, the deployment should be considered **at risk**.
+
+---
+
+## 1. Redis (Cache & Job Queues)
+
+Redis is used for caching and job queues in the write path (Backend → Redis → Queue → ClickHouse). It buffers trace ingestion and manages queue processing, making it critical for handling write volume and preventing backpressure.
+
+- [ ] Redis memory sized for expected write volume
+- [ ] External Redis (e.g., AWS ElastiCache) used for production workloads with significant write volume
+- [ ] Redis is not shared with unrelated services
+- [ ] Memory usage, connection counts, and queue depths monitored
+- [ ] Eviction and memory pressure metrics monitored
+
+---
+
+## 2. PostgreSQL (Metadata)
+
+PostgreSQL stores LangSmith control-plane data (orgs, projects, users, API keys, config). It is **latency-sensitive** and **connection-bound**, not throughput-heavy. PostgreSQL outages commonly surface as authentication failures, project load errors, or global 500s—even when trace ingestion appears healthy.
+
+### Architecture & Sizing
+- [ ] Production uses a **managed or externally hosted PostgreSQL**
+- [ ] Single-node embedded Postgres avoided for production (consider managed services for better reliability)
+- [ ] Adequate IOPS and low-latency storage provisioned
+- [ ] Automated backups enabled and tested
+
+### Connections & Limits
+- [ ] `max_connections` sized for expected backend and worker concurrency
+- [ ] Connection pooling in place (e.g., PgBouncer or equivalent)
+- [ ] Backend connection reuse verified (no per-request connections)
+- [ ] Monitoring enabled for:
+  - Active connections
+  - Connection saturation
+  - Slow queries
+
+### Operational Readiness
+- [ ] Disk usage monitored with alerting (>70%)
+- [ ] Vacuum / autovacuum enabled and healthy
+- [ ] Schema migrations tested in non-production before rollout
+- [ ] Recovery procedure documented and rehearsed
+
+---
+
+## 3. ClickHouse (Traces & Analytics) (REQUIRED)
+
+LangSmith uses ClickHouse as the primary storage engine for **traces** and **feedback**. ClickHouse stores run data fields and all feedback data fields, making it essential for production deployments. Proper ClickHouse architecture and configuration are critical for system stability and performance.
+
+### Topology
+- [ ] ClickHouse is deployed as a **replicated cluster**
+- [ ] **Minimum 3 replicas** configured (baseline for production; single-node is not supported for production workloads)
+- [ ] Total replicas ≤ 5 (guardrail: higher counts require careful coordination)
+- [ ] Read and write concurrency can scale independently
+- [ ] ClickHouse user permissions and row policies verified
+- [ ] Migrations completed cleanly (no dirty schema state)
+
+### Resource Sizing
+- [ ] ≥ 8 vCPU / 32 GB RAM per node (baseline)
+- [ ] SSD-backed persistent storage
+- [ ] ~7000 IOPS and ~1000 MiB/s throughput
+- [ ] Disk expansion supported (PVC allowVolumeExpansion where applicable)
+- [ ] Disk usage monitored and alerting configured (>70%)
+- [ ] Query concurrency and disk I/O metrics monitored (leading indicators, not just CPU/memory)
+
+---
+
+## 4. Blob Storage (STRONGLY RECOMMENDED)
+
+Blob storage (e.g., S3 or GCS) stores large trace artifacts and payloads, reducing ClickHouse size and improving performance. Without blob storage, large trace payloads are stored inline in ClickHouse, increasing part counts, merge pressure, and read amplification—often leading to delayed or missing traces under load.
+
+Blob storage is **strongly recommended for production** and should be enabled if **any** of the following are true:
+
+- [ ] More than ~10 active tenants
+- [ ] Peak concurrent ClickHouse queries > 100 (or spikes > 200)
+- [ ] P95 query latency > 2s for trace or run retrieval
+- [ ] P95 ingestion delay (`received_at → inserted_at`) > 60s
+- [ ] One or more tenants produce large or verbose traces
+
+If blob storage is enabled:
+
+- [ ] Blob storage connectivity validated from cluster
+- [ ] Blob lifecycle policies aligned with ClickHouse TTL settings
+- [ ] Object storage throughput and request limits verified
+
+
+---
+
+## 5. Scaling Mental Model (UNDERSTOOD)
+
+> **For detailed explanation of read vs write paths, see [`README.md`](./README.md#65-read-vs-write-path-mental-model).**
+
+- [ ] Team understands **write path**:
+  - Backend → Redis → Queue → ClickHouse
+- [ ] Team understands **read path**:
+  - Backend → ClickHouse
+- [ ] Scaling actions target the correct bottleneck (write path vs read path)
+- [ ] Team validates ClickHouse capacity before scaling queue workers
+
+> Scaling the wrong layer (e.g., adding workers without scaling ClickHouse) can worsen outages.
+
+---
+
+## 6. Networking & Proxies
+
+- [ ] Frontend / ingress `maxBodySize` supports expected trace payload sizes
+- [ ] Reverse proxy timeouts reviewed for large reads
+- [ ] Network ACLs allow access to blob storage endpoints
+- [ ] Internal service-to-service latency validated
+
+---
+
+## 7. Operational Safeguards
+
+- [ ] Monitoring in place for:
+  - Concurrent ClickHouse queries (leading indicator)
+  - Query duration (P95 / P99)
+  - Ingestion delay (received_at → inserted_at)
+  - Disk usage and I/O (leading indicator)
+  - Redis memory and queue depth
+  - ClickHouse merge operations and part counts
+- [ ] Alerts configured for sustained ingestion delay or query saturation
+- [ ] Usage limits configured (or planned) for high-volume tenants
+
+---
+
+## 8. Optional Performance Levers (NOT FIXES)
+
+- [ ] `CLICKHOUSE_ASYNC_INSERT_WAIT_PCT_FLOAT=0` evaluated as an optional ingest lever to reduce write latency
+- [ ] Team understands this setting does not fix underlying ClickHouse saturation
+- [ ] Timeouts reviewed but not used to mask slow queries
+
+> These settings can improve behavior under load but **do not resolve underlying ClickHouse saturation**. If ClickHouse is saturated, these may mask symptoms temporarily but will not resolve root causes.
+
+---
+
+## 9. Diagnostics & Support Readiness
+
+- [ ] Log collection procedures documented (Kubernetes / Docker)
+- [ ] Ability to collect ClickHouse system tables (`system.query_log`, parts, merges)
+- [ ] Browser HAR capture process documented for UI errors
+- [ ] Backup and restore procedures tested (ClickHouse, Redis, Postgres)
+
+---
+
+## 10. Known Failure Mode Awareness
+
+- [ ] Team understands that failures often present as:
+  - "Traces created but not visible"
+  - Large delays before traces appear
+  - 500 errors during UI or API access
+- [ ] Team recognizes these symptoms usually indicate **ingestion backpressure or ClickHouse saturation**, not missing data
+- [ ] Team has a troubleshooting process for trace visibility issues:
+  - Check ClickHouse query concurrency and disk I/O metrics
+  - Review queue depth and worker processing rates
+  - Examine ClickHouse merge operations and part counts
+  - Monitor ingestion delay metrics (`received_at → inserted_at`)
+
+---
+
+## Final Sign-off
+
+- [ ] Architecture reviewed against this checklist
+- [ ] All REQUIRED items satisfied
+- [ ] STRONGLY RECOMMENDED items acknowledged or explicitly accepted as risk
+
+> If multiple items above are unchecked, production incidents are more likely under moderate to high load. This checklist serves as guidance to help identify and address potential risks before they impact production.
@@ -110,6 +110,8 @@ This diagram represents the **minimum supported topology** for the P0 reference
  - **16 vCPU / 64 GB RAM** available
 - This includes LangSmith services + system overhead

+> **For detailed production capacity requirements, see [`PROD_CHECKLIST.md`](./PROD_CHECKLIST.md).**
+
 ---

 ## 6. Data Stores
@@ -131,12 +133,16 @@ LangSmith SH relies on three core data stores.

 ClickHouse is **memory-, I/O-, and concurrency-intensive**. Proper sizing and topology are mandatory for production stability.

+> **For detailed production requirements, see [`PROD_CHECKLIST.md`](./PROD_CHECKLIST.md#3-clickhouse-traces--analytics-required).**
+
 ### Production Requirements (P0 – Baseline)

 **Topology**
- **Minimum of 2 ClickHouse read nodes (replicas) is required for production**
+- **Production requires a replicated ClickHouse cluster**
+- **Baseline: 3 ClickHouse replicas** (minimum for production)
 - Single-node ClickHouse is **not supported for production workloads**
 - Read and write concurrency must be able to scale independently
+- **Guardrail:** Clusters typically should remain ≤5 replicas

 **Compute**
 - 8 vCPU
@@ -147,7 +153,7 @@ ClickHouse is **memory-, I/O-, and concurrency-intensive**. Proper sizing and to
 - ~7000 IOPS
 - ~1000 MiB/s throughput

-> ⚠️ CPU and memory alone are not sufficient indicators of health. Query concurrency and disk I/O are often the first bottlenecks.
+> ⚠️ **Query concurrency and disk I/O are leading indicators**, not CPU/memory. Monitor these metrics to identify bottlenecks before they impact system health.

 ---

@@ -159,12 +165,11 @@ ClickHouse is **memory-, I/O-, and concurrency-intensive**. Proper sizing and to

 ---

-### Blob Storage (Strongly Advised)
+### Blob Storage (Strong Production Recommendation)

-Blob storage is **strongly advised** for any production deployment that meets **either** of the following conditions:
+Blob storage is **strongly recommended for production** and should be enabled once deployments exceed **~10 active tenants** OR hit any of the workload-based triggers below.

- **More than ~10 active tenants**
- **Any of the workload triggers below**
+> **For complete blob storage requirements and workload triggers, see [`PROD_CHECKLIST.md`](./PROD_CHECKLIST.md#4-blob-storage-strongly-recommended).**

 #### Workload Triggers
 Enable blob storage if **any** of the following are observed or expected:
@@ -172,22 +177,54 @@ Enable blob storage if **any** of the following are observed or expected:
 - Peak concurrent ClickHouse queries consistently **> 100** (or spikes > 200)
 - P95 query latency **> 2 seconds** for trace or run retrieval queries
 - P95 ingestion delay (`received_at → inserted_at`) **> 60 seconds**
- One or more tenants producing **large or verbose traces** (e.g., large tool outputs, attachments, or deeply nested spans)
+- Any tenant producing **large or verbose traces** (e.g., large tool outputs, attachments, or deeply nested spans)

-> Without blob storage, large trace payloads are stored inline in ClickHouse. This increases part counts, merge pressure, and read amplification, which can lead to query concurrency collapse and severe trace visibility delays.
+> ⚠️ **Without blob storage**, large payloads increase part counts, merge pressure, and read amplification, leading to concurrency collapse and delayed trace visibility.
+
+> ⚠️ **Blob storage lifecycle policies must align with ClickHouse TTL settings** to prevent data inconsistencies and ensure proper cleanup.

 ---

 ### Scaling Guidance (P1)

-Scale ClickHouse to **16 vCPU / 64 GB RAM** and/or additional replicas when:
+**ClickHouse Scaling**
+- Keep existing CPU/RAM sizing (8 vCPU / 32 GB RAM baseline, scale to 16 vCPU / 64 GB RAM as needed)
+- **Query concurrency and disk I/O are leading indicators**, not CPU/memory
+- Scale ClickHouse to **16 vCPU / 64 GB RAM** and/or additional replicas when:
+  - Trace ingestion volume grows
+  - Concurrent query count increases
+  - Query latency trends upward
+  - Insert lag begins to drift

- Trace ingestion volume grows
- Concurrent query count increases
- Query latency trends upward
- Insert lag begins to drift
+> ⚠️ **Scaling ClickHouse without blob storage has diminishing returns** at higher write and concurrency levels.

-> Scaling ClickHouse without blob storage has diminishing returns at higher write and concurrency levels.
+**Redis Sizing**
+- For high-write workloads, ensure Redis has sufficient memory and network bandwidth
+- Monitor Redis memory usage, connection counts, and queue depths
+- External Redis (AWS ElastiCache) is required for production workloads with significant write volume
+- Single-node Redis is acceptable for P0 baseline, but consider replication for production workloads
+
+---
+
+## 6.5. Read vs Write Path Mental Model
+
+Understanding the separation between read and write paths is critical for effective scaling and troubleshooting.
+
+### Write Path
+**Backend → Redis → Queue → ClickHouse**
+
+- Traces are received by the backend service
+- Data flows through Redis (caching/queuing)
+- Queue workers process and insert into ClickHouse
+- This path handles ingestion and write concurrency
+
+### Read Path
+**Backend → ClickHouse**
+
+- User queries and trace retrieval go directly from backend to ClickHouse
+- This path handles query concurrency and read performance
+
+> ⚠️ **Scaling the wrong layer can worsen outages.** For example, adding queue workers without scaling ClickHouse will increase write pressure on an already saturated database, making the problem worse. Always identify whether the bottleneck is in the write path (queue/workers) or read path (ClickHouse query capacity) before scaling.

 ---

@@ -223,6 +260,31 @@ Scale ClickHouse to **16 vCPU / 64 GB RAM** and/or additional replicas when:

 ---

+## 8.5. Operational Guidance
+
+> **For detailed operational guidance including ingestion configuration and failure modes, see [`PROD_CHECKLIST.md`](./PROD_CHECKLIST.md#8-optional-performance-levers-not-fixes) and [`PROD_CHECKLIST.md`](./PROD_CHECKLIST.md#10-known-failure-mode-awareness).**
+
+### Ingestion Configuration
+
+**CLICKHOUSE_ASYNC_INSERT_WAIT_PCT_FLOAT=0** can be used as an optional ingest lever to reduce write latency. However, **this setting does not fix underlying ClickHouse saturation**. If ClickHouse is saturated, this may mask symptoms temporarily but will not resolve root causes.
+
+### Failure Modes
+
+Common issues often manifest as **"traces created but not visible"** due to ingestion backpressure. This occurs when:
+
+- ClickHouse write capacity is exceeded
+- Queue workers cannot keep up with ingestion volume
+- ClickHouse merge operations are backlogged
+- Disk I/O is saturated
+
+When investigating trace visibility issues, check:
+1. ClickHouse query concurrency and disk I/O metrics
+2. Queue depth and worker processing rates
+3. ClickHouse merge operations and part counts
+4. Ingestion delay metrics (`received_at → inserted_at`)
+
+---
+
 ## 9. Observability (Platform-Level)

 Minimum required:
@@ -248,7 +248,7 @@ Capturing this information ensures you address the actual root cause rather than
 - `kubectl top pod <clickhouse-pod> -n langsmith`

 **Fixes**
- Move to **8 vCPU / 32GB RAM** baseline
+- Move to **8 vCPU / 32GB RAM** baseline (see [`PROD_CHECKLIST.md`](./PROD_CHECKLIST.md#3-clickhouse-traces--analytics-required) for production requirements)
 - Increase memory limits/requests
 - Reduce concurrent ingest/query load

@@ -272,7 +272,7 @@ Capturing this information ensures you address the actual root cause rather than
 - Review ClickHouse logs for merge pressure / IO wait

 **Fixes**
- Use SSD-backed storage with sufficient IOPS/throughput
+- Use SSD-backed storage with sufficient IOPS/throughput (see [`PROD_CHECKLIST.md`](./PROD_CHECKLIST.md#3-clickhouse-traces--analytics-required) for storage requirements)
 - Increase volume size
 - Move ClickHouse to a dedicated node group / better instance type

@@ -321,7 +321,7 @@ Capturing this information ensures you address the actual root cause rather than
 - Increase node group size
 - Use larger instance types
 - Remove/adjust taints and affinities
- Ensure ClickHouse has a node that can fit **8/32 allocatable**
+- Ensure ClickHouse has a node that can fit **8/32 allocatable** (see [`PROD_CHECKLIST.md`](./PROD_CHECKLIST.md#3-clickhouse-traces--analytics-required) for production requirements)

 ---

@@ -391,7 +391,7 @@ If you open a ticket, include:
 - External dependencies:
  - Postgres type/version (RDS/Aurora, PG version)
  - Redis type/version
-  - ClickHouse model (external vs in-cluster) + sizing
+  - ClickHouse model (external vs in-cluster) + sizing (see [`PROD_CHECKLIST.md`](./PROD_CHECKLIST.md) for production requirements)
 - ALB target health status and error reason

 Providing this information upfront enables faster resolution. If diagnostics are incomplete, the first step will be to collect the necessary diagnostic data.
@@ -62,6 +62,8 @@ Provision (at minimum):
 - **ClickHouse capacity** if in-cluster:
  - One node with **8 vCPU / 32GB RAM** allocatable

+> **For detailed production capacity and resource requirements, see [`PROD_CHECKLIST.md`](./PROD_CHECKLIST.md).**
+
 ### 2.3 Terraform Verification Gates (Stop if any fail)
 - [ ] `aws eks describe-cluster` shows `ACTIVE`
 - [ ] Worker nodes in private subnets can reach the internet (NAT)
@@ -157,6 +159,8 @@ Your Helm values must define:
 - S3 artifact storage (strongly recommended)
 - Ingress configuration (ALB + TLS)

+> **For production requirements for each component, see [`PROD_CHECKLIST.md`](./PROD_CHECKLIST.md).**
+
 ### 5.3 Install/Upgrade
 - Install the chart into the `langsmith` namespace.
 - Use `helm upgrade --install` (idempotent).