langchain-ai/langsmith-self-hosted-reference-aws

mirror of https://github.com/langchain-ai/langsmith-self-hosted-reference-aws.git synced 2026-07-01 20:04:39 -04:00

Files

T

Cory Waddingham a1a46c0b7d docs: align production requirements and establish clear documentation hierarchy

Update reference architecture documentation to align with current LangSmith
scaling guidance and recent production incidents. Establish PROD_CHECKLIST.md
as the authoritative source for production requirements.

Major changes:

Production Requirements (README.md, PROD_CHECKLIST.md):
- ClickHouse: Require 3 replicas minimum for production (baseline), with ≤5
  guardrail. Single-node explicitly not supported for production.
- Blob Storage: Elevate from optional to strong production recommendation
  with explicit workload triggers (10+ tenants, query concurrency >100,
  latency >2s, ingestion delay >60s).
- Add read vs write path mental model section explaining Backend → Redis →
  Queue → ClickHouse (write) vs Backend → ClickHouse (read).
- Clarify that query concurrency and disk I/O are leading indicators, not
  CPU/memory.
- Add operational guidance: CLICKHOUSE_ASYNC_INSERT_WAIT_PCT_FLOAT=0 usage
  and failure mode awareness ("traces created but not visible").

Documentation Structure (PROD_CHECKLIST.md):
- Reorganize sections 1-3 into logical component order: Redis, PostgreSQL,
  ClickHouse, Blob Storage (matching system data flow).
- Add brief context descriptions for each component.
- Improve tone to be guidance-oriented rather than directive.

Cross-References:
- Add references from README.md, PREFLIGHT.md, WALKTHROUGH.md, and
  TROUBLESHOOTING.md to PROD_CHECKLIST.md for production requirements.
- Add reference from PROD_CHECKLIST.md to README.md for read/write path
  mental model explanation.
- Eliminate duplicate advice by establishing clear source of truth.

Fixes:
- Resolve contradiction: Align single-node ClickHouse wording ("not supported"
  vs "not recommended") across all files.
- Ensure consistent production requirements across all documentation.

This establishes a clear documentation hierarchy where PROD_CHECKLIST.md serves
as the authoritative source for production requirements, while other documents
reference it rather than duplicating information.

2025-12-16 11:15:01 -08:00

12 KiB

Raw Permalink Blame History

LangSmith Self-Hosted on AWS — Troubleshooting Guide (P0)

Purpose: Fast triage for the P0 reference deployment.
Style: Symptom → likely cause → exact checks → common fix.

This guide focuses on actionable, evidence-based troubleshooting. Every item maps to an observable signal and a deterministic check.

0. First Rule of Triage: Gather Evidence First

Before changing anything, capture essential diagnostic information. The easiest way to do this is using the provided diagnostic capture script.

Quick Start: Automated Diagnostics Capture

Use the official LangSmith Kubernetes debugging script to automatically capture all required diagnostic information. This script is maintained in the LangChain Helm repository.

Download and run the script:

# Download the script
curl -O https://raw.githubusercontent.com/langchain-ai/helm/main/charts/langsmith/scripts/get_k8s_debugging_info.sh
chmod +x get_k8s_debugging_info.sh

# Run it with your namespace
./get_k8s_debugging_info.sh --namespace langsmith

What the script captures:

Summary of all Kubernetes resources (kubectl get all)
Detailed YAML for all resources
Kubernetes events (sorted by timestamp)
Resource usage for all pods and containers
Container logs for all pods (last 24 hours)
Previous container logs for restarted containers
All output compressed to a zip or tar.gz file

Script details:

Source: get_k8s_debugging_info.sh
Output location: /tmp/langchain-debugging-<timestamp>/
Output format: Compressed archive (.zip preferred, falls back to .tar.gz)
Required argument: --namespace <namespace>

The script creates a timestamped directory with all diagnostic information and compresses it into a single archive file, making it easy to share with support teams or analyze later.

Manual Capture (Alternative)

If you prefer to capture diagnostics manually, ensure you collect:

kubectl get pods -n langsmith -o wide
kubectl describe pod <POD> -n langsmith (for each pod)
kubectl logs <POD> -n langsmith --tail=200 (for each pod)
kubectl get events -n langsmith --sort-by=.lastTimestamp | tail -50
Ingress/ALB status:
- kubectl get ingress -n langsmith (or your ingress resource type)
- kubectl describe ingress <INGRESS> -n langsmith
If AWS-managed:
- ALB target group health (healthy/unhealthy + reason)

Capturing this information ensures you address the actual root cause rather than symptoms, making troubleshooting more efficient and effective.

1. The Deployment “Works” But UI Is Not Reachable

Symptom

DNS resolves but browser times out
Browser shows 502/503
ALB exists but shows no healthy targets

Likely Causes

Ingress misconfigured
Service port mismatch
Pod readiness failing (so targets never become healthy)
Security group / NACL blocks

Checks

kubectl get ingress -n langsmith -o yaml
kubectl get svc -n langsmith
kubectl describe svc <SERVICE> -n langsmith
kubectl get endpoints -n langsmith
kubectl get pods -n langsmith
Inspect readiness:
- kubectl describe pod <POD> -n langsmith | sed -n '/Readiness/,/Conditions/p'

Fixes (Common)

Ensure ingress points to the correct service + port
Ensure service selectors match pod labels
Fix readiness probe failures before touching ALB
Confirm ALB security group allows inbound 443 and node security group allows target traffic

2. Pods CrashLoopBackOff Immediately

Symptom

Pods oscillate between CrashLoopBackOff and Running
Logs show immediate exit

Likely Causes

Missing or invalid secrets
DB/Redis/ClickHouse connection failure
Misconfigured required env vars

Checks

kubectl logs <POD> -n langsmith --previous --tail=200
kubectl describe pod <POD> -n langsmith (look for env var injection and secret refs)
Confirm secrets exist:
- kubectl get secret -n langsmith
Confirm external connectivity from inside cluster:
- Launch a temporary debug pod and test TCP connectivity to DB hosts/ports

Fixes (Common)

Correct secret names/keys referenced in Helm values
Verify DB hostnames and ports (RDS endpoints, Redis endpoints)
Fix network policy / security groups if connections time out

3. Everything Is Running, But “First Successful Trace” Fails

Symptom

UI loads
SDK calls fail (401/403/404) or traces never appear
Client sees timeouts or 5xx

Likely Causes

Wrong endpoint (LANGSMITH_ENDPOINT) or wrong path
Auth mismatch (token vs SSO)
Ingestion path failing due to ClickHouse or Redis issues
ALB health is fine but app errors on ingest

Checks

From client machine:
- Confirm endpoint resolves and responds (TLS + HTTP status)
In cluster logs:
- Search logs of the API/ingestion service for auth or write errors
Check ClickHouse health:
- Look for write failures, memory pressure, disk pressure
Check Redis:
- Look for connection errors or queue backlog signals (if exposed)

Fixes (Common)

Ensure client is using the correct base URL and auth method
Regenerate token / verify permissions
Fix ClickHouse sizing or disk throughput issues if writes fail
Fix Redis connectivity if queues are used for ingest

4. ALB Exists But Targets Are “Unhealthy”

Symptom

ALB target group shows all targets unhealthy
UI returns 503 even though pods are running

Likely Causes

Readiness probe failing
Target group health check path/port mismatch
Service isn’t exposing the expected port
Pods are running but not listening

Checks

kubectl describe pod <POD> -n langsmith (readiness probe results)
kubectl get svc -n langsmith -o yaml
Confirm the container port aligns with service targetPort
Confirm health check path matches what the service actually serves

Fixes (Common)

Correct ingress annotations / health check settings
Fix readiness probe configuration or dependencies causing readiness to fail
Align service ports with actual container ports

5. DB Connectivity Failures (PostgreSQL)

Symptom

App logs show:
- authentication failures
- connection refused
- timeout
- “could not translate host name”
App won’t start or fails on request

Likely Causes

Wrong credentials
Security group blocks EKS to RDS
RDS not in the right subnets or routing broken
DNS/resolution issues inside cluster

Checks

Validate the RDS endpoint and port
Confirm security groups allow inbound from EKS node group / pod CIDR (depending on setup)
Test connectivity from a debug pod:
- DNS resolution
- TCP connect to <rds-endpoint>:5432

Fixes (Common)

Correct creds in secrets
Fix SG rules
Ensure private subnets have proper routing and NAT where required
Ensure RDS is reachable from EKS VPC/subnets

6. Redis Connectivity Failures

Symptom

Logs show Redis connection errors/timeouts
Background jobs stall (if used)
Ingestion or async tasks fail

Likely Causes

Wrong endpoint/port
Security group blocks EKS to ElastiCache
Auth mismatch (if Redis auth enabled)

Checks

Confirm ElastiCache endpoint and port
Test TCP connectivity from debug pod
Check whether Redis auth is enabled and whether Helm values match

Fixes (Common)

Fix endpoint in values
Fix security group rules
Align auth config

7. ClickHouse Problems (Most Common Real Root Cause)

7.1 ClickHouse OOM / Memory Pressure

Symptom

ClickHouse pod restarts
OOMKilled events
Trace writes fail or become slow

Likely Cause

ClickHouse undersized (4/16 used for real workload)
Memory limits too tight
Query pressure

Checks

kubectl describe pod <clickhouse-pod> -n langsmith (look for OOMKilled)
kubectl logs <clickhouse-pod> -n langsmith --tail=200
kubectl top pod <clickhouse-pod> -n langsmith

Fixes

Move to 8 vCPU / 32GB RAM baseline (see PROD_CHECKLIST.md for production requirements)
Increase memory limits/requests
Reduce concurrent ingest/query load

7.2 ClickHouse Disk / IO Throughput Issues

Symptom

Latency spikes
Writes time out
ClickHouse logs mention slow merges / IO waits

Likely Cause

Slow storage class
Inadequate IOPS/throughput
Disk nearing capacity

Checks

Confirm PV storage class and performance characteristics
Check disk usage in ClickHouse pod
Review ClickHouse logs for merge pressure / IO wait

Fixes

Use SSD-backed storage with sufficient IOPS/throughput (see PROD_CHECKLIST.md for storage requirements)
Increase volume size
Move ClickHouse to a dedicated node group / better instance type

7.3 ClickHouse Not Persistent (Data Loss Risk)

Symptom

ClickHouse redeploy loses data
Traces disappear after restart

Likely Cause

No persistent volume attached
StatefulSet misconfigured

Checks

Confirm PVC exists and is bound:
- kubectl get pvc -n langsmith
Confirm ClickHouse uses that PVC

Fixes

Attach PVC and ensure StatefulSet mounts it
Do not treat ClickHouse as stateless

8. Kubernetes Scheduling Issues

Symptom

Pods stuck in Pending
Events show “insufficient cpu/memory”
ClickHouse never schedules

Likely Causes

Cluster too small
Node group instance types too small
Taints/affinity constraints prevent scheduling

Checks

kubectl describe pod <POD> -n langsmith (look at scheduling events)
kubectl get nodes -o wide
Check taints:
- kubectl describe node <NODE> | sed -n '/Taints/,/Conditions/p'

Fixes

Increase node group size
Use larger instance types
Remove/adjust taints and affinities
Ensure ClickHouse has a node that can fit 8/32 allocatable (see PROD_CHECKLIST.md for production requirements)

9. TLS / Certificate Issues

Symptom

Browser warnings
Client SDK fails TLS handshake
Mixed content or redirect loops

Likely Causes

Wrong ACM cert attached
Wrong DNS name on cert
HTTP/HTTPS mismatch

Checks

Confirm ALB listener is HTTPS
Confirm cert CN/SAN includes your DNS name
Confirm DNS record points to the correct ALB

Fixes

Attach correct cert
Fix DNS record
Enforce HTTPS redirects intentionally (not accidentally)

10. “It Worked Yesterday” Failures (The Dangerous Ones)

Symptom

Random 5xx
Slow UI
Traces intermittently missing

Likely Causes

Resource pressure (CPU throttling / memory pressure)
ClickHouse disk pressure or merge backlog
Redis saturation
Node churn / autoscaling issues

Checks

kubectl top pods -n langsmith
Pod restarts:
- kubectl get pods -n langsmith --sort-by=.status.containerStatuses[0].restartCount
Node events and scaling activity
DB metrics (RDS CPU/connections; Redis CPU/memory; ClickHouse memory/disk)

Fixes

Add capacity (scale nodes)
Increase ClickHouse resources or improve disk class
Increase Redis tier if saturated
Tune autoscaler limits (don’t let it starve the cluster)

11. What to Include in a Support Request (If You Must Escalate)

If you open a ticket, include:

Reference path confirmation:
- “Deployed via reference architecture + terraform + helm”
- repo SHAs / chart versions
Current cluster state:
- kubectl get pods -n langsmith -o wide
- relevant describe output
- last 200 lines of logs from failing pods
External dependencies:
- Postgres type/version (RDS/Aurora, PG version)
- Redis type/version
- ClickHouse model (external vs in-cluster) + sizing (see PROD_CHECKLIST.md for production requirements)
ALB target health status and error reason

Providing this information upfront enables faster resolution. If diagnostics are incomplete, the first step will be to collect the necessary diagnostic data.

12. Add to This Guide (How)

Only add entries that:

Came from a real failure
Include a deterministic check
Include a fix that is repeatable

12 KiB Raw Permalink Blame History Unescape Escape

LangSmith Self-Hosted on AWS — Troubleshooting Guide (P0)

0. First Rule of Triage: Gather Evidence First

Quick Start: Automated Diagnostics Capture

Manual Capture (Alternative)

1. The Deployment “Works” But UI Is Not Reachable

Symptom

Likely Causes

Checks

Fixes (Common)

2. Pods CrashLoopBackOff Immediately

Symptom

Likely Causes

Checks

Fixes (Common)

3. Everything Is Running, But “First Successful Trace” Fails

Symptom

Likely Causes

Checks

Fixes (Common)

4. ALB Exists But Targets Are “Unhealthy”

Symptom

Likely Causes

Checks

Fixes (Common)

5. DB Connectivity Failures (PostgreSQL)

Symptom

Likely Causes

Checks

Fixes (Common)

6. Redis Connectivity Failures

Symptom

Likely Causes

Checks

Fixes (Common)

7. ClickHouse Problems (Most Common Real Root Cause)

7.1 ClickHouse OOM / Memory Pressure

7.2 ClickHouse Disk / IO Throughput Issues

7.3 ClickHouse Not Persistent (Data Loss Risk)

8. Kubernetes Scheduling Issues

Symptom

Likely Causes

Checks

Fixes

9. TLS / Certificate Issues

Symptom

Likely Causes

Checks

Fixes

10. “It Worked Yesterday” Failures (The Dangerous Ones)

Symptom

Likely Causes

Checks

Fixes

11. What to Include in a Support Request (If You Must Escalate)

12. Add to This Guide (How)

12 KiB

Raw Permalink Blame History