langchain-ai/langsmith-self-hosted-reference-aws

mirror of https://github.com/langchain-ai/langsmith-self-hosted-reference-aws.git synced 2026-07-01 20:04:39 -04:00

Files

T

Cory Waddingham 8dd8b66977 docs: improve clarity and structure across documentation

Improve phrasing, organization, and clarity across PREFLIGHT.md,
README.md, and WALKTHROUGH.md to make the deployment process
easier to follow and understand.

Changes include:
- Clarify ClickHouse deployment options with production vs dev
  distinctions
- Add more context and explanations throughout WALKTHROUGH.md
- Improve section organization and formatting
- Add troubleshooting guidance and prerequisites
- Enhance step-by-step instructions with better structure
- Specify "AWS reference architecture" in README for clarity

These changes maintain the technical accuracy while making the
documentation more accessible and easier to follow for engineers
deploying LangSmith Self-Hosted on AWS.

2025-12-18 13:54:43 -08:00

12 KiB

Raw Permalink Blame History

LangSmith Self-Hosted on AWS — Deployment Walkthrough (P0)

Goal: Get from zero → running LangSmith Self-Hosted → first successful trace → basic health validation.
Prerequisite: Complete the PREFLIGHT.md checklist before starting. This ensures your environment is ready and helps prevent common deployment issues.

This walkthrough provides a step-by-step path to deploy LangSmith Self-Hosted. Following it sequentially will help you avoid common pitfalls and ensure a successful deployment.

0. Decisions to Make Before Starting

Before you begin deploying with Terraform, decide on the following:

AWS Region: us-west-2 (example — pick one and stick to it)
Environment name: dev / staging / prod (do not share resources across envs)
DNS name: langsmith.<your-domain>
Exposure model: Public (ALB) or Private-only (VPN/PrivateLink)
Auth model: Token-based (P0) or OIDC/SSO (P1 unless already standard internally)
Data store model:
- Postgres: RDS/Aurora (recommended)
- Redis: ElastiCache (recommended)
- ClickHouse: Externally managed (preferred) or in-cluster (allowed)

Tip: Document these decisions in a deploy/ENV.md file so you can reference them throughout the deployment process.

1. Clone Repos and Pin Versions

To ensure a reproducible deployment, use specific versions of the Terraform and Helm repositories rather than always using the latest code.

Clone the required repositories:
- https://github.com/langchain-ai/terraform
- https://github.com/langchain-ai/helm
Record the specific versions you're using:
- Terraform repo commit SHA
- Helm repo commit SHA or chart version
Avoid using floating/latest versions to ensure you can reproduce your deployment later.

Why this matters: Using pinned versions ensures you can recreate your exact deployment configuration later, which is essential for troubleshooting, upgrades, and disaster recovery.

2. Terraform: Provision AWS Infrastructure

2.1 Configure Terraform State

Use S3 backend + DynamoDB lock (recommended).
Ensure state is unique per environment.

2.2 Apply Infrastructure

Provision (at minimum):

VPC + subnets (public for ALB, private for nodes/data)
- Use a VPC CIDR block of at least /16 to ensure sufficient IP addresses for all nodes and pods
EKS cluster + managed node groups
RDS Postgres (14+)
ElastiCache Redis
S3 bucket for artifacts
Security groups and IAM roles/policies
(Optional) Route53 hosted zone / record scaffolding

Hard requirement: Ensure the EKS node groups provide at least:

16 vCPU / 64GB RAM allocatable capacity total
ClickHouse capacity if deploying in-cluster:
- Production: Capacity for 3 replicas, each with 8 vCPU / 32GB RAM allocatable (single-node ClickHouse is not supported for production)
- Dev-only: Single node with 8 vCPU / 32GB RAM allocatable (non-production proof-of-concept only)

For detailed production capacity and resource requirements, including ClickHouse topology requirements, see PROD_CHECKLIST.md.

2.3 Verify Infrastructure Before Proceeding

Before moving to the next step, verify that your infrastructure is correctly provisioned:

aws eks describe-cluster shows ACTIVE
Worker nodes in private subnets can reach the internet (NAT)
RDS reachable from EKS subnets/security groups
Redis reachable from EKS subnets/security groups
S3 bucket exists and IAM access path is defined (IRSA preferred)

3. Kubernetes: Connect and Validate the Cluster

3.1 Connect to the Cluster

Update kubeconfig:
- aws eks update-kubeconfig --region <REGION> --name <CLUSTER_NAME>
Confirm:
- kubectl get nodes

3.2 Install/Validate Required Add-ons

You must have:

Metrics Server
Cluster Autoscaler

Verification:

kubectl top nodes returns metrics
Autoscaler is running and has permissions

3.3 Create a Namespace

Create a dedicated namespace, e.g.:

langsmith

3.4 Validate Ingress Before Installing LangSmith

Important: Complete this validation before installing LangSmith with Helm. This step helps isolate any ingress configuration issues from application-level problems, making troubleshooting much easier if something goes wrong.

Many deployment issues that appear to be LangSmith problems are actually related to ingress, controller, or subnet-tagging configuration.

3.4.1 Deploy a Test Application

Deploy a minimal HTTP echo service (or any simple web service) into a test namespace (or the langsmith namespace). This will serve as a test target for your ingress.

Verify the test app is running:

kubectl get pods shows the pod in Running state
kubectl get svc shows the service has endpoints

3.4.2 Create a Test Ingress

Create an Ingress resource pointing at your test service. This will trigger the AWS Load Balancer Controller to provision an ALB.

Verify everything works:

An ALB is created in AWS
A target group is created and associated with the ALB
Targets become healthy in the target group
You can successfully access the endpoint over HTTPS and receive a response

3.4.3 Troubleshooting Ingress Issues

If the test ingress fails, do not proceed to installing LangSmith until this is resolved. Fixing ingress issues after LangSmith is installed makes troubleshooting more difficult.

If the ingress test fails, check these areas first:

Kubernetes events on the Ingress resource: kubectl describe ingress <ingress-name>
AWS Load Balancer Controller logs: kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller
ALB target group health status in the AWS console (look for specific error reasons)

Note: This reference architecture requires AWS ALB for ingress. If you're using a different ingress controller, you'll need to adapt the configuration accordingly.

4. Prepare Dependencies and Secrets

4.1 Collect Required Connection Info

You need:

Postgres host/port/db/user/password
Redis host/port (and auth if enabled)
ClickHouse endpoint/user/password (or in-cluster config)
S3 bucket name and region

4.2 Store Secrets Securely

Critical: Never commit passwords, API keys, or other secrets to version control.

Recommended approach: Use AWS Secrets Manager with External Secrets Operator to automatically sync secrets into Kubernetes.

Minimum requirement:

Keep all secrets out of your git repository
Use a secrets management solution (AWS Secrets Manager, HashiCorp Vault, etc.)
Inject secrets into Kubernetes securely using External Secrets, CSI driver, or secure environment variable injection

Security reminder: If secrets end up in your git history, they can be exposed. Always use a proper secrets management solution.

5. Helm: Install LangSmith

5.1 Choose the Values Strategy

You should have:

values.yaml (non-secret config)
secrets.yaml OR external secrets (secret values only, not committed)

5.2 Configure Required Values

Your Helm values must define:

External Postgres connection
External Redis connection
ClickHouse configuration (external or in-cluster)
S3 artifact storage (strongly recommended)
Ingress configuration (ALB + TLS)

For production requirements for each component, see PROD_CHECKLIST.md.

5.3 Install/Upgrade

Install the chart into the langsmith namespace.
Use helm upgrade --install (idempotent).

5.4 Verify Helm Installation

After installing LangSmith, verify that everything is running correctly:

All pods in langsmith namespace reach Running or expected steady state
No CrashLoopBackOff
Services have endpoints
Ingress is created and gets an ALB hostname/address

Verification commands:

kubectl get pods -n langsmith - Check all pods are running
kubectl describe pod <pod-name> -n langsmith - Inspect any pods that aren't running
kubectl get svc -n langsmith - Verify services are created
kubectl get ingress -n langsmith - Confirm ingress resource exists and has an ALB address

6. Ingress + DNS: Make It Reachable

6.1 TLS

Ensure the ALB listener is HTTPS
Ensure cert is valid (ACM recommended)

6.2 DNS

Create a Route53 record:
- langsmith.<domain> → ALB DNS name

6.3 Verify Reachability

You can load the LangSmith UI at https://langsmith.<domain>
Auth behaves as intended (token login or SSO)

7. Send Your First Trace

A deployment isn't complete until you can successfully send and view traces. This step validates that the entire ingestion pipeline is working correctly.

7.1 Create an API Key / Token (if applicable)

Create the token per your configured auth model.
Store it securely.

7.2 Send a Minimal Trace

From a laptop or CI runner with egress to the endpoint:

Configure LANGSMITH_ENDPOINT
Configure auth (LANGSMITH_API_KEY or equivalent)
Run a minimal trace-producing script (LangChain example or direct API).

7.3 Verify Trace Ingestion

Check that your trace was successfully ingested:

The trace appears in the LangSmith UI
The trace includes at least one run/span with data
No ingestion errors appear in the application logs

If traces don't appear: Don't proceed to operational tasks yet. Fix the ingestion pipeline first. Common issues include:

ClickHouse connectivity problems
Redis queue issues
Authentication/authorization errors
Network connectivity between services

See TROUBLESHOOTING.md for detailed troubleshooting steps.

8. Basic Health Validation (P0 Ops Readiness)

8.1 What “Healthy” Means (Minimum)

UI loads reliably
API responds
DB connections stable
No sustained error logs
ClickHouse writes succeed
Redis queues not stuck

8.2 Validate Logs

Check:

LangSmith app logs for errors
ClickHouse logs for disk/memory pressure
Ingress/ALB logs (4xx/5xx spikes)

8.3 Validate Resource Pressure

kubectl top pods -n langsmith
Look for:
- OOMKills
- CPU throttling
- Persistent volume saturation

9. Backup & Restore Planning

Before considering your deployment production-ready, ensure you have a backup and restore strategy:

RDS backups: Confirm automated backups are enabled and test that you can restore from them
ClickHouse persistence: Verify your ClickHouse data is stored on persistent volumes and understand how to restore it
S3 bucket lifecycle: Confirm your S3 bucket has appropriate lifecycle policies and versioning configured

Important: You don't need to perform a full restore test immediately, but you should document the restore procedure and understand how long it would take to recover from a failure.

10. Common Failure Points (Fast Triage)

If deployment fails, the usual culprits are:

Networking / Security Groups
- EKS can’t reach Postgres/Redis/ClickHouse
ClickHouse undersized or slow disk
- OOM, high latency, ingestion failures
Ingress misconfiguration
- ALB created but no healthy targets
Auth mismatch
- UI loads but API calls fail
Secrets handling
- Bad credentials injected, pods loop

When something breaks: capture

kubectl describe
pod logs
DB connection test results
ALB target health

This data becomes your failure-mode catalog later.

11. Deployment Complete Checklist

Your deployment is complete when all of the following are true:

Terraform applied cleanly and is reproducible
Helm install is idempotent (upgrade --install works)
UI reachable via HTTPS on your chosen DNS
First successful trace appears in the UI
Basic health checks are green (no crash loops, stable DB connectivity)

If any item isn't checked, continue working through the walkthrough or consult TROUBLESHOOTING.md to resolve the issue.

Appendix: Notes for Your First Deployment

As you go through this walkthrough for the first time, consider keeping notes on:

Steps where you needed to pause and look up additional information
Decisions you had to make that weren't clearly documented
Any issues you encountered and how you resolved them
Configuration choices you made and why

These notes will be valuable for:

Troubleshooting future issues
Onboarding other team members
Planning upgrades or changes
Understanding your specific deployment configuration

12 KiB Raw Permalink Blame History Unescape Escape