Initial commit: LangSmith Self-Hosted AWS Reference Architecture (P0)

This commit establishes the P0 reference architecture documentation and
supporting tooling for deploying LangSmith Self-Hosted on AWS.

Documentation:
- README.md: Reference architecture overview with embedded request flow diagram
- PREFLIGHT.md: Comprehensive preflight checklist with automated script integration
- WALKTHROUGH.md: Step-by-step deployment walkthrough
- INGRESS.md: ALB-only ingress configuration guide
- TROUBLESHOOTING.md: Evidence-based troubleshooting guide with diagnostic automation

Tooling:
- scripts/preflight.sh: Automated AWS permission and prerequisite validation
- scripts/capture-diagnostics.sh: Automated diagnostic information capture for troubleshooting

All documentation follows a professional, educational, and encouraging tone
designed to support platform/infrastructure/MLOps engineers through the
deployment process.

Key features:
- Opinionated, supportable deployment path
- AWS + EKS + Terraform + Helm stack
- Automated validation and diagnostic tools
- Clear separation of P0 (baseline) vs P1+ (advanced) features
This commit is contained in:
Cory Waddingham
2025-12-15 12:10:50 -08:00
commit ca682685a7
10 changed files with 2437 additions and 0 deletions
+2
View File
@@ -0,0 +1,2 @@
.env
.venv
+192
View File
@@ -0,0 +1,192 @@
# Ingress for LangSmith Self-Hosted on AWS (P0) — ALB Only
**P0 Reference Requirement:** Use **AWS Application Load Balancer (ALB)** via the **AWS Load Balancer Controller**.
This requirement is intentionally opinionated. Ingress configuration is a common source of deployment challenges due to the many valid options available. The reference architecture standardizes on ALB to provide a clear, well-tested path.
If you are not using ALB, you are operating **outside the P0 reference path**.
---
## Supported Ingress (P0)
### ✅ Supported
- **AWS Load Balancer Controller** + **ALB**
- TLS termination using **ACM**
- DNS via **Route53** (or equivalent, but Route53 is assumed for P0 examples)
- Optional but strongly recommended:
- **AWS WAF** attached to ALB
- Private-only exposure (internal ALB + VPN/PrivateLink)
### ❌ Explicitly Out of Scope (P0)
- NGINX Ingress Controller
- Traefik
- Istio / service mesh gateways
- API Gateway “fronting” Kubernetes as a substitute for ingress
- CloudFront as a substitute for ingress (can be layered later, but not P0)
- Custom gateways / reverse proxies
These may work. We do not support them in the reference enablement path.
---
## Why We Require ALB
- The ALB path is the **lowest-friction**, most reproducible option for AWS customers.
- It provides a standardized approach that avoids controller complexity and configuration variations.
- It aligns with what most platform teams already deploy and secure.
- It makes debugging straightforward: ALB target health metrics and Kubernetes events provide clear diagnostic information.
This requirement exists to reduce:
- install failures
- support escalations
- time-to-first-trace delays
---
## Required Components
You must have the following working before you install LangSmith:
1. **EKS cluster** running and reachable with `kubectl`
2. **AWS Load Balancer Controller** installed and healthy
3. **IAM permissions** for the controller (IRSA strongly recommended)
4. **Subnet tagging** correct for ALB discovery
5. **ACM certificate** for your DNS name
6. **Route53 record** (or other DNS) pointing to the created ALB
If any of these are missing, Helm installation may succeed, but the product will be unreachable.
---
## Preflight Checks (Ingress-Specific)
### Controller Health
- [ ] The AWS Load Balancer Controller pods are running
- [ ] No CrashLoopBackOff
- [ ] Controller has permission to create:
- ALBs
- Target groups
- Listeners
- Security group rules
### Subnet Tagging (Common Failure)
- [ ] Subnets are tagged so the controller can discover them for ALB creation
- [ ] You know which subnets should be:
- public-facing ALB
- internal-only ALB (if private)
### TLS
- [ ] ACM cert exists in the **same region** as the ALB
- [ ] Cert covers the intended DNS name (`langsmith.<domain>`)
### DNS
- [ ] You can create DNS records for the LangSmith hostname
---
## Mandatory Validation Step: Prove ALB Ingress Works Before LangSmith
Complete this validation **before** installing LangSmith. This step helps isolate ingress configuration issues from application-level problems, making troubleshooting more efficient.
### Step 1: Deploy a tiny test service
Pick one lightweight HTTP echo service (example shown conceptually):
- Create a deployment + service that listens on HTTP (port 80)
- Confirm:
- `kubectl get pods` shows it running
- `kubectl get svc` shows endpoints
### Step 2: Create a test Ingress that provisions an ALB
Create an Ingress resource targeting the test service.
What must happen:
- An ALB is created
- A target group is created
- Targets become **healthy**
- You can curl the endpoint and receive a response
### Step 3: If the test Ingress fails, stop
Do not proceed to LangSmith until:
- ALB provisioning works
- target health becomes green
- HTTPS works with your cert
---
## Common Failure Modes (and Where to Look First)
### ALB never gets created
**Likely causes**
- Controller not installed
- Missing IAM permissions
- Subnet discovery fails
**Look at**
- Kubernetes events on the Ingress
- Controller logs
- AWS console: whether any ALB attempt exists
---
### ALB created but targets unhealthy
**Likely causes**
- Wrong service port / targetPort
- Pods not ready
- Health check path mismatch
- Security group blocks node-to-target traffic
**Look at**
- ALB target group health reason
- `kubectl describe ingress ...`
- `kubectl describe svc ...`
- Pod readiness probe status
---
### HTTPS broken / cert issues
**Likely causes**
- Wrong ACM cert
- Cert in wrong region
- DNS mismatch
**Look at**
- ALB listener config
- ACM cert validity and SANs
- DNS record points to the right ALB
---
## Security Recommendations (P0 Baseline)
Minimum expected posture for P0:
- HTTPS only (no plaintext)
- WAF or equivalent rate limiting at the edge
- Prefer private exposure for enterprise deployments
- Least privilege IAM for the controller and application
- No public DB endpoints
---
## What to Document When You Deviate (Off-Reference)
If a customer insists on non-ALB ingress, require them to capture:
- ingress controller type/version
- config manifests
- load balancer / gateway config
- health check settings
- network policies / SG rules
Note: this configuration is **not supported by P0 enablement**.
---
## Done Criteria (Ingress)
Ingress is “done” when:
- [ ] AWS Load Balancer Controller is healthy
- [ ] A test Ingress provisions an ALB successfully
- [ ] Targets are healthy
- [ ] HTTPS works with your DNS name
Only then install LangSmith.
+201
View File
@@ -0,0 +1,201 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
+290
View File
@@ -0,0 +1,290 @@
# LangSmith Self-Hosted — Preflight Checklist (P0)
**Purpose:**
Ensure the environment is ready *before* running Terraform or Helm.
Most deployment challenges can be prevented by completing these checks upfront, rather than discovering issues during installation.
If a preflight check fails, **address it before proceeding**. This ensures a smoother deployment experience.
---
## Automated Preflight Checks
You can use the provided preflight script to automatically verify AWS permissions and prerequisites before proceeding with manual checks.
### Quick Start
Run the automated preflight script:
```bash
./scripts/preflight.sh
```
### What the Script Does
The script performs **read-only** permission checks to verify you have the necessary AWS permissions for deploying LangSmith Self-Hosted. By default, it:
- Verifies AWS credentials are configured
- Tests permissions for required AWS services:
- **EC2** (VPCs, subnets, availability zones)
- **EKS** (cluster management)
- **IAM** (role creation and management)
- **RDS** (PostgreSQL/Aurora)
- **ElastiCache** (Redis)
- **Application Load Balancer** (ALB/ELB)
- **ACM** (TLS certificates)
- **Route53** (DNS management)
- **WAFv2** (optional, for production)
- Checks for sandbox account restrictions
- Validates region configuration
**Note:** The script is read-only by default and does not create or modify any resources.
### Command-Line Options
```bash
./scripts/preflight.sh [OPTIONS]
```
**Options:**
- `-s, --skip_resource_tests, --skip_checks`
Skip resource creation tests (only run read-only permission checks)
- `-y, --yes`
Non-interactive mode (skip confirmation prompts). Automatically enabled in CI environments.
- `--create-test-resources`
Create temporary test resources (VPC, subnet, security group, IAM role) to verify write permissions. Resources are automatically cleaned up on exit. Use this to fully validate your permissions.
- `--domain <domain>`
Check for ACM certificate and Route53 hosted zone matching the specified domain (e.g., `langsmith.example.com`). The script will check for exact matches and wildcard certificates.
**Examples:**
```bash
# Basic permission check (read-only)
./scripts/preflight.sh
# Check permissions and verify ACM certificate exists
./scripts/preflight.sh --domain langsmith.example.com
# Full permission test including resource creation
./scripts/preflight.sh --create-test-resources
# Non-interactive mode (useful for CI/CD)
./scripts/preflight.sh --yes --domain langsmith.example.com
```
### When to Use the Script
- **Before starting deployment:** Run the script to verify all AWS permissions are in place
- **Troubleshooting permission issues:** Use `--create-test-resources` to test write permissions
- **CI/CD pipelines:** Use `--yes` flag for automated checks
- **Certificate validation:** Use `--domain` to verify ACM certificates and Route53 zones exist
The script provides clear success/failure indicators for each permission check, making it easy to identify and resolve permission issues before deployment.
---
## 1. Account & Access
### AWS Account
- [ ] You have **full access** to an AWS account (not a sandbox with hidden SCPs)
- [ ] You can create:
- VPCs
- EKS clusters
- ALBs
- IAM roles and policies
- RDS / ElastiCache
- EBS volumes
- [ ] No org-level policy blocks required services
### Credentials
- [ ] AWS credentials configured locally (`aws sts get-caller-identity` works)
- [ ] Region selected and consistent across Terraform and Helm
- [ ] You understand **who pays for this** (this will not be free)
---
## 2. Terraform Readiness
### Tooling
- [ ] Terraform installed (supported version)
- [ ] `kubectl` installed
- [ ] `helm` installed
- [ ] `awscli` installed
### State Management
- [ ] Terraform state backend chosen (S3 + DynamoDB recommended)
- [ ] State bucket exists or can be created
- [ ] You are not sharing state with another environment
### Assumptions (Explicit)
- [ ] You are deploying **one environment** (no shared dev/prod infra)
- [ ] You are okay with Terraform creating networking resources
- [ ] You will not “hot-edit” AWS resources Terraform owns
---
## 3. Network & DNS
### VPC
- [ ] A dedicated VPC will exist for LangSmith
- [ ] At least:
- 2 public subnets (ALB)
- 2 private subnets (EKS + data)
- [ ] NAT Gateway available for private subnet egress
### DNS
- [ ] A Route53 hosted zone exists (or you control DNS externally)
- [ ] You can create DNS records for the LangSmith endpoint
- [ ] You know whether this will be:
- [ ] Publicly accessible
- [ ] Private-only (VPN / PrivateLink)
---
## 4. Kubernetes (EKS) Expectations
### Cluster
- [ ] EKS will be used (not self-managed k8s)
- [ ] You accept managed node groups
- [ ] You are not using custom admission controllers that block installs
### Capacity (Hard Requirement)
- [ ] Minimum **16 vCPU / 64 GB RAM** allocatable cluster capacity
- [ ] Nodes are sized to allow:
- LangSmith services
- ClickHouse
- System overhead
### Required Add-ons
- [ ] Metrics Server enabled
- [ ] Cluster Autoscaler enabled
- [ ] You can install CRDs
---
## 5. Data Stores
### PostgreSQL
- [ ] PostgreSQL **14+**
- [ ] Managed service (RDS/Aurora) preferred
- [ ] Automated backups enabled
- [ ] Network access from EKS confirmed
### Redis
- [ ] Redis OSS **5+**
- [ ] Managed (ElastiCache) or in-cluster
- [ ] Network access from EKS confirmed
### ClickHouse (Critical)
- [ ] Deployment model chosen:
- [ ] Externally managed
- [ ] In-cluster (StatefulSet)
- [ ] If in-cluster:
- [ ] Node with **8 vCPU / 32 GB RAM** available
- [ ] SSD-backed storage
- [ ] PersistentVolume provisioner available
- [ ] You understand ClickHouse is **not stateless**
---
## 6. Object Storage (Strongly Recommended)
### S3
- [ ] S3 bucket planned for LangSmith artifacts
- [ ] Bucket region matches deployment region
- [ ] IAM access model chosen:
- [ ] IRSA (preferred)
- [ ] Explicit credentials (discouraged)
---
## 7. Secrets Management
- [ ] Secrets **will not** be committed to git
- [ ] Secrets backend chosen:
- [ ] AWS Secrets Manager
- [ ] External Secrets
- [ ] CSI driver
- [ ] Rotation strategy understood (even if manual)
---
## 8. Auth & Access Model
- [ ] Auth strategy selected:
- [ ] Token-based
- [ ] OIDC / SSO
- [ ] You know **who can access LangSmith**
- [ ] You know **how access is revoked**
- [ ] You are not assuming “security by obscurity”
> Pick one auth model for initial enablement. Others are out of scope.
---
## 9. Ingress (P0 Hard Gate) — ALB Only
Ingress configuration is a critical component that requires careful attention. For the P0 reference deployment, ingress is **not optional** and there are **no alternative controllers**.
**P0 Requirement:** AWS ALB via **AWS Load Balancer Controller**.
If you are using NGINX/Traefik/Istio/API Gateway/etc., you are operating **outside the reference path**.
### Controller & Permissions
- [ ] AWS Load Balancer Controller is installed in the cluster
- [ ] Controller pods are healthy (no CrashLoopBackOff)
- [ ] Controller IAM permissions are in place (IRSA strongly preferred)
### Subnet Discovery (Common Failure)
- [ ] Public subnets are correctly tagged for ALB discovery (public ALB)
- [ ] Private subnets are correctly tagged if you plan an internal ALB
- [ ] You know which subnets ALBs will land in
### TLS & DNS
- [ ] ACM certificate exists for `langsmith.<domain>` (same region as ALB)
- [ ] You control DNS and can create records for the endpoint
### Mandatory Proof (Stop if not true)
- [ ] You have successfully provisioned a **test ALB** from Kubernetes Ingress
- ALB created
- target group created
- targets become healthy
- HTTPS works on your DNS name
If you cannot prove ALB ingress works **before** LangSmith, resolve the ingress configuration before proceeding with the LangSmith installation.
---
## 10. Operational Expectations (Read This)
Before proceeding, confirm you accept:
- [ ] You are responsible for upgrades
- [ ] You are responsible for backups
- [ ] You are responsible for incident response
- [ ] Support will assume this reference architecture when debugging
If any of these are unacceptable, **review your requirements** before proceeding, as these responsibilities are fundamental to operating a self-hosted deployment.
---
## 11. Preflight Outcome
- [ ] All required checks passed
→ You may proceed to **Terraform deployment**.
- [ ] One or more checks failed
→ Address them **before** continuing. Proceeding without resolving these issues will likely result in deployment challenges.
---
## Why This Checklist Exists
Every unchecked box above corresponds to common issues that have caused:
- Support escalations
- Deployment delays
- Production incidents
Completing preflight checks thoroughly significantly increases your chances of a successful deployment. While passing preflight does not guarantee success, **failing to address these checks almost guarantees challenges**.
+262
View File
@@ -0,0 +1,262 @@
# LangSmith Self-Hosted on AWS — Reference Architecture (P0)
**Status:** P0 Enablement Baseline
**Audience:** Platform / Infra / MLOps Engineers
**Goal:** Provide a single, opinionated, supportable path to deploying and operating LangSmith Self-Hosted (SH) on AWS with minimal support intervention.
This document defines the **reference architecture LangChain Enablement stands behind**.
Alternative approaches may work, but are **out of scope for P0 enablement and future certification**.
---
## 1. What This Architecture Is (and Is Not)
### This *is*:
- A production-capable **baseline deployment**
- Opinionated by design
- Built on **AWS + EKS + Terraform + Helm**
- Designed to surface real operator responsibilities early
- The foundation for future labs and certification
### This is *not*:
- A performance benchmark
- A multi-region or HA architecture
- A guide for custom service meshes or bespoke gateways
- A promise of security guarantees
---
## 2. Deployment Mode
**P0 Default: Full Self-Hosted**
- Control plane and data plane both run in the customer AWS account
- Customer is responsible for:
- Network exposure
- Authentication
- Data persistence
- Upgrades and backups
> Hybrid (SaaS control plane + SH data plane) is valid but **out of scope for P0 enablement**.
---
## 3. High-Level Architecture
Request flow (top to bottom):
![Request Flow Diagram](diagrams/RequestFlow.png)
Users / CI / SDKs
→ Route53
→ Application Load Balancer (ALB) + WAF
→ Kubernetes Ingress (EKS)
→ LangSmith application services
Persistent dependencies:
- PostgreSQL — metadata (projects, orgs, users)
- Redis — cache and job queues
- ClickHouse — traces and analytics
- S3 — large artifacts and payload storage
**Flow Summary**
- Traffic enters via **Route53 → ALB** (with optional WAF).
- ALB forwards to **Kubernetes ingress** inside EKS.
- LangSmith application services run in EKS.
- Persistent state is handled by:
- **PostgreSQL** (metadata)
- **Redis** (cache / queues)
- **ClickHouse** (traces & analytics)
- **S3** (large artifacts and payloads)
This diagram represents the **minimum supported topology** for the P0 reference architecture.
---
## 4. Network & Ingress
### VPC
- Single VPC
- **Public subnets**: ALB only
- **Private subnets**:
- EKS worker nodes
- Data services (RDS, Redis, ClickHouse if in-cluster)
### Ingress
- **Application Load Balancer (ALB)**
- **AWS WAF strongly recommended**
- TLS termination at ALB (end-to-end TLS recommended)
- Optionally:
- Internal ALB + VPN / PrivateLink for non-public access
### Egress
- Outbound HTTPS access to required LangChain endpoints (if applicable)
- Restrict egress access per organizational policy requirements
---
## 5. Compute: Kubernetes (EKS)
### Cluster
- **Amazon EKS**
- Managed node groups
- Cluster Autoscaler enabled
- Metrics Server enabled
### Baseline Capacity
- Minimum cluster capacity:
- **16 vCPU / 64 GB RAM** available
- This includes LangSmith services + system overhead
---
## 6. Data Stores
LangSmith SH relies on three core data stores.
### PostgreSQL (Metadata)
- **AWS RDS PostgreSQL or Aurora PostgreSQL**
- PostgreSQL **14+**
- Single AZ for P0 (HA is P1)
- Automated backups enabled
### Redis (Cache / Queues)
- **AWS ElastiCache (Redis OSS)**
- Single node acceptable for P0
- Persistence optional but recommended
### ClickHouse (Traces & Analytics)
ClickHouse is **memory- and I/O-intensive**. Proper sizing is critical for optimal performance and stability.
#### P0 Reference Sizing (Production Baseline)
- **8 vCPU**
- **32 GB RAM**
- **SSD-backed persistent storage**
- ~7000 IOPS
- ~1000 MiB/s throughput
#### Allowed but Dev-Only
- **4 vCPU / 16 GB RAM**
- Non-production proof-of-concept only
#### Scaling Guidance (P1)
- Scale to **16 vCPU / 64 GB RAM** when:
- Trace ingestion grows
- Query latency increases
- Memory pressure appears
> Strong recommendation: use externally managed ClickHouse where possible.
> In-cluster ClickHouse is supported for P0 and works well with proper operational practices.
---
## 7. Object Storage
### S3 (Strongly Recommended)
- Store large trace artifacts and payloads
- Reduces DB size and blast radius
- Improves security posture for sensitive inputs/outputs
### Access Pattern
- Use **IAM Roles for Service Accounts (IRSA)** where possible
- No static credentials in Helm values
---
## 8. Secrets & Identity
### Secrets
- **AWS Secrets Manager** (preferred)
- Inject into Kubernetes via:
- External Secrets
- CSI driver
- Secure environment injection
### Identity & Auth
- LangSmith authentication must be configured explicitly
- Supported patterns include:
- Token-based authentication
- OIDC / SSO (at least one concrete example recommended for enablement)
> For P0 enablement, select **one authentication pattern** to focus on. Additional patterns may be explored in future enablement tracks.
---
## 9. Observability (Platform-Level)
Minimum required:
- Application logs accessible via CloudWatch
- Kubernetes events visible
- Health endpoints monitored
Optional (P1):
- Prometheus / OpenTelemetry exporters
- Alerting on:
- Pod restarts
- DB connectivity
- Ingestion failures
---
## 10. Security Baseline (Non-Negotiable)
This reference architecture requires **essential security controls** as a baseline.
### MUST
- TLS enabled
- No plaintext secrets
- Least-privilege IAM
- Network isolation (private subnets for data services)
- WAF or equivalent rate limiting at ingress
### SHOULD
- Private access only (VPN / PrivateLink)
- Auth required for all UI and API access
- Regular patching and upgrades
### Explicit Disclaimer
> This reference architecture does **not** guarantee security.
> Customers are responsible for reviewing and approving deployments with their security teams.
---
## 11. What This Architecture Explicitly Excludes
These are **out of scope for P0 enablement**:
- Multi-region active/active
- Custom gateways or service meshes
- HA ClickHouse clusters
- Custom scaling policies beyond autoscaler defaults
- Performance benchmarking beyond sanity checks
These may appear in P1/P2 enablement or certification tracks.
---
## 12. Why This Exists
This reference architecture exists to:
- Reduce installation failures and complexity
- Provide support teams with a shared baseline
- Create a clear, well-documented enablement path
- Serve as the foundation for:
- Hands-on labs
- Operator certification
- Support playbooks
If you encounter challenges during implementation, these often indicate areas where additional attention or configuration is needed, rather than system defects.
---
## 13. Next Artifacts (Planned)
- Preflight checklist
- Deployment walkthrough
- Known sharp edges
- Failure-mode diagnostics
- Operator mental model
These resources build **on top of this foundation**, providing additional guidance and support as you progress.
+405
View File
@@ -0,0 +1,405 @@
# LangSmith Self-Hosted on AWS — Troubleshooting Guide (P0)
**Purpose:** Fast triage for the P0 reference deployment.
**Style:** Symptom → likely cause → exact checks → common fix.
This guide focuses on actionable, evidence-based troubleshooting. Every item maps to an observable signal and a deterministic check.
---
## 0. First Rule of Triage: Gather Evidence First
Before changing anything, capture essential diagnostic information. The easiest way to do this is using the provided diagnostic capture script.
### Quick Start: Automated Diagnostics Capture
Run the diagnostic script to automatically capture all required information:
```bash
./scripts/capture-diagnostics.sh
```
The script captures:
- Pod list and detailed descriptions for all pods
- Logs from all pods (current and previous if restarted)
- Kubernetes events
- Ingress resources and detailed configurations
- Service and endpoint information
- Node information and resource usage
- ALB target group health (if AWS CLI is configured)
**Configuration via environment variables:**
- `NAMESPACE` - Kubernetes namespace (default: `langsmith`)
- `LOG_TAIL` - Number of log lines to capture per pod (default: `200`)
- `EVENTS_TAIL` - Number of events to capture (default: `50`)
- `OUTPUT_DIR` - Directory for diagnostic output (default: `./diagnostics`)
- `AWS_REGION` - AWS region for ALB queries (default: `us-west-2`)
**Example with custom configuration:**
```bash
NAMESPACE=langsmith-prod LOG_TAIL=500 OUTPUT_DIR=./prod-diagnostics ./scripts/capture-diagnostics.sh
```
The script creates a timestamped directory with all diagnostic information and a summary file. All output is saved for later analysis.
### Manual Capture (Alternative)
If you prefer to capture diagnostics manually, ensure you collect:
- `kubectl get pods -n langsmith -o wide`
- `kubectl describe pod <POD> -n langsmith` (for each pod)
- `kubectl logs <POD> -n langsmith --tail=200` (for each pod)
- `kubectl get events -n langsmith --sort-by=.lastTimestamp | tail -50`
- Ingress/ALB status:
- `kubectl get ingress -n langsmith` (or your ingress resource type)
- `kubectl describe ingress <INGRESS> -n langsmith`
- If AWS-managed:
- ALB target group health (healthy/unhealthy + reason)
Capturing this information ensures you address the actual root cause rather than symptoms, making troubleshooting more efficient and effective.
---
## 1. The Deployment “Works” But UI Is Not Reachable
### Symptom
- DNS resolves but browser times out
- Browser shows `502/503`
- ALB exists but shows no healthy targets
### Likely Causes
- Ingress misconfigured
- Service port mismatch
- Pod readiness failing (so targets never become healthy)
- Security group / NACL blocks
### Checks
- `kubectl get ingress -n langsmith -o yaml`
- `kubectl get svc -n langsmith`
- `kubectl describe svc <SERVICE> -n langsmith`
- `kubectl get endpoints -n langsmith`
- `kubectl get pods -n langsmith`
- Inspect readiness:
- `kubectl describe pod <POD> -n langsmith | sed -n '/Readiness/,/Conditions/p'`
### Fixes (Common)
- Ensure ingress points to the correct service + port
- Ensure service selectors match pod labels
- Fix readiness probe failures before touching ALB
- Confirm ALB security group allows inbound 443 and node security group allows target traffic
---
## 2. Pods CrashLoopBackOff Immediately
### Symptom
- Pods oscillate between `CrashLoopBackOff` and `Running`
- Logs show immediate exit
### Likely Causes
- Missing or invalid secrets
- DB/Redis/ClickHouse connection failure
- Misconfigured required env vars
### Checks
- `kubectl logs <POD> -n langsmith --previous --tail=200`
- `kubectl describe pod <POD> -n langsmith` (look for env var injection and secret refs)
- Confirm secrets exist:
- `kubectl get secret -n langsmith`
- Confirm external connectivity from inside cluster:
- Launch a temporary debug pod and test TCP connectivity to DB hosts/ports
### Fixes (Common)
- Correct secret names/keys referenced in Helm values
- Verify DB hostnames and ports (RDS endpoints, Redis endpoints)
- Fix network policy / security groups if connections time out
---
## 3. Everything Is Running, But “First Successful Trace” Fails
### Symptom
- UI loads
- SDK calls fail (401/403/404) or traces never appear
- Client sees timeouts or 5xx
### Likely Causes
- Wrong endpoint (`LANGSMITH_ENDPOINT`) or wrong path
- Auth mismatch (token vs SSO)
- Ingestion path failing due to ClickHouse or Redis issues
- ALB health is fine but app errors on ingest
### Checks
- From client machine:
- Confirm endpoint resolves and responds (TLS + HTTP status)
- In cluster logs:
- Search logs of the API/ingestion service for auth or write errors
- Check ClickHouse health:
- Look for write failures, memory pressure, disk pressure
- Check Redis:
- Look for connection errors or queue backlog signals (if exposed)
### Fixes (Common)
- Ensure client is using the correct base URL and auth method
- Regenerate token / verify permissions
- Fix ClickHouse sizing or disk throughput issues if writes fail
- Fix Redis connectivity if queues are used for ingest
---
## 4. ALB Exists But Targets Are “Unhealthy”
### Symptom
- ALB target group shows all targets unhealthy
- UI returns `503` even though pods are running
### Likely Causes
- Readiness probe failing
- Target group health check path/port mismatch
- Service isnt exposing the expected port
- Pods are running but not listening
### Checks
- `kubectl describe pod <POD> -n langsmith` (readiness probe results)
- `kubectl get svc -n langsmith -o yaml`
- Confirm the container port aligns with service targetPort
- Confirm health check path matches what the service actually serves
### Fixes (Common)
- Correct ingress annotations / health check settings
- Fix readiness probe configuration or dependencies causing readiness to fail
- Align service ports with actual container ports
---
## 5. DB Connectivity Failures (PostgreSQL)
### Symptom
- App logs show:
- authentication failures
- connection refused
- timeout
- “could not translate host name”
- App wont start or fails on request
### Likely Causes
- Wrong credentials
- Security group blocks EKS to RDS
- RDS not in the right subnets or routing broken
- DNS/resolution issues inside cluster
### Checks
- Validate the RDS endpoint and port
- Confirm security groups allow inbound from EKS node group / pod CIDR (depending on setup)
- Test connectivity from a debug pod:
- DNS resolution
- TCP connect to `<rds-endpoint>:5432`
### Fixes (Common)
- Correct creds in secrets
- Fix SG rules
- Ensure private subnets have proper routing and NAT where required
- Ensure RDS is reachable from EKS VPC/subnets
---
## 6. Redis Connectivity Failures
### Symptom
- Logs show Redis connection errors/timeouts
- Background jobs stall (if used)
- Ingestion or async tasks fail
### Likely Causes
- Wrong endpoint/port
- Security group blocks EKS to ElastiCache
- Auth mismatch (if Redis auth enabled)
### Checks
- Confirm ElastiCache endpoint and port
- Test TCP connectivity from debug pod
- Check whether Redis auth is enabled and whether Helm values match
### Fixes (Common)
- Fix endpoint in values
- Fix security group rules
- Align auth config
---
## 7. ClickHouse Problems (Most Common Real Root Cause)
### 7.1 ClickHouse OOM / Memory Pressure
**Symptom**
- ClickHouse pod restarts
- OOMKilled events
- Trace writes fail or become slow
**Likely Cause**
- ClickHouse undersized (4/16 used for real workload)
- Memory limits too tight
- Query pressure
**Checks**
- `kubectl describe pod <clickhouse-pod> -n langsmith` (look for OOMKilled)
- `kubectl logs <clickhouse-pod> -n langsmith --tail=200`
- `kubectl top pod <clickhouse-pod> -n langsmith`
**Fixes**
- Move to **8 vCPU / 32GB RAM** baseline
- Increase memory limits/requests
- Reduce concurrent ingest/query load
---
### 7.2 ClickHouse Disk / IO Throughput Issues
**Symptom**
- Latency spikes
- Writes time out
- ClickHouse logs mention slow merges / IO waits
**Likely Cause**
- Slow storage class
- Inadequate IOPS/throughput
- Disk nearing capacity
**Checks**
- Confirm PV storage class and performance characteristics
- Check disk usage in ClickHouse pod
- Review ClickHouse logs for merge pressure / IO wait
**Fixes**
- Use SSD-backed storage with sufficient IOPS/throughput
- Increase volume size
- Move ClickHouse to a dedicated node group / better instance type
---
### 7.3 ClickHouse Not Persistent (Data Loss Risk)
**Symptom**
- ClickHouse redeploy loses data
- Traces disappear after restart
**Likely Cause**
- No persistent volume attached
- StatefulSet misconfigured
**Checks**
- Confirm PVC exists and is bound:
- `kubectl get pvc -n langsmith`
- Confirm ClickHouse uses that PVC
**Fixes**
- Attach PVC and ensure StatefulSet mounts it
- Do not treat ClickHouse as stateless
---
## 8. Kubernetes Scheduling Issues
### Symptom
- Pods stuck in `Pending`
- Events show “insufficient cpu/memory”
- ClickHouse never schedules
### Likely Causes
- Cluster too small
- Node group instance types too small
- Taints/affinity constraints prevent scheduling
### Checks
- `kubectl describe pod <POD> -n langsmith` (look at scheduling events)
- `kubectl get nodes -o wide`
- Check taints:
- `kubectl describe node <NODE> | sed -n '/Taints/,/Conditions/p'`
### Fixes
- Increase node group size
- Use larger instance types
- Remove/adjust taints and affinities
- Ensure ClickHouse has a node that can fit **8/32 allocatable**
---
## 9. TLS / Certificate Issues
### Symptom
- Browser warnings
- Client SDK fails TLS handshake
- Mixed content or redirect loops
### Likely Causes
- Wrong ACM cert attached
- Wrong DNS name on cert
- HTTP/HTTPS mismatch
### Checks
- Confirm ALB listener is HTTPS
- Confirm cert CN/SAN includes your DNS name
- Confirm DNS record points to the correct ALB
### Fixes
- Attach correct cert
- Fix DNS record
- Enforce HTTPS redirects intentionally (not accidentally)
---
## 10. “It Worked Yesterday” Failures (The Dangerous Ones)
### Symptom
- Random 5xx
- Slow UI
- Traces intermittently missing
### Likely Causes
- Resource pressure (CPU throttling / memory pressure)
- ClickHouse disk pressure or merge backlog
- Redis saturation
- Node churn / autoscaling issues
### Checks
- `kubectl top pods -n langsmith`
- Pod restarts:
- `kubectl get pods -n langsmith --sort-by=.status.containerStatuses[0].restartCount`
- Node events and scaling activity
- DB metrics (RDS CPU/connections; Redis CPU/memory; ClickHouse memory/disk)
### Fixes
- Add capacity (scale nodes)
- Increase ClickHouse resources or improve disk class
- Increase Redis tier if saturated
- Tune autoscaler limits (dont let it starve the cluster)
---
## 11. What to Include in a Support Request (If You Must Escalate)
If you open a ticket, include:
- Reference path confirmation:
- “Deployed via reference architecture + terraform + helm”
- repo SHAs / chart versions
- Current cluster state:
- `kubectl get pods -n langsmith -o wide`
- relevant `describe` output
- last 200 lines of logs from failing pods
- External dependencies:
- Postgres type/version (RDS/Aurora, PG version)
- Redis type/version
- ClickHouse model (external vs in-cluster) + sizing
- ALB target health status and error reason
Providing this information upfront enables faster resolution. If diagnostics are incomplete, the first step will be to collect the necessary diagnostic data.
---
## 12. Add to This Guide (How)
Only add entries that:
- Came from a real failure
- Include a deterministic check
- Include a fix that is repeatable
+303
View File
@@ -0,0 +1,303 @@
# LangSmith Self-Hosted on AWS — Deployment Walkthrough (P0)
**Goal:** Get from zero → running LangSmith SH → first successful trace → basic health validation.
**Assumption:** You passed [`PREFLIGHT.md`](./PREFLIGHT.md). If not, stop and do that first.
This walkthrough is intentionally opinionated and linear. Following it step-by-step ensures you stay on the reference path and can receive full support.
---
## 0. Inputs You Must Decide Up Front
Pick these *before* you touch Terraform:
- **AWS Region:** `us-west-2` (example — pick one and stick to it)
- **Environment name:** `dev` / `staging` / `prod` (do not share resources across envs)
- **DNS name:** `langsmith.<your-domain>`
- **Exposure model:** Public (ALB) or Private-only (VPN/PrivateLink)
- **Auth model:** Token-based (P0) or OIDC/SSO (P1 unless already standard internally)
- **Data store model:**
- Postgres: RDS/Aurora (recommended)
- Redis: ElastiCache (recommended)
- ClickHouse: Externally managed (preferred) or in-cluster (allowed)
Write these in a `deploy/ENV.md` file for your own sanity.
---
## 1. Clone Repos and Pin Versions
You are building an enablement path. That means **pinning** matters.
- Clone:
- `https://github.com/langchain-ai/terraform`
- `https://github.com/langchain-ai/helm`
- Record:
- Terraform repo commit SHA
- Helm repo commit SHA or chart version
- Do not “float” versions for the reference deployment.
> Reproducibility is essential for effective enablement. If you cannot reproduce a deployment later, the enablement process has not been fully captured.
---
## 2. Terraform: Provision AWS Infrastructure
### 2.1 Configure Terraform State
- Use S3 backend + DynamoDB lock (recommended).
- Ensure state is **unique per environment**.
### 2.2 Apply Infrastructure
Provision (at minimum):
- VPC + subnets (public for ALB, private for nodes/data)
- EKS cluster + managed node groups
- RDS Postgres (14+)
- ElastiCache Redis
- S3 bucket for artifacts
- Security groups and IAM roles/policies
- (Optional) Route53 hosted zone / record scaffolding
**Hard requirement:** Ensure the EKS node groups provide at least:
- **16 vCPU / 64GB RAM** allocatable capacity total
- **ClickHouse capacity** if in-cluster:
- One node with **8 vCPU / 32GB RAM** allocatable
### 2.3 Terraform Verification Gates (Stop if any fail)
- [ ] `aws eks describe-cluster` shows `ACTIVE`
- [ ] Worker nodes in private subnets can reach the internet (NAT)
- [ ] RDS reachable from EKS subnets/security groups
- [ ] Redis reachable from EKS subnets/security groups
- [ ] S3 bucket exists and IAM access path is defined (IRSA preferred)
---
## 3. Kubernetes: Connect and Validate the Cluster
### 3.1 Connect to the Cluster
- Update kubeconfig:
- `aws eks update-kubeconfig --region <REGION> --name <CLUSTER_NAME>`
- Confirm:
- `kubectl get nodes`
### 3.2 Install/Validate Required Add-ons
You must have:
- Metrics Server
- Cluster Autoscaler
Verification:
- `kubectl top nodes` returns metrics
- Autoscaler is running and has permissions
### 3.3 Create a Namespace
Create a dedicated namespace, e.g.:
- `langsmith`
## 3.4 Ingress Gate — Prove ALB Works Before Installing LangSmith
Complete this validation **before** Helm-installing LangSmith. Many deployment issues initially attributed to LangSmith are actually ingress, controller, or subnet-tagging configuration problems.
### 3.4.1 Deploy a tiny test app
Deploy any minimal HTTP echo service into a test namespace (or the `langsmith` namespace). Confirm:
- `kubectl get pods` shows it running
- `kubectl get svc` shows endpoints
### 3.4.2 Create a test Ingress that provisions an ALB
Create an Ingress pointing at the test service.
Your success criteria are binary:
- [ ] An **ALB** is created
- [ ] A target group is created
- [ ] Targets become **healthy**
- [ ] You can hit the endpoint and get a response over **HTTPS**
### 3.4.3 If this fails, stop
Do not proceed to LangSmith until this gate passes.
When it fails, the first places to look are:
- Kubernetes events on the Ingress
- AWS Load Balancer Controller logs
- ALB target group health reasons in the AWS console
> If you are not using ALB for ingress, you are operating outside the P0 reference path.
---
## 4. Prepare Dependencies and Secrets
### 4.1 Collect Required Connection Info
You need:
- Postgres host/port/db/user/password
- Redis host/port (and auth if enabled)
- ClickHouse endpoint/user/password (or in-cluster config)
- S3 bucket name and region
### 4.2 Store Secrets (Do Not Put in Git)
Preferred: AWS Secrets Manager + External Secrets integration.
At minimum for P0 enablement:
- Keep secrets out of repo
- Inject into Kubernetes securely (ExternalSecrets/CSI/secure env)
**Stop condition:** Never commit passwords or secrets into `values.yaml` or version control. Use a secrets management solution instead.
---
## 5. Helm: Install LangSmith
### 5.1 Choose the Values Strategy
You should have:
- `values.yaml` (non-secret config)
- `secrets.yaml` OR external secrets (secret values only, not committed)
### 5.2 Configure Required Values
Your Helm values must define:
- External Postgres connection
- External Redis connection
- ClickHouse configuration (external or in-cluster)
- S3 artifact storage (strongly recommended)
- Ingress configuration (ALB + TLS)
### 5.3 Install/Upgrade
- Install the chart into the `langsmith` namespace.
- Use `helm upgrade --install` (idempotent).
### 5.4 Helm Verification Gates (Stop if any fail)
- [ ] All pods in `langsmith` namespace reach `Running` or expected steady state
- [ ] No CrashLoopBackOff
- [ ] Services have endpoints
- [ ] Ingress is created and gets an ALB hostname/address
Commands you should run (conceptually):
- `kubectl get pods -n langsmith`
- `kubectl describe pod <...> -n langsmith`
- `kubectl get svc -n langsmith`
- `kubectl get ingress -n langsmith` (or equivalent ingress resource)
---
## 6. Ingress + DNS: Make It Reachable
### 6.1 TLS
- Ensure the ALB listener is HTTPS
- Ensure cert is valid (ACM recommended)
### 6.2 DNS
- Create a Route53 record:
- `langsmith.<domain>` → ALB DNS name
### 6.3 Reachability Gate
- [ ] You can load the LangSmith UI at `https://langsmith.<domain>`
- [ ] Auth behaves as intended (token login or SSO)
---
## 7. “First Successful Trace” (The Real Success Condition)
A deployment is not “done” until traces flow.
### 7.1 Create an API Key / Token (if applicable)
- Create the token per your configured auth model.
- Store it securely.
### 7.2 Send a Minimal Trace
From a laptop or CI runner with egress to the endpoint:
- Configure `LANGSMITH_ENDPOINT`
- Configure auth (`LANGSMITH_API_KEY` or equivalent)
- Run a minimal trace-producing script (LangChain example or direct API).
### 7.3 Trace Gate (Stop if fails)
- [ ] A trace appears in the LangSmith UI
- [ ] Trace includes at least one run/span
- [ ] No ingestion errors in logs
If this fails, do not proceed to operational tasks. Fix ingestion first to ensure the system is functioning correctly.
---
## 8. Basic Health Validation (P0 Ops Readiness)
### 8.1 What “Healthy” Means (Minimum)
- UI loads reliably
- API responds
- DB connections stable
- No sustained error logs
- ClickHouse writes succeed
- Redis queues not stuck
### 8.2 Validate Logs
Check:
- LangSmith app logs for errors
- ClickHouse logs for disk/memory pressure
- Ingress/ALB logs (4xx/5xx spikes)
### 8.3 Validate Resource Pressure
- `kubectl top pods -n langsmith`
- Look for:
- OOMKills
- CPU throttling
- Persistent volume saturation
---
## 9. Backup & Restore (P0 Expectations)
For P0 enablement, you must at least:
- Confirm RDS backups are enabled
- Confirm ClickHouse persistence strategy is defined
- Confirm S3 bucket lifecycle/versioning policy is intentional
You do not need to execute a restore yet, but you must document how it would be done.
---
## 10. Common Failure Points (Fast Triage)
If deployment fails, the usual culprits are:
1. **Networking / Security Groups**
- EKS cant reach Postgres/Redis/ClickHouse
2. **ClickHouse undersized or slow disk**
- OOM, high latency, ingestion failures
3. **Ingress misconfiguration**
- ALB created but no healthy targets
4. **Auth mismatch**
- UI loads but API calls fail
5. **Secrets handling**
- Bad credentials injected, pods loop
When something breaks: capture
- `kubectl describe`
- pod logs
- DB connection test results
- ALB target health
This data becomes your failure-mode catalog later.
---
## 11. “Done” Definition (P0)
You are done only when:
- [ ] Terraform applied cleanly and is reproducible
- [ ] Helm install is idempotent (`upgrade --install` works)
- [ ] UI reachable via HTTPS on your chosen DNS
- [ ] First successful trace appears in the UI
- [ ] Basic health checks are green (no crash loops, stable DB connectivity)
If any box isn't checked, continue working through the checklist until all items are complete to ensure a fully functional reference deployment.
---
## Appendix: What to Capture During Your First Real Deployment
As you run this the first time, log:
- Where you hesitated
- What you had to guess
- What you looked up
- What failed and how you fixed it
Those are the inputs for:
- `TROUBLESHOOTING.md`
- “Top failure modes”
- Future certification labs
Binary file not shown.

After

Width:  |  Height:  |  Size: 270 KiB

+268
View File
@@ -0,0 +1,268 @@
#!/usr/bin/env bash
# LangSmith Self-Hosted Diagnostics Capture Script
# This script captures essential diagnostic information for troubleshooting
# LangSmith Self-Hosted deployments on AWS/EKS.
set -euo pipefail
# Configuration via environment variables (with defaults)
NAMESPACE="${NAMESPACE:-langsmith}"
LOG_TAIL="${LOG_TAIL:-200}"
EVENTS_TAIL="${EVENTS_TAIL:-50}"
OUTPUT_DIR="${OUTPUT_DIR:-./diagnostics}"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
OUTPUT_PATH="${OUTPUT_DIR}/${TIMESTAMP}"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Create output directory
mkdir -p "${OUTPUT_PATH}"
echo -e "${GREEN}Capturing diagnostics for namespace: ${NAMESPACE}${NC}"
echo -e "${GREEN}Output directory: ${OUTPUT_PATH}${NC}"
echo ""
# Function to run command and save output
capture_output() {
local description="$1"
local command="$2"
local filename="$3"
echo -e "${YELLOW}Capturing: ${description}${NC}"
if eval "${command}" > "${OUTPUT_PATH}/${filename}" 2>&1; then
echo -e "${GREEN} ✓ Saved to ${filename}${NC}"
else
echo -e "${RED} ✗ Failed to capture ${description}${NC}"
fi
echo ""
}
# Check if kubectl is available
if ! command -v kubectl &> /dev/null; then
echo -e "${RED}Error: kubectl is not installed or not in PATH${NC}"
exit 1
fi
# Check if namespace exists
if ! kubectl get namespace "${NAMESPACE}" &> /dev/null; then
echo -e "${RED}Error: Namespace '${NAMESPACE}' does not exist${NC}"
exit 1
fi
# Capture pod list
capture_output \
"Pod list (wide format)" \
"kubectl get pods -n ${NAMESPACE} -o wide" \
"pods-wide.txt"
# Get list of pods
PODS=$(kubectl get pods -n "${NAMESPACE}" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
if [ -z "${PODS}" ]; then
echo -e "${YELLOW}No pods found in namespace ${NAMESPACE}${NC}"
echo ""
else
# Capture describe and logs for each pod
for POD in ${PODS}; do
echo -e "${YELLOW}Processing pod: ${POD}${NC}"
# Capture pod description
capture_output \
"Pod description: ${POD}" \
"kubectl describe pod ${POD} -n ${NAMESPACE}" \
"pod-${POD}-describe.txt"
# Capture pod logs
capture_output \
"Pod logs: ${POD} (last ${LOG_TAIL} lines)" \
"kubectl logs ${POD} -n ${NAMESPACE} --tail=${LOG_TAIL}" \
"pod-${POD}-logs.txt"
# Capture previous logs if pod has restarted
if kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.status.containerStatuses[0].restartCount}' 2>/dev/null | grep -q '[1-9]'; then
capture_output \
"Previous pod logs: ${POD} (last ${LOG_TAIL} lines)" \
"kubectl logs ${POD} -n ${NAMESPACE} --previous --tail=${LOG_TAIL}" \
"pod-${POD}-logs-previous.txt"
fi
done
fi
# Capture events
capture_output \
"Kubernetes events (last ${EVENTS_TAIL} events)" \
"kubectl get events -n ${NAMESPACE} --sort-by=.lastTimestamp | tail -${EVENTS_TAIL}" \
"events.txt"
# Capture ingress status
capture_output \
"Ingress resources" \
"kubectl get ingress -n ${NAMESPACE} -o wide" \
"ingress-list.txt"
# Capture detailed ingress information
INGRESS_RESOURCES=$(kubectl get ingress -n "${NAMESPACE}" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
if [ -n "${INGRESS_RESOURCES}" ]; then
for INGRESS in ${INGRESS_RESOURCES}; do
capture_output \
"Ingress details: ${INGRESS}" \
"kubectl describe ingress ${INGRESS} -n ${NAMESPACE}" \
"ingress-${INGRESS}-describe.txt"
capture_output \
"Ingress YAML: ${INGRESS}" \
"kubectl get ingress ${INGRESS} -n ${NAMESPACE} -o yaml" \
"ingress-${INGRESS}.yaml"
done
fi
# Capture service status
capture_output \
"Service list" \
"kubectl get svc -n ${NAMESPACE} -o wide" \
"services-list.txt"
# Capture endpoints
capture_output \
"Endpoints" \
"kubectl get endpoints -n ${NAMESPACE}" \
"endpoints.txt"
# Capture service details
SERVICES=$(kubectl get svc -n "${NAMESPACE}" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
if [ -n "${SERVICES}" ]; then
for SVC in ${SERVICES}; do
capture_output \
"Service details: ${SVC}" \
"kubectl describe svc ${SVC} -n ${NAMESPACE}" \
"svc-${SVC}-describe.txt"
done
fi
# Capture node information
capture_output \
"Node list" \
"kubectl get nodes -o wide" \
"nodes-wide.txt"
# Capture resource usage (if metrics-server is available)
if kubectl top nodes &> /dev/null; then
capture_output \
"Node resource usage" \
"kubectl top nodes" \
"nodes-top.txt"
if [ -n "${PODS}" ]; then
capture_output \
"Pod resource usage" \
"kubectl top pods -n ${NAMESPACE}" \
"pods-top.txt"
fi
else
echo -e "${YELLOW}Metrics Server not available, skipping resource usage metrics${NC}"
echo ""
fi
# Capture PVC information
capture_output \
"Persistent Volume Claims" \
"kubectl get pvc -n ${NAMESPACE}" \
"pvc-list.txt"
# Capture StatefulSets and Deployments
capture_output \
"StatefulSets" \
"kubectl get statefulsets -n ${NAMESPACE} -o wide" \
"statefulsets.txt"
capture_output \
"Deployments" \
"kubectl get deployments -n ${NAMESPACE} -o wide" \
"deployments.txt"
# AWS-specific: ALB target group health (if AWS CLI is available and configured)
if command -v aws &> /dev/null; then
echo -e "${YELLOW}Attempting to capture ALB target group health information...${NC}"
# Try to get ALB information from ingress annotations
if [ -n "${INGRESS_RESOURCES}" ]; then
for INGRESS in ${INGRESS_RESOURCES}; do
ALB_ARN=$(kubectl get ingress "${INGRESS}" -n "${NAMESPACE}" -o jsonpath='{.metadata.annotations.alb\.ingress\.kubernetes\.io/load-balancer-id}' 2>/dev/null || echo "")
if [ -n "${ALB_ARN}" ]; then
# Extract ALB name from ARN or use ARN directly
echo "ALB ARN: ${ALB_ARN}" > "${OUTPUT_PATH}/alb-${INGRESS}-info.txt"
# Get target groups for this ALB
if aws elbv2 describe-target-groups --load-balancer-arn "${ALB_ARN}" --region "${AWS_REGION:-us-west-2}" &> /dev/null; then
capture_output \
"ALB target groups: ${INGRESS}" \
"aws elbv2 describe-target-groups --load-balancer-arn ${ALB_ARN} --region ${AWS_REGION:-us-west-2}" \
"alb-${INGRESS}-target-groups.json"
# Get target health for each target group
TARGET_GROUPS=$(aws elbv2 describe-target-groups --load-balancer-arn "${ALB_ARN}" --region "${AWS_REGION:-us-west-2}" --query 'TargetGroups[*].TargetGroupArn' --output text 2>/dev/null || echo "")
if [ -n "${TARGET_GROUPS}" ]; then
for TG_ARN in ${TARGET_GROUPS}; do
capture_output \
"Target group health: ${TG_ARN}" \
"aws elbv2 describe-target-health --target-group-arn ${TG_ARN} --region ${AWS_REGION:-us-west-2}" \
"alb-${INGRESS}-target-health-$(basename ${TG_ARN}).json"
done
fi
fi
fi
done
fi
echo ""
else
echo -e "${YELLOW}AWS CLI not available, skipping ALB target group health capture${NC}"
echo -e "${YELLOW}To capture ALB information, install AWS CLI and configure credentials${NC}"
echo ""
fi
# Create summary file
SUMMARY_FILE="${OUTPUT_PATH}/summary.txt"
{
echo "LangSmith Self-Hosted Diagnostics Summary"
echo "========================================"
echo "Timestamp: ${TIMESTAMP}"
echo "Namespace: ${NAMESPACE}"
echo "Output Directory: ${OUTPUT_PATH}"
echo ""
echo "Configuration:"
echo " LOG_TAIL: ${LOG_TAIL}"
echo " EVENTS_TAIL: ${EVENTS_TAIL}"
echo ""
echo "Captured Information:"
echo " - Pod list and descriptions"
echo " - Pod logs (current and previous if restarted)"
echo " - Kubernetes events"
echo " - Ingress resources and details"
echo " - Services and endpoints"
echo " - Node information"
echo " - Resource usage (if metrics-server available)"
echo " - Persistent Volume Claims"
echo " - StatefulSets and Deployments"
if command -v aws &> /dev/null; then
echo " - ALB target group health (if available)"
fi
echo ""
echo "Files captured:"
find "${OUTPUT_PATH}" -type f -name "*.txt" -o -name "*.yaml" -o -name "*.json" | sort | sed 's|^| |'
} > "${SUMMARY_FILE}"
echo -e "${GREEN}✓ Diagnostics capture complete!${NC}"
echo -e "${GREEN}Summary saved to: ${SUMMARY_FILE}${NC}"
echo ""
echo "To view the summary:"
echo " cat ${SUMMARY_FILE}"
echo ""
echo "All diagnostic files are in: ${OUTPUT_PATH}"
+514
View File
@@ -0,0 +1,514 @@
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Deny pattern regex (broadened to catch all AWS denial patterns)
DENY_RE='AccessDenied|AccessDeniedException|UnauthorizedOperation|not authorized|NotAuthorized|is not authorized'
# Parse command line arguments
SKIP_RESOURCE_TESTS=false
NON_INTERACTIVE=false
CREATE_TEST_RESOURCES=false
ACM_DOMAIN=""
while [[ $# -gt 0 ]]; do
case $1 in
-s|--skip_resource_tests|--skip_checks)
SKIP_RESOURCE_TESTS=true
shift
;;
-y|--yes)
NON_INTERACTIVE=true
shift
;;
--create-test-resources)
CREATE_TEST_RESOURCES=true
shift
;;
--domain)
ACM_DOMAIN="$2"
shift 2
;;
*)
printf "Unknown option: %s\n" "$1"
printf "Usage: %s [-s|--skip_resource_tests] [-y|--yes] [--create-test-resources] [--domain <domain>]\n" "$0"
exit 1
;;
esac
done
# Check for CI environment
if [ "${CI:-false}" = "true" ]; then
NON_INTERACTIVE=true
fi
# Function to print colored output
info() {
printf "${BLUE}[INFO]${NC} %s\n" "$1"
}
success() {
printf "${GREEN}[SUCCESS]${NC} %s\n" "$1"
}
warning() {
printf "${YELLOW}[WARNING]${NC} %s\n" "$1"
}
error() {
printf "${RED}[ERROR]${NC} %s\n" "$1"
}
# Function to check for access denied patterns
check_denied() {
local output="$1"
if echo "$output" | grep -Eqi "$DENY_RE"; then
return 0 # Access denied found
fi
return 1 # No access denied
}
# Check if AWS CLI is installed
if ! command -v aws &> /dev/null; then
error "AWS CLI is not installed. Please install it first."
exit 1
fi
# Safety banner
printf "\n"
info "=== LangSmith AWS Preflight Check ==="
info "Default mode: READ-ONLY (no resource creation)"
info "Use --create-test-resources to test resource creation"
info "No modifications to existing resources will be made."
info "Temporary test resources may be created only with --create-test-resources."
printf "\n"
# Check AWS credentials
info "Checking AWS credentials..."
if ! aws sts get-caller-identity &> /dev/null; then
error "Not logged into AWS. Please run 'aws configure' or set AWS credentials."
exit 1
fi
# Get AWS account info
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
USER_ARN=$(aws sts get-caller-identity --query Arn --output text)
# Better region handling
REGION=$(aws configure get region 2>/dev/null || true)
REGION=${REGION:-${AWS_DEFAULT_REGION:-us-west-2}}
info "AWS Account ID: $ACCOUNT_ID"
info "User ARN: $USER_ARN"
info "Current region: $REGION"
# Confirm region (non-interactive mode skips this)
if [ "$NON_INTERACTIVE" = false ]; then
printf "\n"
read -p "Is the region '$REGION' correct? (y/n): " -n 1 -r
printf "\n"
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
error "Please set the correct region using 'aws configure set region <region>' or export AWS_DEFAULT_REGION"
exit 1
fi
else
info "Non-interactive mode: using region '$REGION'"
fi
# Check for sandbox account indicators (warning only, no prompt)
info "Checking for sandbox account restrictions..."
if [[ "$ACCOUNT_ID" =~ ^[0-9]{12}$ ]]; then
# Check if account has restrictions (common sandbox patterns)
ALIASES=$(aws iam list-account-aliases --query 'AccountAliases' --output text 2>/dev/null || echo "")
if echo "$ALIASES" | grep -qi "sandbox\|test\|dev"; then
warning "Account alias suggests this might be a sandbox/test account: $ALIASES"
warning "Please verify this account is not restricted by SCPs or other policies"
fi
else
warning "Account ID format is unusual. Please verify this is not a restricted account."
fi
# Function to cleanup resources on exit (with retry logic)
cleanup() {
info "Cleaning up test resources..."
# Cleanup in correct order: SG -> Subnet -> VPC -> IAM role
# Retry logic for eventual consistency
# Delete security group (with retry)
if [ -n "${TEST_SG_ID:-}" ]; then
for i in {1..3}; do
if aws ec2 delete-security-group --group-id "$TEST_SG_ID" --region "$REGION" 2>/dev/null; then
success "Security group deleted: $TEST_SG_ID"
break
fi
if [ $i -lt 3 ]; then
sleep 2
fi
done
fi
# Delete subnet (with retry)
if [ -n "${TEST_SUBNET_ID:-}" ]; then
for i in {1..3}; do
if aws ec2 delete-subnet --subnet-id "$TEST_SUBNET_ID" --region "$REGION" 2>/dev/null; then
success "Subnet deleted: $TEST_SUBNET_ID"
break
fi
if [ $i -lt 3 ]; then
sleep 2
fi
done
fi
# Delete VPC (with retry)
if [ -n "${TEST_VPC_ID:-}" ]; then
for i in {1..3}; do
if aws ec2 delete-vpc --vpc-id "$TEST_VPC_ID" --region "$REGION" 2>/dev/null; then
success "VPC deleted: $TEST_VPC_ID"
break
fi
if [ $i -lt 3 ]; then
sleep 2
fi
done
fi
# Delete IAM role (IAM is global, no --region, with retry)
# NOTE: If you attach policies to the role, you must detach them before deleting
if [ -n "${TEST_ROLE_NAME:-}" ]; then
for i in {1..3}; do
if aws iam delete-role --role-name "$TEST_ROLE_NAME" 2>/dev/null; then
success "IAM role deleted: $TEST_ROLE_NAME"
break
fi
if [ $i -lt 3 ]; then
sleep 2
fi
done
fi
}
# Read-only permission checks (always run)
info "Running read-only permission checks..."
# Test EC2 permissions (needed for EKS and ALB controller)
info "Testing EC2 permissions (VPC, subnets, availability zones)..."
EC2_VPC_OUTPUT=$(aws ec2 describe-vpcs --region "$REGION" --max-items 1 2>&1 || true)
EC2_SUBNET_OUTPUT=$(aws ec2 describe-subnets --region "$REGION" --max-items 1 2>&1 || true)
EC2_AZ_OUTPUT=$(aws ec2 describe-availability-zones --region "$REGION" 2>&1 || true)
if check_denied "$EC2_VPC_OUTPUT" || check_denied "$EC2_SUBNET_OUTPUT" || check_denied "$EC2_AZ_OUTPUT"; then
error "Failed EC2 permission check. Check IAM permissions for ec2:DescribeVpcs, ec2:DescribeSubnets, ec2:DescribeAvailabilityZones"
exit 1
fi
success "EC2 permissions verified"
# Test EKS permissions (fixed - single call with broader check)
info "Testing EKS permissions..."
EKS_OUTPUT=$(aws eks describe-cluster --name "preflight-nonexistent-$(date +%s)" --region "$REGION" 2>&1 || true)
if check_denied "$EKS_OUTPUT"; then
error "Failed EKS permission check. Check IAM permissions for eks:*"
exit 1
elif echo "$EKS_OUTPUT" | grep -q "ResourceNotFoundException"; then
success "EKS permissions verified"
else
# Try list-clusters as alternative check
EKS_LIST_OUTPUT=$(aws eks list-clusters --region "$REGION" 2>&1 || true)
if check_denied "$EKS_LIST_OUTPUT"; then
error "Failed EKS permission check. Check IAM permissions for eks:*"
exit 1
elif [ -n "$EKS_LIST_OUTPUT" ]; then
success "EKS permissions verified"
else
warning "EKS permission check inconclusive, but continuing..."
fi
fi
# Add warning about EKS prerequisites
warning "Note: Passing EKS checks does not guarantee you can create EKS/nodegroups."
warning "Common failures occur at iam:PassRole, ec2:* permissions, and service quotas."
# Test IAM permissions (read-only check, needed for PassRole)
info "Testing IAM permissions..."
IAM_OUTPUT=$(aws iam list-roles --max-items 1 2>&1 || true)
if check_denied "$IAM_OUTPUT"; then
error "Failed IAM permission check. Check IAM permissions for iam:ListRoles (needed for iam:PassRole)"
exit 1
else
success "IAM permissions verified"
fi
# Test RDS permissions
info "Testing RDS permissions..."
RDS_OUTPUT=$(aws rds describe-db-instances --region "$REGION" 2>&1 || true)
if check_denied "$RDS_OUTPUT"; then
error "Failed RDS permission check. Check IAM permissions for rds:*"
exit 1
else
success "RDS permissions verified"
fi
# Test ElastiCache permissions
info "Testing ElastiCache permissions..."
CACHE_OUTPUT=$(aws elasticache describe-cache-clusters --region "$REGION" 2>&1 || true)
if check_denied "$CACHE_OUTPUT"; then
error "Failed ElastiCache permission check. Check IAM permissions for elasticache:*"
exit 1
else
success "ElastiCache permissions verified"
fi
# Test ALB/ELB permissions
info "Testing Application Load Balancer permissions..."
ALB_OUTPUT=$(aws elbv2 describe-load-balancers --region "$REGION" 2>&1 || true)
if check_denied "$ALB_OUTPUT"; then
error "Failed ALB permission check. Check IAM permissions for elasticloadbalancing:*"
exit 1
else
success "ALB permissions verified"
fi
# Test ACM permissions (for TLS certificates)
info "Testing ACM (Certificate Manager) permissions..."
ACM_OUTPUT=$(aws acm list-certificates --region "$REGION" 2>&1 || true)
if check_denied "$ACM_OUTPUT"; then
error "Failed ACM permission check. Check IAM permissions for acm:*"
exit 1
else
success "ACM permissions verified"
if [ -z "$ACM_DOMAIN" ]; then
warning "ACM check passed (does not confirm a cert exists for your chosen domain)"
else
# Check if certificate exists for the domain
info "Checking for ACM certificate matching domain: $ACM_DOMAIN"
# Extract zone apex (e.g., "example.com" from "langsmith.example.com")
# If domain contains a dot, extract everything after the first dot
if echo "$ACM_DOMAIN" | grep -q '\.'; then
ZONE_APEX=$(echo "$ACM_DOMAIN" | sed -E 's/^[^.]*\.(.+)$/\1/')
else
# Already an apex domain (unlikely but handle it)
ZONE_APEX="$ACM_DOMAIN"
fi
# First try exact match
CERT_ARN=$(aws acm list-certificates --region "$REGION" --query "CertificateSummaryList[?DomainName=='$ACM_DOMAIN'].CertificateArn" --output text 2>/dev/null || echo "")
# If no exact match, try wildcard for zone apex (e.g., *.example.com)
if [ -z "$CERT_ARN" ] && [ "$ZONE_APEX" != "$ACM_DOMAIN" ]; then
WILDCARD_DOMAIN="*.$ZONE_APEX"
CERT_ARN=$(aws acm list-certificates --region "$REGION" --query "CertificateSummaryList[?DomainName=='$WILDCARD_DOMAIN'].CertificateArn" --output text 2>/dev/null || echo "")
fi
# If still no match, check SANs by describing each cert (limited check)
if [ -z "$CERT_ARN" ]; then
ALL_CERTS=$(aws acm list-certificates --region "$REGION" --query "CertificateSummaryList[*].CertificateArn" --output text 2>/dev/null || echo "")
for cert_arn in $ALL_CERTS; do
CERT_DETAILS=$(aws acm describe-certificate --certificate-arn "$cert_arn" --region "$REGION" --query "Certificate.{Domain:DomainName,SANs:SubjectAlternativeNames}" --output json 2>/dev/null || echo "{}")
if echo "$CERT_DETAILS" | grep -q "\"$ACM_DOMAIN\"" || echo "$CERT_DETAILS" | grep -q "\"*.$ZONE_APEX\""; then
CERT_ARN="$cert_arn"
break
fi
done
fi
if [ -n "$CERT_ARN" ]; then
success "Found ACM certificate for domain: $CERT_ARN"
else
warning "No ACM certificate found for domain '$ACM_DOMAIN' in region '$REGION'"
warning "You may need to request a certificate before deploying"
fi
fi
fi
# Test Route53 permissions (for DNS/ingress) - Route53 is global, no --region
info "Testing Route53 permissions..."
R53_OUTPUT=$(aws route53 list-hosted-zones 2>&1 || true)
if check_denied "$R53_OUTPUT"; then
error "Failed Route53 permission check. Check IAM permissions for route53:*"
exit 1
else
success "Route53 permissions verified"
# Check if hosted zones exist
ZONE_COUNT=$(aws route53 list-hosted-zones --query "HostedZones | length(@)" --output text 2>/dev/null || echo "0")
if [ "$ZONE_COUNT" = "0" ] || [ -z "$ZONE_COUNT" ]; then
warning "No Route53 hosted zones found."
warning "If you intend to use Route53 for ingress, create/identify the hosted zone first."
else
info "Found $ZONE_COUNT Route53 hosted zone(s)"
# If domain provided, check for matching hosted zone
if [ -n "$ACM_DOMAIN" ]; then
# Extract zone apex (same logic as ACM check)
if echo "$ACM_DOMAIN" | grep -q '\.'; then
ZONE_APEX=$(echo "$ACM_DOMAIN" | sed -E 's/^[^.]*\.(.+)$/\1/')
else
ZONE_APEX="$ACM_DOMAIN"
fi
# Route53 zone names end with a dot
ZONE_NAME="${ZONE_APEX}."
MATCHING_ZONE=$(aws route53 list-hosted-zones --query "HostedZones[?Name=='$ZONE_NAME'].Id" --output text 2>/dev/null || echo "")
if [ -n "$MATCHING_ZONE" ]; then
success "Found Route53 hosted zone for domain: $ZONE_NAME"
else
warning "No Route53 hosted zone found matching domain '$ACM_DOMAIN' (checked for zone: $ZONE_NAME)"
fi
fi
fi
fi
# Test WAFv2 permissions (optional, for WAF support)
info "Testing WAFv2 permissions (optional)..."
WAF_OUTPUT=$(aws wafv2 list-web-acls --scope REGIONAL --region "$REGION" 2>&1 || true)
if check_denied "$WAF_OUTPUT"; then
warning "WAFv2 permission check failed (optional, but recommended for production)"
else
WAF_COUNT=$(aws wafv2 list-web-acls --scope REGIONAL --region "$REGION" --query "WebACLs | length(@)" --output text 2>/dev/null || echo "0")
if [ "$WAF_COUNT" = "0" ] || [ -z "$WAF_COUNT" ]; then
success "WAFv2 accessible (no web ACLs found)"
else
success "WAFv2 permissions verified (found $WAF_COUNT web ACL(s))"
fi
fi
# Resource creation tests (only if --create-test-resources is set or skip is not set)
if [ "$SKIP_RESOURCE_TESTS" = true ]; then
info "Skipping resource creation tests (--skip_resource_tests flag provided)"
success "Preflight checks complete (resource tests skipped)"
exit 0
fi
if [ "$CREATE_TEST_RESOURCES" = false ]; then
info "Skipping resource creation tests (use --create-test-resources to enable)"
info "Read-only checks passed. You can proceed with deployment."
success "Preflight checks complete!"
exit 0
fi
# Set trap only when we're actually creating resources
trap cleanup EXIT
# Confirmation prompt before creating resources (unless --yes is set)
if [ "$NON_INTERACTIVE" = false ]; then
printf "\n"
warning "This will create temporary test resources:"
warning " - VPC, Subnet, Security Group (isolated, will be deleted)"
warning " - IAM Role (will be deleted)"
warning " - No modifications to existing resources"
printf "\n"
read -p "Continue with resource creation tests? (y/n): " -n 1 -r
printf "\n"
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
info "Resource creation tests cancelled by user"
exit 0
fi
fi
# Resource creation tests (only run if --create-test-resources is set)
info "Running resource creation tests (--create-test-resources mode)..."
# Generate a safer CIDR block (10.254.x.x range, avoid 0, less likely to conflict)
# Retry logic for VPC creation in case of CIDR conflicts
VPC_CREATED=false
for attempt in {1..3}; do
RANDOM_SUFFIX=$(( (RANDOM % 250) + 1 )) # Range 1-250, avoids 0
TEST_CIDR="10.254.${RANDOM_SUFFIX}.0/28"
info "Attempting VPC creation with CIDR $TEST_CIDR (attempt $attempt/3)..."
VPC_OUTPUT=$(aws ec2 create-vpc \
--cidr-block "$TEST_CIDR" \
--region "$REGION" \
--query 'Vpc.VpcId' \
--output text 2>&1) || {
if echo "$VPC_OUTPUT" | grep -qi "InvalidVpc.Range\|overlap\|conflict"; then
if [ $attempt -lt 3 ]; then
warning "VPC creation failed (org policy or CIDR validation), trying different CIDR..."
continue
else
error "Failed to create VPC after 3 attempts (org policy or CIDR validation). Check IAM permissions for ec2:CreateVpc"
exit 1
fi
else
error "Failed to create VPC. Check IAM permissions for ec2:CreateVpc"
exit 1
fi
}
TEST_VPC_ID="$VPC_OUTPUT"
VPC_CREATED=true
success "VPC created: $TEST_VPC_ID"
break
done
if [ "$VPC_CREATED" = false ]; then
error "Failed to create VPC after all retry attempts"
exit 1
fi
# Test subnet creation (reuse VPC CIDR since it's a /28)
info "Testing subnet creation..."
AZ=$(aws ec2 describe-availability-zones --region "$REGION" --query 'AvailabilityZones[0].ZoneName' --output text)
TEST_SUBNET_ID=$(aws ec2 create-subnet \
--vpc-id "$TEST_VPC_ID" \
--cidr-block "$TEST_CIDR" \
--availability-zone "$AZ" \
--region "$REGION" \
--query 'Subnet.SubnetId' \
--output text 2>/dev/null) || {
error "Failed to create subnet. Check IAM permissions for ec2:CreateSubnet"
exit 1
}
success "Subnet created: $TEST_SUBNET_ID"
# Test security group creation
info "Testing security group creation..."
TEST_SG_ID=$(aws ec2 create-security-group \
--group-name "preflight-test-sg-$(date +%s)" \
--description "Preflight test security group" \
--vpc-id "$TEST_VPC_ID" \
--region "$REGION" \
--query 'GroupId' \
--output text 2>/dev/null) || {
error "Failed to create security group. Check IAM permissions for ec2:CreateSecurityGroup"
exit 1
}
success "Security group created: $TEST_SG_ID"
# Test IAM role creation (IAM is global, no --region)
info "Testing IAM role creation..."
TEST_ROLE_NAME="preflight-test-role-$(date +%s)"
TRUST_POLICY='{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}'
if aws iam create-role \
--role-name "$TEST_ROLE_NAME" \
--assume-role-policy-document "$TRUST_POLICY" \
--output text > /dev/null 2>&1; then
success "IAM role created: $TEST_ROLE_NAME"
else
error "Failed to create IAM role. Check IAM permissions for iam:CreateRole"
exit 1
fi
# Cleanup happens automatically via trap
info "All test resources will be cleaned up on exit..."
success "Preflight checks complete! All permissions verified."
info "You are ready to deploy LangSmith infrastructure."