Initial commit: LangSmith Self-Hosted AWS Reference Architecture (P0)

This commit establishes the P0 reference architecture documentation and supporting tooling for deploying LangSmith Self-Hosted on AWS. Documentation: - README.md: Reference architecture overview with embedded request flow diagram - PREFLIGHT.md: Comprehensive preflight checklist with automated script integration - WALKTHROUGH.md: Step-by-step deployment walkthrough - INGRESS.md: ALB-only ingress configuration guide - TROUBLESHOOTING.md: Evidence-based troubleshooting guide with diagnostic automation Tooling: - scripts/preflight.sh: Automated AWS permission and prerequisite validation - scripts/capture-diagnostics.sh: Automated diagnostic information capture for troubleshooting All documentation follows a professional, educational, and encouraging tone designed to support platform/infrastructure/MLOps engineers through the deployment process. Key features: - Opinionated, supportable deployment path - AWS + EKS + Terraform + Helm stack - Automated validation and diagnostic tools - Clear separation of P0 (baseline) vs P1+ (advanced) features
2026-07-01 20:04:39 -04:00 · 2025-12-15 12:10:50 -08:00
commit ca682685a7
10 changed files with 2437 additions and 0 deletions
@@ -0,0 +1,2 @@
+.env
+.venv
@@ -0,0 +1,192 @@
+# Ingress for LangSmith Self-Hosted on AWS (P0) — ALB Only
+
+**P0 Reference Requirement:** Use **AWS Application Load Balancer (ALB)** via the **AWS Load Balancer Controller**.
+
+This requirement is intentionally opinionated. Ingress configuration is a common source of deployment challenges due to the many valid options available. The reference architecture standardizes on ALB to provide a clear, well-tested path.
+
+If you are not using ALB, you are operating **outside the P0 reference path**.
+
+---
+
+## Supported Ingress (P0)
+
+### ✅ Supported
+- **AWS Load Balancer Controller** + **ALB**
+- TLS termination using **ACM**
+- DNS via **Route53** (or equivalent, but Route53 is assumed for P0 examples)
+- Optional but strongly recommended:
+  - **AWS WAF** attached to ALB
+  - Private-only exposure (internal ALB + VPN/PrivateLink)
+
+### ❌ Explicitly Out of Scope (P0)
+- NGINX Ingress Controller
+- Traefik
+- Istio / service mesh gateways
+- API Gateway “fronting” Kubernetes as a substitute for ingress
+- CloudFront as a substitute for ingress (can be layered later, but not P0)
+- Custom gateways / reverse proxies
+
+These may work. We do not support them in the reference enablement path.
+
+---
+
+## Why We Require ALB
+
+- The ALB path is the **lowest-friction**, most reproducible option for AWS customers.
+- It provides a standardized approach that avoids controller complexity and configuration variations.
+- It aligns with what most platform teams already deploy and secure.
+- It makes debugging straightforward: ALB target health metrics and Kubernetes events provide clear diagnostic information.
+
+This requirement exists to reduce:
+- install failures
+- support escalations
+- time-to-first-trace delays
+
+---
+
+## Required Components
+
+You must have the following working before you install LangSmith:
+
+1. **EKS cluster** running and reachable with `kubectl`
+2. **AWS Load Balancer Controller** installed and healthy
+3. **IAM permissions** for the controller (IRSA strongly recommended)
+4. **Subnet tagging** correct for ALB discovery
+5. **ACM certificate** for your DNS name
+6. **Route53 record** (or other DNS) pointing to the created ALB
+
+If any of these are missing, Helm installation may succeed, but the product will be unreachable.
+
+---
+
+## Preflight Checks (Ingress-Specific)
+
+### Controller Health
+- [ ] The AWS Load Balancer Controller pods are running
+- [ ] No CrashLoopBackOff
+- [ ] Controller has permission to create:
+  - ALBs
+  - Target groups
+  - Listeners
+  - Security group rules
+
+### Subnet Tagging (Common Failure)
+- [ ] Subnets are tagged so the controller can discover them for ALB creation
+- [ ] You know which subnets should be:
+  - public-facing ALB
+  - internal-only ALB (if private)
+
+### TLS
+- [ ] ACM cert exists in the **same region** as the ALB
+- [ ] Cert covers the intended DNS name (`langsmith.<domain>`)
+
+### DNS
+- [ ] You can create DNS records for the LangSmith hostname
+
+---
+
+## Mandatory Validation Step: Prove ALB Ingress Works Before LangSmith
+
+Complete this validation **before** installing LangSmith. This step helps isolate ingress configuration issues from application-level problems, making troubleshooting more efficient.
+
+### Step 1: Deploy a tiny test service
+Pick one lightweight HTTP echo service (example shown conceptually):
+
+- Create a deployment + service that listens on HTTP (port 80)
+- Confirm:
+  - `kubectl get pods` shows it running
+  - `kubectl get svc` shows endpoints
+
+### Step 2: Create a test Ingress that provisions an ALB
+Create an Ingress resource targeting the test service.
+
+What must happen:
+- An ALB is created
+- A target group is created
+- Targets become **healthy**
+- You can curl the endpoint and receive a response
+
+### Step 3: If the test Ingress fails, stop
+Do not proceed to LangSmith until:
+- ALB provisioning works
+- target health becomes green
+- HTTPS works with your cert
+
+---
+
+## Common Failure Modes (and Where to Look First)
+
+### ALB never gets created
+**Likely causes**
+- Controller not installed
+- Missing IAM permissions
+- Subnet discovery fails
+
+**Look at**
+- Kubernetes events on the Ingress
+- Controller logs
+- AWS console: whether any ALB attempt exists
+
+---
+
+### ALB created but targets unhealthy
+**Likely causes**
+- Wrong service port / targetPort
+- Pods not ready
+- Health check path mismatch
+- Security group blocks node-to-target traffic
+
+**Look at**
+- ALB target group health reason
+- `kubectl describe ingress ...`
+- `kubectl describe svc ...`
+- Pod readiness probe status
+
+---
+
+### HTTPS broken / cert issues
+**Likely causes**
+- Wrong ACM cert
+- Cert in wrong region
+- DNS mismatch
+
+**Look at**
+- ALB listener config
+- ACM cert validity and SANs
+- DNS record points to the right ALB
+
+---
+
+## Security Recommendations (P0 Baseline)
+
+Minimum expected posture for P0:
+- HTTPS only (no plaintext)
+- WAF or equivalent rate limiting at the edge
+- Prefer private exposure for enterprise deployments
+- Least privilege IAM for the controller and application
+- No public DB endpoints
+
+---
+
+## What to Document When You Deviate (Off-Reference)
+
+If a customer insists on non-ALB ingress, require them to capture:
+- ingress controller type/version
+- config manifests
+- load balancer / gateway config
+- health check settings
+- network policies / SG rules
+
+Note: this configuration is **not supported by P0 enablement**.
+
+---
+
+## Done Criteria (Ingress)
+
+Ingress is “done” when:
+- [ ] AWS Load Balancer Controller is healthy
+- [ ] A test Ingress provisions an ALB successfully
+- [ ] Targets are healthy
+- [ ] HTTPS works with your DNS name
+
+Only then install LangSmith.
@@ -0,0 +1,201 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
@@ -0,0 +1,290 @@
+# LangSmith Self-Hosted — Preflight Checklist (P0)
+
+**Purpose:**  
+Ensure the environment is ready *before* running Terraform or Helm.  
+Most deployment challenges can be prevented by completing these checks upfront, rather than discovering issues during installation.
+
+If a preflight check fails, **address it before proceeding**. This ensures a smoother deployment experience.
+
+---
+
+## Automated Preflight Checks
+
+You can use the provided preflight script to automatically verify AWS permissions and prerequisites before proceeding with manual checks.
+
+### Quick Start
+
+Run the automated preflight script:
+
+```bash
+./scripts/preflight.sh
+```
+
+### What the Script Does
+
+The script performs **read-only** permission checks to verify you have the necessary AWS permissions for deploying LangSmith Self-Hosted. By default, it:
+
+- Verifies AWS credentials are configured
+- Tests permissions for required AWS services:
+  - **EC2** (VPCs, subnets, availability zones)
+  - **EKS** (cluster management)
+  - **IAM** (role creation and management)
+  - **RDS** (PostgreSQL/Aurora)
+  - **ElastiCache** (Redis)
+  - **Application Load Balancer** (ALB/ELB)
+  - **ACM** (TLS certificates)
+  - **Route53** (DNS management)
+  - **WAFv2** (optional, for production)
+- Checks for sandbox account restrictions
+- Validates region configuration
+
+**Note:** The script is read-only by default and does not create or modify any resources.
+
+### Command-Line Options
+
+```bash
+./scripts/preflight.sh [OPTIONS]
+```
+
+**Options:**
+- `-s, --skip_resource_tests, --skip_checks`  
+  Skip resource creation tests (only run read-only permission checks)
+
+- `-y, --yes`  
+  Non-interactive mode (skip confirmation prompts). Automatically enabled in CI environments.
+
+- `--create-test-resources`  
+  Create temporary test resources (VPC, subnet, security group, IAM role) to verify write permissions. Resources are automatically cleaned up on exit. Use this to fully validate your permissions.
+
+- `--domain <domain>`  
+  Check for ACM certificate and Route53 hosted zone matching the specified domain (e.g., `langsmith.example.com`). The script will check for exact matches and wildcard certificates.
+
+**Examples:**
+
+```bash
+# Basic permission check (read-only)
+./scripts/preflight.sh
+
+# Check permissions and verify ACM certificate exists
+./scripts/preflight.sh --domain langsmith.example.com
+
+# Full permission test including resource creation
+./scripts/preflight.sh --create-test-resources
+
+# Non-interactive mode (useful for CI/CD)
+./scripts/preflight.sh --yes --domain langsmith.example.com
+```
+
+### When to Use the Script
+
+- **Before starting deployment:** Run the script to verify all AWS permissions are in place
+- **Troubleshooting permission issues:** Use `--create-test-resources` to test write permissions
+- **CI/CD pipelines:** Use `--yes` flag for automated checks
+- **Certificate validation:** Use `--domain` to verify ACM certificates and Route53 zones exist
+
+The script provides clear success/failure indicators for each permission check, making it easy to identify and resolve permission issues before deployment.
+
+---
+
+## 1. Account & Access
+
+### AWS Account
+- [ ] You have **full access** to an AWS account (not a sandbox with hidden SCPs)
+- [ ] You can create:
+  - VPCs
+  - EKS clusters
+  - ALBs
+  - IAM roles and policies
+  - RDS / ElastiCache
+  - EBS volumes
+- [ ] No org-level policy blocks required services
+
+### Credentials
+- [ ] AWS credentials configured locally (`aws sts get-caller-identity` works)
+- [ ] Region selected and consistent across Terraform and Helm
+- [ ] You understand **who pays for this** (this will not be free)
+
+---
+
+## 2. Terraform Readiness
+
+### Tooling
+- [ ] Terraform installed (supported version)
+- [ ] `kubectl` installed
+- [ ] `helm` installed
+- [ ] `awscli` installed
+
+### State Management
+- [ ] Terraform state backend chosen (S3 + DynamoDB recommended)
+- [ ] State bucket exists or can be created
+- [ ] You are not sharing state with another environment
+
+### Assumptions (Explicit)
+- [ ] You are deploying **one environment** (no shared dev/prod infra)
+- [ ] You are okay with Terraform creating networking resources
+- [ ] You will not “hot-edit” AWS resources Terraform owns
+
+---
+
+## 3. Network & DNS
+
+### VPC
+- [ ] A dedicated VPC will exist for LangSmith
+- [ ] At least:
+  - 2 public subnets (ALB)
+  - 2 private subnets (EKS + data)
+- [ ] NAT Gateway available for private subnet egress
+
+### DNS
+- [ ] A Route53 hosted zone exists (or you control DNS externally)
+- [ ] You can create DNS records for the LangSmith endpoint
+- [ ] You know whether this will be:
+  - [ ] Publicly accessible
+  - [ ] Private-only (VPN / PrivateLink)
+
+---
+
+## 4. Kubernetes (EKS) Expectations
+
+### Cluster
+- [ ] EKS will be used (not self-managed k8s)
+- [ ] You accept managed node groups
+- [ ] You are not using custom admission controllers that block installs
+
+### Capacity (Hard Requirement)
+- [ ] Minimum **16 vCPU / 64 GB RAM** allocatable cluster capacity
+- [ ] Nodes are sized to allow:
+  - LangSmith services
+  - ClickHouse
+  - System overhead
+
+### Required Add-ons
+- [ ] Metrics Server enabled
+- [ ] Cluster Autoscaler enabled
+- [ ] You can install CRDs
+
+---
+
+## 5. Data Stores
+
+### PostgreSQL
+- [ ] PostgreSQL **14+**
+- [ ] Managed service (RDS/Aurora) preferred
+- [ ] Automated backups enabled
+- [ ] Network access from EKS confirmed
+
+### Redis
+- [ ] Redis OSS **5+**
+- [ ] Managed (ElastiCache) or in-cluster
+- [ ] Network access from EKS confirmed
+
+### ClickHouse (Critical)
+- [ ] Deployment model chosen:
+  - [ ] Externally managed
+  - [ ] In-cluster (StatefulSet)
+- [ ] If in-cluster:
+  - [ ] Node with **8 vCPU / 32 GB RAM** available
+  - [ ] SSD-backed storage
+  - [ ] PersistentVolume provisioner available
+- [ ] You understand ClickHouse is **not stateless**
+
+---
+
+## 6. Object Storage (Strongly Recommended)
+
+### S3
+- [ ] S3 bucket planned for LangSmith artifacts
+- [ ] Bucket region matches deployment region
+- [ ] IAM access model chosen:
+  - [ ] IRSA (preferred)
+  - [ ] Explicit credentials (discouraged)
+
+---
+
+## 7. Secrets Management
+
+- [ ] Secrets **will not** be committed to git
+- [ ] Secrets backend chosen:
+  - [ ] AWS Secrets Manager
+  - [ ] External Secrets
+  - [ ] CSI driver
+- [ ] Rotation strategy understood (even if manual)
+
+---
+
+## 8. Auth & Access Model
+
+- [ ] Auth strategy selected:
+  - [ ] Token-based
+  - [ ] OIDC / SSO
+- [ ] You know **who can access LangSmith**
+- [ ] You know **how access is revoked**
+- [ ] You are not assuming “security by obscurity”
+
+> Pick one auth model for initial enablement. Others are out of scope.
+
+---
+
+## 9. Ingress (P0 Hard Gate) — ALB Only
+
+Ingress configuration is a critical component that requires careful attention. For the P0 reference deployment, ingress is **not optional** and there are **no alternative controllers**.
+
+**P0 Requirement:** AWS ALB via **AWS Load Balancer Controller**.  
+If you are using NGINX/Traefik/Istio/API Gateway/etc., you are operating **outside the reference path**.
+
+### Controller & Permissions
+- [ ] AWS Load Balancer Controller is installed in the cluster
+- [ ] Controller pods are healthy (no CrashLoopBackOff)
+- [ ] Controller IAM permissions are in place (IRSA strongly preferred)
+
+### Subnet Discovery (Common Failure)
+- [ ] Public subnets are correctly tagged for ALB discovery (public ALB)
+- [ ] Private subnets are correctly tagged if you plan an internal ALB
+- [ ] You know which subnets ALBs will land in
+
+### TLS & DNS
+- [ ] ACM certificate exists for `langsmith.<domain>` (same region as ALB)
+- [ ] You control DNS and can create records for the endpoint
+
+### Mandatory Proof (Stop if not true)
+- [ ] You have successfully provisioned a **test ALB** from Kubernetes Ingress
+  - ALB created
+  - target group created
+  - targets become healthy
+  - HTTPS works on your DNS name
+
+If you cannot prove ALB ingress works **before** LangSmith, resolve the ingress configuration before proceeding with the LangSmith installation.
+
+---
+
+## 10. Operational Expectations (Read This)
+
+Before proceeding, confirm you accept:
+
+- [ ] You are responsible for upgrades
+- [ ] You are responsible for backups
+- [ ] You are responsible for incident response
+- [ ] Support will assume this reference architecture when debugging
+
+If any of these are unacceptable, **review your requirements** before proceeding, as these responsibilities are fundamental to operating a self-hosted deployment.
+
+---
+
+## 11. Preflight Outcome
+
+- [ ] All required checks passed  
+→ You may proceed to **Terraform deployment**.
+
+- [ ] One or more checks failed  
+→ Address them **before** continuing. Proceeding without resolving these issues will likely result in deployment challenges.
+
+---
+
+## Why This Checklist Exists
+
+Every unchecked box above corresponds to common issues that have caused:
+- Support escalations
+- Deployment delays
+- Production incidents
+
+Completing preflight checks thoroughly significantly increases your chances of a successful deployment. While passing preflight does not guarantee success, **failing to address these checks almost guarantees challenges**.
@@ -0,0 +1,262 @@
+# LangSmith Self-Hosted on AWS — Reference Architecture (P0)
+
+**Status:** P0 Enablement Baseline  
+**Audience:** Platform / Infra / MLOps Engineers  
+**Goal:** Provide a single, opinionated, supportable path to deploying and operating LangSmith Self-Hosted (SH) on AWS with minimal support intervention.
+
+This document defines the **reference architecture LangChain Enablement stands behind**.  
+Alternative approaches may work, but are **out of scope for P0 enablement and future certification**.
+
+---
+
+## 1. What This Architecture Is (and Is Not)
+
+### This *is*:
+- A production-capable **baseline deployment**
+- Opinionated by design
+- Built on **AWS + EKS + Terraform + Helm**
+- Designed to surface real operator responsibilities early
+- The foundation for future labs and certification
+
+### This is *not*:
+- A performance benchmark
+- A multi-region or HA architecture
+- A guide for custom service meshes or bespoke gateways
+- A promise of security guarantees
+
+---
+
+## 2. Deployment Mode
+
+**P0 Default: Full Self-Hosted**
+
+- Control plane and data plane both run in the customer AWS account
+- Customer is responsible for:
+  - Network exposure
+  - Authentication
+  - Data persistence
+  - Upgrades and backups
+
+> Hybrid (SaaS control plane + SH data plane) is valid but **out of scope for P0 enablement**.
+
+---
+
+## 3. High-Level Architecture
+
+Request flow (top to bottom):
+
+![Request Flow Diagram](diagrams/RequestFlow.png)
+
+Users / CI / SDKs  
+→ Route53  
+→ Application Load Balancer (ALB) + WAF  
+→ Kubernetes Ingress (EKS)  
+→ LangSmith application services  
+
+Persistent dependencies:
+
+- PostgreSQL — metadata (projects, orgs, users)
+- Redis — cache and job queues
+- ClickHouse — traces and analytics
+- S3 — large artifacts and payload storage
+
+
+**Flow Summary**
+- Traffic enters via **Route53 → ALB** (with optional WAF).
+- ALB forwards to **Kubernetes ingress** inside EKS.
+- LangSmith application services run in EKS.
+- Persistent state is handled by:
+  - **PostgreSQL** (metadata)
+  - **Redis** (cache / queues)
+  - **ClickHouse** (traces & analytics)
+  - **S3** (large artifacts and payloads)
+
+This diagram represents the **minimum supported topology** for the P0 reference architecture.
+
+---
+
+## 4. Network & Ingress
+
+### VPC
+- Single VPC
+- **Public subnets**: ALB only
+- **Private subnets**:
+  - EKS worker nodes
+  - Data services (RDS, Redis, ClickHouse if in-cluster)
+
+### Ingress
+- **Application Load Balancer (ALB)**
+- **AWS WAF strongly recommended**
+- TLS termination at ALB (end-to-end TLS recommended)
+- Optionally:
+  - Internal ALB + VPN / PrivateLink for non-public access
+
+### Egress
+- Outbound HTTPS access to required LangChain endpoints (if applicable)
+- Restrict egress access per organizational policy requirements
+
+---
+
+## 5. Compute: Kubernetes (EKS)
+
+### Cluster
+- **Amazon EKS**
+- Managed node groups
+- Cluster Autoscaler enabled
+- Metrics Server enabled
+
+### Baseline Capacity
+- Minimum cluster capacity:
+  - **16 vCPU / 64 GB RAM** available
+- This includes LangSmith services + system overhead
+
+---
+
+## 6. Data Stores
+
+LangSmith SH relies on three core data stores.
+
+### PostgreSQL (Metadata)
+- **AWS RDS PostgreSQL or Aurora PostgreSQL**
+- PostgreSQL **14+**
+- Single AZ for P0 (HA is P1)
+- Automated backups enabled
+
+### Redis (Cache / Queues)
+- **AWS ElastiCache (Redis OSS)**
+- Single node acceptable for P0
+- Persistence optional but recommended
+
+### ClickHouse (Traces & Analytics)
+
+ClickHouse is **memory- and I/O-intensive**. Proper sizing is critical for optimal performance and stability.
+
+#### P0 Reference Sizing (Production Baseline)
+- **8 vCPU**
+- **32 GB RAM**
+- **SSD-backed persistent storage**
+  - ~7000 IOPS
+  - ~1000 MiB/s throughput
+
+#### Allowed but Dev-Only
+- **4 vCPU / 16 GB RAM**
+- Non-production proof-of-concept only
+
+#### Scaling Guidance (P1)
+- Scale to **16 vCPU / 64 GB RAM** when:
+  - Trace ingestion grows
+  - Query latency increases
+  - Memory pressure appears
+
+> Strong recommendation: use externally managed ClickHouse where possible.  
+> In-cluster ClickHouse is supported for P0 and works well with proper operational practices.
+
+---
+
+## 7. Object Storage
+
+### S3 (Strongly Recommended)
+- Store large trace artifacts and payloads
+- Reduces DB size and blast radius
+- Improves security posture for sensitive inputs/outputs
+
+### Access Pattern
+- Use **IAM Roles for Service Accounts (IRSA)** where possible
+- No static credentials in Helm values
+
+---
+
+## 8. Secrets & Identity
+
+### Secrets
+- **AWS Secrets Manager** (preferred)
+- Inject into Kubernetes via:
+  - External Secrets
+  - CSI driver
+  - Secure environment injection
+
+### Identity & Auth
+- LangSmith authentication must be configured explicitly
+- Supported patterns include:
+  - Token-based authentication
+  - OIDC / SSO (at least one concrete example recommended for enablement)
+
+> For P0 enablement, select **one authentication pattern** to focus on. Additional patterns may be explored in future enablement tracks.
+
+---
+
+## 9. Observability (Platform-Level)
+
+Minimum required:
+- Application logs accessible via CloudWatch
+- Kubernetes events visible
+- Health endpoints monitored
+
+Optional (P1):
+- Prometheus / OpenTelemetry exporters
+- Alerting on:
+  - Pod restarts
+  - DB connectivity
+  - Ingestion failures
+
+---
+
+## 10. Security Baseline (Non-Negotiable)
+
+This reference architecture requires **essential security controls** as a baseline.
+
+### MUST
+- TLS enabled
+- No plaintext secrets
+- Least-privilege IAM
+- Network isolation (private subnets for data services)
+- WAF or equivalent rate limiting at ingress
+
+### SHOULD
+- Private access only (VPN / PrivateLink)
+- Auth required for all UI and API access
+- Regular patching and upgrades
+
+### Explicit Disclaimer
+> This reference architecture does **not** guarantee security.  
+> Customers are responsible for reviewing and approving deployments with their security teams.
+
+---
+
+## 11. What This Architecture Explicitly Excludes
+
+These are **out of scope for P0 enablement**:
+- Multi-region active/active
+- Custom gateways or service meshes
+- HA ClickHouse clusters
+- Custom scaling policies beyond autoscaler defaults
+- Performance benchmarking beyond sanity checks
+
+These may appear in P1/P2 enablement or certification tracks.
+
+---
+
+## 12. Why This Exists
+
+This reference architecture exists to:
+- Reduce installation failures and complexity
+- Provide support teams with a shared baseline
+- Create a clear, well-documented enablement path
+- Serve as the foundation for:
+  - Hands-on labs
+  - Operator certification
+  - Support playbooks
+
+If you encounter challenges during implementation, these often indicate areas where additional attention or configuration is needed, rather than system defects.
+
+---
+
+## 13. Next Artifacts (Planned)
+
+- Preflight checklist
+- Deployment walkthrough
+- Known sharp edges
+- Failure-mode diagnostics
+- Operator mental model
+
+These resources build **on top of this foundation**, providing additional guidance and support as you progress.
@@ -0,0 +1,405 @@
+# LangSmith Self-Hosted on AWS — Troubleshooting Guide (P0)
+
+**Purpose:** Fast triage for the P0 reference deployment.  
+**Style:** Symptom → likely cause → exact checks → common fix.
+
+This guide focuses on actionable, evidence-based troubleshooting. Every item maps to an observable signal and a deterministic check.
+
+---
+
+## 0. First Rule of Triage: Gather Evidence First
+
+Before changing anything, capture essential diagnostic information. The easiest way to do this is using the provided diagnostic capture script.
+
+### Quick Start: Automated Diagnostics Capture
+
+Run the diagnostic script to automatically capture all required information:
+
+```bash
+./scripts/capture-diagnostics.sh
+```
+
+The script captures:
+- Pod list and detailed descriptions for all pods
+- Logs from all pods (current and previous if restarted)
+- Kubernetes events
+- Ingress resources and detailed configurations
+- Service and endpoint information
+- Node information and resource usage
+- ALB target group health (if AWS CLI is configured)
+
+**Configuration via environment variables:**
+- `NAMESPACE` - Kubernetes namespace (default: `langsmith`)
+- `LOG_TAIL` - Number of log lines to capture per pod (default: `200`)
+- `EVENTS_TAIL` - Number of events to capture (default: `50`)
+- `OUTPUT_DIR` - Directory for diagnostic output (default: `./diagnostics`)
+- `AWS_REGION` - AWS region for ALB queries (default: `us-west-2`)
+
+**Example with custom configuration:**
+```bash
+NAMESPACE=langsmith-prod LOG_TAIL=500 OUTPUT_DIR=./prod-diagnostics ./scripts/capture-diagnostics.sh
+```
+
+The script creates a timestamped directory with all diagnostic information and a summary file. All output is saved for later analysis.
+
+### Manual Capture (Alternative)
+
+If you prefer to capture diagnostics manually, ensure you collect:
+
+- `kubectl get pods -n langsmith -o wide`
+- `kubectl describe pod <POD> -n langsmith` (for each pod)
+- `kubectl logs <POD> -n langsmith --tail=200` (for each pod)
+- `kubectl get events -n langsmith --sort-by=.lastTimestamp | tail -50`
+- Ingress/ALB status:
+  - `kubectl get ingress -n langsmith` (or your ingress resource type)
+  - `kubectl describe ingress <INGRESS> -n langsmith`
+- If AWS-managed:
+  - ALB target group health (healthy/unhealthy + reason)
+
+Capturing this information ensures you address the actual root cause rather than symptoms, making troubleshooting more efficient and effective.
+
+---
+
+## 1. The Deployment “Works” But UI Is Not Reachable
+
+### Symptom
+- DNS resolves but browser times out
+- Browser shows `502/503`
+- ALB exists but shows no healthy targets
+
+### Likely Causes
+- Ingress misconfigured
+- Service port mismatch
+- Pod readiness failing (so targets never become healthy)
+- Security group / NACL blocks
+
+### Checks
+- `kubectl get ingress -n langsmith -o yaml`
+- `kubectl get svc -n langsmith`
+- `kubectl describe svc <SERVICE> -n langsmith`
+- `kubectl get endpoints -n langsmith`
+- `kubectl get pods -n langsmith`
+- Inspect readiness:
+  - `kubectl describe pod <POD> -n langsmith | sed -n '/Readiness/,/Conditions/p'`
+
+### Fixes (Common)
+- Ensure ingress points to the correct service + port
+- Ensure service selectors match pod labels
+- Fix readiness probe failures before touching ALB
+- Confirm ALB security group allows inbound 443 and node security group allows target traffic
+
+---
+
+## 2. Pods CrashLoopBackOff Immediately
+
+### Symptom
+- Pods oscillate between `CrashLoopBackOff` and `Running`
+- Logs show immediate exit
+
+### Likely Causes
+- Missing or invalid secrets
+- DB/Redis/ClickHouse connection failure
+- Misconfigured required env vars
+
+### Checks
+- `kubectl logs <POD> -n langsmith --previous --tail=200`
+- `kubectl describe pod <POD> -n langsmith` (look for env var injection and secret refs)
+- Confirm secrets exist:
+  - `kubectl get secret -n langsmith`
+- Confirm external connectivity from inside cluster:
+  - Launch a temporary debug pod and test TCP connectivity to DB hosts/ports
+
+### Fixes (Common)
+- Correct secret names/keys referenced in Helm values
+- Verify DB hostnames and ports (RDS endpoints, Redis endpoints)
+- Fix network policy / security groups if connections time out
+
+---
+
+## 3. Everything Is Running, But “First Successful Trace” Fails
+
+### Symptom
+- UI loads
+- SDK calls fail (401/403/404) or traces never appear
+- Client sees timeouts or 5xx
+
+### Likely Causes
+- Wrong endpoint (`LANGSMITH_ENDPOINT`) or wrong path
+- Auth mismatch (token vs SSO)
+- Ingestion path failing due to ClickHouse or Redis issues
+- ALB health is fine but app errors on ingest
+
+### Checks
+- From client machine:
+  - Confirm endpoint resolves and responds (TLS + HTTP status)
+- In cluster logs:
+  - Search logs of the API/ingestion service for auth or write errors
+- Check ClickHouse health:
+  - Look for write failures, memory pressure, disk pressure
+- Check Redis:
+  - Look for connection errors or queue backlog signals (if exposed)
+
+### Fixes (Common)
+- Ensure client is using the correct base URL and auth method
+- Regenerate token / verify permissions
+- Fix ClickHouse sizing or disk throughput issues if writes fail
+- Fix Redis connectivity if queues are used for ingest
+
+---
+
+## 4. ALB Exists But Targets Are “Unhealthy”
+
+### Symptom
+- ALB target group shows all targets unhealthy
+- UI returns `503` even though pods are running
+
+### Likely Causes
+- Readiness probe failing
+- Target group health check path/port mismatch
+- Service isn’t exposing the expected port
+- Pods are running but not listening
+
+### Checks
+- `kubectl describe pod <POD> -n langsmith` (readiness probe results)
+- `kubectl get svc -n langsmith -o yaml`
+- Confirm the container port aligns with service targetPort
+- Confirm health check path matches what the service actually serves
+
+### Fixes (Common)
+- Correct ingress annotations / health check settings
+- Fix readiness probe configuration or dependencies causing readiness to fail
+- Align service ports with actual container ports
+
+---
+
+## 5. DB Connectivity Failures (PostgreSQL)
+
+### Symptom
+- App logs show:
+  - authentication failures
+  - connection refused
+  - timeout
+  - “could not translate host name”
+- App won’t start or fails on request
+
+### Likely Causes
+- Wrong credentials
+- Security group blocks EKS to RDS
+- RDS not in the right subnets or routing broken
+- DNS/resolution issues inside cluster
+
+### Checks
+- Validate the RDS endpoint and port
+- Confirm security groups allow inbound from EKS node group / pod CIDR (depending on setup)
+- Test connectivity from a debug pod:
+  - DNS resolution
+  - TCP connect to `<rds-endpoint>:5432`
+
+### Fixes (Common)
+- Correct creds in secrets
+- Fix SG rules
+- Ensure private subnets have proper routing and NAT where required
+- Ensure RDS is reachable from EKS VPC/subnets
+
+---
+
+## 6. Redis Connectivity Failures
+
+### Symptom
+- Logs show Redis connection errors/timeouts
+- Background jobs stall (if used)
+- Ingestion or async tasks fail
+
+### Likely Causes
+- Wrong endpoint/port
+- Security group blocks EKS to ElastiCache
+- Auth mismatch (if Redis auth enabled)
+
+### Checks
+- Confirm ElastiCache endpoint and port
+- Test TCP connectivity from debug pod
+- Check whether Redis auth is enabled and whether Helm values match
+
+### Fixes (Common)
+- Fix endpoint in values
+- Fix security group rules
+- Align auth config
+
+---
+
+## 7. ClickHouse Problems (Most Common Real Root Cause)
+
+### 7.1 ClickHouse OOM / Memory Pressure
+
+**Symptom**
+- ClickHouse pod restarts
+- OOMKilled events
+- Trace writes fail or become slow
+
+**Likely Cause**
+- ClickHouse undersized (4/16 used for real workload)
+- Memory limits too tight
+- Query pressure
+
+**Checks**
+- `kubectl describe pod <clickhouse-pod> -n langsmith` (look for OOMKilled)
+- `kubectl logs <clickhouse-pod> -n langsmith --tail=200`
+- `kubectl top pod <clickhouse-pod> -n langsmith`
+
+**Fixes**
+- Move to **8 vCPU / 32GB RAM** baseline
+- Increase memory limits/requests
+- Reduce concurrent ingest/query load
+
+---
+
+### 7.2 ClickHouse Disk / IO Throughput Issues
+
+**Symptom**
+- Latency spikes
+- Writes time out
+- ClickHouse logs mention slow merges / IO waits
+
+**Likely Cause**
+- Slow storage class
+- Inadequate IOPS/throughput
+- Disk nearing capacity
+
+**Checks**
+- Confirm PV storage class and performance characteristics
+- Check disk usage in ClickHouse pod
+- Review ClickHouse logs for merge pressure / IO wait
+
+**Fixes**
+- Use SSD-backed storage with sufficient IOPS/throughput
+- Increase volume size
+- Move ClickHouse to a dedicated node group / better instance type
+
+---
+
+### 7.3 ClickHouse Not Persistent (Data Loss Risk)
+
+**Symptom**
+- ClickHouse redeploy loses data
+- Traces disappear after restart
+
+**Likely Cause**
+- No persistent volume attached
+- StatefulSet misconfigured
+
+**Checks**
+- Confirm PVC exists and is bound:
+  - `kubectl get pvc -n langsmith`
+- Confirm ClickHouse uses that PVC
+
+**Fixes**
+- Attach PVC and ensure StatefulSet mounts it
+- Do not treat ClickHouse as stateless
+
+---
+
+## 8. Kubernetes Scheduling Issues
+
+### Symptom
+- Pods stuck in `Pending`
+- Events show “insufficient cpu/memory”
+- ClickHouse never schedules
+
+### Likely Causes
+- Cluster too small
+- Node group instance types too small
+- Taints/affinity constraints prevent scheduling
+
+### Checks
+- `kubectl describe pod <POD> -n langsmith` (look at scheduling events)
+- `kubectl get nodes -o wide`
+- Check taints:
+  - `kubectl describe node <NODE> | sed -n '/Taints/,/Conditions/p'`
+
+### Fixes
+- Increase node group size
+- Use larger instance types
+- Remove/adjust taints and affinities
+- Ensure ClickHouse has a node that can fit **8/32 allocatable**
+
+---
+
+## 9. TLS / Certificate Issues
+
+### Symptom
+- Browser warnings
+- Client SDK fails TLS handshake
+- Mixed content or redirect loops
+
+### Likely Causes
+- Wrong ACM cert attached
+- Wrong DNS name on cert
+- HTTP/HTTPS mismatch
+
+### Checks
+- Confirm ALB listener is HTTPS
+- Confirm cert CN/SAN includes your DNS name
+- Confirm DNS record points to the correct ALB
+
+### Fixes
+- Attach correct cert
+- Fix DNS record
+- Enforce HTTPS redirects intentionally (not accidentally)
+
+---
+
+## 10. “It Worked Yesterday” Failures (The Dangerous Ones)
+
+### Symptom
+- Random 5xx
+- Slow UI
+- Traces intermittently missing
+
+### Likely Causes
+- Resource pressure (CPU throttling / memory pressure)
+- ClickHouse disk pressure or merge backlog
+- Redis saturation
+- Node churn / autoscaling issues
+
+### Checks
+- `kubectl top pods -n langsmith`
+- Pod restarts:
+  - `kubectl get pods -n langsmith --sort-by=.status.containerStatuses[0].restartCount`
+- Node events and scaling activity
+- DB metrics (RDS CPU/connections; Redis CPU/memory; ClickHouse memory/disk)
+
+### Fixes
+- Add capacity (scale nodes)
+- Increase ClickHouse resources or improve disk class
+- Increase Redis tier if saturated
+- Tune autoscaler limits (don’t let it starve the cluster)
+
+---
+
+## 11. What to Include in a Support Request (If You Must Escalate)
+
+If you open a ticket, include:
+
+- Reference path confirmation:
+  - “Deployed via reference architecture + terraform + helm”
+  - repo SHAs / chart versions
+- Current cluster state:
+  - `kubectl get pods -n langsmith -o wide`
+  - relevant `describe` output
+  - last 200 lines of logs from failing pods
+- External dependencies:
+  - Postgres type/version (RDS/Aurora, PG version)
+  - Redis type/version
+  - ClickHouse model (external vs in-cluster) + sizing
+- ALB target health status and error reason
+
+Providing this information upfront enables faster resolution. If diagnostics are incomplete, the first step will be to collect the necessary diagnostic data.
+
+---
+
+## 12. Add to This Guide (How)
+
+Only add entries that:
+- Came from a real failure
+- Include a deterministic check
+- Include a fix that is repeatable
@@ -0,0 +1,303 @@
+# LangSmith Self-Hosted on AWS — Deployment Walkthrough (P0)
+
+**Goal:** Get from zero → running LangSmith SH → first successful trace → basic health validation.  
+**Assumption:** You passed [`PREFLIGHT.md`](./PREFLIGHT.md). If not, stop and do that first.
+
+This walkthrough is intentionally opinionated and linear. Following it step-by-step ensures you stay on the reference path and can receive full support.
+
+---
+
+## 0. Inputs You Must Decide Up Front
+
+Pick these *before* you touch Terraform:
+
+- **AWS Region:** `us-west-2` (example — pick one and stick to it)
+- **Environment name:** `dev` / `staging` / `prod` (do not share resources across envs)
+- **DNS name:** `langsmith.<your-domain>`
+- **Exposure model:** Public (ALB) or Private-only (VPN/PrivateLink)
+- **Auth model:** Token-based (P0) or OIDC/SSO (P1 unless already standard internally)
+- **Data store model:**
+  - Postgres: RDS/Aurora (recommended)
+  - Redis: ElastiCache (recommended)
+  - ClickHouse: Externally managed (preferred) or in-cluster (allowed)
+
+Write these in a `deploy/ENV.md` file for your own sanity.
+
+---
+
+## 1. Clone Repos and Pin Versions
+
+You are building an enablement path. That means **pinning** matters.
+
+- Clone:
+  - `https://github.com/langchain-ai/terraform`
+  - `https://github.com/langchain-ai/helm`
+- Record:
+  - Terraform repo commit SHA
+  - Helm repo commit SHA or chart version
+- Do not “float” versions for the reference deployment.
+
+> Reproducibility is essential for effective enablement. If you cannot reproduce a deployment later, the enablement process has not been fully captured.
+
+---
+
+## 2. Terraform: Provision AWS Infrastructure
+
+### 2.1 Configure Terraform State
+- Use S3 backend + DynamoDB lock (recommended).
+- Ensure state is **unique per environment**.
+
+### 2.2 Apply Infrastructure
+Provision (at minimum):
+- VPC + subnets (public for ALB, private for nodes/data)
+- EKS cluster + managed node groups
+- RDS Postgres (14+)
+- ElastiCache Redis
+- S3 bucket for artifacts
+- Security groups and IAM roles/policies
+- (Optional) Route53 hosted zone / record scaffolding
+
+**Hard requirement:** Ensure the EKS node groups provide at least:
+- **16 vCPU / 64GB RAM** allocatable capacity total
+- **ClickHouse capacity** if in-cluster:
+  - One node with **8 vCPU / 32GB RAM** allocatable
+
+### 2.3 Terraform Verification Gates (Stop if any fail)
+- [ ] `aws eks describe-cluster` shows `ACTIVE`
+- [ ] Worker nodes in private subnets can reach the internet (NAT)
+- [ ] RDS reachable from EKS subnets/security groups
+- [ ] Redis reachable from EKS subnets/security groups
+- [ ] S3 bucket exists and IAM access path is defined (IRSA preferred)
+
+---
+
+## 3. Kubernetes: Connect and Validate the Cluster
+
+### 3.1 Connect to the Cluster
+- Update kubeconfig:
+  - `aws eks update-kubeconfig --region <REGION> --name <CLUSTER_NAME>`
+- Confirm:
+  - `kubectl get nodes`
+
+### 3.2 Install/Validate Required Add-ons
+You must have:
+- Metrics Server
+- Cluster Autoscaler
+
+Verification:
+- `kubectl top nodes` returns metrics
+- Autoscaler is running and has permissions
+
+### 3.3 Create a Namespace
+Create a dedicated namespace, e.g.:
+- `langsmith`
+
+## 3.4 Ingress Gate — Prove ALB Works Before Installing LangSmith
+
+Complete this validation **before** Helm-installing LangSmith. Many deployment issues initially attributed to LangSmith are actually ingress, controller, or subnet-tagging configuration problems.
+
+### 3.4.1 Deploy a tiny test app
+Deploy any minimal HTTP echo service into a test namespace (or the `langsmith` namespace). Confirm:
+- `kubectl get pods` shows it running
+- `kubectl get svc` shows endpoints
+
+### 3.4.2 Create a test Ingress that provisions an ALB
+Create an Ingress pointing at the test service.
+
+Your success criteria are binary:
+- [ ] An **ALB** is created
+- [ ] A target group is created
+- [ ] Targets become **healthy**
+- [ ] You can hit the endpoint and get a response over **HTTPS**
+
+### 3.4.3 If this fails, stop
+Do not proceed to LangSmith until this gate passes.
+
+When it fails, the first places to look are:
+- Kubernetes events on the Ingress
+- AWS Load Balancer Controller logs
+- ALB target group health reasons in the AWS console
+
+> If you are not using ALB for ingress, you are operating outside the P0 reference path.
+
+---
+
+## 4. Prepare Dependencies and Secrets
+
+### 4.1 Collect Required Connection Info
+You need:
+- Postgres host/port/db/user/password
+- Redis host/port (and auth if enabled)
+- ClickHouse endpoint/user/password (or in-cluster config)
+- S3 bucket name and region
+
+### 4.2 Store Secrets (Do Not Put in Git)
+Preferred: AWS Secrets Manager + External Secrets integration.
+
+At minimum for P0 enablement:
+- Keep secrets out of repo
+- Inject into Kubernetes securely (ExternalSecrets/CSI/secure env)
+
+**Stop condition:** Never commit passwords or secrets into `values.yaml` or version control. Use a secrets management solution instead.
+
+---
+
+## 5. Helm: Install LangSmith
+
+### 5.1 Choose the Values Strategy
+You should have:
+- `values.yaml` (non-secret config)
+- `secrets.yaml` OR external secrets (secret values only, not committed)
+
+### 5.2 Configure Required Values
+Your Helm values must define:
+- External Postgres connection
+- External Redis connection
+- ClickHouse configuration (external or in-cluster)
+- S3 artifact storage (strongly recommended)
+- Ingress configuration (ALB + TLS)
+
+### 5.3 Install/Upgrade
+- Install the chart into the `langsmith` namespace.
+- Use `helm upgrade --install` (idempotent).
+
+### 5.4 Helm Verification Gates (Stop if any fail)
+- [ ] All pods in `langsmith` namespace reach `Running` or expected steady state
+- [ ] No CrashLoopBackOff
+- [ ] Services have endpoints
+- [ ] Ingress is created and gets an ALB hostname/address
+
+Commands you should run (conceptually):
+- `kubectl get pods -n langsmith`
+- `kubectl describe pod <...> -n langsmith`
+- `kubectl get svc -n langsmith`
+- `kubectl get ingress -n langsmith` (or equivalent ingress resource)
+
+---
+
+## 6. Ingress + DNS: Make It Reachable
+
+### 6.1 TLS
+- Ensure the ALB listener is HTTPS
+- Ensure cert is valid (ACM recommended)
+
+### 6.2 DNS
+- Create a Route53 record:
+  - `langsmith.<domain>` → ALB DNS name
+
+### 6.3 Reachability Gate
+- [ ] You can load the LangSmith UI at `https://langsmith.<domain>`
+- [ ] Auth behaves as intended (token login or SSO)
+
+---
+
+## 7. “First Successful Trace” (The Real Success Condition)
+
+A deployment is not “done” until traces flow.
+
+### 7.1 Create an API Key / Token (if applicable)
+- Create the token per your configured auth model.
+- Store it securely.
+
+### 7.2 Send a Minimal Trace
+From a laptop or CI runner with egress to the endpoint:
+- Configure `LANGSMITH_ENDPOINT`
+- Configure auth (`LANGSMITH_API_KEY` or equivalent)
+- Run a minimal trace-producing script (LangChain example or direct API).
+
+### 7.3 Trace Gate (Stop if fails)
+- [ ] A trace appears in the LangSmith UI
+- [ ] Trace includes at least one run/span
+- [ ] No ingestion errors in logs
+
+If this fails, do not proceed to operational tasks. Fix ingestion first to ensure the system is functioning correctly.
+
+---
+
+## 8. Basic Health Validation (P0 Ops Readiness)
+
+### 8.1 What “Healthy” Means (Minimum)
+- UI loads reliably
+- API responds
+- DB connections stable
+- No sustained error logs
+- ClickHouse writes succeed
+- Redis queues not stuck
+
+### 8.2 Validate Logs
+Check:
+- LangSmith app logs for errors
+- ClickHouse logs for disk/memory pressure
+- Ingress/ALB logs (4xx/5xx spikes)
+
+### 8.3 Validate Resource Pressure
+- `kubectl top pods -n langsmith`
+- Look for:
+  - OOMKills
+  - CPU throttling
+  - Persistent volume saturation
+
+---
+
+## 9. Backup & Restore (P0 Expectations)
+
+For P0 enablement, you must at least:
+- Confirm RDS backups are enabled
+- Confirm ClickHouse persistence strategy is defined
+- Confirm S3 bucket lifecycle/versioning policy is intentional
+
+You do not need to execute a restore yet, but you must document how it would be done.
+
+---
+
+## 10. Common Failure Points (Fast Triage)
+
+If deployment fails, the usual culprits are:
+
+1. **Networking / Security Groups**
+   - EKS can’t reach Postgres/Redis/ClickHouse
+2. **ClickHouse undersized or slow disk**
+   - OOM, high latency, ingestion failures
+3. **Ingress misconfiguration**
+   - ALB created but no healthy targets
+4. **Auth mismatch**
+   - UI loads but API calls fail
+5. **Secrets handling**
+   - Bad credentials injected, pods loop
+
+When something breaks: capture
+- `kubectl describe`
+- pod logs
+- DB connection test results
+- ALB target health
+
+This data becomes your failure-mode catalog later.
+
+---
+
+## 11. “Done” Definition (P0)
+
+You are done only when:
+
+- [ ] Terraform applied cleanly and is reproducible
+- [ ] Helm install is idempotent (`upgrade --install` works)
+- [ ] UI reachable via HTTPS on your chosen DNS
+- [ ] First successful trace appears in the UI
+- [ ] Basic health checks are green (no crash loops, stable DB connectivity)
+
+If any box isn't checked, continue working through the checklist until all items are complete to ensure a fully functional reference deployment.
+
+---
+
+## Appendix: What to Capture During Your First Real Deployment
+
+As you run this the first time, log:
+- Where you hesitated
+- What you had to guess
+- What you looked up
+- What failed and how you fixed it
+
+Those are the inputs for:
+- `TROUBLESHOOTING.md`
+- “Top failure modes”
+- Future certification labs
@@ -0,0 +1,268 @@
+#!/usr/bin/env bash
+
+# LangSmith Self-Hosted Diagnostics Capture Script
+# This script captures essential diagnostic information for troubleshooting
+# LangSmith Self-Hosted deployments on AWS/EKS.
+
+set -euo pipefail
+
+# Configuration via environment variables (with defaults)
+NAMESPACE="${NAMESPACE:-langsmith}"
+LOG_TAIL="${LOG_TAIL:-200}"
+EVENTS_TAIL="${EVENTS_TAIL:-50}"
+OUTPUT_DIR="${OUTPUT_DIR:-./diagnostics}"
+TIMESTAMP=$(date +%Y%m%d-%H%M%S)
+OUTPUT_PATH="${OUTPUT_DIR}/${TIMESTAMP}"
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+NC='\033[0m' # No Color
+
+# Create output directory
+mkdir -p "${OUTPUT_PATH}"
+
+echo -e "${GREEN}Capturing diagnostics for namespace: ${NAMESPACE}${NC}"
+echo -e "${GREEN}Output directory: ${OUTPUT_PATH}${NC}"
+echo ""
+
+# Function to run command and save output
+capture_output() {
+    local description="$1"
+    local command="$2"
+    local filename="$3"
+    
+    echo -e "${YELLOW}Capturing: ${description}${NC}"
+    if eval "${command}" > "${OUTPUT_PATH}/${filename}" 2>&1; then
+        echo -e "${GREEN}  ✓ Saved to ${filename}${NC}"
+    else
+        echo -e "${RED}  ✗ Failed to capture ${description}${NC}"
+    fi
+    echo ""
+}
+
+# Check if kubectl is available
+if ! command -v kubectl &> /dev/null; then
+    echo -e "${RED}Error: kubectl is not installed or not in PATH${NC}"
+    exit 1
+fi
+
+# Check if namespace exists
+if ! kubectl get namespace "${NAMESPACE}" &> /dev/null; then
+    echo -e "${RED}Error: Namespace '${NAMESPACE}' does not exist${NC}"
+    exit 1
+fi
+
+# Capture pod list
+capture_output \
+    "Pod list (wide format)" \
+    "kubectl get pods -n ${NAMESPACE} -o wide" \
+    "pods-wide.txt"
+
+# Get list of pods
+PODS=$(kubectl get pods -n "${NAMESPACE}" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
+
+if [ -z "${PODS}" ]; then
+    echo -e "${YELLOW}No pods found in namespace ${NAMESPACE}${NC}"
+    echo ""
+else
+    # Capture describe and logs for each pod
+    for POD in ${PODS}; do
+        echo -e "${YELLOW}Processing pod: ${POD}${NC}"
+        
+        # Capture pod description
+        capture_output \
+            "Pod description: ${POD}" \
+            "kubectl describe pod ${POD} -n ${NAMESPACE}" \
+            "pod-${POD}-describe.txt"
+        
+        # Capture pod logs
+        capture_output \
+            "Pod logs: ${POD} (last ${LOG_TAIL} lines)" \
+            "kubectl logs ${POD} -n ${NAMESPACE} --tail=${LOG_TAIL}" \
+            "pod-${POD}-logs.txt"
+        
+        # Capture previous logs if pod has restarted
+        if kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.status.containerStatuses[0].restartCount}' 2>/dev/null | grep -q '[1-9]'; then
+            capture_output \
+                "Previous pod logs: ${POD} (last ${LOG_TAIL} lines)" \
+                "kubectl logs ${POD} -n ${NAMESPACE} --previous --tail=${LOG_TAIL}" \
+                "pod-${POD}-logs-previous.txt"
+        fi
+    done
+fi
+
+# Capture events
+capture_output \
+    "Kubernetes events (last ${EVENTS_TAIL} events)" \
+    "kubectl get events -n ${NAMESPACE} --sort-by=.lastTimestamp | tail -${EVENTS_TAIL}" \
+    "events.txt"
+
+# Capture ingress status
+capture_output \
+    "Ingress resources" \
+    "kubectl get ingress -n ${NAMESPACE} -o wide" \
+    "ingress-list.txt"
+
+# Capture detailed ingress information
+INGRESS_RESOURCES=$(kubectl get ingress -n "${NAMESPACE}" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
+if [ -n "${INGRESS_RESOURCES}" ]; then
+    for INGRESS in ${INGRESS_RESOURCES}; do
+        capture_output \
+            "Ingress details: ${INGRESS}" \
+            "kubectl describe ingress ${INGRESS} -n ${NAMESPACE}" \
+            "ingress-${INGRESS}-describe.txt"
+        
+        capture_output \
+            "Ingress YAML: ${INGRESS}" \
+            "kubectl get ingress ${INGRESS} -n ${NAMESPACE} -o yaml" \
+            "ingress-${INGRESS}.yaml"
+    done
+fi
+
+# Capture service status
+capture_output \
+    "Service list" \
+    "kubectl get svc -n ${NAMESPACE} -o wide" \
+    "services-list.txt"
+
+# Capture endpoints
+capture_output \
+    "Endpoints" \
+    "kubectl get endpoints -n ${NAMESPACE}" \
+    "endpoints.txt"
+
+# Capture service details
+SERVICES=$(kubectl get svc -n "${NAMESPACE}" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
+if [ -n "${SERVICES}" ]; then
+    for SVC in ${SERVICES}; do
+        capture_output \
+            "Service details: ${SVC}" \
+            "kubectl describe svc ${SVC} -n ${NAMESPACE}" \
+            "svc-${SVC}-describe.txt"
+    done
+fi
+
+# Capture node information
+capture_output \
+    "Node list" \
+    "kubectl get nodes -o wide" \
+    "nodes-wide.txt"
+
+# Capture resource usage (if metrics-server is available)
+if kubectl top nodes &> /dev/null; then
+    capture_output \
+        "Node resource usage" \
+        "kubectl top nodes" \
+        "nodes-top.txt"
+    
+    if [ -n "${PODS}" ]; then
+        capture_output \
+            "Pod resource usage" \
+            "kubectl top pods -n ${NAMESPACE}" \
+            "pods-top.txt"
+    fi
+else
+    echo -e "${YELLOW}Metrics Server not available, skipping resource usage metrics${NC}"
+    echo ""
+fi
+
+# Capture PVC information
+capture_output \
+    "Persistent Volume Claims" \
+    "kubectl get pvc -n ${NAMESPACE}" \
+    "pvc-list.txt"
+
+# Capture StatefulSets and Deployments
+capture_output \
+    "StatefulSets" \
+    "kubectl get statefulsets -n ${NAMESPACE} -o wide" \
+    "statefulsets.txt"
+
+capture_output \
+    "Deployments" \
+    "kubectl get deployments -n ${NAMESPACE} -o wide" \
+    "deployments.txt"
+
+# AWS-specific: ALB target group health (if AWS CLI is available and configured)
+if command -v aws &> /dev/null; then
+    echo -e "${YELLOW}Attempting to capture ALB target group health information...${NC}"
+    
+    # Try to get ALB information from ingress annotations
+    if [ -n "${INGRESS_RESOURCES}" ]; then
+        for INGRESS in ${INGRESS_RESOURCES}; do
+            ALB_ARN=$(kubectl get ingress "${INGRESS}" -n "${NAMESPACE}" -o jsonpath='{.metadata.annotations.alb\.ingress\.kubernetes\.io/load-balancer-id}' 2>/dev/null || echo "")
+            
+            if [ -n "${ALB_ARN}" ]; then
+                # Extract ALB name from ARN or use ARN directly
+                echo "ALB ARN: ${ALB_ARN}" > "${OUTPUT_PATH}/alb-${INGRESS}-info.txt"
+                
+                # Get target groups for this ALB
+                if aws elbv2 describe-target-groups --load-balancer-arn "${ALB_ARN}" --region "${AWS_REGION:-us-west-2}" &> /dev/null; then
+                    capture_output \
+                        "ALB target groups: ${INGRESS}" \
+                        "aws elbv2 describe-target-groups --load-balancer-arn ${ALB_ARN} --region ${AWS_REGION:-us-west-2}" \
+                        "alb-${INGRESS}-target-groups.json"
+                    
+                    # Get target health for each target group
+                    TARGET_GROUPS=$(aws elbv2 describe-target-groups --load-balancer-arn "${ALB_ARN}" --region "${AWS_REGION:-us-west-2}" --query 'TargetGroups[*].TargetGroupArn' --output text 2>/dev/null || echo "")
+                    if [ -n "${TARGET_GROUPS}" ]; then
+                        for TG_ARN in ${TARGET_GROUPS}; do
+                            capture_output \
+                                "Target group health: ${TG_ARN}" \
+                                "aws elbv2 describe-target-health --target-group-arn ${TG_ARN} --region ${AWS_REGION:-us-west-2}" \
+                                "alb-${INGRESS}-target-health-$(basename ${TG_ARN}).json"
+                        done
+                    fi
+                fi
+            fi
+        done
+    fi
+    
+    echo ""
+else
+    echo -e "${YELLOW}AWS CLI not available, skipping ALB target group health capture${NC}"
+    echo -e "${YELLOW}To capture ALB information, install AWS CLI and configure credentials${NC}"
+    echo ""
+fi
+
+# Create summary file
+SUMMARY_FILE="${OUTPUT_PATH}/summary.txt"
+{
+    echo "LangSmith Self-Hosted Diagnostics Summary"
+    echo "========================================"
+    echo "Timestamp: ${TIMESTAMP}"
+    echo "Namespace: ${NAMESPACE}"
+    echo "Output Directory: ${OUTPUT_PATH}"
+    echo ""
+    echo "Configuration:"
+    echo "  LOG_TAIL: ${LOG_TAIL}"
+    echo "  EVENTS_TAIL: ${EVENTS_TAIL}"
+    echo ""
+    echo "Captured Information:"
+    echo "  - Pod list and descriptions"
+    echo "  - Pod logs (current and previous if restarted)"
+    echo "  - Kubernetes events"
+    echo "  - Ingress resources and details"
+    echo "  - Services and endpoints"
+    echo "  - Node information"
+    echo "  - Resource usage (if metrics-server available)"
+    echo "  - Persistent Volume Claims"
+    echo "  - StatefulSets and Deployments"
+    if command -v aws &> /dev/null; then
+        echo "  - ALB target group health (if available)"
+    fi
+    echo ""
+    echo "Files captured:"
+    find "${OUTPUT_PATH}" -type f -name "*.txt" -o -name "*.yaml" -o -name "*.json" | sort | sed 's|^|  |'
+} > "${SUMMARY_FILE}"
+
+echo -e "${GREEN}✓ Diagnostics capture complete!${NC}"
+echo -e "${GREEN}Summary saved to: ${SUMMARY_FILE}${NC}"
+echo ""
+echo "To view the summary:"
+echo "  cat ${SUMMARY_FILE}"
+echo ""
+echo "All diagnostic files are in: ${OUTPUT_PATH}"
+
@@ -0,0 +1,514 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+# Deny pattern regex (broadened to catch all AWS denial patterns)
+DENY_RE='AccessDenied|AccessDeniedException|UnauthorizedOperation|not authorized|NotAuthorized|is not authorized'
+
+# Parse command line arguments
+SKIP_RESOURCE_TESTS=false
+NON_INTERACTIVE=false
+CREATE_TEST_RESOURCES=false
+ACM_DOMAIN=""
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        -s|--skip_resource_tests|--skip_checks)
+            SKIP_RESOURCE_TESTS=true
+            shift
+            ;;
+        -y|--yes)
+            NON_INTERACTIVE=true
+            shift
+            ;;
+        --create-test-resources)
+            CREATE_TEST_RESOURCES=true
+            shift
+            ;;
+        --domain)
+            ACM_DOMAIN="$2"
+            shift 2
+            ;;
+        *)
+            printf "Unknown option: %s\n" "$1"
+            printf "Usage: %s [-s|--skip_resource_tests] [-y|--yes] [--create-test-resources] [--domain <domain>]\n" "$0"
+            exit 1
+            ;;
+    esac
+done
+
+# Check for CI environment
+if [ "${CI:-false}" = "true" ]; then
+    NON_INTERACTIVE=true
+fi
+
+# Function to print colored output
+info() {
+    printf "${BLUE}[INFO]${NC} %s\n" "$1"
+}
+
+success() {
+    printf "${GREEN}[SUCCESS]${NC} %s\n" "$1"
+}
+
+warning() {
+    printf "${YELLOW}[WARNING]${NC} %s\n" "$1"
+}
+
+error() {
+    printf "${RED}[ERROR]${NC} %s\n" "$1"
+}
+
+# Function to check for access denied patterns
+check_denied() {
+    local output="$1"
+    if echo "$output" | grep -Eqi "$DENY_RE"; then
+        return 0  # Access denied found
+    fi
+    return 1  # No access denied
+}
+
+# Check if AWS CLI is installed
+if ! command -v aws &> /dev/null; then
+    error "AWS CLI is not installed. Please install it first."
+    exit 1
+fi
+
+# Safety banner
+printf "\n"
+info "=== LangSmith AWS Preflight Check ==="
+info "Default mode: READ-ONLY (no resource creation)"
+info "Use --create-test-resources to test resource creation"
+info "No modifications to existing resources will be made."
+info "Temporary test resources may be created only with --create-test-resources."
+printf "\n"
+
+# Check AWS credentials
+info "Checking AWS credentials..."
+if ! aws sts get-caller-identity &> /dev/null; then
+    error "Not logged into AWS. Please run 'aws configure' or set AWS credentials."
+    exit 1
+fi
+
+# Get AWS account info
+ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
+USER_ARN=$(aws sts get-caller-identity --query Arn --output text)
+
+# Better region handling
+REGION=$(aws configure get region 2>/dev/null || true)
+REGION=${REGION:-${AWS_DEFAULT_REGION:-us-west-2}}
+
+info "AWS Account ID: $ACCOUNT_ID"
+info "User ARN: $USER_ARN"
+info "Current region: $REGION"
+
+# Confirm region (non-interactive mode skips this)
+if [ "$NON_INTERACTIVE" = false ]; then
+    printf "\n"
+    read -p "Is the region '$REGION' correct? (y/n): " -n 1 -r
+    printf "\n"
+    if [[ ! $REPLY =~ ^[Yy]$ ]]; then
+        error "Please set the correct region using 'aws configure set region <region>' or export AWS_DEFAULT_REGION"
+        exit 1
+    fi
+else
+    info "Non-interactive mode: using region '$REGION'"
+fi
+
+# Check for sandbox account indicators (warning only, no prompt)
+info "Checking for sandbox account restrictions..."
+if [[ "$ACCOUNT_ID" =~ ^[0-9]{12}$ ]]; then
+    # Check if account has restrictions (common sandbox patterns)
+    ALIASES=$(aws iam list-account-aliases --query 'AccountAliases' --output text 2>/dev/null || echo "")
+    if echo "$ALIASES" | grep -qi "sandbox\|test\|dev"; then
+        warning "Account alias suggests this might be a sandbox/test account: $ALIASES"
+        warning "Please verify this account is not restricted by SCPs or other policies"
+    fi
+else
+    warning "Account ID format is unusual. Please verify this is not a restricted account."
+fi
+
+# Function to cleanup resources on exit (with retry logic)
+cleanup() {
+    info "Cleaning up test resources..."
+    
+    # Cleanup in correct order: SG -> Subnet -> VPC -> IAM role
+    # Retry logic for eventual consistency
+    
+    # Delete security group (with retry)
+    if [ -n "${TEST_SG_ID:-}" ]; then
+        for i in {1..3}; do
+            if aws ec2 delete-security-group --group-id "$TEST_SG_ID" --region "$REGION" 2>/dev/null; then
+                success "Security group deleted: $TEST_SG_ID"
+                break
+            fi
+            if [ $i -lt 3 ]; then
+                sleep 2
+            fi
+        done
+    fi
+    
+    # Delete subnet (with retry)
+    if [ -n "${TEST_SUBNET_ID:-}" ]; then
+        for i in {1..3}; do
+            if aws ec2 delete-subnet --subnet-id "$TEST_SUBNET_ID" --region "$REGION" 2>/dev/null; then
+                success "Subnet deleted: $TEST_SUBNET_ID"
+                break
+            fi
+            if [ $i -lt 3 ]; then
+                sleep 2
+            fi
+        done
+    fi
+    
+    # Delete VPC (with retry)
+    if [ -n "${TEST_VPC_ID:-}" ]; then
+        for i in {1..3}; do
+            if aws ec2 delete-vpc --vpc-id "$TEST_VPC_ID" --region "$REGION" 2>/dev/null; then
+                success "VPC deleted: $TEST_VPC_ID"
+                break
+            fi
+            if [ $i -lt 3 ]; then
+                sleep 2
+            fi
+        done
+    fi
+    
+    # Delete IAM role (IAM is global, no --region, with retry)
+    # NOTE: If you attach policies to the role, you must detach them before deleting
+    if [ -n "${TEST_ROLE_NAME:-}" ]; then
+        for i in {1..3}; do
+            if aws iam delete-role --role-name "$TEST_ROLE_NAME" 2>/dev/null; then
+                success "IAM role deleted: $TEST_ROLE_NAME"
+                break
+            fi
+            if [ $i -lt 3 ]; then
+                sleep 2
+            fi
+        done
+    fi
+}
+
+# Read-only permission checks (always run)
+info "Running read-only permission checks..."
+
+# Test EC2 permissions (needed for EKS and ALB controller)
+info "Testing EC2 permissions (VPC, subnets, availability zones)..."
+EC2_VPC_OUTPUT=$(aws ec2 describe-vpcs --region "$REGION" --max-items 1 2>&1 || true)
+EC2_SUBNET_OUTPUT=$(aws ec2 describe-subnets --region "$REGION" --max-items 1 2>&1 || true)
+EC2_AZ_OUTPUT=$(aws ec2 describe-availability-zones --region "$REGION" 2>&1 || true)
+
+if check_denied "$EC2_VPC_OUTPUT" || check_denied "$EC2_SUBNET_OUTPUT" || check_denied "$EC2_AZ_OUTPUT"; then
+    error "Failed EC2 permission check. Check IAM permissions for ec2:DescribeVpcs, ec2:DescribeSubnets, ec2:DescribeAvailabilityZones"
+    exit 1
+fi
+success "EC2 permissions verified"
+
+# Test EKS permissions (fixed - single call with broader check)
+info "Testing EKS permissions..."
+EKS_OUTPUT=$(aws eks describe-cluster --name "preflight-nonexistent-$(date +%s)" --region "$REGION" 2>&1 || true)
+if check_denied "$EKS_OUTPUT"; then
+    error "Failed EKS permission check. Check IAM permissions for eks:*"
+    exit 1
+elif echo "$EKS_OUTPUT" | grep -q "ResourceNotFoundException"; then
+    success "EKS permissions verified"
+else
+    # Try list-clusters as alternative check
+    EKS_LIST_OUTPUT=$(aws eks list-clusters --region "$REGION" 2>&1 || true)
+    if check_denied "$EKS_LIST_OUTPUT"; then
+        error "Failed EKS permission check. Check IAM permissions for eks:*"
+        exit 1
+    elif [ -n "$EKS_LIST_OUTPUT" ]; then
+        success "EKS permissions verified"
+    else
+        warning "EKS permission check inconclusive, but continuing..."
+    fi
+fi
+
+# Add warning about EKS prerequisites
+warning "Note: Passing EKS checks does not guarantee you can create EKS/nodegroups."
+warning "Common failures occur at iam:PassRole, ec2:* permissions, and service quotas."
+
+# Test IAM permissions (read-only check, needed for PassRole)
+info "Testing IAM permissions..."
+IAM_OUTPUT=$(aws iam list-roles --max-items 1 2>&1 || true)
+if check_denied "$IAM_OUTPUT"; then
+    error "Failed IAM permission check. Check IAM permissions for iam:ListRoles (needed for iam:PassRole)"
+    exit 1
+else
+    success "IAM permissions verified"
+fi
+
+# Test RDS permissions
+info "Testing RDS permissions..."
+RDS_OUTPUT=$(aws rds describe-db-instances --region "$REGION" 2>&1 || true)
+if check_denied "$RDS_OUTPUT"; then
+    error "Failed RDS permission check. Check IAM permissions for rds:*"
+    exit 1
+else
+    success "RDS permissions verified"
+fi
+
+# Test ElastiCache permissions
+info "Testing ElastiCache permissions..."
+CACHE_OUTPUT=$(aws elasticache describe-cache-clusters --region "$REGION" 2>&1 || true)
+if check_denied "$CACHE_OUTPUT"; then
+    error "Failed ElastiCache permission check. Check IAM permissions for elasticache:*"
+    exit 1
+else
+    success "ElastiCache permissions verified"
+fi
+
+# Test ALB/ELB permissions
+info "Testing Application Load Balancer permissions..."
+ALB_OUTPUT=$(aws elbv2 describe-load-balancers --region "$REGION" 2>&1 || true)
+if check_denied "$ALB_OUTPUT"; then
+    error "Failed ALB permission check. Check IAM permissions for elasticloadbalancing:*"
+    exit 1
+else
+    success "ALB permissions verified"
+fi
+
+# Test ACM permissions (for TLS certificates)
+info "Testing ACM (Certificate Manager) permissions..."
+ACM_OUTPUT=$(aws acm list-certificates --region "$REGION" 2>&1 || true)
+if check_denied "$ACM_OUTPUT"; then
+    error "Failed ACM permission check. Check IAM permissions for acm:*"
+    exit 1
+else
+    success "ACM permissions verified"
+    if [ -z "$ACM_DOMAIN" ]; then
+        warning "ACM check passed (does not confirm a cert exists for your chosen domain)"
+    else
+        # Check if certificate exists for the domain
+        info "Checking for ACM certificate matching domain: $ACM_DOMAIN"
+        
+        # Extract zone apex (e.g., "example.com" from "langsmith.example.com")
+        # If domain contains a dot, extract everything after the first dot
+        if echo "$ACM_DOMAIN" | grep -q '\.'; then
+            ZONE_APEX=$(echo "$ACM_DOMAIN" | sed -E 's/^[^.]*\.(.+)$/\1/')
+        else
+            # Already an apex domain (unlikely but handle it)
+            ZONE_APEX="$ACM_DOMAIN"
+        fi
+        
+        # First try exact match
+        CERT_ARN=$(aws acm list-certificates --region "$REGION" --query "CertificateSummaryList[?DomainName=='$ACM_DOMAIN'].CertificateArn" --output text 2>/dev/null || echo "")
+        
+        # If no exact match, try wildcard for zone apex (e.g., *.example.com)
+        if [ -z "$CERT_ARN" ] && [ "$ZONE_APEX" != "$ACM_DOMAIN" ]; then
+            WILDCARD_DOMAIN="*.$ZONE_APEX"
+            CERT_ARN=$(aws acm list-certificates --region "$REGION" --query "CertificateSummaryList[?DomainName=='$WILDCARD_DOMAIN'].CertificateArn" --output text 2>/dev/null || echo "")
+        fi
+        
+        # If still no match, check SANs by describing each cert (limited check)
+        if [ -z "$CERT_ARN" ]; then
+            ALL_CERTS=$(aws acm list-certificates --region "$REGION" --query "CertificateSummaryList[*].CertificateArn" --output text 2>/dev/null || echo "")
+            for cert_arn in $ALL_CERTS; do
+                CERT_DETAILS=$(aws acm describe-certificate --certificate-arn "$cert_arn" --region "$REGION" --query "Certificate.{Domain:DomainName,SANs:SubjectAlternativeNames}" --output json 2>/dev/null || echo "{}")
+                if echo "$CERT_DETAILS" | grep -q "\"$ACM_DOMAIN\"" || echo "$CERT_DETAILS" | grep -q "\"*.$ZONE_APEX\""; then
+                    CERT_ARN="$cert_arn"
+                    break
+                fi
+            done
+        fi
+        
+        if [ -n "$CERT_ARN" ]; then
+            success "Found ACM certificate for domain: $CERT_ARN"
+        else
+            warning "No ACM certificate found for domain '$ACM_DOMAIN' in region '$REGION'"
+            warning "You may need to request a certificate before deploying"
+        fi
+    fi
+fi
+
+# Test Route53 permissions (for DNS/ingress) - Route53 is global, no --region
+info "Testing Route53 permissions..."
+R53_OUTPUT=$(aws route53 list-hosted-zones 2>&1 || true)
+if check_denied "$R53_OUTPUT"; then
+    error "Failed Route53 permission check. Check IAM permissions for route53:*"
+    exit 1
+else
+    success "Route53 permissions verified"
+    # Check if hosted zones exist
+    ZONE_COUNT=$(aws route53 list-hosted-zones --query "HostedZones | length(@)" --output text 2>/dev/null || echo "0")
+    if [ "$ZONE_COUNT" = "0" ] || [ -z "$ZONE_COUNT" ]; then
+        warning "No Route53 hosted zones found."
+        warning "If you intend to use Route53 for ingress, create/identify the hosted zone first."
+    else
+        info "Found $ZONE_COUNT Route53 hosted zone(s)"
+        
+        # If domain provided, check for matching hosted zone
+        if [ -n "$ACM_DOMAIN" ]; then
+            # Extract zone apex (same logic as ACM check)
+            if echo "$ACM_DOMAIN" | grep -q '\.'; then
+                ZONE_APEX=$(echo "$ACM_DOMAIN" | sed -E 's/^[^.]*\.(.+)$/\1/')
+            else
+                ZONE_APEX="$ACM_DOMAIN"
+            fi
+            # Route53 zone names end with a dot
+            ZONE_NAME="${ZONE_APEX}."
+            MATCHING_ZONE=$(aws route53 list-hosted-zones --query "HostedZones[?Name=='$ZONE_NAME'].Id" --output text 2>/dev/null || echo "")
+            if [ -n "$MATCHING_ZONE" ]; then
+                success "Found Route53 hosted zone for domain: $ZONE_NAME"
+            else
+                warning "No Route53 hosted zone found matching domain '$ACM_DOMAIN' (checked for zone: $ZONE_NAME)"
+            fi
+        fi
+    fi
+fi
+
+# Test WAFv2 permissions (optional, for WAF support)
+info "Testing WAFv2 permissions (optional)..."
+WAF_OUTPUT=$(aws wafv2 list-web-acls --scope REGIONAL --region "$REGION" 2>&1 || true)
+if check_denied "$WAF_OUTPUT"; then
+    warning "WAFv2 permission check failed (optional, but recommended for production)"
+else
+    WAF_COUNT=$(aws wafv2 list-web-acls --scope REGIONAL --region "$REGION" --query "WebACLs | length(@)" --output text 2>/dev/null || echo "0")
+    if [ "$WAF_COUNT" = "0" ] || [ -z "$WAF_COUNT" ]; then
+        success "WAFv2 accessible (no web ACLs found)"
+    else
+        success "WAFv2 permissions verified (found $WAF_COUNT web ACL(s))"
+    fi
+fi
+
+# Resource creation tests (only if --create-test-resources is set or skip is not set)
+if [ "$SKIP_RESOURCE_TESTS" = true ]; then
+    info "Skipping resource creation tests (--skip_resource_tests flag provided)"
+    success "Preflight checks complete (resource tests skipped)"
+    exit 0
+fi
+
+if [ "$CREATE_TEST_RESOURCES" = false ]; then
+    info "Skipping resource creation tests (use --create-test-resources to enable)"
+    info "Read-only checks passed. You can proceed with deployment."
+    success "Preflight checks complete!"
+    exit 0
+fi
+
+# Set trap only when we're actually creating resources
+trap cleanup EXIT
+
+# Confirmation prompt before creating resources (unless --yes is set)
+if [ "$NON_INTERACTIVE" = false ]; then
+    printf "\n"
+    warning "This will create temporary test resources:"
+    warning "  - VPC, Subnet, Security Group (isolated, will be deleted)"
+    warning "  - IAM Role (will be deleted)"
+    warning "  - No modifications to existing resources"
+    printf "\n"
+    read -p "Continue with resource creation tests? (y/n): " -n 1 -r
+    printf "\n"
+    if [[ ! $REPLY =~ ^[Yy]$ ]]; then
+        info "Resource creation tests cancelled by user"
+        exit 0
+    fi
+fi
+
+# Resource creation tests (only run if --create-test-resources is set)
+info "Running resource creation tests (--create-test-resources mode)..."
+
+# Generate a safer CIDR block (10.254.x.x range, avoid 0, less likely to conflict)
+# Retry logic for VPC creation in case of CIDR conflicts
+VPC_CREATED=false
+for attempt in {1..3}; do
+    RANDOM_SUFFIX=$(( (RANDOM % 250) + 1 ))  # Range 1-250, avoids 0
+    TEST_CIDR="10.254.${RANDOM_SUFFIX}.0/28"
+    
+    info "Attempting VPC creation with CIDR $TEST_CIDR (attempt $attempt/3)..."
+    VPC_OUTPUT=$(aws ec2 create-vpc \
+        --cidr-block "$TEST_CIDR" \
+        --region "$REGION" \
+        --query 'Vpc.VpcId' \
+        --output text 2>&1) || {
+        if echo "$VPC_OUTPUT" | grep -qi "InvalidVpc.Range\|overlap\|conflict"; then
+            if [ $attempt -lt 3 ]; then
+                warning "VPC creation failed (org policy or CIDR validation), trying different CIDR..."
+                continue
+            else
+                error "Failed to create VPC after 3 attempts (org policy or CIDR validation). Check IAM permissions for ec2:CreateVpc"
+                exit 1
+            fi
+        else
+            error "Failed to create VPC. Check IAM permissions for ec2:CreateVpc"
+            exit 1
+        fi
+    }
+    
+    TEST_VPC_ID="$VPC_OUTPUT"
+    VPC_CREATED=true
+    success "VPC created: $TEST_VPC_ID"
+    break
+done
+
+if [ "$VPC_CREATED" = false ]; then
+    error "Failed to create VPC after all retry attempts"
+    exit 1
+fi
+
+# Test subnet creation (reuse VPC CIDR since it's a /28)
+info "Testing subnet creation..."
+AZ=$(aws ec2 describe-availability-zones --region "$REGION" --query 'AvailabilityZones[0].ZoneName' --output text)
+TEST_SUBNET_ID=$(aws ec2 create-subnet \
+    --vpc-id "$TEST_VPC_ID" \
+    --cidr-block "$TEST_CIDR" \
+    --availability-zone "$AZ" \
+    --region "$REGION" \
+    --query 'Subnet.SubnetId' \
+    --output text 2>/dev/null) || {
+    error "Failed to create subnet. Check IAM permissions for ec2:CreateSubnet"
+    exit 1
+}
+success "Subnet created: $TEST_SUBNET_ID"
+
+# Test security group creation
+info "Testing security group creation..."
+TEST_SG_ID=$(aws ec2 create-security-group \
+    --group-name "preflight-test-sg-$(date +%s)" \
+    --description "Preflight test security group" \
+    --vpc-id "$TEST_VPC_ID" \
+    --region "$REGION" \
+    --query 'GroupId' \
+    --output text 2>/dev/null) || {
+    error "Failed to create security group. Check IAM permissions for ec2:CreateSecurityGroup"
+    exit 1
+}
+success "Security group created: $TEST_SG_ID"
+
+# Test IAM role creation (IAM is global, no --region)
+info "Testing IAM role creation..."
+TEST_ROLE_NAME="preflight-test-role-$(date +%s)"
+TRUST_POLICY='{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Effect": "Allow",
+      "Principal": {
+        "Service": "ec2.amazonaws.com"
+      },
+      "Action": "sts:AssumeRole"
+    }
+  ]
+}'
+if aws iam create-role \
+    --role-name "$TEST_ROLE_NAME" \
+    --assume-role-policy-document "$TRUST_POLICY" \
+    --output text > /dev/null 2>&1; then
+    success "IAM role created: $TEST_ROLE_NAME"
+else
+    error "Failed to create IAM role. Check IAM permissions for iam:CreateRole"
+    exit 1
+fi
+
+# Cleanup happens automatically via trap
+info "All test resources will be cleaned up on exit..."
+
+success "Preflight checks complete! All permissions verified."
+info "You are ready to deploy LangSmith infrastructure."