mirror of
https://github.com/langchain-ai/langsmith-self-hosted-reference-aws.git
synced 2026-07-01 20:04:39 -04:00
Initial commit: LangSmith Self-Hosted AWS Reference Architecture (P0)
This commit establishes the P0 reference architecture documentation and supporting tooling for deploying LangSmith Self-Hosted on AWS. Documentation: - README.md: Reference architecture overview with embedded request flow diagram - PREFLIGHT.md: Comprehensive preflight checklist with automated script integration - WALKTHROUGH.md: Step-by-step deployment walkthrough - INGRESS.md: ALB-only ingress configuration guide - TROUBLESHOOTING.md: Evidence-based troubleshooting guide with diagnostic automation Tooling: - scripts/preflight.sh: Automated AWS permission and prerequisite validation - scripts/capture-diagnostics.sh: Automated diagnostic information capture for troubleshooting All documentation follows a professional, educational, and encouraging tone designed to support platform/infrastructure/MLOps engineers through the deployment process. Key features: - Opinionated, supportable deployment path - AWS + EKS + Terraform + Helm stack - Automated validation and diagnostic tools - Clear separation of P0 (baseline) vs P1+ (advanced) features
This commit is contained in:
@@ -0,0 +1,2 @@
|
||||
.env
|
||||
.venv
|
||||
+192
@@ -0,0 +1,192 @@
|
||||
# Ingress for LangSmith Self-Hosted on AWS (P0) — ALB Only
|
||||
|
||||
**P0 Reference Requirement:** Use **AWS Application Load Balancer (ALB)** via the **AWS Load Balancer Controller**.
|
||||
|
||||
This requirement is intentionally opinionated. Ingress configuration is a common source of deployment challenges due to the many valid options available. The reference architecture standardizes on ALB to provide a clear, well-tested path.
|
||||
|
||||
If you are not using ALB, you are operating **outside the P0 reference path**.
|
||||
|
||||
---
|
||||
|
||||
## Supported Ingress (P0)
|
||||
|
||||
### ✅ Supported
|
||||
- **AWS Load Balancer Controller** + **ALB**
|
||||
- TLS termination using **ACM**
|
||||
- DNS via **Route53** (or equivalent, but Route53 is assumed for P0 examples)
|
||||
- Optional but strongly recommended:
|
||||
- **AWS WAF** attached to ALB
|
||||
- Private-only exposure (internal ALB + VPN/PrivateLink)
|
||||
|
||||
### ❌ Explicitly Out of Scope (P0)
|
||||
- NGINX Ingress Controller
|
||||
- Traefik
|
||||
- Istio / service mesh gateways
|
||||
- API Gateway “fronting” Kubernetes as a substitute for ingress
|
||||
- CloudFront as a substitute for ingress (can be layered later, but not P0)
|
||||
- Custom gateways / reverse proxies
|
||||
|
||||
These may work. We do not support them in the reference enablement path.
|
||||
|
||||
---
|
||||
|
||||
## Why We Require ALB
|
||||
|
||||
- The ALB path is the **lowest-friction**, most reproducible option for AWS customers.
|
||||
- It provides a standardized approach that avoids controller complexity and configuration variations.
|
||||
- It aligns with what most platform teams already deploy and secure.
|
||||
- It makes debugging straightforward: ALB target health metrics and Kubernetes events provide clear diagnostic information.
|
||||
|
||||
This requirement exists to reduce:
|
||||
- install failures
|
||||
- support escalations
|
||||
- time-to-first-trace delays
|
||||
|
||||
---
|
||||
|
||||
## Required Components
|
||||
|
||||
You must have the following working before you install LangSmith:
|
||||
|
||||
1. **EKS cluster** running and reachable with `kubectl`
|
||||
2. **AWS Load Balancer Controller** installed and healthy
|
||||
3. **IAM permissions** for the controller (IRSA strongly recommended)
|
||||
4. **Subnet tagging** correct for ALB discovery
|
||||
5. **ACM certificate** for your DNS name
|
||||
6. **Route53 record** (or other DNS) pointing to the created ALB
|
||||
|
||||
If any of these are missing, Helm installation may succeed, but the product will be unreachable.
|
||||
|
||||
---
|
||||
|
||||
## Preflight Checks (Ingress-Specific)
|
||||
|
||||
### Controller Health
|
||||
- [ ] The AWS Load Balancer Controller pods are running
|
||||
- [ ] No CrashLoopBackOff
|
||||
- [ ] Controller has permission to create:
|
||||
- ALBs
|
||||
- Target groups
|
||||
- Listeners
|
||||
- Security group rules
|
||||
|
||||
### Subnet Tagging (Common Failure)
|
||||
- [ ] Subnets are tagged so the controller can discover them for ALB creation
|
||||
- [ ] You know which subnets should be:
|
||||
- public-facing ALB
|
||||
- internal-only ALB (if private)
|
||||
|
||||
### TLS
|
||||
- [ ] ACM cert exists in the **same region** as the ALB
|
||||
- [ ] Cert covers the intended DNS name (`langsmith.<domain>`)
|
||||
|
||||
### DNS
|
||||
- [ ] You can create DNS records for the LangSmith hostname
|
||||
|
||||
---
|
||||
|
||||
## Mandatory Validation Step: Prove ALB Ingress Works Before LangSmith
|
||||
|
||||
Complete this validation **before** installing LangSmith. This step helps isolate ingress configuration issues from application-level problems, making troubleshooting more efficient.
|
||||
|
||||
### Step 1: Deploy a tiny test service
|
||||
Pick one lightweight HTTP echo service (example shown conceptually):
|
||||
|
||||
- Create a deployment + service that listens on HTTP (port 80)
|
||||
- Confirm:
|
||||
- `kubectl get pods` shows it running
|
||||
- `kubectl get svc` shows endpoints
|
||||
|
||||
### Step 2: Create a test Ingress that provisions an ALB
|
||||
Create an Ingress resource targeting the test service.
|
||||
|
||||
What must happen:
|
||||
- An ALB is created
|
||||
- A target group is created
|
||||
- Targets become **healthy**
|
||||
- You can curl the endpoint and receive a response
|
||||
|
||||
### Step 3: If the test Ingress fails, stop
|
||||
Do not proceed to LangSmith until:
|
||||
- ALB provisioning works
|
||||
- target health becomes green
|
||||
- HTTPS works with your cert
|
||||
|
||||
---
|
||||
|
||||
## Common Failure Modes (and Where to Look First)
|
||||
|
||||
### ALB never gets created
|
||||
**Likely causes**
|
||||
- Controller not installed
|
||||
- Missing IAM permissions
|
||||
- Subnet discovery fails
|
||||
|
||||
**Look at**
|
||||
- Kubernetes events on the Ingress
|
||||
- Controller logs
|
||||
- AWS console: whether any ALB attempt exists
|
||||
|
||||
---
|
||||
|
||||
### ALB created but targets unhealthy
|
||||
**Likely causes**
|
||||
- Wrong service port / targetPort
|
||||
- Pods not ready
|
||||
- Health check path mismatch
|
||||
- Security group blocks node-to-target traffic
|
||||
|
||||
**Look at**
|
||||
- ALB target group health reason
|
||||
- `kubectl describe ingress ...`
|
||||
- `kubectl describe svc ...`
|
||||
- Pod readiness probe status
|
||||
|
||||
---
|
||||
|
||||
### HTTPS broken / cert issues
|
||||
**Likely causes**
|
||||
- Wrong ACM cert
|
||||
- Cert in wrong region
|
||||
- DNS mismatch
|
||||
|
||||
**Look at**
|
||||
- ALB listener config
|
||||
- ACM cert validity and SANs
|
||||
- DNS record points to the right ALB
|
||||
|
||||
---
|
||||
|
||||
## Security Recommendations (P0 Baseline)
|
||||
|
||||
Minimum expected posture for P0:
|
||||
- HTTPS only (no plaintext)
|
||||
- WAF or equivalent rate limiting at the edge
|
||||
- Prefer private exposure for enterprise deployments
|
||||
- Least privilege IAM for the controller and application
|
||||
- No public DB endpoints
|
||||
|
||||
---
|
||||
|
||||
## What to Document When You Deviate (Off-Reference)
|
||||
|
||||
If a customer insists on non-ALB ingress, require them to capture:
|
||||
- ingress controller type/version
|
||||
- config manifests
|
||||
- load balancer / gateway config
|
||||
- health check settings
|
||||
- network policies / SG rules
|
||||
|
||||
Note: this configuration is **not supported by P0 enablement**.
|
||||
|
||||
---
|
||||
|
||||
## Done Criteria (Ingress)
|
||||
|
||||
Ingress is “done” when:
|
||||
- [ ] AWS Load Balancer Controller is healthy
|
||||
- [ ] A test Ingress provisions an ALB successfully
|
||||
- [ ] Targets are healthy
|
||||
- [ ] HTTPS works with your DNS name
|
||||
|
||||
Only then install LangSmith.
|
||||
@@ -0,0 +1,201 @@
|
||||
Apache License
|
||||
Version 2.0, January 2004
|
||||
http://www.apache.org/licenses/
|
||||
|
||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
||||
|
||||
1. Definitions.
|
||||
|
||||
"License" shall mean the terms and conditions for use, reproduction,
|
||||
and distribution as defined by Sections 1 through 9 of this document.
|
||||
|
||||
"Licensor" shall mean the copyright owner or entity authorized by
|
||||
the copyright owner that is granting the License.
|
||||
|
||||
"Legal Entity" shall mean the union of the acting entity and all
|
||||
other entities that control, are controlled by, or are under common
|
||||
control with that entity. For the purposes of this definition,
|
||||
"control" means (i) the power, direct or indirect, to cause the
|
||||
direction or management of such entity, whether by contract or
|
||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
||||
|
||||
"You" (or "Your") shall mean an individual or Legal Entity
|
||||
exercising permissions granted by this License.
|
||||
|
||||
"Source" form shall mean the preferred form for making modifications,
|
||||
including but not limited to software source code, documentation
|
||||
source, and configuration files.
|
||||
|
||||
"Object" form shall mean any form resulting from mechanical
|
||||
transformation or translation of a Source form, including but
|
||||
not limited to compiled object code, generated documentation,
|
||||
and conversions to other media types.
|
||||
|
||||
"Work" shall mean the work of authorship, whether in Source or
|
||||
Object form, made available under the License, as indicated by a
|
||||
copyright notice that is included in or attached to the work
|
||||
(an example is provided in the Appendix below).
|
||||
|
||||
"Derivative Works" shall mean any work, whether in Source or Object
|
||||
form, that is based on (or derived from) the Work and for which the
|
||||
editorial revisions, annotations, elaborations, or other modifications
|
||||
represent, as a whole, an original work of authorship. For the purposes
|
||||
of this License, Derivative Works shall not include works that remain
|
||||
separable from, or merely link (or bind by name) to the interfaces of,
|
||||
the Work and Derivative Works thereof.
|
||||
|
||||
"Contribution" shall mean any work of authorship, including
|
||||
the original version of the Work and any modifications or additions
|
||||
to that Work or Derivative Works thereof, that is intentionally
|
||||
submitted to Licensor for inclusion in the Work by the copyright owner
|
||||
or by an individual or Legal Entity authorized to submit on behalf of
|
||||
the copyright owner. For the purposes of this definition, "submitted"
|
||||
means any form of electronic, verbal, or written communication sent
|
||||
to the Licensor or its representatives, including but not limited to
|
||||
communication on electronic mailing lists, source code control systems,
|
||||
and issue tracking systems that are managed by, or on behalf of, the
|
||||
Licensor for the purpose of discussing and improving the Work, but
|
||||
excluding communication that is conspicuously marked or otherwise
|
||||
designated in writing by the copyright owner as "Not a Contribution."
|
||||
|
||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
||||
on behalf of whom a Contribution has been received by Licensor and
|
||||
subsequently incorporated within the Work.
|
||||
|
||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
copyright license to reproduce, prepare Derivative Works of,
|
||||
publicly display, publicly perform, sublicense, and distribute the
|
||||
Work and such Derivative Works in Source or Object form.
|
||||
|
||||
3. Grant of Patent License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
(except as stated in this section) patent license to make, have made,
|
||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
||||
where such license applies only to those patent claims licensable
|
||||
by such Contributor that are necessarily infringed by their
|
||||
Contribution(s) alone or by combination of their Contribution(s)
|
||||
with the Work to which such Contribution(s) was submitted. If You
|
||||
institute patent litigation against any entity (including a
|
||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
||||
or a Contribution incorporated within the Work constitutes direct
|
||||
or contributory patent infringement, then any patent licenses
|
||||
granted to You under this License for that Work shall terminate
|
||||
as of the date such litigation is filed.
|
||||
|
||||
4. Redistribution. You may reproduce and distribute copies of the
|
||||
Work or Derivative Works thereof in any medium, with or without
|
||||
modifications, and in Source or Object form, provided that You
|
||||
meet the following conditions:
|
||||
|
||||
(a) You must give any other recipients of the Work or
|
||||
Derivative Works a copy of this License; and
|
||||
|
||||
(b) You must cause any modified files to carry prominent notices
|
||||
stating that You changed the files; and
|
||||
|
||||
(c) You must retain, in the Source form of any Derivative Works
|
||||
that You distribute, all copyright, patent, trademark, and
|
||||
attribution notices from the Source form of the Work,
|
||||
excluding those notices that do not pertain to any part of
|
||||
the Derivative Works; and
|
||||
|
||||
(d) If the Work includes a "NOTICE" text file as part of its
|
||||
distribution, then any Derivative Works that You distribute must
|
||||
include a readable copy of the attribution notices contained
|
||||
within such NOTICE file, excluding those notices that do not
|
||||
pertain to any part of the Derivative Works, in at least one
|
||||
of the following places: within a NOTICE text file distributed
|
||||
as part of the Derivative Works; within the Source form or
|
||||
documentation, if provided along with the Derivative Works; or,
|
||||
within a display generated by the Derivative Works, if and
|
||||
wherever such third-party notices normally appear. The contents
|
||||
of the NOTICE file are for informational purposes only and
|
||||
do not modify the License. You may add Your own attribution
|
||||
notices within Derivative Works that You distribute, alongside
|
||||
or as an addendum to the NOTICE text from the Work, provided
|
||||
that such additional attribution notices cannot be construed
|
||||
as modifying the License.
|
||||
|
||||
You may add Your own copyright statement to Your modifications and
|
||||
may provide additional or different license terms and conditions
|
||||
for use, reproduction, or distribution of Your modifications, or
|
||||
for any such Derivative Works as a whole, provided Your use,
|
||||
reproduction, and distribution of the Work otherwise complies with
|
||||
the conditions stated in this License.
|
||||
|
||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
||||
any Contribution intentionally submitted for inclusion in the Work
|
||||
by You to the Licensor shall be under the terms and conditions of
|
||||
this License, without any additional terms or conditions.
|
||||
Notwithstanding the above, nothing herein shall supersede or modify
|
||||
the terms of any separate license agreement you may have executed
|
||||
with Licensor regarding such Contributions.
|
||||
|
||||
6. Trademarks. This License does not grant permission to use the trade
|
||||
names, trademarks, service marks, or product names of the Licensor,
|
||||
except as required for reasonable and customary use in describing the
|
||||
origin of the Work and reproducing the content of the NOTICE file.
|
||||
|
||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
||||
agreed to in writing, Licensor provides the Work (and each
|
||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
implied, including, without limitation, any warranties or conditions
|
||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
||||
appropriateness of using or redistributing the Work and assume any
|
||||
risks associated with Your exercise of permissions under this License.
|
||||
|
||||
8. Limitation of Liability. In no event and under no legal theory,
|
||||
whether in tort (including negligence), contract, or otherwise,
|
||||
unless required by applicable law (such as deliberate and grossly
|
||||
negligent acts) or agreed to in writing, shall any Contributor be
|
||||
liable to You for damages, including any direct, indirect, special,
|
||||
incidental, or consequential damages of any character arising as a
|
||||
result of this License or out of the use or inability to use the
|
||||
Work (including but not limited to damages for loss of goodwill,
|
||||
work stoppage, computer failure or malfunction, or any and all
|
||||
other commercial damages or losses), even if such Contributor
|
||||
has been advised of the possibility of such damages.
|
||||
|
||||
9. Accepting Warranty or Additional Liability. While redistributing
|
||||
the Work or Derivative Works thereof, You may choose to offer,
|
||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
||||
or other liability obligations and/or rights consistent with this
|
||||
License. However, in accepting such obligations, You may act only
|
||||
on Your own behalf and on Your sole responsibility, not on behalf
|
||||
of any other Contributor, and only if You agree to indemnify,
|
||||
defend, and hold each Contributor harmless for any liability
|
||||
incurred by, or claims asserted against, such Contributor by reason
|
||||
of your accepting any such warranty or additional liability.
|
||||
|
||||
END OF TERMS AND CONDITIONS
|
||||
|
||||
APPENDIX: How to apply the Apache License to your work.
|
||||
|
||||
To apply the Apache License to your work, attach the following
|
||||
boilerplate notice, with the fields enclosed by brackets "[]"
|
||||
replaced with your own identifying information. (Don't include
|
||||
the brackets!) The text should be enclosed in the appropriate
|
||||
comment syntax for the file format. We also recommend that a
|
||||
file or class name and description of purpose be included on the
|
||||
same "printed page" as the copyright notice for easier
|
||||
identification within third-party archives.
|
||||
|
||||
Copyright [yyyy] [name of copyright owner]
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
+290
@@ -0,0 +1,290 @@
|
||||
# LangSmith Self-Hosted — Preflight Checklist (P0)
|
||||
|
||||
**Purpose:**
|
||||
Ensure the environment is ready *before* running Terraform or Helm.
|
||||
Most deployment challenges can be prevented by completing these checks upfront, rather than discovering issues during installation.
|
||||
|
||||
If a preflight check fails, **address it before proceeding**. This ensures a smoother deployment experience.
|
||||
|
||||
---
|
||||
|
||||
## Automated Preflight Checks
|
||||
|
||||
You can use the provided preflight script to automatically verify AWS permissions and prerequisites before proceeding with manual checks.
|
||||
|
||||
### Quick Start
|
||||
|
||||
Run the automated preflight script:
|
||||
|
||||
```bash
|
||||
./scripts/preflight.sh
|
||||
```
|
||||
|
||||
### What the Script Does
|
||||
|
||||
The script performs **read-only** permission checks to verify you have the necessary AWS permissions for deploying LangSmith Self-Hosted. By default, it:
|
||||
|
||||
- Verifies AWS credentials are configured
|
||||
- Tests permissions for required AWS services:
|
||||
- **EC2** (VPCs, subnets, availability zones)
|
||||
- **EKS** (cluster management)
|
||||
- **IAM** (role creation and management)
|
||||
- **RDS** (PostgreSQL/Aurora)
|
||||
- **ElastiCache** (Redis)
|
||||
- **Application Load Balancer** (ALB/ELB)
|
||||
- **ACM** (TLS certificates)
|
||||
- **Route53** (DNS management)
|
||||
- **WAFv2** (optional, for production)
|
||||
- Checks for sandbox account restrictions
|
||||
- Validates region configuration
|
||||
|
||||
**Note:** The script is read-only by default and does not create or modify any resources.
|
||||
|
||||
### Command-Line Options
|
||||
|
||||
```bash
|
||||
./scripts/preflight.sh [OPTIONS]
|
||||
```
|
||||
|
||||
**Options:**
|
||||
- `-s, --skip_resource_tests, --skip_checks`
|
||||
Skip resource creation tests (only run read-only permission checks)
|
||||
|
||||
- `-y, --yes`
|
||||
Non-interactive mode (skip confirmation prompts). Automatically enabled in CI environments.
|
||||
|
||||
- `--create-test-resources`
|
||||
Create temporary test resources (VPC, subnet, security group, IAM role) to verify write permissions. Resources are automatically cleaned up on exit. Use this to fully validate your permissions.
|
||||
|
||||
- `--domain <domain>`
|
||||
Check for ACM certificate and Route53 hosted zone matching the specified domain (e.g., `langsmith.example.com`). The script will check for exact matches and wildcard certificates.
|
||||
|
||||
**Examples:**
|
||||
|
||||
```bash
|
||||
# Basic permission check (read-only)
|
||||
./scripts/preflight.sh
|
||||
|
||||
# Check permissions and verify ACM certificate exists
|
||||
./scripts/preflight.sh --domain langsmith.example.com
|
||||
|
||||
# Full permission test including resource creation
|
||||
./scripts/preflight.sh --create-test-resources
|
||||
|
||||
# Non-interactive mode (useful for CI/CD)
|
||||
./scripts/preflight.sh --yes --domain langsmith.example.com
|
||||
```
|
||||
|
||||
### When to Use the Script
|
||||
|
||||
- **Before starting deployment:** Run the script to verify all AWS permissions are in place
|
||||
- **Troubleshooting permission issues:** Use `--create-test-resources` to test write permissions
|
||||
- **CI/CD pipelines:** Use `--yes` flag for automated checks
|
||||
- **Certificate validation:** Use `--domain` to verify ACM certificates and Route53 zones exist
|
||||
|
||||
The script provides clear success/failure indicators for each permission check, making it easy to identify and resolve permission issues before deployment.
|
||||
|
||||
---
|
||||
|
||||
## 1. Account & Access
|
||||
|
||||
### AWS Account
|
||||
- [ ] You have **full access** to an AWS account (not a sandbox with hidden SCPs)
|
||||
- [ ] You can create:
|
||||
- VPCs
|
||||
- EKS clusters
|
||||
- ALBs
|
||||
- IAM roles and policies
|
||||
- RDS / ElastiCache
|
||||
- EBS volumes
|
||||
- [ ] No org-level policy blocks required services
|
||||
|
||||
### Credentials
|
||||
- [ ] AWS credentials configured locally (`aws sts get-caller-identity` works)
|
||||
- [ ] Region selected and consistent across Terraform and Helm
|
||||
- [ ] You understand **who pays for this** (this will not be free)
|
||||
|
||||
---
|
||||
|
||||
## 2. Terraform Readiness
|
||||
|
||||
### Tooling
|
||||
- [ ] Terraform installed (supported version)
|
||||
- [ ] `kubectl` installed
|
||||
- [ ] `helm` installed
|
||||
- [ ] `awscli` installed
|
||||
|
||||
### State Management
|
||||
- [ ] Terraform state backend chosen (S3 + DynamoDB recommended)
|
||||
- [ ] State bucket exists or can be created
|
||||
- [ ] You are not sharing state with another environment
|
||||
|
||||
### Assumptions (Explicit)
|
||||
- [ ] You are deploying **one environment** (no shared dev/prod infra)
|
||||
- [ ] You are okay with Terraform creating networking resources
|
||||
- [ ] You will not “hot-edit” AWS resources Terraform owns
|
||||
|
||||
---
|
||||
|
||||
## 3. Network & DNS
|
||||
|
||||
### VPC
|
||||
- [ ] A dedicated VPC will exist for LangSmith
|
||||
- [ ] At least:
|
||||
- 2 public subnets (ALB)
|
||||
- 2 private subnets (EKS + data)
|
||||
- [ ] NAT Gateway available for private subnet egress
|
||||
|
||||
### DNS
|
||||
- [ ] A Route53 hosted zone exists (or you control DNS externally)
|
||||
- [ ] You can create DNS records for the LangSmith endpoint
|
||||
- [ ] You know whether this will be:
|
||||
- [ ] Publicly accessible
|
||||
- [ ] Private-only (VPN / PrivateLink)
|
||||
|
||||
---
|
||||
|
||||
## 4. Kubernetes (EKS) Expectations
|
||||
|
||||
### Cluster
|
||||
- [ ] EKS will be used (not self-managed k8s)
|
||||
- [ ] You accept managed node groups
|
||||
- [ ] You are not using custom admission controllers that block installs
|
||||
|
||||
### Capacity (Hard Requirement)
|
||||
- [ ] Minimum **16 vCPU / 64 GB RAM** allocatable cluster capacity
|
||||
- [ ] Nodes are sized to allow:
|
||||
- LangSmith services
|
||||
- ClickHouse
|
||||
- System overhead
|
||||
|
||||
### Required Add-ons
|
||||
- [ ] Metrics Server enabled
|
||||
- [ ] Cluster Autoscaler enabled
|
||||
- [ ] You can install CRDs
|
||||
|
||||
---
|
||||
|
||||
## 5. Data Stores
|
||||
|
||||
### PostgreSQL
|
||||
- [ ] PostgreSQL **14+**
|
||||
- [ ] Managed service (RDS/Aurora) preferred
|
||||
- [ ] Automated backups enabled
|
||||
- [ ] Network access from EKS confirmed
|
||||
|
||||
### Redis
|
||||
- [ ] Redis OSS **5+**
|
||||
- [ ] Managed (ElastiCache) or in-cluster
|
||||
- [ ] Network access from EKS confirmed
|
||||
|
||||
### ClickHouse (Critical)
|
||||
- [ ] Deployment model chosen:
|
||||
- [ ] Externally managed
|
||||
- [ ] In-cluster (StatefulSet)
|
||||
- [ ] If in-cluster:
|
||||
- [ ] Node with **8 vCPU / 32 GB RAM** available
|
||||
- [ ] SSD-backed storage
|
||||
- [ ] PersistentVolume provisioner available
|
||||
- [ ] You understand ClickHouse is **not stateless**
|
||||
|
||||
---
|
||||
|
||||
## 6. Object Storage (Strongly Recommended)
|
||||
|
||||
### S3
|
||||
- [ ] S3 bucket planned for LangSmith artifacts
|
||||
- [ ] Bucket region matches deployment region
|
||||
- [ ] IAM access model chosen:
|
||||
- [ ] IRSA (preferred)
|
||||
- [ ] Explicit credentials (discouraged)
|
||||
|
||||
---
|
||||
|
||||
## 7. Secrets Management
|
||||
|
||||
- [ ] Secrets **will not** be committed to git
|
||||
- [ ] Secrets backend chosen:
|
||||
- [ ] AWS Secrets Manager
|
||||
- [ ] External Secrets
|
||||
- [ ] CSI driver
|
||||
- [ ] Rotation strategy understood (even if manual)
|
||||
|
||||
---
|
||||
|
||||
## 8. Auth & Access Model
|
||||
|
||||
- [ ] Auth strategy selected:
|
||||
- [ ] Token-based
|
||||
- [ ] OIDC / SSO
|
||||
- [ ] You know **who can access LangSmith**
|
||||
- [ ] You know **how access is revoked**
|
||||
- [ ] You are not assuming “security by obscurity”
|
||||
|
||||
> Pick one auth model for initial enablement. Others are out of scope.
|
||||
|
||||
---
|
||||
|
||||
## 9. Ingress (P0 Hard Gate) — ALB Only
|
||||
|
||||
Ingress configuration is a critical component that requires careful attention. For the P0 reference deployment, ingress is **not optional** and there are **no alternative controllers**.
|
||||
|
||||
**P0 Requirement:** AWS ALB via **AWS Load Balancer Controller**.
|
||||
If you are using NGINX/Traefik/Istio/API Gateway/etc., you are operating **outside the reference path**.
|
||||
|
||||
### Controller & Permissions
|
||||
- [ ] AWS Load Balancer Controller is installed in the cluster
|
||||
- [ ] Controller pods are healthy (no CrashLoopBackOff)
|
||||
- [ ] Controller IAM permissions are in place (IRSA strongly preferred)
|
||||
|
||||
### Subnet Discovery (Common Failure)
|
||||
- [ ] Public subnets are correctly tagged for ALB discovery (public ALB)
|
||||
- [ ] Private subnets are correctly tagged if you plan an internal ALB
|
||||
- [ ] You know which subnets ALBs will land in
|
||||
|
||||
### TLS & DNS
|
||||
- [ ] ACM certificate exists for `langsmith.<domain>` (same region as ALB)
|
||||
- [ ] You control DNS and can create records for the endpoint
|
||||
|
||||
### Mandatory Proof (Stop if not true)
|
||||
- [ ] You have successfully provisioned a **test ALB** from Kubernetes Ingress
|
||||
- ALB created
|
||||
- target group created
|
||||
- targets become healthy
|
||||
- HTTPS works on your DNS name
|
||||
|
||||
If you cannot prove ALB ingress works **before** LangSmith, resolve the ingress configuration before proceeding with the LangSmith installation.
|
||||
|
||||
---
|
||||
|
||||
## 10. Operational Expectations (Read This)
|
||||
|
||||
Before proceeding, confirm you accept:
|
||||
|
||||
- [ ] You are responsible for upgrades
|
||||
- [ ] You are responsible for backups
|
||||
- [ ] You are responsible for incident response
|
||||
- [ ] Support will assume this reference architecture when debugging
|
||||
|
||||
If any of these are unacceptable, **review your requirements** before proceeding, as these responsibilities are fundamental to operating a self-hosted deployment.
|
||||
|
||||
---
|
||||
|
||||
## 11. Preflight Outcome
|
||||
|
||||
- [ ] All required checks passed
|
||||
→ You may proceed to **Terraform deployment**.
|
||||
|
||||
- [ ] One or more checks failed
|
||||
→ Address them **before** continuing. Proceeding without resolving these issues will likely result in deployment challenges.
|
||||
|
||||
---
|
||||
|
||||
## Why This Checklist Exists
|
||||
|
||||
Every unchecked box above corresponds to common issues that have caused:
|
||||
- Support escalations
|
||||
- Deployment delays
|
||||
- Production incidents
|
||||
|
||||
Completing preflight checks thoroughly significantly increases your chances of a successful deployment. While passing preflight does not guarantee success, **failing to address these checks almost guarantees challenges**.
|
||||
@@ -0,0 +1,262 @@
|
||||
# LangSmith Self-Hosted on AWS — Reference Architecture (P0)
|
||||
|
||||
**Status:** P0 Enablement Baseline
|
||||
**Audience:** Platform / Infra / MLOps Engineers
|
||||
**Goal:** Provide a single, opinionated, supportable path to deploying and operating LangSmith Self-Hosted (SH) on AWS with minimal support intervention.
|
||||
|
||||
This document defines the **reference architecture LangChain Enablement stands behind**.
|
||||
Alternative approaches may work, but are **out of scope for P0 enablement and future certification**.
|
||||
|
||||
---
|
||||
|
||||
## 1. What This Architecture Is (and Is Not)
|
||||
|
||||
### This *is*:
|
||||
- A production-capable **baseline deployment**
|
||||
- Opinionated by design
|
||||
- Built on **AWS + EKS + Terraform + Helm**
|
||||
- Designed to surface real operator responsibilities early
|
||||
- The foundation for future labs and certification
|
||||
|
||||
### This is *not*:
|
||||
- A performance benchmark
|
||||
- A multi-region or HA architecture
|
||||
- A guide for custom service meshes or bespoke gateways
|
||||
- A promise of security guarantees
|
||||
|
||||
---
|
||||
|
||||
## 2. Deployment Mode
|
||||
|
||||
**P0 Default: Full Self-Hosted**
|
||||
|
||||
- Control plane and data plane both run in the customer AWS account
|
||||
- Customer is responsible for:
|
||||
- Network exposure
|
||||
- Authentication
|
||||
- Data persistence
|
||||
- Upgrades and backups
|
||||
|
||||
> Hybrid (SaaS control plane + SH data plane) is valid but **out of scope for P0 enablement**.
|
||||
|
||||
---
|
||||
|
||||
## 3. High-Level Architecture
|
||||
|
||||
Request flow (top to bottom):
|
||||
|
||||

|
||||
|
||||
Users / CI / SDKs
|
||||
→ Route53
|
||||
→ Application Load Balancer (ALB) + WAF
|
||||
→ Kubernetes Ingress (EKS)
|
||||
→ LangSmith application services
|
||||
|
||||
Persistent dependencies:
|
||||
|
||||
- PostgreSQL — metadata (projects, orgs, users)
|
||||
- Redis — cache and job queues
|
||||
- ClickHouse — traces and analytics
|
||||
- S3 — large artifacts and payload storage
|
||||
|
||||
|
||||
**Flow Summary**
|
||||
- Traffic enters via **Route53 → ALB** (with optional WAF).
|
||||
- ALB forwards to **Kubernetes ingress** inside EKS.
|
||||
- LangSmith application services run in EKS.
|
||||
- Persistent state is handled by:
|
||||
- **PostgreSQL** (metadata)
|
||||
- **Redis** (cache / queues)
|
||||
- **ClickHouse** (traces & analytics)
|
||||
- **S3** (large artifacts and payloads)
|
||||
|
||||
This diagram represents the **minimum supported topology** for the P0 reference architecture.
|
||||
|
||||
---
|
||||
|
||||
## 4. Network & Ingress
|
||||
|
||||
### VPC
|
||||
- Single VPC
|
||||
- **Public subnets**: ALB only
|
||||
- **Private subnets**:
|
||||
- EKS worker nodes
|
||||
- Data services (RDS, Redis, ClickHouse if in-cluster)
|
||||
|
||||
### Ingress
|
||||
- **Application Load Balancer (ALB)**
|
||||
- **AWS WAF strongly recommended**
|
||||
- TLS termination at ALB (end-to-end TLS recommended)
|
||||
- Optionally:
|
||||
- Internal ALB + VPN / PrivateLink for non-public access
|
||||
|
||||
### Egress
|
||||
- Outbound HTTPS access to required LangChain endpoints (if applicable)
|
||||
- Restrict egress access per organizational policy requirements
|
||||
|
||||
---
|
||||
|
||||
## 5. Compute: Kubernetes (EKS)
|
||||
|
||||
### Cluster
|
||||
- **Amazon EKS**
|
||||
- Managed node groups
|
||||
- Cluster Autoscaler enabled
|
||||
- Metrics Server enabled
|
||||
|
||||
### Baseline Capacity
|
||||
- Minimum cluster capacity:
|
||||
- **16 vCPU / 64 GB RAM** available
|
||||
- This includes LangSmith services + system overhead
|
||||
|
||||
---
|
||||
|
||||
## 6. Data Stores
|
||||
|
||||
LangSmith SH relies on three core data stores.
|
||||
|
||||
### PostgreSQL (Metadata)
|
||||
- **AWS RDS PostgreSQL or Aurora PostgreSQL**
|
||||
- PostgreSQL **14+**
|
||||
- Single AZ for P0 (HA is P1)
|
||||
- Automated backups enabled
|
||||
|
||||
### Redis (Cache / Queues)
|
||||
- **AWS ElastiCache (Redis OSS)**
|
||||
- Single node acceptable for P0
|
||||
- Persistence optional but recommended
|
||||
|
||||
### ClickHouse (Traces & Analytics)
|
||||
|
||||
ClickHouse is **memory- and I/O-intensive**. Proper sizing is critical for optimal performance and stability.
|
||||
|
||||
#### P0 Reference Sizing (Production Baseline)
|
||||
- **8 vCPU**
|
||||
- **32 GB RAM**
|
||||
- **SSD-backed persistent storage**
|
||||
- ~7000 IOPS
|
||||
- ~1000 MiB/s throughput
|
||||
|
||||
#### Allowed but Dev-Only
|
||||
- **4 vCPU / 16 GB RAM**
|
||||
- Non-production proof-of-concept only
|
||||
|
||||
#### Scaling Guidance (P1)
|
||||
- Scale to **16 vCPU / 64 GB RAM** when:
|
||||
- Trace ingestion grows
|
||||
- Query latency increases
|
||||
- Memory pressure appears
|
||||
|
||||
> Strong recommendation: use externally managed ClickHouse where possible.
|
||||
> In-cluster ClickHouse is supported for P0 and works well with proper operational practices.
|
||||
|
||||
---
|
||||
|
||||
## 7. Object Storage
|
||||
|
||||
### S3 (Strongly Recommended)
|
||||
- Store large trace artifacts and payloads
|
||||
- Reduces DB size and blast radius
|
||||
- Improves security posture for sensitive inputs/outputs
|
||||
|
||||
### Access Pattern
|
||||
- Use **IAM Roles for Service Accounts (IRSA)** where possible
|
||||
- No static credentials in Helm values
|
||||
|
||||
---
|
||||
|
||||
## 8. Secrets & Identity
|
||||
|
||||
### Secrets
|
||||
- **AWS Secrets Manager** (preferred)
|
||||
- Inject into Kubernetes via:
|
||||
- External Secrets
|
||||
- CSI driver
|
||||
- Secure environment injection
|
||||
|
||||
### Identity & Auth
|
||||
- LangSmith authentication must be configured explicitly
|
||||
- Supported patterns include:
|
||||
- Token-based authentication
|
||||
- OIDC / SSO (at least one concrete example recommended for enablement)
|
||||
|
||||
> For P0 enablement, select **one authentication pattern** to focus on. Additional patterns may be explored in future enablement tracks.
|
||||
|
||||
---
|
||||
|
||||
## 9. Observability (Platform-Level)
|
||||
|
||||
Minimum required:
|
||||
- Application logs accessible via CloudWatch
|
||||
- Kubernetes events visible
|
||||
- Health endpoints monitored
|
||||
|
||||
Optional (P1):
|
||||
- Prometheus / OpenTelemetry exporters
|
||||
- Alerting on:
|
||||
- Pod restarts
|
||||
- DB connectivity
|
||||
- Ingestion failures
|
||||
|
||||
---
|
||||
|
||||
## 10. Security Baseline (Non-Negotiable)
|
||||
|
||||
This reference architecture requires **essential security controls** as a baseline.
|
||||
|
||||
### MUST
|
||||
- TLS enabled
|
||||
- No plaintext secrets
|
||||
- Least-privilege IAM
|
||||
- Network isolation (private subnets for data services)
|
||||
- WAF or equivalent rate limiting at ingress
|
||||
|
||||
### SHOULD
|
||||
- Private access only (VPN / PrivateLink)
|
||||
- Auth required for all UI and API access
|
||||
- Regular patching and upgrades
|
||||
|
||||
### Explicit Disclaimer
|
||||
> This reference architecture does **not** guarantee security.
|
||||
> Customers are responsible for reviewing and approving deployments with their security teams.
|
||||
|
||||
---
|
||||
|
||||
## 11. What This Architecture Explicitly Excludes
|
||||
|
||||
These are **out of scope for P0 enablement**:
|
||||
- Multi-region active/active
|
||||
- Custom gateways or service meshes
|
||||
- HA ClickHouse clusters
|
||||
- Custom scaling policies beyond autoscaler defaults
|
||||
- Performance benchmarking beyond sanity checks
|
||||
|
||||
These may appear in P1/P2 enablement or certification tracks.
|
||||
|
||||
---
|
||||
|
||||
## 12. Why This Exists
|
||||
|
||||
This reference architecture exists to:
|
||||
- Reduce installation failures and complexity
|
||||
- Provide support teams with a shared baseline
|
||||
- Create a clear, well-documented enablement path
|
||||
- Serve as the foundation for:
|
||||
- Hands-on labs
|
||||
- Operator certification
|
||||
- Support playbooks
|
||||
|
||||
If you encounter challenges during implementation, these often indicate areas where additional attention or configuration is needed, rather than system defects.
|
||||
|
||||
---
|
||||
|
||||
## 13. Next Artifacts (Planned)
|
||||
|
||||
- Preflight checklist
|
||||
- Deployment walkthrough
|
||||
- Known sharp edges
|
||||
- Failure-mode diagnostics
|
||||
- Operator mental model
|
||||
|
||||
These resources build **on top of this foundation**, providing additional guidance and support as you progress.
|
||||
@@ -0,0 +1,405 @@
|
||||
# LangSmith Self-Hosted on AWS — Troubleshooting Guide (P0)
|
||||
|
||||
**Purpose:** Fast triage for the P0 reference deployment.
|
||||
**Style:** Symptom → likely cause → exact checks → common fix.
|
||||
|
||||
This guide focuses on actionable, evidence-based troubleshooting. Every item maps to an observable signal and a deterministic check.
|
||||
|
||||
---
|
||||
|
||||
## 0. First Rule of Triage: Gather Evidence First
|
||||
|
||||
Before changing anything, capture essential diagnostic information. The easiest way to do this is using the provided diagnostic capture script.
|
||||
|
||||
### Quick Start: Automated Diagnostics Capture
|
||||
|
||||
Run the diagnostic script to automatically capture all required information:
|
||||
|
||||
```bash
|
||||
./scripts/capture-diagnostics.sh
|
||||
```
|
||||
|
||||
The script captures:
|
||||
- Pod list and detailed descriptions for all pods
|
||||
- Logs from all pods (current and previous if restarted)
|
||||
- Kubernetes events
|
||||
- Ingress resources and detailed configurations
|
||||
- Service and endpoint information
|
||||
- Node information and resource usage
|
||||
- ALB target group health (if AWS CLI is configured)
|
||||
|
||||
**Configuration via environment variables:**
|
||||
- `NAMESPACE` - Kubernetes namespace (default: `langsmith`)
|
||||
- `LOG_TAIL` - Number of log lines to capture per pod (default: `200`)
|
||||
- `EVENTS_TAIL` - Number of events to capture (default: `50`)
|
||||
- `OUTPUT_DIR` - Directory for diagnostic output (default: `./diagnostics`)
|
||||
- `AWS_REGION` - AWS region for ALB queries (default: `us-west-2`)
|
||||
|
||||
**Example with custom configuration:**
|
||||
```bash
|
||||
NAMESPACE=langsmith-prod LOG_TAIL=500 OUTPUT_DIR=./prod-diagnostics ./scripts/capture-diagnostics.sh
|
||||
```
|
||||
|
||||
The script creates a timestamped directory with all diagnostic information and a summary file. All output is saved for later analysis.
|
||||
|
||||
### Manual Capture (Alternative)
|
||||
|
||||
If you prefer to capture diagnostics manually, ensure you collect:
|
||||
|
||||
- `kubectl get pods -n langsmith -o wide`
|
||||
- `kubectl describe pod <POD> -n langsmith` (for each pod)
|
||||
- `kubectl logs <POD> -n langsmith --tail=200` (for each pod)
|
||||
- `kubectl get events -n langsmith --sort-by=.lastTimestamp | tail -50`
|
||||
- Ingress/ALB status:
|
||||
- `kubectl get ingress -n langsmith` (or your ingress resource type)
|
||||
- `kubectl describe ingress <INGRESS> -n langsmith`
|
||||
- If AWS-managed:
|
||||
- ALB target group health (healthy/unhealthy + reason)
|
||||
|
||||
Capturing this information ensures you address the actual root cause rather than symptoms, making troubleshooting more efficient and effective.
|
||||
|
||||
---
|
||||
|
||||
## 1. The Deployment “Works” But UI Is Not Reachable
|
||||
|
||||
### Symptom
|
||||
- DNS resolves but browser times out
|
||||
- Browser shows `502/503`
|
||||
- ALB exists but shows no healthy targets
|
||||
|
||||
### Likely Causes
|
||||
- Ingress misconfigured
|
||||
- Service port mismatch
|
||||
- Pod readiness failing (so targets never become healthy)
|
||||
- Security group / NACL blocks
|
||||
|
||||
### Checks
|
||||
- `kubectl get ingress -n langsmith -o yaml`
|
||||
- `kubectl get svc -n langsmith`
|
||||
- `kubectl describe svc <SERVICE> -n langsmith`
|
||||
- `kubectl get endpoints -n langsmith`
|
||||
- `kubectl get pods -n langsmith`
|
||||
- Inspect readiness:
|
||||
- `kubectl describe pod <POD> -n langsmith | sed -n '/Readiness/,/Conditions/p'`
|
||||
|
||||
### Fixes (Common)
|
||||
- Ensure ingress points to the correct service + port
|
||||
- Ensure service selectors match pod labels
|
||||
- Fix readiness probe failures before touching ALB
|
||||
- Confirm ALB security group allows inbound 443 and node security group allows target traffic
|
||||
|
||||
---
|
||||
|
||||
## 2. Pods CrashLoopBackOff Immediately
|
||||
|
||||
### Symptom
|
||||
- Pods oscillate between `CrashLoopBackOff` and `Running`
|
||||
- Logs show immediate exit
|
||||
|
||||
### Likely Causes
|
||||
- Missing or invalid secrets
|
||||
- DB/Redis/ClickHouse connection failure
|
||||
- Misconfigured required env vars
|
||||
|
||||
### Checks
|
||||
- `kubectl logs <POD> -n langsmith --previous --tail=200`
|
||||
- `kubectl describe pod <POD> -n langsmith` (look for env var injection and secret refs)
|
||||
- Confirm secrets exist:
|
||||
- `kubectl get secret -n langsmith`
|
||||
- Confirm external connectivity from inside cluster:
|
||||
- Launch a temporary debug pod and test TCP connectivity to DB hosts/ports
|
||||
|
||||
### Fixes (Common)
|
||||
- Correct secret names/keys referenced in Helm values
|
||||
- Verify DB hostnames and ports (RDS endpoints, Redis endpoints)
|
||||
- Fix network policy / security groups if connections time out
|
||||
|
||||
---
|
||||
|
||||
## 3. Everything Is Running, But “First Successful Trace” Fails
|
||||
|
||||
### Symptom
|
||||
- UI loads
|
||||
- SDK calls fail (401/403/404) or traces never appear
|
||||
- Client sees timeouts or 5xx
|
||||
|
||||
### Likely Causes
|
||||
- Wrong endpoint (`LANGSMITH_ENDPOINT`) or wrong path
|
||||
- Auth mismatch (token vs SSO)
|
||||
- Ingestion path failing due to ClickHouse or Redis issues
|
||||
- ALB health is fine but app errors on ingest
|
||||
|
||||
### Checks
|
||||
- From client machine:
|
||||
- Confirm endpoint resolves and responds (TLS + HTTP status)
|
||||
- In cluster logs:
|
||||
- Search logs of the API/ingestion service for auth or write errors
|
||||
- Check ClickHouse health:
|
||||
- Look for write failures, memory pressure, disk pressure
|
||||
- Check Redis:
|
||||
- Look for connection errors or queue backlog signals (if exposed)
|
||||
|
||||
### Fixes (Common)
|
||||
- Ensure client is using the correct base URL and auth method
|
||||
- Regenerate token / verify permissions
|
||||
- Fix ClickHouse sizing or disk throughput issues if writes fail
|
||||
- Fix Redis connectivity if queues are used for ingest
|
||||
|
||||
---
|
||||
|
||||
## 4. ALB Exists But Targets Are “Unhealthy”
|
||||
|
||||
### Symptom
|
||||
- ALB target group shows all targets unhealthy
|
||||
- UI returns `503` even though pods are running
|
||||
|
||||
### Likely Causes
|
||||
- Readiness probe failing
|
||||
- Target group health check path/port mismatch
|
||||
- Service isn’t exposing the expected port
|
||||
- Pods are running but not listening
|
||||
|
||||
### Checks
|
||||
- `kubectl describe pod <POD> -n langsmith` (readiness probe results)
|
||||
- `kubectl get svc -n langsmith -o yaml`
|
||||
- Confirm the container port aligns with service targetPort
|
||||
- Confirm health check path matches what the service actually serves
|
||||
|
||||
### Fixes (Common)
|
||||
- Correct ingress annotations / health check settings
|
||||
- Fix readiness probe configuration or dependencies causing readiness to fail
|
||||
- Align service ports with actual container ports
|
||||
|
||||
---
|
||||
|
||||
## 5. DB Connectivity Failures (PostgreSQL)
|
||||
|
||||
### Symptom
|
||||
- App logs show:
|
||||
- authentication failures
|
||||
- connection refused
|
||||
- timeout
|
||||
- “could not translate host name”
|
||||
- App won’t start or fails on request
|
||||
|
||||
### Likely Causes
|
||||
- Wrong credentials
|
||||
- Security group blocks EKS to RDS
|
||||
- RDS not in the right subnets or routing broken
|
||||
- DNS/resolution issues inside cluster
|
||||
|
||||
### Checks
|
||||
- Validate the RDS endpoint and port
|
||||
- Confirm security groups allow inbound from EKS node group / pod CIDR (depending on setup)
|
||||
- Test connectivity from a debug pod:
|
||||
- DNS resolution
|
||||
- TCP connect to `<rds-endpoint>:5432`
|
||||
|
||||
### Fixes (Common)
|
||||
- Correct creds in secrets
|
||||
- Fix SG rules
|
||||
- Ensure private subnets have proper routing and NAT where required
|
||||
- Ensure RDS is reachable from EKS VPC/subnets
|
||||
|
||||
---
|
||||
|
||||
## 6. Redis Connectivity Failures
|
||||
|
||||
### Symptom
|
||||
- Logs show Redis connection errors/timeouts
|
||||
- Background jobs stall (if used)
|
||||
- Ingestion or async tasks fail
|
||||
|
||||
### Likely Causes
|
||||
- Wrong endpoint/port
|
||||
- Security group blocks EKS to ElastiCache
|
||||
- Auth mismatch (if Redis auth enabled)
|
||||
|
||||
### Checks
|
||||
- Confirm ElastiCache endpoint and port
|
||||
- Test TCP connectivity from debug pod
|
||||
- Check whether Redis auth is enabled and whether Helm values match
|
||||
|
||||
### Fixes (Common)
|
||||
- Fix endpoint in values
|
||||
- Fix security group rules
|
||||
- Align auth config
|
||||
|
||||
---
|
||||
|
||||
## 7. ClickHouse Problems (Most Common Real Root Cause)
|
||||
|
||||
### 7.1 ClickHouse OOM / Memory Pressure
|
||||
|
||||
**Symptom**
|
||||
- ClickHouse pod restarts
|
||||
- OOMKilled events
|
||||
- Trace writes fail or become slow
|
||||
|
||||
**Likely Cause**
|
||||
- ClickHouse undersized (4/16 used for real workload)
|
||||
- Memory limits too tight
|
||||
- Query pressure
|
||||
|
||||
**Checks**
|
||||
- `kubectl describe pod <clickhouse-pod> -n langsmith` (look for OOMKilled)
|
||||
- `kubectl logs <clickhouse-pod> -n langsmith --tail=200`
|
||||
- `kubectl top pod <clickhouse-pod> -n langsmith`
|
||||
|
||||
**Fixes**
|
||||
- Move to **8 vCPU / 32GB RAM** baseline
|
||||
- Increase memory limits/requests
|
||||
- Reduce concurrent ingest/query load
|
||||
|
||||
---
|
||||
|
||||
### 7.2 ClickHouse Disk / IO Throughput Issues
|
||||
|
||||
**Symptom**
|
||||
- Latency spikes
|
||||
- Writes time out
|
||||
- ClickHouse logs mention slow merges / IO waits
|
||||
|
||||
**Likely Cause**
|
||||
- Slow storage class
|
||||
- Inadequate IOPS/throughput
|
||||
- Disk nearing capacity
|
||||
|
||||
**Checks**
|
||||
- Confirm PV storage class and performance characteristics
|
||||
- Check disk usage in ClickHouse pod
|
||||
- Review ClickHouse logs for merge pressure / IO wait
|
||||
|
||||
**Fixes**
|
||||
- Use SSD-backed storage with sufficient IOPS/throughput
|
||||
- Increase volume size
|
||||
- Move ClickHouse to a dedicated node group / better instance type
|
||||
|
||||
---
|
||||
|
||||
### 7.3 ClickHouse Not Persistent (Data Loss Risk)
|
||||
|
||||
**Symptom**
|
||||
- ClickHouse redeploy loses data
|
||||
- Traces disappear after restart
|
||||
|
||||
**Likely Cause**
|
||||
- No persistent volume attached
|
||||
- StatefulSet misconfigured
|
||||
|
||||
**Checks**
|
||||
- Confirm PVC exists and is bound:
|
||||
- `kubectl get pvc -n langsmith`
|
||||
- Confirm ClickHouse uses that PVC
|
||||
|
||||
**Fixes**
|
||||
- Attach PVC and ensure StatefulSet mounts it
|
||||
- Do not treat ClickHouse as stateless
|
||||
|
||||
---
|
||||
|
||||
## 8. Kubernetes Scheduling Issues
|
||||
|
||||
### Symptom
|
||||
- Pods stuck in `Pending`
|
||||
- Events show “insufficient cpu/memory”
|
||||
- ClickHouse never schedules
|
||||
|
||||
### Likely Causes
|
||||
- Cluster too small
|
||||
- Node group instance types too small
|
||||
- Taints/affinity constraints prevent scheduling
|
||||
|
||||
### Checks
|
||||
- `kubectl describe pod <POD> -n langsmith` (look at scheduling events)
|
||||
- `kubectl get nodes -o wide`
|
||||
- Check taints:
|
||||
- `kubectl describe node <NODE> | sed -n '/Taints/,/Conditions/p'`
|
||||
|
||||
### Fixes
|
||||
- Increase node group size
|
||||
- Use larger instance types
|
||||
- Remove/adjust taints and affinities
|
||||
- Ensure ClickHouse has a node that can fit **8/32 allocatable**
|
||||
|
||||
---
|
||||
|
||||
## 9. TLS / Certificate Issues
|
||||
|
||||
### Symptom
|
||||
- Browser warnings
|
||||
- Client SDK fails TLS handshake
|
||||
- Mixed content or redirect loops
|
||||
|
||||
### Likely Causes
|
||||
- Wrong ACM cert attached
|
||||
- Wrong DNS name on cert
|
||||
- HTTP/HTTPS mismatch
|
||||
|
||||
### Checks
|
||||
- Confirm ALB listener is HTTPS
|
||||
- Confirm cert CN/SAN includes your DNS name
|
||||
- Confirm DNS record points to the correct ALB
|
||||
|
||||
### Fixes
|
||||
- Attach correct cert
|
||||
- Fix DNS record
|
||||
- Enforce HTTPS redirects intentionally (not accidentally)
|
||||
|
||||
---
|
||||
|
||||
## 10. “It Worked Yesterday” Failures (The Dangerous Ones)
|
||||
|
||||
### Symptom
|
||||
- Random 5xx
|
||||
- Slow UI
|
||||
- Traces intermittently missing
|
||||
|
||||
### Likely Causes
|
||||
- Resource pressure (CPU throttling / memory pressure)
|
||||
- ClickHouse disk pressure or merge backlog
|
||||
- Redis saturation
|
||||
- Node churn / autoscaling issues
|
||||
|
||||
### Checks
|
||||
- `kubectl top pods -n langsmith`
|
||||
- Pod restarts:
|
||||
- `kubectl get pods -n langsmith --sort-by=.status.containerStatuses[0].restartCount`
|
||||
- Node events and scaling activity
|
||||
- DB metrics (RDS CPU/connections; Redis CPU/memory; ClickHouse memory/disk)
|
||||
|
||||
### Fixes
|
||||
- Add capacity (scale nodes)
|
||||
- Increase ClickHouse resources or improve disk class
|
||||
- Increase Redis tier if saturated
|
||||
- Tune autoscaler limits (don’t let it starve the cluster)
|
||||
|
||||
---
|
||||
|
||||
## 11. What to Include in a Support Request (If You Must Escalate)
|
||||
|
||||
If you open a ticket, include:
|
||||
|
||||
- Reference path confirmation:
|
||||
- “Deployed via reference architecture + terraform + helm”
|
||||
- repo SHAs / chart versions
|
||||
- Current cluster state:
|
||||
- `kubectl get pods -n langsmith -o wide`
|
||||
- relevant `describe` output
|
||||
- last 200 lines of logs from failing pods
|
||||
- External dependencies:
|
||||
- Postgres type/version (RDS/Aurora, PG version)
|
||||
- Redis type/version
|
||||
- ClickHouse model (external vs in-cluster) + sizing
|
||||
- ALB target health status and error reason
|
||||
|
||||
Providing this information upfront enables faster resolution. If diagnostics are incomplete, the first step will be to collect the necessary diagnostic data.
|
||||
|
||||
---
|
||||
|
||||
## 12. Add to This Guide (How)
|
||||
|
||||
Only add entries that:
|
||||
- Came from a real failure
|
||||
- Include a deterministic check
|
||||
- Include a fix that is repeatable
|
||||
+303
@@ -0,0 +1,303 @@
|
||||
# LangSmith Self-Hosted on AWS — Deployment Walkthrough (P0)
|
||||
|
||||
**Goal:** Get from zero → running LangSmith SH → first successful trace → basic health validation.
|
||||
**Assumption:** You passed [`PREFLIGHT.md`](./PREFLIGHT.md). If not, stop and do that first.
|
||||
|
||||
This walkthrough is intentionally opinionated and linear. Following it step-by-step ensures you stay on the reference path and can receive full support.
|
||||
|
||||
---
|
||||
|
||||
## 0. Inputs You Must Decide Up Front
|
||||
|
||||
Pick these *before* you touch Terraform:
|
||||
|
||||
- **AWS Region:** `us-west-2` (example — pick one and stick to it)
|
||||
- **Environment name:** `dev` / `staging` / `prod` (do not share resources across envs)
|
||||
- **DNS name:** `langsmith.<your-domain>`
|
||||
- **Exposure model:** Public (ALB) or Private-only (VPN/PrivateLink)
|
||||
- **Auth model:** Token-based (P0) or OIDC/SSO (P1 unless already standard internally)
|
||||
- **Data store model:**
|
||||
- Postgres: RDS/Aurora (recommended)
|
||||
- Redis: ElastiCache (recommended)
|
||||
- ClickHouse: Externally managed (preferred) or in-cluster (allowed)
|
||||
|
||||
Write these in a `deploy/ENV.md` file for your own sanity.
|
||||
|
||||
---
|
||||
|
||||
## 1. Clone Repos and Pin Versions
|
||||
|
||||
You are building an enablement path. That means **pinning** matters.
|
||||
|
||||
- Clone:
|
||||
- `https://github.com/langchain-ai/terraform`
|
||||
- `https://github.com/langchain-ai/helm`
|
||||
- Record:
|
||||
- Terraform repo commit SHA
|
||||
- Helm repo commit SHA or chart version
|
||||
- Do not “float” versions for the reference deployment.
|
||||
|
||||
> Reproducibility is essential for effective enablement. If you cannot reproduce a deployment later, the enablement process has not been fully captured.
|
||||
|
||||
---
|
||||
|
||||
## 2. Terraform: Provision AWS Infrastructure
|
||||
|
||||
### 2.1 Configure Terraform State
|
||||
- Use S3 backend + DynamoDB lock (recommended).
|
||||
- Ensure state is **unique per environment**.
|
||||
|
||||
### 2.2 Apply Infrastructure
|
||||
Provision (at minimum):
|
||||
- VPC + subnets (public for ALB, private for nodes/data)
|
||||
- EKS cluster + managed node groups
|
||||
- RDS Postgres (14+)
|
||||
- ElastiCache Redis
|
||||
- S3 bucket for artifacts
|
||||
- Security groups and IAM roles/policies
|
||||
- (Optional) Route53 hosted zone / record scaffolding
|
||||
|
||||
**Hard requirement:** Ensure the EKS node groups provide at least:
|
||||
- **16 vCPU / 64GB RAM** allocatable capacity total
|
||||
- **ClickHouse capacity** if in-cluster:
|
||||
- One node with **8 vCPU / 32GB RAM** allocatable
|
||||
|
||||
### 2.3 Terraform Verification Gates (Stop if any fail)
|
||||
- [ ] `aws eks describe-cluster` shows `ACTIVE`
|
||||
- [ ] Worker nodes in private subnets can reach the internet (NAT)
|
||||
- [ ] RDS reachable from EKS subnets/security groups
|
||||
- [ ] Redis reachable from EKS subnets/security groups
|
||||
- [ ] S3 bucket exists and IAM access path is defined (IRSA preferred)
|
||||
|
||||
---
|
||||
|
||||
## 3. Kubernetes: Connect and Validate the Cluster
|
||||
|
||||
### 3.1 Connect to the Cluster
|
||||
- Update kubeconfig:
|
||||
- `aws eks update-kubeconfig --region <REGION> --name <CLUSTER_NAME>`
|
||||
- Confirm:
|
||||
- `kubectl get nodes`
|
||||
|
||||
### 3.2 Install/Validate Required Add-ons
|
||||
You must have:
|
||||
- Metrics Server
|
||||
- Cluster Autoscaler
|
||||
|
||||
Verification:
|
||||
- `kubectl top nodes` returns metrics
|
||||
- Autoscaler is running and has permissions
|
||||
|
||||
### 3.3 Create a Namespace
|
||||
Create a dedicated namespace, e.g.:
|
||||
- `langsmith`
|
||||
|
||||
## 3.4 Ingress Gate — Prove ALB Works Before Installing LangSmith
|
||||
|
||||
Complete this validation **before** Helm-installing LangSmith. Many deployment issues initially attributed to LangSmith are actually ingress, controller, or subnet-tagging configuration problems.
|
||||
|
||||
### 3.4.1 Deploy a tiny test app
|
||||
Deploy any minimal HTTP echo service into a test namespace (or the `langsmith` namespace). Confirm:
|
||||
- `kubectl get pods` shows it running
|
||||
- `kubectl get svc` shows endpoints
|
||||
|
||||
### 3.4.2 Create a test Ingress that provisions an ALB
|
||||
Create an Ingress pointing at the test service.
|
||||
|
||||
Your success criteria are binary:
|
||||
- [ ] An **ALB** is created
|
||||
- [ ] A target group is created
|
||||
- [ ] Targets become **healthy**
|
||||
- [ ] You can hit the endpoint and get a response over **HTTPS**
|
||||
|
||||
### 3.4.3 If this fails, stop
|
||||
Do not proceed to LangSmith until this gate passes.
|
||||
|
||||
When it fails, the first places to look are:
|
||||
- Kubernetes events on the Ingress
|
||||
- AWS Load Balancer Controller logs
|
||||
- ALB target group health reasons in the AWS console
|
||||
|
||||
> If you are not using ALB for ingress, you are operating outside the P0 reference path.
|
||||
|
||||
---
|
||||
|
||||
## 4. Prepare Dependencies and Secrets
|
||||
|
||||
### 4.1 Collect Required Connection Info
|
||||
You need:
|
||||
- Postgres host/port/db/user/password
|
||||
- Redis host/port (and auth if enabled)
|
||||
- ClickHouse endpoint/user/password (or in-cluster config)
|
||||
- S3 bucket name and region
|
||||
|
||||
### 4.2 Store Secrets (Do Not Put in Git)
|
||||
Preferred: AWS Secrets Manager + External Secrets integration.
|
||||
|
||||
At minimum for P0 enablement:
|
||||
- Keep secrets out of repo
|
||||
- Inject into Kubernetes securely (ExternalSecrets/CSI/secure env)
|
||||
|
||||
**Stop condition:** Never commit passwords or secrets into `values.yaml` or version control. Use a secrets management solution instead.
|
||||
|
||||
---
|
||||
|
||||
## 5. Helm: Install LangSmith
|
||||
|
||||
### 5.1 Choose the Values Strategy
|
||||
You should have:
|
||||
- `values.yaml` (non-secret config)
|
||||
- `secrets.yaml` OR external secrets (secret values only, not committed)
|
||||
|
||||
### 5.2 Configure Required Values
|
||||
Your Helm values must define:
|
||||
- External Postgres connection
|
||||
- External Redis connection
|
||||
- ClickHouse configuration (external or in-cluster)
|
||||
- S3 artifact storage (strongly recommended)
|
||||
- Ingress configuration (ALB + TLS)
|
||||
|
||||
### 5.3 Install/Upgrade
|
||||
- Install the chart into the `langsmith` namespace.
|
||||
- Use `helm upgrade --install` (idempotent).
|
||||
|
||||
### 5.4 Helm Verification Gates (Stop if any fail)
|
||||
- [ ] All pods in `langsmith` namespace reach `Running` or expected steady state
|
||||
- [ ] No CrashLoopBackOff
|
||||
- [ ] Services have endpoints
|
||||
- [ ] Ingress is created and gets an ALB hostname/address
|
||||
|
||||
Commands you should run (conceptually):
|
||||
- `kubectl get pods -n langsmith`
|
||||
- `kubectl describe pod <...> -n langsmith`
|
||||
- `kubectl get svc -n langsmith`
|
||||
- `kubectl get ingress -n langsmith` (or equivalent ingress resource)
|
||||
|
||||
---
|
||||
|
||||
## 6. Ingress + DNS: Make It Reachable
|
||||
|
||||
### 6.1 TLS
|
||||
- Ensure the ALB listener is HTTPS
|
||||
- Ensure cert is valid (ACM recommended)
|
||||
|
||||
### 6.2 DNS
|
||||
- Create a Route53 record:
|
||||
- `langsmith.<domain>` → ALB DNS name
|
||||
|
||||
### 6.3 Reachability Gate
|
||||
- [ ] You can load the LangSmith UI at `https://langsmith.<domain>`
|
||||
- [ ] Auth behaves as intended (token login or SSO)
|
||||
|
||||
---
|
||||
|
||||
## 7. “First Successful Trace” (The Real Success Condition)
|
||||
|
||||
A deployment is not “done” until traces flow.
|
||||
|
||||
### 7.1 Create an API Key / Token (if applicable)
|
||||
- Create the token per your configured auth model.
|
||||
- Store it securely.
|
||||
|
||||
### 7.2 Send a Minimal Trace
|
||||
From a laptop or CI runner with egress to the endpoint:
|
||||
- Configure `LANGSMITH_ENDPOINT`
|
||||
- Configure auth (`LANGSMITH_API_KEY` or equivalent)
|
||||
- Run a minimal trace-producing script (LangChain example or direct API).
|
||||
|
||||
### 7.3 Trace Gate (Stop if fails)
|
||||
- [ ] A trace appears in the LangSmith UI
|
||||
- [ ] Trace includes at least one run/span
|
||||
- [ ] No ingestion errors in logs
|
||||
|
||||
If this fails, do not proceed to operational tasks. Fix ingestion first to ensure the system is functioning correctly.
|
||||
|
||||
---
|
||||
|
||||
## 8. Basic Health Validation (P0 Ops Readiness)
|
||||
|
||||
### 8.1 What “Healthy” Means (Minimum)
|
||||
- UI loads reliably
|
||||
- API responds
|
||||
- DB connections stable
|
||||
- No sustained error logs
|
||||
- ClickHouse writes succeed
|
||||
- Redis queues not stuck
|
||||
|
||||
### 8.2 Validate Logs
|
||||
Check:
|
||||
- LangSmith app logs for errors
|
||||
- ClickHouse logs for disk/memory pressure
|
||||
- Ingress/ALB logs (4xx/5xx spikes)
|
||||
|
||||
### 8.3 Validate Resource Pressure
|
||||
- `kubectl top pods -n langsmith`
|
||||
- Look for:
|
||||
- OOMKills
|
||||
- CPU throttling
|
||||
- Persistent volume saturation
|
||||
|
||||
---
|
||||
|
||||
## 9. Backup & Restore (P0 Expectations)
|
||||
|
||||
For P0 enablement, you must at least:
|
||||
- Confirm RDS backups are enabled
|
||||
- Confirm ClickHouse persistence strategy is defined
|
||||
- Confirm S3 bucket lifecycle/versioning policy is intentional
|
||||
|
||||
You do not need to execute a restore yet, but you must document how it would be done.
|
||||
|
||||
---
|
||||
|
||||
## 10. Common Failure Points (Fast Triage)
|
||||
|
||||
If deployment fails, the usual culprits are:
|
||||
|
||||
1. **Networking / Security Groups**
|
||||
- EKS can’t reach Postgres/Redis/ClickHouse
|
||||
2. **ClickHouse undersized or slow disk**
|
||||
- OOM, high latency, ingestion failures
|
||||
3. **Ingress misconfiguration**
|
||||
- ALB created but no healthy targets
|
||||
4. **Auth mismatch**
|
||||
- UI loads but API calls fail
|
||||
5. **Secrets handling**
|
||||
- Bad credentials injected, pods loop
|
||||
|
||||
When something breaks: capture
|
||||
- `kubectl describe`
|
||||
- pod logs
|
||||
- DB connection test results
|
||||
- ALB target health
|
||||
|
||||
This data becomes your failure-mode catalog later.
|
||||
|
||||
---
|
||||
|
||||
## 11. “Done” Definition (P0)
|
||||
|
||||
You are done only when:
|
||||
|
||||
- [ ] Terraform applied cleanly and is reproducible
|
||||
- [ ] Helm install is idempotent (`upgrade --install` works)
|
||||
- [ ] UI reachable via HTTPS on your chosen DNS
|
||||
- [ ] First successful trace appears in the UI
|
||||
- [ ] Basic health checks are green (no crash loops, stable DB connectivity)
|
||||
|
||||
If any box isn't checked, continue working through the checklist until all items are complete to ensure a fully functional reference deployment.
|
||||
|
||||
---
|
||||
|
||||
## Appendix: What to Capture During Your First Real Deployment
|
||||
|
||||
As you run this the first time, log:
|
||||
- Where you hesitated
|
||||
- What you had to guess
|
||||
- What you looked up
|
||||
- What failed and how you fixed it
|
||||
|
||||
Those are the inputs for:
|
||||
- `TROUBLESHOOTING.md`
|
||||
- “Top failure modes”
|
||||
- Future certification labs
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 270 KiB |
Executable
+268
@@ -0,0 +1,268 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
# LangSmith Self-Hosted Diagnostics Capture Script
|
||||
# This script captures essential diagnostic information for troubleshooting
|
||||
# LangSmith Self-Hosted deployments on AWS/EKS.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration via environment variables (with defaults)
|
||||
NAMESPACE="${NAMESPACE:-langsmith}"
|
||||
LOG_TAIL="${LOG_TAIL:-200}"
|
||||
EVENTS_TAIL="${EVENTS_TAIL:-50}"
|
||||
OUTPUT_DIR="${OUTPUT_DIR:-./diagnostics}"
|
||||
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
|
||||
OUTPUT_PATH="${OUTPUT_DIR}/${TIMESTAMP}"
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Create output directory
|
||||
mkdir -p "${OUTPUT_PATH}"
|
||||
|
||||
echo -e "${GREEN}Capturing diagnostics for namespace: ${NAMESPACE}${NC}"
|
||||
echo -e "${GREEN}Output directory: ${OUTPUT_PATH}${NC}"
|
||||
echo ""
|
||||
|
||||
# Function to run command and save output
|
||||
capture_output() {
|
||||
local description="$1"
|
||||
local command="$2"
|
||||
local filename="$3"
|
||||
|
||||
echo -e "${YELLOW}Capturing: ${description}${NC}"
|
||||
if eval "${command}" > "${OUTPUT_PATH}/${filename}" 2>&1; then
|
||||
echo -e "${GREEN} ✓ Saved to ${filename}${NC}"
|
||||
else
|
||||
echo -e "${RED} ✗ Failed to capture ${description}${NC}"
|
||||
fi
|
||||
echo ""
|
||||
}
|
||||
|
||||
# Check if kubectl is available
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
echo -e "${RED}Error: kubectl is not installed or not in PATH${NC}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check if namespace exists
|
||||
if ! kubectl get namespace "${NAMESPACE}" &> /dev/null; then
|
||||
echo -e "${RED}Error: Namespace '${NAMESPACE}' does not exist${NC}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Capture pod list
|
||||
capture_output \
|
||||
"Pod list (wide format)" \
|
||||
"kubectl get pods -n ${NAMESPACE} -o wide" \
|
||||
"pods-wide.txt"
|
||||
|
||||
# Get list of pods
|
||||
PODS=$(kubectl get pods -n "${NAMESPACE}" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
|
||||
|
||||
if [ -z "${PODS}" ]; then
|
||||
echo -e "${YELLOW}No pods found in namespace ${NAMESPACE}${NC}"
|
||||
echo ""
|
||||
else
|
||||
# Capture describe and logs for each pod
|
||||
for POD in ${PODS}; do
|
||||
echo -e "${YELLOW}Processing pod: ${POD}${NC}"
|
||||
|
||||
# Capture pod description
|
||||
capture_output \
|
||||
"Pod description: ${POD}" \
|
||||
"kubectl describe pod ${POD} -n ${NAMESPACE}" \
|
||||
"pod-${POD}-describe.txt"
|
||||
|
||||
# Capture pod logs
|
||||
capture_output \
|
||||
"Pod logs: ${POD} (last ${LOG_TAIL} lines)" \
|
||||
"kubectl logs ${POD} -n ${NAMESPACE} --tail=${LOG_TAIL}" \
|
||||
"pod-${POD}-logs.txt"
|
||||
|
||||
# Capture previous logs if pod has restarted
|
||||
if kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.status.containerStatuses[0].restartCount}' 2>/dev/null | grep -q '[1-9]'; then
|
||||
capture_output \
|
||||
"Previous pod logs: ${POD} (last ${LOG_TAIL} lines)" \
|
||||
"kubectl logs ${POD} -n ${NAMESPACE} --previous --tail=${LOG_TAIL}" \
|
||||
"pod-${POD}-logs-previous.txt"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
# Capture events
|
||||
capture_output \
|
||||
"Kubernetes events (last ${EVENTS_TAIL} events)" \
|
||||
"kubectl get events -n ${NAMESPACE} --sort-by=.lastTimestamp | tail -${EVENTS_TAIL}" \
|
||||
"events.txt"
|
||||
|
||||
# Capture ingress status
|
||||
capture_output \
|
||||
"Ingress resources" \
|
||||
"kubectl get ingress -n ${NAMESPACE} -o wide" \
|
||||
"ingress-list.txt"
|
||||
|
||||
# Capture detailed ingress information
|
||||
INGRESS_RESOURCES=$(kubectl get ingress -n "${NAMESPACE}" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
|
||||
if [ -n "${INGRESS_RESOURCES}" ]; then
|
||||
for INGRESS in ${INGRESS_RESOURCES}; do
|
||||
capture_output \
|
||||
"Ingress details: ${INGRESS}" \
|
||||
"kubectl describe ingress ${INGRESS} -n ${NAMESPACE}" \
|
||||
"ingress-${INGRESS}-describe.txt"
|
||||
|
||||
capture_output \
|
||||
"Ingress YAML: ${INGRESS}" \
|
||||
"kubectl get ingress ${INGRESS} -n ${NAMESPACE} -o yaml" \
|
||||
"ingress-${INGRESS}.yaml"
|
||||
done
|
||||
fi
|
||||
|
||||
# Capture service status
|
||||
capture_output \
|
||||
"Service list" \
|
||||
"kubectl get svc -n ${NAMESPACE} -o wide" \
|
||||
"services-list.txt"
|
||||
|
||||
# Capture endpoints
|
||||
capture_output \
|
||||
"Endpoints" \
|
||||
"kubectl get endpoints -n ${NAMESPACE}" \
|
||||
"endpoints.txt"
|
||||
|
||||
# Capture service details
|
||||
SERVICES=$(kubectl get svc -n "${NAMESPACE}" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
|
||||
if [ -n "${SERVICES}" ]; then
|
||||
for SVC in ${SERVICES}; do
|
||||
capture_output \
|
||||
"Service details: ${SVC}" \
|
||||
"kubectl describe svc ${SVC} -n ${NAMESPACE}" \
|
||||
"svc-${SVC}-describe.txt"
|
||||
done
|
||||
fi
|
||||
|
||||
# Capture node information
|
||||
capture_output \
|
||||
"Node list" \
|
||||
"kubectl get nodes -o wide" \
|
||||
"nodes-wide.txt"
|
||||
|
||||
# Capture resource usage (if metrics-server is available)
|
||||
if kubectl top nodes &> /dev/null; then
|
||||
capture_output \
|
||||
"Node resource usage" \
|
||||
"kubectl top nodes" \
|
||||
"nodes-top.txt"
|
||||
|
||||
if [ -n "${PODS}" ]; then
|
||||
capture_output \
|
||||
"Pod resource usage" \
|
||||
"kubectl top pods -n ${NAMESPACE}" \
|
||||
"pods-top.txt"
|
||||
fi
|
||||
else
|
||||
echo -e "${YELLOW}Metrics Server not available, skipping resource usage metrics${NC}"
|
||||
echo ""
|
||||
fi
|
||||
|
||||
# Capture PVC information
|
||||
capture_output \
|
||||
"Persistent Volume Claims" \
|
||||
"kubectl get pvc -n ${NAMESPACE}" \
|
||||
"pvc-list.txt"
|
||||
|
||||
# Capture StatefulSets and Deployments
|
||||
capture_output \
|
||||
"StatefulSets" \
|
||||
"kubectl get statefulsets -n ${NAMESPACE} -o wide" \
|
||||
"statefulsets.txt"
|
||||
|
||||
capture_output \
|
||||
"Deployments" \
|
||||
"kubectl get deployments -n ${NAMESPACE} -o wide" \
|
||||
"deployments.txt"
|
||||
|
||||
# AWS-specific: ALB target group health (if AWS CLI is available and configured)
|
||||
if command -v aws &> /dev/null; then
|
||||
echo -e "${YELLOW}Attempting to capture ALB target group health information...${NC}"
|
||||
|
||||
# Try to get ALB information from ingress annotations
|
||||
if [ -n "${INGRESS_RESOURCES}" ]; then
|
||||
for INGRESS in ${INGRESS_RESOURCES}; do
|
||||
ALB_ARN=$(kubectl get ingress "${INGRESS}" -n "${NAMESPACE}" -o jsonpath='{.metadata.annotations.alb\.ingress\.kubernetes\.io/load-balancer-id}' 2>/dev/null || echo "")
|
||||
|
||||
if [ -n "${ALB_ARN}" ]; then
|
||||
# Extract ALB name from ARN or use ARN directly
|
||||
echo "ALB ARN: ${ALB_ARN}" > "${OUTPUT_PATH}/alb-${INGRESS}-info.txt"
|
||||
|
||||
# Get target groups for this ALB
|
||||
if aws elbv2 describe-target-groups --load-balancer-arn "${ALB_ARN}" --region "${AWS_REGION:-us-west-2}" &> /dev/null; then
|
||||
capture_output \
|
||||
"ALB target groups: ${INGRESS}" \
|
||||
"aws elbv2 describe-target-groups --load-balancer-arn ${ALB_ARN} --region ${AWS_REGION:-us-west-2}" \
|
||||
"alb-${INGRESS}-target-groups.json"
|
||||
|
||||
# Get target health for each target group
|
||||
TARGET_GROUPS=$(aws elbv2 describe-target-groups --load-balancer-arn "${ALB_ARN}" --region "${AWS_REGION:-us-west-2}" --query 'TargetGroups[*].TargetGroupArn' --output text 2>/dev/null || echo "")
|
||||
if [ -n "${TARGET_GROUPS}" ]; then
|
||||
for TG_ARN in ${TARGET_GROUPS}; do
|
||||
capture_output \
|
||||
"Target group health: ${TG_ARN}" \
|
||||
"aws elbv2 describe-target-health --target-group-arn ${TG_ARN} --region ${AWS_REGION:-us-west-2}" \
|
||||
"alb-${INGRESS}-target-health-$(basename ${TG_ARN}).json"
|
||||
done
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
echo ""
|
||||
else
|
||||
echo -e "${YELLOW}AWS CLI not available, skipping ALB target group health capture${NC}"
|
||||
echo -e "${YELLOW}To capture ALB information, install AWS CLI and configure credentials${NC}"
|
||||
echo ""
|
||||
fi
|
||||
|
||||
# Create summary file
|
||||
SUMMARY_FILE="${OUTPUT_PATH}/summary.txt"
|
||||
{
|
||||
echo "LangSmith Self-Hosted Diagnostics Summary"
|
||||
echo "========================================"
|
||||
echo "Timestamp: ${TIMESTAMP}"
|
||||
echo "Namespace: ${NAMESPACE}"
|
||||
echo "Output Directory: ${OUTPUT_PATH}"
|
||||
echo ""
|
||||
echo "Configuration:"
|
||||
echo " LOG_TAIL: ${LOG_TAIL}"
|
||||
echo " EVENTS_TAIL: ${EVENTS_TAIL}"
|
||||
echo ""
|
||||
echo "Captured Information:"
|
||||
echo " - Pod list and descriptions"
|
||||
echo " - Pod logs (current and previous if restarted)"
|
||||
echo " - Kubernetes events"
|
||||
echo " - Ingress resources and details"
|
||||
echo " - Services and endpoints"
|
||||
echo " - Node information"
|
||||
echo " - Resource usage (if metrics-server available)"
|
||||
echo " - Persistent Volume Claims"
|
||||
echo " - StatefulSets and Deployments"
|
||||
if command -v aws &> /dev/null; then
|
||||
echo " - ALB target group health (if available)"
|
||||
fi
|
||||
echo ""
|
||||
echo "Files captured:"
|
||||
find "${OUTPUT_PATH}" -type f -name "*.txt" -o -name "*.yaml" -o -name "*.json" | sort | sed 's|^| |'
|
||||
} > "${SUMMARY_FILE}"
|
||||
|
||||
echo -e "${GREEN}✓ Diagnostics capture complete!${NC}"
|
||||
echo -e "${GREEN}Summary saved to: ${SUMMARY_FILE}${NC}"
|
||||
echo ""
|
||||
echo "To view the summary:"
|
||||
echo " cat ${SUMMARY_FILE}"
|
||||
echo ""
|
||||
echo "All diagnostic files are in: ${OUTPUT_PATH}"
|
||||
|
||||
Executable
+514
@@ -0,0 +1,514 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Deny pattern regex (broadened to catch all AWS denial patterns)
|
||||
DENY_RE='AccessDenied|AccessDeniedException|UnauthorizedOperation|not authorized|NotAuthorized|is not authorized'
|
||||
|
||||
# Parse command line arguments
|
||||
SKIP_RESOURCE_TESTS=false
|
||||
NON_INTERACTIVE=false
|
||||
CREATE_TEST_RESOURCES=false
|
||||
ACM_DOMAIN=""
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
-s|--skip_resource_tests|--skip_checks)
|
||||
SKIP_RESOURCE_TESTS=true
|
||||
shift
|
||||
;;
|
||||
-y|--yes)
|
||||
NON_INTERACTIVE=true
|
||||
shift
|
||||
;;
|
||||
--create-test-resources)
|
||||
CREATE_TEST_RESOURCES=true
|
||||
shift
|
||||
;;
|
||||
--domain)
|
||||
ACM_DOMAIN="$2"
|
||||
shift 2
|
||||
;;
|
||||
*)
|
||||
printf "Unknown option: %s\n" "$1"
|
||||
printf "Usage: %s [-s|--skip_resource_tests] [-y|--yes] [--create-test-resources] [--domain <domain>]\n" "$0"
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Check for CI environment
|
||||
if [ "${CI:-false}" = "true" ]; then
|
||||
NON_INTERACTIVE=true
|
||||
fi
|
||||
|
||||
# Function to print colored output
|
||||
info() {
|
||||
printf "${BLUE}[INFO]${NC} %s\n" "$1"
|
||||
}
|
||||
|
||||
success() {
|
||||
printf "${GREEN}[SUCCESS]${NC} %s\n" "$1"
|
||||
}
|
||||
|
||||
warning() {
|
||||
printf "${YELLOW}[WARNING]${NC} %s\n" "$1"
|
||||
}
|
||||
|
||||
error() {
|
||||
printf "${RED}[ERROR]${NC} %s\n" "$1"
|
||||
}
|
||||
|
||||
# Function to check for access denied patterns
|
||||
check_denied() {
|
||||
local output="$1"
|
||||
if echo "$output" | grep -Eqi "$DENY_RE"; then
|
||||
return 0 # Access denied found
|
||||
fi
|
||||
return 1 # No access denied
|
||||
}
|
||||
|
||||
# Check if AWS CLI is installed
|
||||
if ! command -v aws &> /dev/null; then
|
||||
error "AWS CLI is not installed. Please install it first."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Safety banner
|
||||
printf "\n"
|
||||
info "=== LangSmith AWS Preflight Check ==="
|
||||
info "Default mode: READ-ONLY (no resource creation)"
|
||||
info "Use --create-test-resources to test resource creation"
|
||||
info "No modifications to existing resources will be made."
|
||||
info "Temporary test resources may be created only with --create-test-resources."
|
||||
printf "\n"
|
||||
|
||||
# Check AWS credentials
|
||||
info "Checking AWS credentials..."
|
||||
if ! aws sts get-caller-identity &> /dev/null; then
|
||||
error "Not logged into AWS. Please run 'aws configure' or set AWS credentials."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Get AWS account info
|
||||
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
|
||||
USER_ARN=$(aws sts get-caller-identity --query Arn --output text)
|
||||
|
||||
# Better region handling
|
||||
REGION=$(aws configure get region 2>/dev/null || true)
|
||||
REGION=${REGION:-${AWS_DEFAULT_REGION:-us-west-2}}
|
||||
|
||||
info "AWS Account ID: $ACCOUNT_ID"
|
||||
info "User ARN: $USER_ARN"
|
||||
info "Current region: $REGION"
|
||||
|
||||
# Confirm region (non-interactive mode skips this)
|
||||
if [ "$NON_INTERACTIVE" = false ]; then
|
||||
printf "\n"
|
||||
read -p "Is the region '$REGION' correct? (y/n): " -n 1 -r
|
||||
printf "\n"
|
||||
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
|
||||
error "Please set the correct region using 'aws configure set region <region>' or export AWS_DEFAULT_REGION"
|
||||
exit 1
|
||||
fi
|
||||
else
|
||||
info "Non-interactive mode: using region '$REGION'"
|
||||
fi
|
||||
|
||||
# Check for sandbox account indicators (warning only, no prompt)
|
||||
info "Checking for sandbox account restrictions..."
|
||||
if [[ "$ACCOUNT_ID" =~ ^[0-9]{12}$ ]]; then
|
||||
# Check if account has restrictions (common sandbox patterns)
|
||||
ALIASES=$(aws iam list-account-aliases --query 'AccountAliases' --output text 2>/dev/null || echo "")
|
||||
if echo "$ALIASES" | grep -qi "sandbox\|test\|dev"; then
|
||||
warning "Account alias suggests this might be a sandbox/test account: $ALIASES"
|
||||
warning "Please verify this account is not restricted by SCPs or other policies"
|
||||
fi
|
||||
else
|
||||
warning "Account ID format is unusual. Please verify this is not a restricted account."
|
||||
fi
|
||||
|
||||
# Function to cleanup resources on exit (with retry logic)
|
||||
cleanup() {
|
||||
info "Cleaning up test resources..."
|
||||
|
||||
# Cleanup in correct order: SG -> Subnet -> VPC -> IAM role
|
||||
# Retry logic for eventual consistency
|
||||
|
||||
# Delete security group (with retry)
|
||||
if [ -n "${TEST_SG_ID:-}" ]; then
|
||||
for i in {1..3}; do
|
||||
if aws ec2 delete-security-group --group-id "$TEST_SG_ID" --region "$REGION" 2>/dev/null; then
|
||||
success "Security group deleted: $TEST_SG_ID"
|
||||
break
|
||||
fi
|
||||
if [ $i -lt 3 ]; then
|
||||
sleep 2
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
# Delete subnet (with retry)
|
||||
if [ -n "${TEST_SUBNET_ID:-}" ]; then
|
||||
for i in {1..3}; do
|
||||
if aws ec2 delete-subnet --subnet-id "$TEST_SUBNET_ID" --region "$REGION" 2>/dev/null; then
|
||||
success "Subnet deleted: $TEST_SUBNET_ID"
|
||||
break
|
||||
fi
|
||||
if [ $i -lt 3 ]; then
|
||||
sleep 2
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
# Delete VPC (with retry)
|
||||
if [ -n "${TEST_VPC_ID:-}" ]; then
|
||||
for i in {1..3}; do
|
||||
if aws ec2 delete-vpc --vpc-id "$TEST_VPC_ID" --region "$REGION" 2>/dev/null; then
|
||||
success "VPC deleted: $TEST_VPC_ID"
|
||||
break
|
||||
fi
|
||||
if [ $i -lt 3 ]; then
|
||||
sleep 2
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
# Delete IAM role (IAM is global, no --region, with retry)
|
||||
# NOTE: If you attach policies to the role, you must detach them before deleting
|
||||
if [ -n "${TEST_ROLE_NAME:-}" ]; then
|
||||
for i in {1..3}; do
|
||||
if aws iam delete-role --role-name "$TEST_ROLE_NAME" 2>/dev/null; then
|
||||
success "IAM role deleted: $TEST_ROLE_NAME"
|
||||
break
|
||||
fi
|
||||
if [ $i -lt 3 ]; then
|
||||
sleep 2
|
||||
fi
|
||||
done
|
||||
fi
|
||||
}
|
||||
|
||||
# Read-only permission checks (always run)
|
||||
info "Running read-only permission checks..."
|
||||
|
||||
# Test EC2 permissions (needed for EKS and ALB controller)
|
||||
info "Testing EC2 permissions (VPC, subnets, availability zones)..."
|
||||
EC2_VPC_OUTPUT=$(aws ec2 describe-vpcs --region "$REGION" --max-items 1 2>&1 || true)
|
||||
EC2_SUBNET_OUTPUT=$(aws ec2 describe-subnets --region "$REGION" --max-items 1 2>&1 || true)
|
||||
EC2_AZ_OUTPUT=$(aws ec2 describe-availability-zones --region "$REGION" 2>&1 || true)
|
||||
|
||||
if check_denied "$EC2_VPC_OUTPUT" || check_denied "$EC2_SUBNET_OUTPUT" || check_denied "$EC2_AZ_OUTPUT"; then
|
||||
error "Failed EC2 permission check. Check IAM permissions for ec2:DescribeVpcs, ec2:DescribeSubnets, ec2:DescribeAvailabilityZones"
|
||||
exit 1
|
||||
fi
|
||||
success "EC2 permissions verified"
|
||||
|
||||
# Test EKS permissions (fixed - single call with broader check)
|
||||
info "Testing EKS permissions..."
|
||||
EKS_OUTPUT=$(aws eks describe-cluster --name "preflight-nonexistent-$(date +%s)" --region "$REGION" 2>&1 || true)
|
||||
if check_denied "$EKS_OUTPUT"; then
|
||||
error "Failed EKS permission check. Check IAM permissions for eks:*"
|
||||
exit 1
|
||||
elif echo "$EKS_OUTPUT" | grep -q "ResourceNotFoundException"; then
|
||||
success "EKS permissions verified"
|
||||
else
|
||||
# Try list-clusters as alternative check
|
||||
EKS_LIST_OUTPUT=$(aws eks list-clusters --region "$REGION" 2>&1 || true)
|
||||
if check_denied "$EKS_LIST_OUTPUT"; then
|
||||
error "Failed EKS permission check. Check IAM permissions for eks:*"
|
||||
exit 1
|
||||
elif [ -n "$EKS_LIST_OUTPUT" ]; then
|
||||
success "EKS permissions verified"
|
||||
else
|
||||
warning "EKS permission check inconclusive, but continuing..."
|
||||
fi
|
||||
fi
|
||||
|
||||
# Add warning about EKS prerequisites
|
||||
warning "Note: Passing EKS checks does not guarantee you can create EKS/nodegroups."
|
||||
warning "Common failures occur at iam:PassRole, ec2:* permissions, and service quotas."
|
||||
|
||||
# Test IAM permissions (read-only check, needed for PassRole)
|
||||
info "Testing IAM permissions..."
|
||||
IAM_OUTPUT=$(aws iam list-roles --max-items 1 2>&1 || true)
|
||||
if check_denied "$IAM_OUTPUT"; then
|
||||
error "Failed IAM permission check. Check IAM permissions for iam:ListRoles (needed for iam:PassRole)"
|
||||
exit 1
|
||||
else
|
||||
success "IAM permissions verified"
|
||||
fi
|
||||
|
||||
# Test RDS permissions
|
||||
info "Testing RDS permissions..."
|
||||
RDS_OUTPUT=$(aws rds describe-db-instances --region "$REGION" 2>&1 || true)
|
||||
if check_denied "$RDS_OUTPUT"; then
|
||||
error "Failed RDS permission check. Check IAM permissions for rds:*"
|
||||
exit 1
|
||||
else
|
||||
success "RDS permissions verified"
|
||||
fi
|
||||
|
||||
# Test ElastiCache permissions
|
||||
info "Testing ElastiCache permissions..."
|
||||
CACHE_OUTPUT=$(aws elasticache describe-cache-clusters --region "$REGION" 2>&1 || true)
|
||||
if check_denied "$CACHE_OUTPUT"; then
|
||||
error "Failed ElastiCache permission check. Check IAM permissions for elasticache:*"
|
||||
exit 1
|
||||
else
|
||||
success "ElastiCache permissions verified"
|
||||
fi
|
||||
|
||||
# Test ALB/ELB permissions
|
||||
info "Testing Application Load Balancer permissions..."
|
||||
ALB_OUTPUT=$(aws elbv2 describe-load-balancers --region "$REGION" 2>&1 || true)
|
||||
if check_denied "$ALB_OUTPUT"; then
|
||||
error "Failed ALB permission check. Check IAM permissions for elasticloadbalancing:*"
|
||||
exit 1
|
||||
else
|
||||
success "ALB permissions verified"
|
||||
fi
|
||||
|
||||
# Test ACM permissions (for TLS certificates)
|
||||
info "Testing ACM (Certificate Manager) permissions..."
|
||||
ACM_OUTPUT=$(aws acm list-certificates --region "$REGION" 2>&1 || true)
|
||||
if check_denied "$ACM_OUTPUT"; then
|
||||
error "Failed ACM permission check. Check IAM permissions for acm:*"
|
||||
exit 1
|
||||
else
|
||||
success "ACM permissions verified"
|
||||
if [ -z "$ACM_DOMAIN" ]; then
|
||||
warning "ACM check passed (does not confirm a cert exists for your chosen domain)"
|
||||
else
|
||||
# Check if certificate exists for the domain
|
||||
info "Checking for ACM certificate matching domain: $ACM_DOMAIN"
|
||||
|
||||
# Extract zone apex (e.g., "example.com" from "langsmith.example.com")
|
||||
# If domain contains a dot, extract everything after the first dot
|
||||
if echo "$ACM_DOMAIN" | grep -q '\.'; then
|
||||
ZONE_APEX=$(echo "$ACM_DOMAIN" | sed -E 's/^[^.]*\.(.+)$/\1/')
|
||||
else
|
||||
# Already an apex domain (unlikely but handle it)
|
||||
ZONE_APEX="$ACM_DOMAIN"
|
||||
fi
|
||||
|
||||
# First try exact match
|
||||
CERT_ARN=$(aws acm list-certificates --region "$REGION" --query "CertificateSummaryList[?DomainName=='$ACM_DOMAIN'].CertificateArn" --output text 2>/dev/null || echo "")
|
||||
|
||||
# If no exact match, try wildcard for zone apex (e.g., *.example.com)
|
||||
if [ -z "$CERT_ARN" ] && [ "$ZONE_APEX" != "$ACM_DOMAIN" ]; then
|
||||
WILDCARD_DOMAIN="*.$ZONE_APEX"
|
||||
CERT_ARN=$(aws acm list-certificates --region "$REGION" --query "CertificateSummaryList[?DomainName=='$WILDCARD_DOMAIN'].CertificateArn" --output text 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
# If still no match, check SANs by describing each cert (limited check)
|
||||
if [ -z "$CERT_ARN" ]; then
|
||||
ALL_CERTS=$(aws acm list-certificates --region "$REGION" --query "CertificateSummaryList[*].CertificateArn" --output text 2>/dev/null || echo "")
|
||||
for cert_arn in $ALL_CERTS; do
|
||||
CERT_DETAILS=$(aws acm describe-certificate --certificate-arn "$cert_arn" --region "$REGION" --query "Certificate.{Domain:DomainName,SANs:SubjectAlternativeNames}" --output json 2>/dev/null || echo "{}")
|
||||
if echo "$CERT_DETAILS" | grep -q "\"$ACM_DOMAIN\"" || echo "$CERT_DETAILS" | grep -q "\"*.$ZONE_APEX\""; then
|
||||
CERT_ARN="$cert_arn"
|
||||
break
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
if [ -n "$CERT_ARN" ]; then
|
||||
success "Found ACM certificate for domain: $CERT_ARN"
|
||||
else
|
||||
warning "No ACM certificate found for domain '$ACM_DOMAIN' in region '$REGION'"
|
||||
warning "You may need to request a certificate before deploying"
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
# Test Route53 permissions (for DNS/ingress) - Route53 is global, no --region
|
||||
info "Testing Route53 permissions..."
|
||||
R53_OUTPUT=$(aws route53 list-hosted-zones 2>&1 || true)
|
||||
if check_denied "$R53_OUTPUT"; then
|
||||
error "Failed Route53 permission check. Check IAM permissions for route53:*"
|
||||
exit 1
|
||||
else
|
||||
success "Route53 permissions verified"
|
||||
# Check if hosted zones exist
|
||||
ZONE_COUNT=$(aws route53 list-hosted-zones --query "HostedZones | length(@)" --output text 2>/dev/null || echo "0")
|
||||
if [ "$ZONE_COUNT" = "0" ] || [ -z "$ZONE_COUNT" ]; then
|
||||
warning "No Route53 hosted zones found."
|
||||
warning "If you intend to use Route53 for ingress, create/identify the hosted zone first."
|
||||
else
|
||||
info "Found $ZONE_COUNT Route53 hosted zone(s)"
|
||||
|
||||
# If domain provided, check for matching hosted zone
|
||||
if [ -n "$ACM_DOMAIN" ]; then
|
||||
# Extract zone apex (same logic as ACM check)
|
||||
if echo "$ACM_DOMAIN" | grep -q '\.'; then
|
||||
ZONE_APEX=$(echo "$ACM_DOMAIN" | sed -E 's/^[^.]*\.(.+)$/\1/')
|
||||
else
|
||||
ZONE_APEX="$ACM_DOMAIN"
|
||||
fi
|
||||
# Route53 zone names end with a dot
|
||||
ZONE_NAME="${ZONE_APEX}."
|
||||
MATCHING_ZONE=$(aws route53 list-hosted-zones --query "HostedZones[?Name=='$ZONE_NAME'].Id" --output text 2>/dev/null || echo "")
|
||||
if [ -n "$MATCHING_ZONE" ]; then
|
||||
success "Found Route53 hosted zone for domain: $ZONE_NAME"
|
||||
else
|
||||
warning "No Route53 hosted zone found matching domain '$ACM_DOMAIN' (checked for zone: $ZONE_NAME)"
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
# Test WAFv2 permissions (optional, for WAF support)
|
||||
info "Testing WAFv2 permissions (optional)..."
|
||||
WAF_OUTPUT=$(aws wafv2 list-web-acls --scope REGIONAL --region "$REGION" 2>&1 || true)
|
||||
if check_denied "$WAF_OUTPUT"; then
|
||||
warning "WAFv2 permission check failed (optional, but recommended for production)"
|
||||
else
|
||||
WAF_COUNT=$(aws wafv2 list-web-acls --scope REGIONAL --region "$REGION" --query "WebACLs | length(@)" --output text 2>/dev/null || echo "0")
|
||||
if [ "$WAF_COUNT" = "0" ] || [ -z "$WAF_COUNT" ]; then
|
||||
success "WAFv2 accessible (no web ACLs found)"
|
||||
else
|
||||
success "WAFv2 permissions verified (found $WAF_COUNT web ACL(s))"
|
||||
fi
|
||||
fi
|
||||
|
||||
# Resource creation tests (only if --create-test-resources is set or skip is not set)
|
||||
if [ "$SKIP_RESOURCE_TESTS" = true ]; then
|
||||
info "Skipping resource creation tests (--skip_resource_tests flag provided)"
|
||||
success "Preflight checks complete (resource tests skipped)"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [ "$CREATE_TEST_RESOURCES" = false ]; then
|
||||
info "Skipping resource creation tests (use --create-test-resources to enable)"
|
||||
info "Read-only checks passed. You can proceed with deployment."
|
||||
success "Preflight checks complete!"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Set trap only when we're actually creating resources
|
||||
trap cleanup EXIT
|
||||
|
||||
# Confirmation prompt before creating resources (unless --yes is set)
|
||||
if [ "$NON_INTERACTIVE" = false ]; then
|
||||
printf "\n"
|
||||
warning "This will create temporary test resources:"
|
||||
warning " - VPC, Subnet, Security Group (isolated, will be deleted)"
|
||||
warning " - IAM Role (will be deleted)"
|
||||
warning " - No modifications to existing resources"
|
||||
printf "\n"
|
||||
read -p "Continue with resource creation tests? (y/n): " -n 1 -r
|
||||
printf "\n"
|
||||
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
|
||||
info "Resource creation tests cancelled by user"
|
||||
exit 0
|
||||
fi
|
||||
fi
|
||||
|
||||
# Resource creation tests (only run if --create-test-resources is set)
|
||||
info "Running resource creation tests (--create-test-resources mode)..."
|
||||
|
||||
# Generate a safer CIDR block (10.254.x.x range, avoid 0, less likely to conflict)
|
||||
# Retry logic for VPC creation in case of CIDR conflicts
|
||||
VPC_CREATED=false
|
||||
for attempt in {1..3}; do
|
||||
RANDOM_SUFFIX=$(( (RANDOM % 250) + 1 )) # Range 1-250, avoids 0
|
||||
TEST_CIDR="10.254.${RANDOM_SUFFIX}.0/28"
|
||||
|
||||
info "Attempting VPC creation with CIDR $TEST_CIDR (attempt $attempt/3)..."
|
||||
VPC_OUTPUT=$(aws ec2 create-vpc \
|
||||
--cidr-block "$TEST_CIDR" \
|
||||
--region "$REGION" \
|
||||
--query 'Vpc.VpcId' \
|
||||
--output text 2>&1) || {
|
||||
if echo "$VPC_OUTPUT" | grep -qi "InvalidVpc.Range\|overlap\|conflict"; then
|
||||
if [ $attempt -lt 3 ]; then
|
||||
warning "VPC creation failed (org policy or CIDR validation), trying different CIDR..."
|
||||
continue
|
||||
else
|
||||
error "Failed to create VPC after 3 attempts (org policy or CIDR validation). Check IAM permissions for ec2:CreateVpc"
|
||||
exit 1
|
||||
fi
|
||||
else
|
||||
error "Failed to create VPC. Check IAM permissions for ec2:CreateVpc"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
TEST_VPC_ID="$VPC_OUTPUT"
|
||||
VPC_CREATED=true
|
||||
success "VPC created: $TEST_VPC_ID"
|
||||
break
|
||||
done
|
||||
|
||||
if [ "$VPC_CREATED" = false ]; then
|
||||
error "Failed to create VPC after all retry attempts"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Test subnet creation (reuse VPC CIDR since it's a /28)
|
||||
info "Testing subnet creation..."
|
||||
AZ=$(aws ec2 describe-availability-zones --region "$REGION" --query 'AvailabilityZones[0].ZoneName' --output text)
|
||||
TEST_SUBNET_ID=$(aws ec2 create-subnet \
|
||||
--vpc-id "$TEST_VPC_ID" \
|
||||
--cidr-block "$TEST_CIDR" \
|
||||
--availability-zone "$AZ" \
|
||||
--region "$REGION" \
|
||||
--query 'Subnet.SubnetId' \
|
||||
--output text 2>/dev/null) || {
|
||||
error "Failed to create subnet. Check IAM permissions for ec2:CreateSubnet"
|
||||
exit 1
|
||||
}
|
||||
success "Subnet created: $TEST_SUBNET_ID"
|
||||
|
||||
# Test security group creation
|
||||
info "Testing security group creation..."
|
||||
TEST_SG_ID=$(aws ec2 create-security-group \
|
||||
--group-name "preflight-test-sg-$(date +%s)" \
|
||||
--description "Preflight test security group" \
|
||||
--vpc-id "$TEST_VPC_ID" \
|
||||
--region "$REGION" \
|
||||
--query 'GroupId' \
|
||||
--output text 2>/dev/null) || {
|
||||
error "Failed to create security group. Check IAM permissions for ec2:CreateSecurityGroup"
|
||||
exit 1
|
||||
}
|
||||
success "Security group created: $TEST_SG_ID"
|
||||
|
||||
# Test IAM role creation (IAM is global, no --region)
|
||||
info "Testing IAM role creation..."
|
||||
TEST_ROLE_NAME="preflight-test-role-$(date +%s)"
|
||||
TRUST_POLICY='{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Principal": {
|
||||
"Service": "ec2.amazonaws.com"
|
||||
},
|
||||
"Action": "sts:AssumeRole"
|
||||
}
|
||||
]
|
||||
}'
|
||||
if aws iam create-role \
|
||||
--role-name "$TEST_ROLE_NAME" \
|
||||
--assume-role-policy-document "$TRUST_POLICY" \
|
||||
--output text > /dev/null 2>&1; then
|
||||
success "IAM role created: $TEST_ROLE_NAME"
|
||||
else
|
||||
error "Failed to create IAM role. Check IAM permissions for iam:CreateRole"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Cleanup happens automatically via trap
|
||||
info "All test resources will be cleaned up on exit..."
|
||||
|
||||
success "Preflight checks complete! All permissions verified."
|
||||
info "You are ready to deploy LangSmith infrastructure."
|
||||
Reference in New Issue
Block a user