Initial commit: LangSmith code evaluator skill

This commit is contained in:
sean
2026-02-12 19:12:27 -08:00
commit 5b46c12f62
3 changed files with 486 additions and 0 deletions
+52
View File
@@ -0,0 +1,52 @@
# LCA Skills
A collection of agent skills for LangChain Academy and LangSmith workflows.
## Available Skills
### 🔍 langsmith-code-eval
Create code-based evaluators for LangSmith-traced agents with step-by-step collaborative guidance.
**Use when:** Building custom evaluators to test agent behavior, tool usage, and response quality in LangSmith.
**Features:**
- 9-step collaborative workflow
- Automatic trace structure inspection
- Category-based evaluation patterns
- Complete code generation for evaluators and experiment runners
## Installation
Install all skills:
```bash
npx skills add langchain-ai/lca-skills
```
Install specific skill:
```bash
npx skills add langchain-ai/lca-skills/tree/main/skills/langsmith-code-eval
```
For Claude Code:
```bash
npx skills add langchain-ai/lca-skills -a claude-code
```
## Skills Included
| Skill | Description | Documentation |
|-------|-------------|---------------|
| `langsmith-code-eval` | Create LangSmith code evaluators | [SKILL.md](skills/langsmith-code-eval/SKILL.md) |
## Contributing
Have a skill to add? Open a pull request with your skill in the `skills/` directory.
## License
Apache 2.0
---
Built for [LangChain Academy](https://academy.langchain.com)
+175
View File
@@ -0,0 +1,175 @@
---
name: langsmith-code-eval
description: Create code-based evaluators for LangSmith-traced agents with step-by-step collaborative guidance through inspection, evaluation logic, and testing.
---
# LangSmith Code Evaluator Creation
Create code-based evaluators for LangSmith-traced agents through a 9-step collaborative process.
## Workflow
### Step 1: Locate the Agent
Ask: "Where is your agent file located?"
### Step 2: Understand the Agent
Read the agent file. Identify:
- Main entry point function
- Tools/functions it calls
- Return format (string? dict with messages?)
### Step 3: Check for Traces
Ask: "Do you have recent traces in LangSmith?"
- If yes: Get project name
- If no: Ask to run agent once to generate a trace
### Step 4: Inspect Trace Structure
Run: `python scripts/inspect_trace.py PROJECT_NAME`
This shows where data lives:
- Tool calls in `run.outputs["messages"]`?
- Tool calls in `run.child_runs`?
- What's in inputs/outputs?
Use the returned structure dict programmatically:
```python
from inspect_trace import inspect_trace_structure
structure = inspect_trace_structure("project-name")
if "extract_from_messages" in structure["recommendations"]:
# Tool calls are in run.outputs["messages"]
```
### Step 5: Clarify Evaluation Goals
Ask: "What behavior do you want to test for?"
- If stated: Confirm understanding
- If unclear: Ask clarifying questions
- Understand: Pass vs fail criteria? Different categories? Metadata?
### Step 6: Create the Evaluator
Write `eval_[name].py` using this signature:
```python
from langsmith.schemas import Run, Example
def evaluate_[name](run: Run, example: Example) -> dict:
"""Evaluate [specific behavior]."""
# Extract data (based on Step 4)
messages = run.outputs.get("messages", [])
category = example.metadata.get("category") if example.metadata else None
# Evaluation logic (based on Step 5)
# ...
return {
"key": "evaluator_name",
"score": 1 or 0, # 1 = pass, 0 = fail
"comment": "Specific feedback explaining the score"
}
```
**Extract tool calls from messages:**
```python
for msg in messages:
if msg.get("role") == "assistant" and msg.get("tool_calls"):
for tc in msg["tool_calls"]:
tool_name = tc["function"]["name"]
args = json.loads(tc["function"]["arguments"])
```
**Category-based evaluation:**
```python
category = example.metadata.get("category", "unknown")
if category == "stock":
score = 1 if made_db_call else 0
elif category == "weather":
score = 1 if not made_db_call else 0
```
### Step 7: Create/Update Experiment Runner
Check if `run_experiment_with_eval.py` exists. If not, create:
```python
import asyncio
from langsmith import aevaluate
from [agent_module] import [agent_function]
from eval_[name] import evaluate_[name]
from dotenv import load_dotenv
load_dotenv()
async def agent_wrapper(inputs: dict) -> dict:
result = await [agent_function](inputs["question"])
return result
async def main():
results = await aevaluate(
agent_wrapper,
data="DATASET_NAME",
evaluators=[evaluate_[name]],
experiment_prefix="eval-test",
max_concurrency=5,
)
print(f"Results: {results}")
if __name__ == "__main__":
asyncio.run(main())
```
### Step 8: Configure Dataset
Ask: "What's your dataset name?"
Ask: "Please update the dataset name in the experiment runner"
Wait for confirmation.
### Step 9: Run the Evaluation
Execute: `uv run python run_experiment_with_eval.py`
Show the LangSmith URL when printed.
## Key Patterns
**Extracting from messages** (most reliable):
```python
messages = run.outputs.get("messages", [])
for msg in messages:
if msg.get("role") == "assistant" and msg.get("tool_calls"):
# Tool calls are here
```
**Extracting from child_runs** (if messages not available):
```python
def traverse_runs(run):
if run.name == "tool_name":
# Found it
if hasattr(run, 'child_runs') and run.child_runs:
for child in run.child_runs:
traverse_runs(child)
```
**Using metadata:**
```python
category = example.metadata.get("category") if example.metadata else None
```
## Troubleshooting
**Can't find tool calls**: Re-run `inspect_trace.py` to see actual structure
**child_runs empty**: Agent should return messages in outputs
**Same score always**: Debug evaluation logic with print statements
**Dataset not found**: Verify name in LangSmith UI
## Reference
**Documentation:**
- [Code Evaluator SDK](https://docs.langchain.com/langsmith/code-evaluator-sdk) - Writing evaluators
- [Evaluate LLM Applications](https://docs.langchain.com/langsmith/evaluate-llm-application) - Running experiments
**Important:** Extensive LangSmith documentation exists. If unsure about trace structure, SDK usage, or evaluation patterns, **search documentation** rather than assuming.
**See parent project for complete example:**
- `agent_v4.py` - Returns messages in outputs
- `eval_tool_call_schema.py` - Tool call + schema discovery evaluator
- `run_experiment_with_code_eval.py` - Experiment runner
@@ -0,0 +1,259 @@
"""
Trace Structure Inspector for LangSmith
Use this to understand the structure of your agent's traces before building an evaluator.
"""
from langsmith import Client
from typing import Optional
import json
def inspect_trace_structure(
project_name: str,
run_id: Optional[str] = None,
show_sample_data: bool = True
) -> dict:
"""
Inspect the structure of a LangSmith trace to understand where data lives.
Args:
project_name: The LangSmith project name
run_id: Optional specific run ID to inspect. If None, fetches most recent.
show_sample_data: Whether to show sample values from the trace
Returns:
dict with structure information that can be used programmatically
"""
client = Client()
# Fetch the run
if run_id:
run = client.read_run(run_id)
else:
runs = list(client.list_runs(
project_name=project_name,
is_root=True,
limit=1
))
if not runs:
raise ValueError(f"No runs found in project '{project_name}'")
run = client.read_run(runs[0].id)
print("=" * 80)
print("TRACE STRUCTURE ANALYSIS")
print("=" * 80)
print(f"\nProject: {project_name}")
print(f"Run ID: {run.id}")
print(f"Run Name: {run.name}")
print(f"Run Type: {run.run_type}")
# Analyze structure
structure = {
"run_id": str(run.id),
"run_name": run.name,
"run_type": run.run_type,
"has_inputs": bool(run.inputs),
"has_outputs": bool(run.outputs),
"has_child_run_ids": bool(hasattr(run, 'child_run_ids') and run.child_run_ids),
"inputs": {},
"outputs": {},
"child_runs_info": [],
"metadata": run.metadata if hasattr(run, 'metadata') and run.metadata else None
}
# Analyze inputs
print("\n" + "=" * 80)
print("INPUTS")
print("=" * 80)
if run.inputs:
print(f"\nKeys in run.inputs: {list(run.inputs.keys())}")
structure["inputs"]["keys"] = list(run.inputs.keys())
for key, value in run.inputs.items():
value_type = type(value).__name__
structure["inputs"][key] = {"type": value_type}
if show_sample_data:
if isinstance(value, (str, int, float, bool)):
sample = str(value)[:100]
print(f" {key} ({value_type}): {sample}{'...' if len(str(value)) > 100 else ''}")
elif isinstance(value, list):
print(f" {key} ({value_type}): List with {len(value)} items")
if value and len(value) > 0:
print(f" First item type: {type(value[0]).__name__}")
structure["inputs"][key]["list_item_type"] = type(value[0]).__name__
elif isinstance(value, dict):
print(f" {key} ({value_type}): Dict with keys: {list(value.keys())}")
structure["inputs"][key]["dict_keys"] = list(value.keys())
else:
print(f" {key} ({value_type})")
else:
print("No inputs found")
# Analyze outputs
print("\n" + "=" * 80)
print("OUTPUTS")
print("=" * 80)
if run.outputs:
print(f"\nKeys in run.outputs: {list(run.outputs.keys())}")
structure["outputs"]["keys"] = list(run.outputs.keys())
for key, value in run.outputs.items():
value_type = type(value).__name__
structure["outputs"][key] = {"type": value_type}
if show_sample_data:
if isinstance(value, (str, int, float, bool)):
sample = str(value)[:100]
print(f" {key} ({value_type}): {sample}{'...' if len(str(value)) > 100 else ''}")
elif isinstance(value, list):
print(f" {key} ({value_type}): List with {len(value)} items")
if value and len(value) > 0:
print(f" First item type: {type(value[0]).__name__}")
structure["outputs"][key]["list_item_type"] = type(value[0]).__name__
# Special handling for messages array
if key == "messages" and isinstance(value[0], dict):
print(f" Looks like a messages array!")
print(f" Message roles found: {set(m.get('role') for m in value if isinstance(m, dict))}")
structure["outputs"][key]["is_messages_array"] = True
structure["outputs"][key]["message_roles"] = list(set(m.get('role') for m in value if isinstance(m, dict)))
# Check for tool calls in messages
has_tool_calls = any(
m.get('role') == 'assistant' and m.get('tool_calls')
for m in value if isinstance(m, dict)
)
if has_tool_calls:
print(f" ✓ Contains tool calls in assistant messages!")
structure["outputs"][key]["has_tool_calls"] = True
# Extract tool names
tool_names = set()
for m in value:
if isinstance(m, dict) and m.get('role') == 'assistant' and m.get('tool_calls'):
for tc in m.get('tool_calls', []):
if isinstance(tc, dict):
tool_names.add(tc.get('function', {}).get('name'))
print(f" Tools called: {tool_names}")
structure["outputs"][key]["tool_names"] = list(tool_names)
elif isinstance(value, dict):
print(f" {key} ({value_type}): Dict with keys: {list(value.keys())}")
structure["outputs"][key]["dict_keys"] = list(value.keys())
else:
print(f" {key} ({value_type})")
else:
print("No outputs found")
# Analyze child runs
print("\n" + "=" * 80)
print("CHILD RUNS")
print("=" * 80)
if hasattr(run, 'child_run_ids') and run.child_run_ids:
print(f"\n✓ Has {len(run.child_run_ids)} child run IDs")
structure["num_child_runs"] = len(run.child_run_ids)
# Fetch a few child runs to see structure
print("\nFetching child runs to inspect structure...")
for i, child_id in enumerate(run.child_run_ids[:3]): # Just first 3
child_run = client.read_run(child_id)
child_info = {
"name": child_run.name,
"type": child_run.run_type,
"has_inputs": bool(child_run.inputs),
"has_outputs": bool(child_run.outputs),
}
print(f"\n Child Run {i+1}:")
print(f" Name: {child_run.name}")
print(f" Type: {child_run.run_type}")
if child_run.inputs:
print(f" Input keys: {list(child_run.inputs.keys())}")
child_info["input_keys"] = list(child_run.inputs.keys())
# Show sample for tool calls
if "query" in child_run.inputs:
print(f" Query: {child_run.inputs['query'][:80]}...")
if child_run.outputs:
print(f" Output keys: {list(child_run.outputs.keys())}")
child_info["output_keys"] = list(child_run.outputs.keys())
structure["child_runs_info"].append(child_info)
if len(run.child_run_ids) > 3:
print(f"\n ... and {len(run.child_run_ids) - 3} more child runs")
else:
print("\n✗ No child run IDs found")
structure["num_child_runs"] = 0
# Metadata
if structure["metadata"]:
print("\n" + "=" * 80)
print("METADATA")
print("=" * 80)
print(f"\nMetadata keys: {list(structure['metadata'].keys())}")
# Summary and recommendations
print("\n" + "=" * 80)
print("RECOMMENDATIONS FOR EVALUATOR")
print("=" * 80)
recommendations = []
# Check if messages are in outputs
if (structure["outputs"].get("keys") and "messages" in structure["outputs"]["keys"] and
structure["outputs"].get("messages", {}).get("is_messages_array")):
print("\n✓ Agent returns messages in outputs")
print(" Recommendation: Extract tool calls from run.outputs['messages']")
print(" This is the most reliable approach.")
recommendations.append("extract_from_messages")
if structure["outputs"]["messages"].get("has_tool_calls"):
print(f"\n✓ Tool calls found in messages")
print(f" Tools: {structure['outputs']['messages'].get('tool_names')}")
else:
print("\n✗ Agent does not return messages in outputs")
if structure["num_child_runs"] > 0:
print(" Recommendation: Extract tool calls from run.child_runs")
print(" Note: This requires traversing the child run tree")
recommendations.append("extract_from_child_runs")
else:
print(" Warning: No obvious place to find tool calls")
print(" Consider updating agent to return messages in outputs")
structure["recommendations"] = recommendations
# Return structure for programmatic use
print("\n" + "=" * 80)
return structure
if __name__ == "__main__":
import sys
if len(sys.argv) < 2:
print("Usage: python inspect_trace.py <project_name> [run_id]")
print("\nExample:")
print(" python inspect_trace.py my-langsmith-project")
print(" python inspect_trace.py my-langsmith-project 019c546c-2ce6-7853-8ac5-939a88d7c4a4")
sys.exit(1)
project_name = sys.argv[1]
run_id = sys.argv[2] if len(sys.argv) > 2 else None
structure = inspect_trace_structure(project_name, run_id)
print("\n" + "=" * 80)
print("Structure data saved for programmatic use")
print("=" * 80)
print("\nYou can import this function and use the returned dict:")
print(" from inspect_trace import inspect_trace_structure")
print(" structure = inspect_trace_structure('your-project')")
print(" if 'extract_from_messages' in structure['recommendations']:")
print(" # Extract from run.outputs['messages']")