mirror of
https://github.com/langchain-ai/lca-skills.git
synced 2026-07-01 11:30:46 -04:00
Initial commit: LangSmith code evaluator skill
This commit is contained in:
@@ -0,0 +1,52 @@
|
||||
# LCA Skills
|
||||
|
||||
A collection of agent skills for LangChain Academy and LangSmith workflows.
|
||||
|
||||
## Available Skills
|
||||
|
||||
### 🔍 langsmith-code-eval
|
||||
|
||||
Create code-based evaluators for LangSmith-traced agents with step-by-step collaborative guidance.
|
||||
|
||||
**Use when:** Building custom evaluators to test agent behavior, tool usage, and response quality in LangSmith.
|
||||
|
||||
**Features:**
|
||||
- 9-step collaborative workflow
|
||||
- Automatic trace structure inspection
|
||||
- Category-based evaluation patterns
|
||||
- Complete code generation for evaluators and experiment runners
|
||||
|
||||
## Installation
|
||||
|
||||
Install all skills:
|
||||
```bash
|
||||
npx skills add langchain-ai/lca-skills
|
||||
```
|
||||
|
||||
Install specific skill:
|
||||
```bash
|
||||
npx skills add langchain-ai/lca-skills/tree/main/skills/langsmith-code-eval
|
||||
```
|
||||
|
||||
For Claude Code:
|
||||
```bash
|
||||
npx skills add langchain-ai/lca-skills -a claude-code
|
||||
```
|
||||
|
||||
## Skills Included
|
||||
|
||||
| Skill | Description | Documentation |
|
||||
|-------|-------------|---------------|
|
||||
| `langsmith-code-eval` | Create LangSmith code evaluators | [SKILL.md](skills/langsmith-code-eval/SKILL.md) |
|
||||
|
||||
## Contributing
|
||||
|
||||
Have a skill to add? Open a pull request with your skill in the `skills/` directory.
|
||||
|
||||
## License
|
||||
|
||||
Apache 2.0
|
||||
|
||||
---
|
||||
|
||||
Built for [LangChain Academy](https://academy.langchain.com)
|
||||
@@ -0,0 +1,175 @@
|
||||
---
|
||||
name: langsmith-code-eval
|
||||
description: Create code-based evaluators for LangSmith-traced agents with step-by-step collaborative guidance through inspection, evaluation logic, and testing.
|
||||
---
|
||||
|
||||
# LangSmith Code Evaluator Creation
|
||||
|
||||
Create code-based evaluators for LangSmith-traced agents through a 9-step collaborative process.
|
||||
|
||||
## Workflow
|
||||
|
||||
### Step 1: Locate the Agent
|
||||
Ask: "Where is your agent file located?"
|
||||
|
||||
### Step 2: Understand the Agent
|
||||
Read the agent file. Identify:
|
||||
- Main entry point function
|
||||
- Tools/functions it calls
|
||||
- Return format (string? dict with messages?)
|
||||
|
||||
### Step 3: Check for Traces
|
||||
Ask: "Do you have recent traces in LangSmith?"
|
||||
- If yes: Get project name
|
||||
- If no: Ask to run agent once to generate a trace
|
||||
|
||||
### Step 4: Inspect Trace Structure
|
||||
Run: `python scripts/inspect_trace.py PROJECT_NAME`
|
||||
|
||||
This shows where data lives:
|
||||
- Tool calls in `run.outputs["messages"]`?
|
||||
- Tool calls in `run.child_runs`?
|
||||
- What's in inputs/outputs?
|
||||
|
||||
Use the returned structure dict programmatically:
|
||||
```python
|
||||
from inspect_trace import inspect_trace_structure
|
||||
|
||||
structure = inspect_trace_structure("project-name")
|
||||
if "extract_from_messages" in structure["recommendations"]:
|
||||
# Tool calls are in run.outputs["messages"]
|
||||
```
|
||||
|
||||
### Step 5: Clarify Evaluation Goals
|
||||
Ask: "What behavior do you want to test for?"
|
||||
- If stated: Confirm understanding
|
||||
- If unclear: Ask clarifying questions
|
||||
- Understand: Pass vs fail criteria? Different categories? Metadata?
|
||||
|
||||
### Step 6: Create the Evaluator
|
||||
Write `eval_[name].py` using this signature:
|
||||
|
||||
```python
|
||||
from langsmith.schemas import Run, Example
|
||||
|
||||
def evaluate_[name](run: Run, example: Example) -> dict:
|
||||
"""Evaluate [specific behavior]."""
|
||||
|
||||
# Extract data (based on Step 4)
|
||||
messages = run.outputs.get("messages", [])
|
||||
category = example.metadata.get("category") if example.metadata else None
|
||||
|
||||
# Evaluation logic (based on Step 5)
|
||||
# ...
|
||||
|
||||
return {
|
||||
"key": "evaluator_name",
|
||||
"score": 1 or 0, # 1 = pass, 0 = fail
|
||||
"comment": "Specific feedback explaining the score"
|
||||
}
|
||||
```
|
||||
|
||||
**Extract tool calls from messages:**
|
||||
```python
|
||||
for msg in messages:
|
||||
if msg.get("role") == "assistant" and msg.get("tool_calls"):
|
||||
for tc in msg["tool_calls"]:
|
||||
tool_name = tc["function"]["name"]
|
||||
args = json.loads(tc["function"]["arguments"])
|
||||
```
|
||||
|
||||
**Category-based evaluation:**
|
||||
```python
|
||||
category = example.metadata.get("category", "unknown")
|
||||
if category == "stock":
|
||||
score = 1 if made_db_call else 0
|
||||
elif category == "weather":
|
||||
score = 1 if not made_db_call else 0
|
||||
```
|
||||
|
||||
### Step 7: Create/Update Experiment Runner
|
||||
Check if `run_experiment_with_eval.py` exists. If not, create:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from langsmith import aevaluate
|
||||
from [agent_module] import [agent_function]
|
||||
from eval_[name] import evaluate_[name]
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv()
|
||||
|
||||
async def agent_wrapper(inputs: dict) -> dict:
|
||||
result = await [agent_function](inputs["question"])
|
||||
return result
|
||||
|
||||
async def main():
|
||||
results = await aevaluate(
|
||||
agent_wrapper,
|
||||
data="DATASET_NAME",
|
||||
evaluators=[evaluate_[name]],
|
||||
experiment_prefix="eval-test",
|
||||
max_concurrency=5,
|
||||
)
|
||||
print(f"Results: {results}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Step 8: Configure Dataset
|
||||
Ask: "What's your dataset name?"
|
||||
Ask: "Please update the dataset name in the experiment runner"
|
||||
Wait for confirmation.
|
||||
|
||||
### Step 9: Run the Evaluation
|
||||
Execute: `uv run python run_experiment_with_eval.py`
|
||||
Show the LangSmith URL when printed.
|
||||
|
||||
## Key Patterns
|
||||
|
||||
**Extracting from messages** (most reliable):
|
||||
```python
|
||||
messages = run.outputs.get("messages", [])
|
||||
for msg in messages:
|
||||
if msg.get("role") == "assistant" and msg.get("tool_calls"):
|
||||
# Tool calls are here
|
||||
```
|
||||
|
||||
**Extracting from child_runs** (if messages not available):
|
||||
```python
|
||||
def traverse_runs(run):
|
||||
if run.name == "tool_name":
|
||||
# Found it
|
||||
if hasattr(run, 'child_runs') and run.child_runs:
|
||||
for child in run.child_runs:
|
||||
traverse_runs(child)
|
||||
```
|
||||
|
||||
**Using metadata:**
|
||||
```python
|
||||
category = example.metadata.get("category") if example.metadata else None
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Can't find tool calls**: Re-run `inspect_trace.py` to see actual structure
|
||||
|
||||
**child_runs empty**: Agent should return messages in outputs
|
||||
|
||||
**Same score always**: Debug evaluation logic with print statements
|
||||
|
||||
**Dataset not found**: Verify name in LangSmith UI
|
||||
|
||||
## Reference
|
||||
|
||||
**Documentation:**
|
||||
- [Code Evaluator SDK](https://docs.langchain.com/langsmith/code-evaluator-sdk) - Writing evaluators
|
||||
- [Evaluate LLM Applications](https://docs.langchain.com/langsmith/evaluate-llm-application) - Running experiments
|
||||
|
||||
**Important:** Extensive LangSmith documentation exists. If unsure about trace structure, SDK usage, or evaluation patterns, **search documentation** rather than assuming.
|
||||
|
||||
**See parent project for complete example:**
|
||||
- `agent_v4.py` - Returns messages in outputs
|
||||
- `eval_tool_call_schema.py` - Tool call + schema discovery evaluator
|
||||
- `run_experiment_with_code_eval.py` - Experiment runner
|
||||
@@ -0,0 +1,259 @@
|
||||
"""
|
||||
Trace Structure Inspector for LangSmith
|
||||
|
||||
Use this to understand the structure of your agent's traces before building an evaluator.
|
||||
"""
|
||||
|
||||
from langsmith import Client
|
||||
from typing import Optional
|
||||
import json
|
||||
|
||||
|
||||
def inspect_trace_structure(
|
||||
project_name: str,
|
||||
run_id: Optional[str] = None,
|
||||
show_sample_data: bool = True
|
||||
) -> dict:
|
||||
"""
|
||||
Inspect the structure of a LangSmith trace to understand where data lives.
|
||||
|
||||
Args:
|
||||
project_name: The LangSmith project name
|
||||
run_id: Optional specific run ID to inspect. If None, fetches most recent.
|
||||
show_sample_data: Whether to show sample values from the trace
|
||||
|
||||
Returns:
|
||||
dict with structure information that can be used programmatically
|
||||
"""
|
||||
client = Client()
|
||||
|
||||
# Fetch the run
|
||||
if run_id:
|
||||
run = client.read_run(run_id)
|
||||
else:
|
||||
runs = list(client.list_runs(
|
||||
project_name=project_name,
|
||||
is_root=True,
|
||||
limit=1
|
||||
))
|
||||
if not runs:
|
||||
raise ValueError(f"No runs found in project '{project_name}'")
|
||||
run = client.read_run(runs[0].id)
|
||||
|
||||
print("=" * 80)
|
||||
print("TRACE STRUCTURE ANALYSIS")
|
||||
print("=" * 80)
|
||||
print(f"\nProject: {project_name}")
|
||||
print(f"Run ID: {run.id}")
|
||||
print(f"Run Name: {run.name}")
|
||||
print(f"Run Type: {run.run_type}")
|
||||
|
||||
# Analyze structure
|
||||
structure = {
|
||||
"run_id": str(run.id),
|
||||
"run_name": run.name,
|
||||
"run_type": run.run_type,
|
||||
"has_inputs": bool(run.inputs),
|
||||
"has_outputs": bool(run.outputs),
|
||||
"has_child_run_ids": bool(hasattr(run, 'child_run_ids') and run.child_run_ids),
|
||||
"inputs": {},
|
||||
"outputs": {},
|
||||
"child_runs_info": [],
|
||||
"metadata": run.metadata if hasattr(run, 'metadata') and run.metadata else None
|
||||
}
|
||||
|
||||
# Analyze inputs
|
||||
print("\n" + "=" * 80)
|
||||
print("INPUTS")
|
||||
print("=" * 80)
|
||||
if run.inputs:
|
||||
print(f"\nKeys in run.inputs: {list(run.inputs.keys())}")
|
||||
structure["inputs"]["keys"] = list(run.inputs.keys())
|
||||
|
||||
for key, value in run.inputs.items():
|
||||
value_type = type(value).__name__
|
||||
structure["inputs"][key] = {"type": value_type}
|
||||
|
||||
if show_sample_data:
|
||||
if isinstance(value, (str, int, float, bool)):
|
||||
sample = str(value)[:100]
|
||||
print(f" {key} ({value_type}): {sample}{'...' if len(str(value)) > 100 else ''}")
|
||||
elif isinstance(value, list):
|
||||
print(f" {key} ({value_type}): List with {len(value)} items")
|
||||
if value and len(value) > 0:
|
||||
print(f" First item type: {type(value[0]).__name__}")
|
||||
structure["inputs"][key]["list_item_type"] = type(value[0]).__name__
|
||||
elif isinstance(value, dict):
|
||||
print(f" {key} ({value_type}): Dict with keys: {list(value.keys())}")
|
||||
structure["inputs"][key]["dict_keys"] = list(value.keys())
|
||||
else:
|
||||
print(f" {key} ({value_type})")
|
||||
else:
|
||||
print("No inputs found")
|
||||
|
||||
# Analyze outputs
|
||||
print("\n" + "=" * 80)
|
||||
print("OUTPUTS")
|
||||
print("=" * 80)
|
||||
if run.outputs:
|
||||
print(f"\nKeys in run.outputs: {list(run.outputs.keys())}")
|
||||
structure["outputs"]["keys"] = list(run.outputs.keys())
|
||||
|
||||
for key, value in run.outputs.items():
|
||||
value_type = type(value).__name__
|
||||
structure["outputs"][key] = {"type": value_type}
|
||||
|
||||
if show_sample_data:
|
||||
if isinstance(value, (str, int, float, bool)):
|
||||
sample = str(value)[:100]
|
||||
print(f" {key} ({value_type}): {sample}{'...' if len(str(value)) > 100 else ''}")
|
||||
elif isinstance(value, list):
|
||||
print(f" {key} ({value_type}): List with {len(value)} items")
|
||||
if value and len(value) > 0:
|
||||
print(f" First item type: {type(value[0]).__name__}")
|
||||
structure["outputs"][key]["list_item_type"] = type(value[0]).__name__
|
||||
|
||||
# Special handling for messages array
|
||||
if key == "messages" and isinstance(value[0], dict):
|
||||
print(f" Looks like a messages array!")
|
||||
print(f" Message roles found: {set(m.get('role') for m in value if isinstance(m, dict))}")
|
||||
structure["outputs"][key]["is_messages_array"] = True
|
||||
structure["outputs"][key]["message_roles"] = list(set(m.get('role') for m in value if isinstance(m, dict)))
|
||||
|
||||
# Check for tool calls in messages
|
||||
has_tool_calls = any(
|
||||
m.get('role') == 'assistant' and m.get('tool_calls')
|
||||
for m in value if isinstance(m, dict)
|
||||
)
|
||||
if has_tool_calls:
|
||||
print(f" ✓ Contains tool calls in assistant messages!")
|
||||
structure["outputs"][key]["has_tool_calls"] = True
|
||||
|
||||
# Extract tool names
|
||||
tool_names = set()
|
||||
for m in value:
|
||||
if isinstance(m, dict) and m.get('role') == 'assistant' and m.get('tool_calls'):
|
||||
for tc in m.get('tool_calls', []):
|
||||
if isinstance(tc, dict):
|
||||
tool_names.add(tc.get('function', {}).get('name'))
|
||||
print(f" Tools called: {tool_names}")
|
||||
structure["outputs"][key]["tool_names"] = list(tool_names)
|
||||
|
||||
elif isinstance(value, dict):
|
||||
print(f" {key} ({value_type}): Dict with keys: {list(value.keys())}")
|
||||
structure["outputs"][key]["dict_keys"] = list(value.keys())
|
||||
else:
|
||||
print(f" {key} ({value_type})")
|
||||
else:
|
||||
print("No outputs found")
|
||||
|
||||
# Analyze child runs
|
||||
print("\n" + "=" * 80)
|
||||
print("CHILD RUNS")
|
||||
print("=" * 80)
|
||||
|
||||
if hasattr(run, 'child_run_ids') and run.child_run_ids:
|
||||
print(f"\n✓ Has {len(run.child_run_ids)} child run IDs")
|
||||
structure["num_child_runs"] = len(run.child_run_ids)
|
||||
|
||||
# Fetch a few child runs to see structure
|
||||
print("\nFetching child runs to inspect structure...")
|
||||
for i, child_id in enumerate(run.child_run_ids[:3]): # Just first 3
|
||||
child_run = client.read_run(child_id)
|
||||
child_info = {
|
||||
"name": child_run.name,
|
||||
"type": child_run.run_type,
|
||||
"has_inputs": bool(child_run.inputs),
|
||||
"has_outputs": bool(child_run.outputs),
|
||||
}
|
||||
|
||||
print(f"\n Child Run {i+1}:")
|
||||
print(f" Name: {child_run.name}")
|
||||
print(f" Type: {child_run.run_type}")
|
||||
|
||||
if child_run.inputs:
|
||||
print(f" Input keys: {list(child_run.inputs.keys())}")
|
||||
child_info["input_keys"] = list(child_run.inputs.keys())
|
||||
|
||||
# Show sample for tool calls
|
||||
if "query" in child_run.inputs:
|
||||
print(f" Query: {child_run.inputs['query'][:80]}...")
|
||||
|
||||
if child_run.outputs:
|
||||
print(f" Output keys: {list(child_run.outputs.keys())}")
|
||||
child_info["output_keys"] = list(child_run.outputs.keys())
|
||||
|
||||
structure["child_runs_info"].append(child_info)
|
||||
|
||||
if len(run.child_run_ids) > 3:
|
||||
print(f"\n ... and {len(run.child_run_ids) - 3} more child runs")
|
||||
|
||||
else:
|
||||
print("\n✗ No child run IDs found")
|
||||
structure["num_child_runs"] = 0
|
||||
|
||||
# Metadata
|
||||
if structure["metadata"]:
|
||||
print("\n" + "=" * 80)
|
||||
print("METADATA")
|
||||
print("=" * 80)
|
||||
print(f"\nMetadata keys: {list(structure['metadata'].keys())}")
|
||||
|
||||
# Summary and recommendations
|
||||
print("\n" + "=" * 80)
|
||||
print("RECOMMENDATIONS FOR EVALUATOR")
|
||||
print("=" * 80)
|
||||
|
||||
recommendations = []
|
||||
|
||||
# Check if messages are in outputs
|
||||
if (structure["outputs"].get("keys") and "messages" in structure["outputs"]["keys"] and
|
||||
structure["outputs"].get("messages", {}).get("is_messages_array")):
|
||||
print("\n✓ Agent returns messages in outputs")
|
||||
print(" Recommendation: Extract tool calls from run.outputs['messages']")
|
||||
print(" This is the most reliable approach.")
|
||||
recommendations.append("extract_from_messages")
|
||||
|
||||
if structure["outputs"]["messages"].get("has_tool_calls"):
|
||||
print(f"\n✓ Tool calls found in messages")
|
||||
print(f" Tools: {structure['outputs']['messages'].get('tool_names')}")
|
||||
else:
|
||||
print("\n✗ Agent does not return messages in outputs")
|
||||
if structure["num_child_runs"] > 0:
|
||||
print(" Recommendation: Extract tool calls from run.child_runs")
|
||||
print(" Note: This requires traversing the child run tree")
|
||||
recommendations.append("extract_from_child_runs")
|
||||
else:
|
||||
print(" Warning: No obvious place to find tool calls")
|
||||
print(" Consider updating agent to return messages in outputs")
|
||||
|
||||
structure["recommendations"] = recommendations
|
||||
|
||||
# Return structure for programmatic use
|
||||
print("\n" + "=" * 80)
|
||||
return structure
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python inspect_trace.py <project_name> [run_id]")
|
||||
print("\nExample:")
|
||||
print(" python inspect_trace.py my-langsmith-project")
|
||||
print(" python inspect_trace.py my-langsmith-project 019c546c-2ce6-7853-8ac5-939a88d7c4a4")
|
||||
sys.exit(1)
|
||||
|
||||
project_name = sys.argv[1]
|
||||
run_id = sys.argv[2] if len(sys.argv) > 2 else None
|
||||
|
||||
structure = inspect_trace_structure(project_name, run_id)
|
||||
|
||||
print("\n" + "=" * 80)
|
||||
print("Structure data saved for programmatic use")
|
||||
print("=" * 80)
|
||||
print("\nYou can import this function and use the returned dict:")
|
||||
print(" from inspect_trace import inspect_trace_structure")
|
||||
print(" structure = inspect_trace_structure('your-project')")
|
||||
print(" if 'extract_from_messages' in structure['recommendations']:")
|
||||
print(" # Extract from run.outputs['messages']")
|
||||
Reference in New Issue
Block a user