Initial commit: LangSmith code evaluator skill

2026-07-01 11:30:46 -04:00 · 2026-02-12 19:12:27 -08:00
commit 5b46c12f62
3 changed files with 486 additions and 0 deletions
@@ -0,0 +1,52 @@
+# LCA Skills
+
+A collection of agent skills for LangChain Academy and LangSmith workflows.
+
+## Available Skills
+
+### 🔍 langsmith-code-eval
+
+Create code-based evaluators for LangSmith-traced agents with step-by-step collaborative guidance.
+
+**Use when:** Building custom evaluators to test agent behavior, tool usage, and response quality in LangSmith.
+
+**Features:**
+- 9-step collaborative workflow
+- Automatic trace structure inspection
+- Category-based evaluation patterns
+- Complete code generation for evaluators and experiment runners
+
+## Installation
+
+Install all skills:
+```bash
+npx skills add langchain-ai/lca-skills
+```
+
+Install specific skill:
+```bash
+npx skills add langchain-ai/lca-skills/tree/main/skills/langsmith-code-eval
+```
+
+For Claude Code:
+```bash
+npx skills add langchain-ai/lca-skills -a claude-code
+```
+
+## Skills Included
+
+| Skill | Description | Documentation |
+|-------|-------------|---------------|
+| `langsmith-code-eval` | Create LangSmith code evaluators | [SKILL.md](skills/langsmith-code-eval/SKILL.md) |
+
+## Contributing
+
+Have a skill to add? Open a pull request with your skill in the `skills/` directory.
+
+## License
+
+Apache 2.0
+
+---
+
+Built for [LangChain Academy](https://academy.langchain.com)
@@ -0,0 +1,175 @@
+---
+name: langsmith-code-eval
+description: Create code-based evaluators for LangSmith-traced agents with step-by-step collaborative guidance through inspection, evaluation logic, and testing.
+---
+
+# LangSmith Code Evaluator Creation
+
+Create code-based evaluators for LangSmith-traced agents through a 9-step collaborative process.
+
+## Workflow
+
+### Step 1: Locate the Agent
+Ask: "Where is your agent file located?"
+
+### Step 2: Understand the Agent
+Read the agent file. Identify:
+- Main entry point function
+- Tools/functions it calls
+- Return format (string? dict with messages?)
+
+### Step 3: Check for Traces
+Ask: "Do you have recent traces in LangSmith?"
+- If yes: Get project name
+- If no: Ask to run agent once to generate a trace
+
+### Step 4: Inspect Trace Structure
+Run: `python scripts/inspect_trace.py PROJECT_NAME`
+
+This shows where data lives:
+- Tool calls in `run.outputs["messages"]`?
+- Tool calls in `run.child_runs`?
+- What's in inputs/outputs?
+
+Use the returned structure dict programmatically:
+```python
+from inspect_trace import inspect_trace_structure
+
+structure = inspect_trace_structure("project-name")
+if "extract_from_messages" in structure["recommendations"]:
+    # Tool calls are in run.outputs["messages"]
+```
+
+### Step 5: Clarify Evaluation Goals
+Ask: "What behavior do you want to test for?"
+- If stated: Confirm understanding
+- If unclear: Ask clarifying questions
+- Understand: Pass vs fail criteria? Different categories? Metadata?
+
+### Step 6: Create the Evaluator
+Write `eval_[name].py` using this signature:
+
+```python
+from langsmith.schemas import Run, Example
+
+def evaluate_[name](run: Run, example: Example) -> dict:
+    """Evaluate [specific behavior]."""
+
+    # Extract data (based on Step 4)
+    messages = run.outputs.get("messages", [])
+    category = example.metadata.get("category") if example.metadata else None
+
+    # Evaluation logic (based on Step 5)
+    # ...
+
+    return {
+        "key": "evaluator_name",
+        "score": 1 or 0,  # 1 = pass, 0 = fail
+        "comment": "Specific feedback explaining the score"
+    }
+```
+
+**Extract tool calls from messages:**
+```python
+for msg in messages:
+    if msg.get("role") == "assistant" and msg.get("tool_calls"):
+        for tc in msg["tool_calls"]:
+            tool_name = tc["function"]["name"]
+            args = json.loads(tc["function"]["arguments"])
+```
+
+**Category-based evaluation:**
+```python
+category = example.metadata.get("category", "unknown")
+if category == "stock":
+    score = 1 if made_db_call else 0
+elif category == "weather":
+    score = 1 if not made_db_call else 0
+```
+
+### Step 7: Create/Update Experiment Runner
+Check if `run_experiment_with_eval.py` exists. If not, create:
+
+```python
+import asyncio
+from langsmith import aevaluate
+from [agent_module] import [agent_function]
+from eval_[name] import evaluate_[name]
+from dotenv import load_dotenv
+
+load_dotenv()
+
+async def agent_wrapper(inputs: dict) -> dict:
+    result = await [agent_function](inputs["question"])
+    return result
+
+async def main():
+    results = await aevaluate(
+        agent_wrapper,
+        data="DATASET_NAME",
+        evaluators=[evaluate_[name]],
+        experiment_prefix="eval-test",
+        max_concurrency=5,
+    )
+    print(f"Results: {results}")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### Step 8: Configure Dataset
+Ask: "What's your dataset name?"
+Ask: "Please update the dataset name in the experiment runner"
+Wait for confirmation.
+
+### Step 9: Run the Evaluation
+Execute: `uv run python run_experiment_with_eval.py`
+Show the LangSmith URL when printed.
+
+## Key Patterns
+
+**Extracting from messages** (most reliable):
+```python
+messages = run.outputs.get("messages", [])
+for msg in messages:
+    if msg.get("role") == "assistant" and msg.get("tool_calls"):
+        # Tool calls are here
+```
+
+**Extracting from child_runs** (if messages not available):
+```python
+def traverse_runs(run):
+    if run.name == "tool_name":
+        # Found it
+    if hasattr(run, 'child_runs') and run.child_runs:
+        for child in run.child_runs:
+            traverse_runs(child)
+```
+
+**Using metadata:**
+```python
+category = example.metadata.get("category") if example.metadata else None
+```
+
+## Troubleshooting
+
+**Can't find tool calls**: Re-run `inspect_trace.py` to see actual structure
+
+**child_runs empty**: Agent should return messages in outputs
+
+**Same score always**: Debug evaluation logic with print statements
+
+**Dataset not found**: Verify name in LangSmith UI
+
+## Reference
+
+**Documentation:**
+- [Code Evaluator SDK](https://docs.langchain.com/langsmith/code-evaluator-sdk) - Writing evaluators
+- [Evaluate LLM Applications](https://docs.langchain.com/langsmith/evaluate-llm-application) - Running experiments
+
+**Important:** Extensive LangSmith documentation exists. If unsure about trace structure, SDK usage, or evaluation patterns, **search documentation** rather than assuming.
+
+**See parent project for complete example:**
+- `agent_v4.py` - Returns messages in outputs
+- `eval_tool_call_schema.py` - Tool call + schema discovery evaluator
+- `run_experiment_with_code_eval.py` - Experiment runner
@@ -0,0 +1,259 @@
+"""
+Trace Structure Inspector for LangSmith
+
+Use this to understand the structure of your agent's traces before building an evaluator.
+"""
+
+from langsmith import Client
+from typing import Optional
+import json
+
+
+def inspect_trace_structure(
+    project_name: str,
+    run_id: Optional[str] = None,
+    show_sample_data: bool = True
+) -> dict:
+    """
+    Inspect the structure of a LangSmith trace to understand where data lives.
+
+    Args:
+        project_name: The LangSmith project name
+        run_id: Optional specific run ID to inspect. If None, fetches most recent.
+        show_sample_data: Whether to show sample values from the trace
+
+    Returns:
+        dict with structure information that can be used programmatically
+    """
+    client = Client()
+
+    # Fetch the run
+    if run_id:
+        run = client.read_run(run_id)
+    else:
+        runs = list(client.list_runs(
+            project_name=project_name,
+            is_root=True,
+            limit=1
+        ))
+        if not runs:
+            raise ValueError(f"No runs found in project '{project_name}'")
+        run = client.read_run(runs[0].id)
+
+    print("=" * 80)
+    print("TRACE STRUCTURE ANALYSIS")
+    print("=" * 80)
+    print(f"\nProject: {project_name}")
+    print(f"Run ID: {run.id}")
+    print(f"Run Name: {run.name}")
+    print(f"Run Type: {run.run_type}")
+
+    # Analyze structure
+    structure = {
+        "run_id": str(run.id),
+        "run_name": run.name,
+        "run_type": run.run_type,
+        "has_inputs": bool(run.inputs),
+        "has_outputs": bool(run.outputs),
+        "has_child_run_ids": bool(hasattr(run, 'child_run_ids') and run.child_run_ids),
+        "inputs": {},
+        "outputs": {},
+        "child_runs_info": [],
+        "metadata": run.metadata if hasattr(run, 'metadata') and run.metadata else None
+    }
+
+    # Analyze inputs
+    print("\n" + "=" * 80)
+    print("INPUTS")
+    print("=" * 80)
+    if run.inputs:
+        print(f"\nKeys in run.inputs: {list(run.inputs.keys())}")
+        structure["inputs"]["keys"] = list(run.inputs.keys())
+
+        for key, value in run.inputs.items():
+            value_type = type(value).__name__
+            structure["inputs"][key] = {"type": value_type}
+
+            if show_sample_data:
+                if isinstance(value, (str, int, float, bool)):
+                    sample = str(value)[:100]
+                    print(f"  {key} ({value_type}): {sample}{'...' if len(str(value)) > 100 else ''}")
+                elif isinstance(value, list):
+                    print(f"  {key} ({value_type}): List with {len(value)} items")
+                    if value and len(value) > 0:
+                        print(f"    First item type: {type(value[0]).__name__}")
+                        structure["inputs"][key]["list_item_type"] = type(value[0]).__name__
+                elif isinstance(value, dict):
+                    print(f"  {key} ({value_type}): Dict with keys: {list(value.keys())}")
+                    structure["inputs"][key]["dict_keys"] = list(value.keys())
+                else:
+                    print(f"  {key} ({value_type})")
+    else:
+        print("No inputs found")
+
+    # Analyze outputs
+    print("\n" + "=" * 80)
+    print("OUTPUTS")
+    print("=" * 80)
+    if run.outputs:
+        print(f"\nKeys in run.outputs: {list(run.outputs.keys())}")
+        structure["outputs"]["keys"] = list(run.outputs.keys())
+
+        for key, value in run.outputs.items():
+            value_type = type(value).__name__
+            structure["outputs"][key] = {"type": value_type}
+
+            if show_sample_data:
+                if isinstance(value, (str, int, float, bool)):
+                    sample = str(value)[:100]
+                    print(f"  {key} ({value_type}): {sample}{'...' if len(str(value)) > 100 else ''}")
+                elif isinstance(value, list):
+                    print(f"  {key} ({value_type}): List with {len(value)} items")
+                    if value and len(value) > 0:
+                        print(f"    First item type: {type(value[0]).__name__}")
+                        structure["outputs"][key]["list_item_type"] = type(value[0]).__name__
+
+                        # Special handling for messages array
+                        if key == "messages" and isinstance(value[0], dict):
+                            print(f"    Looks like a messages array!")
+                            print(f"    Message roles found: {set(m.get('role') for m in value if isinstance(m, dict))}")
+                            structure["outputs"][key]["is_messages_array"] = True
+                            structure["outputs"][key]["message_roles"] = list(set(m.get('role') for m in value if isinstance(m, dict)))
+
+                            # Check for tool calls in messages
+                            has_tool_calls = any(
+                                m.get('role') == 'assistant' and m.get('tool_calls')
+                                for m in value if isinstance(m, dict)
+                            )
+                            if has_tool_calls:
+                                print(f"    ✓ Contains tool calls in assistant messages!")
+                                structure["outputs"][key]["has_tool_calls"] = True
+
+                                # Extract tool names
+                                tool_names = set()
+                                for m in value:
+                                    if isinstance(m, dict) and m.get('role') == 'assistant' and m.get('tool_calls'):
+                                        for tc in m.get('tool_calls', []):
+                                            if isinstance(tc, dict):
+                                                tool_names.add(tc.get('function', {}).get('name'))
+                                print(f"    Tools called: {tool_names}")
+                                structure["outputs"][key]["tool_names"] = list(tool_names)
+
+                elif isinstance(value, dict):
+                    print(f"  {key} ({value_type}): Dict with keys: {list(value.keys())}")
+                    structure["outputs"][key]["dict_keys"] = list(value.keys())
+                else:
+                    print(f"  {key} ({value_type})")
+    else:
+        print("No outputs found")
+
+    # Analyze child runs
+    print("\n" + "=" * 80)
+    print("CHILD RUNS")
+    print("=" * 80)
+
+    if hasattr(run, 'child_run_ids') and run.child_run_ids:
+        print(f"\n✓ Has {len(run.child_run_ids)} child run IDs")
+        structure["num_child_runs"] = len(run.child_run_ids)
+
+        # Fetch a few child runs to see structure
+        print("\nFetching child runs to inspect structure...")
+        for i, child_id in enumerate(run.child_run_ids[:3]):  # Just first 3
+            child_run = client.read_run(child_id)
+            child_info = {
+                "name": child_run.name,
+                "type": child_run.run_type,
+                "has_inputs": bool(child_run.inputs),
+                "has_outputs": bool(child_run.outputs),
+            }
+
+            print(f"\n  Child Run {i+1}:")
+            print(f"    Name: {child_run.name}")
+            print(f"    Type: {child_run.run_type}")
+
+            if child_run.inputs:
+                print(f"    Input keys: {list(child_run.inputs.keys())}")
+                child_info["input_keys"] = list(child_run.inputs.keys())
+
+                # Show sample for tool calls
+                if "query" in child_run.inputs:
+                    print(f"    Query: {child_run.inputs['query'][:80]}...")
+
+            if child_run.outputs:
+                print(f"    Output keys: {list(child_run.outputs.keys())}")
+                child_info["output_keys"] = list(child_run.outputs.keys())
+
+            structure["child_runs_info"].append(child_info)
+
+        if len(run.child_run_ids) > 3:
+            print(f"\n  ... and {len(run.child_run_ids) - 3} more child runs")
+
+    else:
+        print("\n✗ No child run IDs found")
+        structure["num_child_runs"] = 0
+
+    # Metadata
+    if structure["metadata"]:
+        print("\n" + "=" * 80)
+        print("METADATA")
+        print("=" * 80)
+        print(f"\nMetadata keys: {list(structure['metadata'].keys())}")
+
+    # Summary and recommendations
+    print("\n" + "=" * 80)
+    print("RECOMMENDATIONS FOR EVALUATOR")
+    print("=" * 80)
+
+    recommendations = []
+
+    # Check if messages are in outputs
+    if (structure["outputs"].get("keys") and "messages" in structure["outputs"]["keys"] and
+        structure["outputs"].get("messages", {}).get("is_messages_array")):
+        print("\n✓ Agent returns messages in outputs")
+        print("  Recommendation: Extract tool calls from run.outputs['messages']")
+        print("  This is the most reliable approach.")
+        recommendations.append("extract_from_messages")
+
+        if structure["outputs"]["messages"].get("has_tool_calls"):
+            print(f"\n✓ Tool calls found in messages")
+            print(f"  Tools: {structure['outputs']['messages'].get('tool_names')}")
+    else:
+        print("\n✗ Agent does not return messages in outputs")
+        if structure["num_child_runs"] > 0:
+            print("  Recommendation: Extract tool calls from run.child_runs")
+            print("  Note: This requires traversing the child run tree")
+            recommendations.append("extract_from_child_runs")
+        else:
+            print("  Warning: No obvious place to find tool calls")
+            print("  Consider updating agent to return messages in outputs")
+
+    structure["recommendations"] = recommendations
+
+    # Return structure for programmatic use
+    print("\n" + "=" * 80)
+    return structure
+
+
+if __name__ == "__main__":
+    import sys
+
+    if len(sys.argv) < 2:
+        print("Usage: python inspect_trace.py <project_name> [run_id]")
+        print("\nExample:")
+        print("  python inspect_trace.py my-langsmith-project")
+        print("  python inspect_trace.py my-langsmith-project 019c546c-2ce6-7853-8ac5-939a88d7c4a4")
+        sys.exit(1)
+
+    project_name = sys.argv[1]
+    run_id = sys.argv[2] if len(sys.argv) > 2 else None
+
+    structure = inspect_trace_structure(project_name, run_id)
+
+    print("\n" + "=" * 80)
+    print("Structure data saved for programmatic use")
+    print("=" * 80)
+    print("\nYou can import this function and use the returned dict:")
+    print("  from inspect_trace import inspect_trace_structure")
+    print("  structure = inspect_trace_structure('your-project')")
+    print("  if 'extract_from_messages' in structure['recommendations']:")
+    print("      # Extract from run.outputs['messages']")