Files
2024-03-19 17:02:31 -07:00

198 lines
6.1 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "1ddb1a3b-eaf7-4755-8bfe-4d9178c7927a",
"metadata": {
"tags": []
},
"source": [
"# Add Metrics to Existing Tests\n",
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langsmith-cookbook/blob/main/testing-examples/evaluate-existing-test-project/evaluate_runs.ipynb)\n",
"\n",
"At times, you may want to apply an evaluator post-hoc. This is useful if you have a new evaluator (or version of an evaluator) and want to add the metrics without re-running your model. \n",
"\n",
"You can do this like so:\n",
"\n",
"```python\n",
"from langsmith.beta import compute_test_metrics\n",
"\n",
"def my_evaluator(run, example):\n",
" score = \"foo\" in run.outputs['output']\n",
" return {\"key\": \"is_foo\", \"score\": score}\n",
"\n",
"# The name of the test you have already run.\n",
"# This is DISTINCT from the dataset name\n",
"test_project = \"test-abc123\"\n",
"compute_test_metrics(test_project, evaluators=[my_evaluator])\n",
"```\n",
"\n",
"Within the `compute_test_metrics` function, we list the runs in the test and apply the provided evaluators to each one.\n",
"\n",
"Below, we will share a quick example."
]
},
{
"cell_type": "markdown",
"id": "9c7e62f7-5f6d-40c7-9efc-e5cd76321fda",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"Install the requisite packages, and generate the initial test results. In reality, you will already have a dataset + test results.\n",
"\n",
"This utility function expects `langsmith>=0.1.31`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "03d82d6f-67a3-4a2d-9b86-604bc48b5820",
"metadata": {},
"outputs": [],
"source": [
"# %pip install -U langsmith langchain"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "ee6bfe5b-9736-4a8b-85e7-2b749ee747fc",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import uuid\n",
"\n",
"os.environ[\"LANGCHAIN_API_KEY\"] = \"YOUR API KEY\"\n",
"os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n",
"# Update if you are self-hosted\n",
"os.environ[\"LANGCHAIN_ENDPOINT\"] = \"https://api.smith.langchain.com\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "be0ff7e9-41f6-463e-943f-f9e77b92cdc0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for project 'puzzled-cloud-96' at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/cbdb128b-a725-4662-a515-dfe0009cb15c/compare?selectedSessions=28f2c88e-3091-4fcc-bac7-c1dbd8a6a43b\n",
"\n",
"View all tests for Dataset My Example Dataset 512ee7 at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/cbdb128b-a725-4662-a515-dfe0009cb15c\n",
"[------------------------------------------------->] 10/10"
]
}
],
"source": [
"from langsmith import Client\n",
"\n",
"client = Client()\n",
"dataset_name = \"My Example Dataset \" + uuid.uuid4().hex[:6]\n",
"\n",
"ds = client.create_dataset(dataset_name=dataset_name)\n",
"client.create_examples(\n",
" inputs=[{\"input\": i} for i in range(10)],\n",
" outputs=[{\"output\": i * (3 % (i + 1))} for i in range(10)],\n",
" dataset_id=ds.id,\n",
")\n",
"\n",
"\n",
"def my_chain(example_input: dict):\n",
" # The input to the llm_or_chain_factory is\n",
" # the example.inputs\n",
" return {\"output\": example_input[\"input\"] * 3}\n",
"\n",
"\n",
"results = client.run_on_dataset(\n",
" dataset_name=dataset_name, llm_or_chain_factory=my_chain\n",
")\n",
"\n",
"test_name = results[\"project_name\"]"
]
},
{
"cell_type": "markdown",
"id": "996a3fb4-ae21-4b18-8ba6-d12c4fa73356",
"metadata": {},
"source": [
"## Add Evaluation Metrics\n",
"\n",
"Now that we have existing test results, we can apply new evaluators to this project using the `compute_test_metrics` utility function."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ae6f9459-51fa-468c-bc65-0b965f5ba628",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/gf/6rnp_mbx5914kx7qmmh7xzmw0000gn/T/ipykernel_80329/988510393.py:14: UserWarning: Function compute_test_metrics is in beta.\n",
" compute_test_metrics(test_name, evaluators=[exact_match])\n"
]
}
],
"source": [
"from langsmith.beta._evals import compute_test_metrics\n",
"from langsmith.schemas import Example, Run\n",
"\n",
"\n",
"def exact_match(run: Run, example: Example):\n",
" # \"output\" is the key we assigned in the create_examples step above\n",
" expected = example.outputs[\"output\"]\n",
" predicted = run.outputs[\"output\"]\n",
" return {\"key\": \"exact_match\", \"score\": predicted == expected}\n",
"\n",
"\n",
"# The name of the test you have already run.\n",
"# This is DISTINCT from the dataset name\n",
"compute_test_metrics(test_name, evaluators=[exact_match])"
]
},
{
"cell_type": "markdown",
"id": "1cdb41ef-3892-4385-8830-c6decfbf8f5c",
"metadata": {},
"source": [
"Now you can check out the test results in the above link.\n",
"\n",
"## Conclusion\n",
"\n",
"Congrats! You've run evals on an existing test. This makes it easy to backfill evaluation results on old test results."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}