Backtesting (#223)

2026-07-01 08:12:02 -04:00 · 2024-04-09 10:46:22 -07:00
parent 657692a778
commit e7065aa285
5 changed files with 33 additions and 23 deletions
@@ -66,7 +66,7 @@ Test and benchmark your LLM systems using methods in these evaluation recipes:

 **Fundamentals**

- [Production Candidate Testing](./testing-examples/prod-candidate-testing/prod-candidate-testing.ipynb): benchmark new versions of your production app using real inputs. Convert production runs to a test dataset, then compare your new system's performance against the baseline.
+- [Backtesting](./testing-examples/backtesting/backtesting.ipynb): benchmark new versions of your production app using real inputs. Convert production runs to a test dataset, then compare your new system's performance against the baseline.
 - [Adding Metrics to Existing Tests](./testing-examples/evaluate-existing-test-project/evaluate_runs.ipynb): Apply new evaluators to existing test results without re-running your model, using the `compute_test_metrics` utility function. This lets you evaluate "post-hoc" and backfill metrics as you define new evaluators.
 - [Naming Test Projects](./testing-examples/naming-test-projects/naming-test-projects.md): manually name your tests with `run_on_dataset(..., project_name='my-project-name')`
 - [Exporting Tests to CSV](./testing-examples/export-test-to-csv/export-test-to-csv.ipynb): Use the `get_test_results` beta utility to easily export your test results to a CSV file. This allows you to analyze and report on the performance metrics, errors, runtime, inputs, outputs, and other details of your tests outside of the Langsmith platform.
@@ -34,7 +34,7 @@ sidebar_position: 4
 **Fundamentals**

 - [Adding Metrics to Existing Tests](./evaluate-existing-test-project/evaluate_runs.ipynb): Apply new evaluators to existing test results without re-running your model, using the `compute_test_metrics` utility function. This lets you evaluate "post-hoc" and backfill metrics as you define new evaluators.
- [Production Candidate Testing](./prod-candidate-testing/prod-candidate-testing.ipynb): benchmark new versions of your production app using real inputs. Convert production runs to a test dataset, then compare your new system's performance against the baseline.
+- [Production Candidate Testing](./backtesting/backtesting.ipynb): benchmark new versions of your production app using real inputs. Convert production runs to a test dataset, then compare your new system's performance against the baseline.
 - [Naming Test Projects](./naming-test-projects/naming-test-projects.md): manually name your tests with `run_on_dataset(..., project_name='my-project-name')`
 - [Exporting Tests to CSV](./export-test-to-csv/export-test-to-csv.ipynb): Use the `get_test_results` beta utility to easily export your test results to a CSV file. This allows you to analyze and report on the performance metrics, errors, runtime, inputs, outputs, and other details of your tests outside of the Langsmith platform.
 - [How to download feedback and examples from a test project](./download-feedback-and-examples/download_example.ipynb): goes beyond the utility described above to query and export the predictions, evaluation results, and other information to programmatically add to your reports.
@@ -5,7 +5,7 @@
   "id": "776d4494-515a-4e5c-b146-515d4ecc8981",
   "metadata": {},
   "source": [
-    "# Production Candidate Testing\n",
+    "# Backtesting\n",
    "\n",
    "Deploying your app into production is just one step in a longer journey continuous improvement. You'll likely want to develop other candidate systems that improve on your production model using improved prompts, llms, indexing strategies, and other techniques. While you may have a set of offline datasets already created by this point, it's often useful to compare system performance on more recent production data.\n",
    "\n",
@@ -43,7 +43,7 @@
   "outputs": [],
   "source": [
    "%%capture --no-stderr\n",
-    "%pip install -U --quiet langchain langsmith langchain_anthropic langchainhub"
+    "%pip install -U --quiet langsmith langchain_anthropic langchainhub langchain"
   ]
  },
  {
@@ -126,7 +126,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 3,
   "id": "c9856567-02dd-42b8-896d-c92f23a822b3",
   "metadata": {},
   "outputs": [
@@ -134,24 +134,22 @@
     "name": "stderr",
     "output_type": "stream",
     "text": [
-      "/var/folders/gf/6rnp_mbx5914kx7qmmh7xzmw0000gn/T/ipykernel_38616/2272910727.py:29: UserWarning: Function convert_runs_to_test is in beta.\n",
+      "/var/folders/gf/6rnp_mbx5914kx7qmmh7xzmw0000gn/T/ipykernel_78795/2746354860.py:27: UserWarning: Function convert_runs_to_test is in beta.\n",
      "  convert_runs_to_test(\n"
     ]
    },
    {
     "data": {
      "text/plain": [
-       "TracerSession(id=UUID('5c17069e-7959-49d6-92e4-5b6f573ae9b7'), start_time=datetime.datetime(2024, 3, 19, 1, 41, 41, 257232), end_time=None, description=None, name='prod-baseline-7a3f20', extra={'metadata': {'which': 'prod-baseline', 'dataset_version': '2024-03-19T01:41:27.787206+00:00'}}, tenant_id=UUID('ebbaf2eb-769b-4505-aca2-d11de10372a4'))"
+       "TracerSession(id=UUID('62afc62c-d831-4a05-97a2-a67db683c67e'), start_time=datetime.datetime(2024, 4, 9, 17, 42, 27, 183712), end_time=None, description=None, name='prod-baseline-90b90f', extra={'metadata': {'which': 'prod-baseline', 'dataset_version': '2024-04-09T17:42:12.001577+00:00'}}, tenant_id=UUID('ebbaf2eb-769b-4505-aca2-d11de10372a4'))"
      ]
     },
-     "execution_count": 5,
+     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
-    "import random\n",
-    "import uuid\n",
    "from datetime import datetime, timedelta, timezone\n",
    "\n",
    "from langsmith import Client\n",
@@ -175,7 +173,7 @@
    ")\n",
    "\n",
    "# Name of the dataset we want to create\n",
-    "dataset_name = f'{project_name}-candidate-testing {start_time.strftime(\"%Y-%m-%d\")}-{end_time.strftime(\"%Y-%m-%d\")}'\n",
+    "dataset_name = f'{project_name}-backtesting {start_time.strftime(\"%Y-%m-%d\")}-{end_time.strftime(\"%Y-%m-%d\")}'\n",
    "# This converts the runs to a dataset + test\n",
    "# It does not actually invoke your model\n",
    "convert_runs_to_test(\n",
@@ -204,7 +202,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 5,
   "id": "c73d53ac-ac50-4cdf-b9cb-b2de7320d6a3",
   "metadata": {
    "scrolled": true
@@ -214,28 +212,41 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "View the evaluation results for project 'timely-shame-82' at:\n",
-      "https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/6c82236b-82cc-408f-839b-fa1d48983932/compare?selectedSessions=003e14f3-995c-41cc-a310-15a7841d3187\n",
+      "View the evaluation results for experiment: 'HaikuBenchmark:2a3311d' at:\n",
+      "https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/79e66af2-db17-4ea1-acb0-efb070340b92/compare?selectedSessions=886b72b8-734c-4431-bf67-9b3e16d41f9c\n",
      "\n",
-      "View all tests for Dataset Tweet Critic-candidate-testing 2024-03-18-2024-03-19 at:\n",
-      "https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/6c82236b-82cc-408f-839b-fa1d48983932\n",
-      "[------------------------------------------------->] 11/11"
+      "\n"
     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "10038b1c8ce04de0aee22baa9bd56ada",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "0it [00:00, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
    }
   ],
   "source": [
-    "from langchain.load import load\n",
+    "from langsmith.evaluation import evaluate\n",
    "\n",
    "\n",
-    "def deserialize_messages(example_input: dict):\n",
+    "def predict(example_input: dict):\n",
    "    # The dataset includes serialized messages that we\n",
    "    # must convert to a format accepted by our system.\n",
-    "    return {\n",
+    "    messages = {\n",
    "        \"messages\": [\n",
    "            (message[\"type\"], message[\"content\"])\n",
    "            for message in example_input[\"messages\"]\n",
    "        ]\n",
    "    }\n",
+    "    return system.invoke(messages)\n",
    "\n",
    "\n",
    "# Use an updated version of the prompt\n",
@@ -243,9 +254,8 @@
    "llm = ChatAnthropic(model=\"claude-3-haiku-20240307\")\n",
    "system = prompt | llm | StrOutputParser()\n",
    "\n",
-    "test_results = client.run_on_dataset(\n",
-    "    llm_or_chain_factory=deserialize_messages | system,\n",
-    "    dataset_name=dataset_name,\n",
+    "test_results = evaluate(\n",
+    "    predict, data=dataset_name, experiment_prefix=\"HaikuBenchmark\", max_concurrency=3\n",
    ")"
   ]
  },