langsmith-cookbook/testing-examples/backtesting/backtesting.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "776d4494-515a-4e5c-b146-515d4ecc8981",
   "metadata": {},
   "source": [
    "# Backtesting\n",
    "\n",
    "Deploying your app into production is just one step in a longer journey continuous improvement. You'll likely want to develop other candidate systems that improve on your production model using improved prompts, llms, indexing strategies, and other techniques. While you may have a set of offline datasets already created by this point, it's often useful to compare system performance on more recent production data.\n",
    "\n",
    "This notebook shows how to do this in LangSmith.\n",
    "\n",
    "The basic steps are:\n",
    "\n",
    "1. Sample runs to test against from your production tracing project.\n",
    "2. Convert runs to dataset + initial test project.\n",
    "3. Run new system against the dataset to compare.\n",
    "\n",
    "You will then have a new dataset of representative inputs you can you can version and backtest your models against.\n",
    "\n",
    "\n",
    "![Resulting Tests](./img/dataset_page.jpg)\n",
    "\n",
    "**Note:** In most cases, you won't have \"ground truth\" answers in this case, but you can manually compare and label or use reference-free evaluators to score the outputs.(If your application DOES permit capturing ground-truth labels, then we obviously recommend you use those."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce385e04-67b8-4fa5-8024-265cf7303fd6",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "Install + set environment variables. This requires `langsmith>=0.1.29` to use the beta utilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ac5a07f-97fe-48e3-9eb9-d5975364d8c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture --no-stderr\n",
    "%pip install -U --quiet langsmith langchain_anthropic langchainhub langchain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ce200938-9229-4b95-b711-e1bcb6a2d391",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# Set the project name to whichever project you'd like to be testing against\n",
    "project_name = \"Tweet Critic\"\n",
    "os.environ[\"LANGCHAIN_API_KEY\"] = \"YOUR API KEY\"\n",
    "os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n",
    "os.environ[\"ANTHROPIC_API_KEY\"] = \"YOUR ANTHROPIC API KEY\"\n",
    "os.environ[\"LANGCHAIN_PROJECT\"] = project_name"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "837af068-7ae6-497e-a29d-e677b637fe90",
   "metadata": {},
   "source": [
    "#### (Preliminary) Production Deployment\n",
    "\n",
    "You likely have a project already and can skip this step. \n",
    "\n",
    "We'll simulate one here so no one reading this notebook gets left out. \n",
    "Our example app is a \"tweet critic\" that revises tweets we put out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "863adc9d-b4de-4e29-9e8f-50eaa4f52388",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain import hub\n",
    "from langchain_anthropic import ChatAnthropic\n",
    "from langchain_core.messages import HumanMessage\n",
    "from langchain_core.output_parsers import StrOutputParser\n",
    "\n",
    "prompt = hub.pull(\"wfh/tweet-critic:7e4f539e\")\n",
    "llm = ChatAnthropic(model=\"claude-3-haiku-20240307\")\n",
    "system = prompt | llm | StrOutputParser()\n",
    "\n",
    "\n",
    "inputs = [\n",
    "    \"\"\"RAG From Scratch: Our RAG From Scratch video series covers some important RAG concepts in short, focused videos with code. This is the 10th video and it discusses query routing. Problem: We sometimes have multiple datastores (e.g., different vector DBs, SQL DBs, etc) and prompts to choose from based on a user query. Idea: Logical routing can use an LLM to decide which datastore is more appropriate. Semantic routing embeds the query and prompts, then chooses the best prompt based on similarity. Video: https://youtu.be/pfpIndq7Fi8 Code: https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_10_and_11.ipynb\"\"\",\n",
    "    \"\"\"@Voyage_AI_ Embedding Integration Package Use the same custom embeddings that power Chat LangChain via the new langchain-voyageai package! Voyage AI builds custom embedding models that can improve retrieval quality. ChatLangChain: https://chat.langchain.com Python Docs: https://python.langchain.com/docs/integrations/providers/voyageai\"\"\",\n",
    "    \"\"\"Implementing RAG: How to Write a Graph Retrieval Query in LangChain Our friends at @neo4j have a nice guide on combining LLMs and graph databases. Blog:\"\"\",\n",
    "    \"\"\"Text-to-PowerPoint with LangGraph.js You can now generate PowerPoint presentations from text! @TheGreatBonnie wrote a guide showing how to use LangGraph.js, @tavilyai, and @CopilotKit to build a Next.js app for this. Tutorial: https://dev.to/copilotkit/how-to-build-an-ai-powered-powerpoint-generator-langchain-copilotkit-openai-nextjs-4c76 Repo: https://github.com/TheGreatBonnie/aipoweredpowerpointapp\"\"\",\n",
    "    \"\"\"Build an Answer Engine Using Groq, Mixtral, Langchain, Brave & OpenAI in 10 Min Our friends at @Dev__Digest have a tutorial on building an answer engine over the internet. Code: https://github.com/developersdigest/llm-answer-engine YouTube: https://youtube.com/watch?v=43ZCeBTcsS8&t=96s\"\"\",\n",
    "    \"\"\"Building a RAG Pipeline with LangChain and Amazon Bedrock Amazon Bedrock has great models for building LLM apps. This guide covers how to get started with them to build a RAG pipeline. https://gettingstarted.ai/langchain-bedrock/\"\"\",\n",
    "    \"\"\"SF Meetup on March 27! Join our meetup to hear from LangChain and Pulumi experts and learn about building AI-enabled capabilities. Sign up: https://meetup.com/san-francisco-pulumi-user-group/events/299491923/?utm_campaign=FY2024Q3_Meetup_PUG%20SF&utm_content=286236214&utm_medium=social&utm_source=twitter&hss_channel=tw-837770064870817792\"\"\",\n",
    "    \"\"\"Chat model response metadata @LangChainAI chat model invocations now include metadata like logprobs directly in the output. Upgrade your version of `langchain-core` to try it. PY: https://python.langchain.com/docs/modules/model_io/chat/logprobs JS: https://js.langchain.com/docs/integrations/chat/openai#generation-metadata\"\"\",\n",
    "    \"\"\"Benchmarking Query Analysis in High Cardinality Situations Handling high-cardinality categorical values can be challenging. This blog explores 6 different approaches you can take in these situations. Blog: https://blog.langchain.dev/high-cardinality\"\"\",\n",
    "    \"\"\"Building Google's Dramatron with LangGraph.js & Claude 3 We just released a long YouTube video (1.5 hours!) on building Dramatron using LangGraphJS and @AnthropicAI's Claude 3 \"Haiku\" model. It's a perfect fit for LangGraph.js and Haiku's speed. Check out the tutorial: https://youtube.com/watch?v=alHnQjyn7hg\"\"\",\n",
    "    \"\"\"Document Loading Webinar with @AirbyteHQ Join a webinar on document loading with PyAirbyte and LangChain on 3/14 at 10am PDT. Features our founding engineer @eyfriis and the @aaronsteers and Bindi Pankhudi team. Register: https://airbyte.com/session/airbyte-monthly-ai-demo\"\"\",\n",
    "]\n",
    "\n",
    "_ = system.batch(\n",
    "    [{\"messages\": [HumanMessage(content=content)]} for content in inputs],\n",
    "    {\"max_concurrency\": 3},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5899ef2f-8846-4449-9eb1-a4e5e58db76a",
   "metadata": {},
   "source": [
    "## Convert Prod Runs to Test\n",
    "\n",
    "The first step is to generate a dataset based on the production _inputs_.\n",
    "Then copy over all the traces to serve as a baseline run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c9856567-02dd-42b8-896d-c92f23a822b3",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/gf/6rnp_mbx5914kx7qmmh7xzmw0000gn/T/ipykernel_78795/2746354860.py:27: UserWarning: Function convert_runs_to_test is in beta.\n",
      "  convert_runs_to_test(\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "TracerSession(id=UUID('62afc62c-d831-4a05-97a2-a67db683c67e'), start_time=datetime.datetime(2024, 4, 9, 17, 42, 27, 183712), end_time=None, description=None, name='prod-baseline-90b90f', extra={'metadata': {'which': 'prod-baseline', 'dataset_version': '2024-04-09T17:42:12.001577+00:00'}}, tenant_id=UUID('ebbaf2eb-769b-4505-aca2-d11de10372a4'))"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from datetime import datetime, timedelta, timezone\n",
    "\n",
    "from langsmith import Client\n",
    "from langsmith.beta import convert_runs_to_test\n",
    "\n",
    "# How we are sampling runs to include in our test\n",
    "end_time = datetime.now(tz=timezone.utc)\n",
    "start_time = end_time - timedelta(days=1)\n",
    "run_filter = f'and(gt(start_time, \"{start_time.isoformat()}\"), lt(end_time, \"{end_time.isoformat()}\"))'\n",
    "\n",
    "\n",
    "# Fetch the runs we want to convert to a test\n",
    "client = Client()\n",
    "\n",
    "prod_runs = list(\n",
    "    client.list_runs(\n",
    "        project_name=project_name,\n",
    "        execution_order=1,\n",
    "        filter=run_filter,\n",
    "    )\n",
    ")\n",
    "\n",
    "# Name of the dataset we want to create\n",
    "dataset_name = f'{project_name}-backtesting {start_time.strftime(\"%Y-%m-%d\")}-{end_time.strftime(\"%Y-%m-%d\")}'\n",
    "# This converts the runs to a dataset + test\n",
    "# It does not actually invoke your model\n",
    "convert_runs_to_test(\n",
    "    prod_runs,\n",
    "    # Name of the resulting dataset\n",
    "    dataset_name=dataset_name,\n",
    "    # Whether to include the run outputs as reference/ground truth\n",
    "    include_outputs=False,\n",
    "    # Whether to include the full traces in the resulting test project\n",
    "    # (default is to just include the root run)\n",
    "    load_child_runs=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21ac5e60-1210-4f9e-a8b7-c14d9c3d40ee",
   "metadata": {},
   "source": [
    "## Benchmark new system\n",
    "\n",
    "Now we have the dataset and prod runs saved as a \"test\".\n",
    "\n",
    "Let's run inference on our new system to compare."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "c73d53ac-ac50-4cdf-b9cb-b2de7320d6a3",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "View the evaluation results for experiment: 'HaikuBenchmark:2a3311d' at:\n",
      "https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/79e66af2-db17-4ea1-acb0-efb070340b92/compare?selectedSessions=886b72b8-734c-4431-bf67-9b3e16d41f9c\n",
      "\n",
      "\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "10038b1c8ce04de0aee22baa9bd56ada",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0it [00:00, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from langsmith.evaluation import evaluate\n",
    "\n",
    "\n",
    "def predict(example_input: dict):\n",
    "    # The dataset includes serialized messages that we\n",
    "    # must convert to a format accepted by our system.\n",
    "    messages = {\n",
    "        \"messages\": [\n",
    "            (message[\"type\"], message[\"content\"])\n",
    "            for message in example_input[\"messages\"]\n",
    "        ]\n",
    "    }\n",
    "    return system.invoke(messages)\n",
    "\n",
    "\n",
    "# Use an updated version of the prompt\n",
    "prompt = hub.pull(\"wfh/tweet-critic:34c57e4f\")\n",
    "llm = ChatAnthropic(model=\"claude-3-haiku-20240307\")\n",
    "system = prompt | llm | StrOutputParser()\n",
    "\n",
    "test_results = evaluate(\n",
    "    predict, data=dataset_name, experiment_prefix=\"HaikuBenchmark\", max_concurrency=3\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0e2e0714-37e3-4fa2-bc8b-0fb2380e7773",
   "metadata": {},
   "source": [
    "## Review runs\n",
    "\n",
    "You can now compare the outputs in the UI.\n",
    "\n",
    "![Comparison page](./img/comparison_view.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d21d8620-34c7-486e-8a36-dfc6d1ea5991",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "Congrats! You've sampled production runs and started benchmarking other systems against them.\n",
    "In this exercise, we chose not to apply any evaluators to simplify things (since we lack ground-truth answers for this task).\n",
    "You can manually review the results in LangSmith and/or apply a reference-free evaluator to the results to generate metrics instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b6fefc3-f959-4213-a4a1-ba6aa853ccc0",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}