langsmith-cookbook/testing-examples/using-fixed-sources/using_fixed_sources.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1a7184af-d54f-487d-ad7f-0f3274dc689b",
   "metadata": {},
   "source": [
    "# RAG Evaluation using Fixed Sources\n",
    "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langsmith-cookbook/blob/main/testing-examples/using-fixed-sources/using_fixed_sources.ipynb)\n",
    "\n",
    "A simple RAG pipeline requries at least two components: a retriever and a response generator. You can evaluate the whole chain end-to-end, as shown in the [QA Correctness](../qa-correctness/) walkthrough. However, for more actionable and fine-grained metrics, it is helpful to evaluate each component in isolation.\n",
    "\n",
    "To evaluate the response generator directly, create a dataset with the user query and retrieved documents as inputs and the expected response as an output.\n",
    "\n",
    "In this walkthrough, you will take this approach to evaluate the response generation component of a RAG pipeline, using both correctness and a custom \"faithfulness\" evaluator to generate multiple metrics. The results will look something like the following:\n",
    "\n",
    "![Custom Evaluator](./img/example_results.png)\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "First, install the required packages and configure your environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "1c748b92-e590-408f-bd20-733dc79d643e",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -U langchain openai anthropic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "78c086f0-f1c4-4a55-a922-c926239de2c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import uuid\n",
    "\n",
    "# Update with your API URL if using a hosted instance of Langsmith.\n",
    "os.environ[\"LANGCHAIN_ENDPOINT\"] = \"https://api.smith.langchain.com\"\n",
    "os.environ[\"LANGCHAIN_API_KEY\"] = \"YOUR API KEY\"  # Update with your API key\n",
    "uid = uuid.uuid4()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "039a0309-48f8-4770-8b34-2b97eb85a247",
   "metadata": {},
   "source": [
    "## 1. Create a dataset\n",
    "\n",
    "Next, create a dataset. The simple dataset below is enough to illustrate ways the response generator may deviate from the desired behavior by relying too much on its pretrained \"knowledge\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "83f83f2e-76d1-4d86-9275-35bd61df014e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# A simple example dataset\n",
    "examples = [\n",
    "    {\n",
    "        \"inputs\": {\n",
    "            \"question\": \"What's the company's total revenue for q2 of 2022?\",\n",
    "            \"documents\": [\n",
    "                {\n",
    "                    \"metadata\": {},\n",
    "                    \"page_content\": \"In q1 the lemonade company made $4.95. In q2 revenue increased by a sizeable amount to just over $2T dollars.\",\n",
    "                }\n",
    "            ],\n",
    "        },\n",
    "        \"outputs\": {\n",
    "            \"label\": \"2 trillion dollars\",\n",
    "        },\n",
    "    },\n",
    "    {\n",
    "        \"inputs\": {\n",
    "            \"question\": \"Who is Lebron?\",\n",
    "            \"documents\": [\n",
    "                {\n",
    "                    \"metadata\": {},\n",
    "                    \"page_content\": \"On Thursday, February 16, Lebron James was nominated as President of the United States.\",\n",
    "                }\n",
    "            ],\n",
    "        },\n",
    "        \"outputs\": {\n",
    "            \"label\": \"Lebron James is the President of the USA.\",\n",
    "        },\n",
    "    },\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "adf85a45-e100-4d28-a102-ec1d135f0ba7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langsmith import Client\n",
    "\n",
    "client = Client()\n",
    "\n",
    "dataset_name = f\"Faithfulness Example - {uid}\"\n",
    "dataset = client.create_dataset(dataset_name=dataset_name)\n",
    "client.create_examples(\n",
    "    inputs=[e[\"inputs\"] for e in examples],\n",
    "    outputs=[e[\"outputs\"] for e in examples],\n",
    "    dataset_id=dataset.id,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4aa0264-d24f-495a-b9f6-87ddf97aaeb6",
   "metadata": {},
   "source": [
    "## 2. Define chain\n",
    "\n",
    "Suppose your chain is composed of two main components: a retriever and response synthesizer. Using LangChain runnables, it's easy to separate these two components to evaluate them in isolation.\n",
    "\n",
    "Below is a very simple RAG chain with a placeholder retriever. For our testing, we will evaluate ONLY the response synthesizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "6314168f-9530-476f-949b-d49c40db55ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain import chat_models, prompts\n",
    "from langchain_core.documents import Document\n",
    "from langchain_core.retrievers import BaseRetriever\n",
    "from langchain_core.runnables import RunnablePassthrough\n",
    "\n",
    "\n",
    "class MyRetriever(BaseRetriever):\n",
    "    def _get_relevant_documents(self, query, *, run_manager):\n",
    "        return [Document(page_content=\"Example\")]\n",
    "\n",
    "\n",
    "# This is what we will evaluate\n",
    "response_synthesizer = prompts.ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\"system\", \"Respond using the following documents as context:\\n{documents}\"),\n",
    "        (\"user\", \"{question}\"),\n",
    "    ]\n",
    ") | chat_models.ChatAnthropic(model=\"claude-2\", max_tokens=1000)\n",
    "\n",
    "# Full chain below for illustration\n",
    "chain = {\n",
    "    \"documents\": MyRetriever(),\n",
    "    \"qusetion\": RunnablePassthrough(),\n",
    "} | response_synthesizer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e6087fa-432d-4a59-b023-29058b5ec6ea",
   "metadata": {},
   "source": [
    "## 3. Evaluate\n",
    "\n",
    "Below, we will define a custom \"FaithfulnessEvaluator\" that measures how faithful the chain's output prediction is to the reference input documents, given the user's input question.\n",
    "\n",
    "In this case, we will wrap the [Scoring Eval Chain](https://python.langchain.com/docs/guides/productionization/evaluation/string/scoring_eval_chain) and manually select which fields in the run and dataset example to use to represent the prediction, input question, and reference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "8217e940-e6d0-4f08-bed7-41cda7a35ce8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langsmith.evaluation import RunEvaluator, EvaluationResult\n",
    "from langchain.evaluation import load_evaluator\n",
    "\n",
    "\n",
    "class FaithfulnessEvaluator(RunEvaluator):\n",
    "    def __init__(self):\n",
    "        self.evaluator = load_evaluator(\n",
    "            \"labeled_score_string\",\n",
    "            criteria={\n",
    "                \"faithful\": \"How faithful is the submission to the reference context?\"\n",
    "            },\n",
    "            normalize_by=10,\n",
    "        )\n",
    "\n",
    "    def evaluate_run(self, run, example) -> EvaluationResult:\n",
    "        res = self.evaluator.evaluate_strings(\n",
    "            prediction=next(iter(run.outputs.values())),\n",
    "            input=run.inputs[\"question\"],\n",
    "            # We are treating the documents as the reference context in this case.\n",
    "            reference=example.inputs[\"documents\"],\n",
    "        )\n",
    "        return EvaluationResult(key=\"labeled_criteria:faithful\", **res)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "70e56fe8-ae02-481d-b6f6-729e535c5e88",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "View the evaluation results for project 'test-puzzled-texture-92' at:\n",
      "https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/4d35dd98-d797-47ce-ae4b-608e96ddf6bf\n",
      "[------------------------------------------------->] 2/2"
     ]
    }
   ],
   "source": [
    "from langchain.smith import RunEvalConfig\n",
    "\n",
    "eval_config = RunEvalConfig(\n",
    "    evaluators=[\"qa\"],\n",
    "    custom_evaluators=[FaithfulnessEvaluator()],\n",
    "    input_key=\"question\",\n",
    ")\n",
    "results = client.run_on_dataset(\n",
    "    llm_or_chain_factory=response_synthesizer,\n",
    "    dataset_name=dataset_name,\n",
    "    evaluation=eval_config,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31e74b5b-b8e2-433d-a9da-c56539152833",
   "metadata": {},
   "source": [
    "You can review the results in LangSmith to see how the chain fares. The trace for the custom faithfulness evaluator should look something like this:\n",
    "\n",
    "[![](./img/example_score.png)](https://smith.langchain.com/public/9a4e6ee2-f26c-4bcd-a050-04766fbfd350/r)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bca9286b-0463-4f81-a27b-b2e3b16955c2",
   "metadata": {},
   "source": [
    "## Discussion\n",
    "\n",
    "You've now evaluated the response generator for its response correctness and its \"faithfulness\" to the source text but fixing retrieved document sources in the dataset. This is an effective way to confirm that the response component of your chat bot behaves according to expectations.\n",
    "\n",
    "In setting up the evaluation, you used a custom run evaluator to select which fields in the dataset to use in the evaluation template. Under the hood, this still uses an off-the-shelf [scoring evaluator](https://python.langchain.com/docs/guides/productionization/evaluation/string/scoring_eval_chain). \n",
    "\n",
    "Most of LangChain's open-source evaluators implement the \"[StringEvaluator](https://python.langchain.com/docs/guides/productionization/evaluation/string/)\" interface, meaning they compute a metric based on:\n",
    "\n",
    "- An input string from the dataset example inputs (configurable by the RunEvalConfig's input_key property)\n",
    "- An output prediction string from the evaluated chain's outputs (configurable by the RunEvalConfig's prediction_key property)\n",
    "- (If labels or context are required) a reference string from the example outputs (configurable by the RunEvalConfig's reference_key property)\n",
    "\n",
    "In our case, we wanted to take the context from the example _inputs_ fields. Wrapping the evaluator as a custom `RunEvaluator` is an easy way to get a further level of control in situations when you want to use other fields."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}