ai-engineer-workshop/notebooks/02_evaluation.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c044803a-3fe2-4297-ad71-c93ae2e078f5",
   "metadata": {},
   "source": [
    "# Part 2: Evaluating our LLM application"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "afc13cfb-5e8a-401d-bb04-24a5793e69be",
   "metadata": {},
   "source": [
    "So far, we've chosen typical/arbitrary values for the various parts of our RAG application. But if we were to change something, such as our chunking logic, embedding model, LLM, etc. how can we know that we have a better configuration than before. A generative task like this is very difficult to quantitatively assess and so we need to develop creative ways to do so. \n",
    "\n",
    "Because we have many moving parts in our application, we need to perform unit/component and end-to-end evaluation. Component-wise evaluation can involve evaluating our retrieval in isolation (is the best source in our set of retrieved chunks) and evaluating our LLMs response (given the best source, is the LLM able to produce a quality answer). As for end-to-end evaluation, we can assess the quality of the entire system (given all data, what is the quality of the response).\n",
    "\n",
    "<img width=\"1000\" src=\"https://images.ctfassets.net/xjan103pcp94/17UQdsEImsXOOdDlT06bvi/4a9b9e46e157541a1178b6938624176a/llm_evaluations.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b67c46c-7650-484b-a792-7eaf62c0e82e",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "119575d6-b0cc-49f3-a118-46bc3adf8189",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# os.environ[\"OPENAI_API_KEY\"] = ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ce9df539-d629-4e8b-9d93-2a2da5cd1b2c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32cedb0a-a7b3-4194-88a9-355122cd8a00",
   "metadata": {},
   "source": [
    "## Golden Context Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31583b8a-bd08-4054-95f6-2ae5256e6a21",
   "metadata": {},
   "source": [
    "In an ideal world, we would have a golden validation dataset: given a set of queries, we would have the correct sources that answer those queries, and optionally the correct answer that should be returned by the LLM.\n",
    "\n",
    "For this example, we have manually collected 177 representative user queries and identified the correct source in the documentation that answer those user queries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8495831b-c4f9-4e84-a211-bc37ecf369e2",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "import json\n",
    "\n",
    "golden_dataset_path = Path(\"../datasets/eval-dataset-v1.jsonl\")\n",
    "\n",
    "with open(golden_dataset_path, \"r\") as f:\n",
    "    data = [json.loads(item) for item in list(f)]\n",
    "    \n",
    "len(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b187c2d5-5c4d-4b9a-9e5b-5f37c9e92b36",
   "metadata": {},
   "source": [
    "Our dataset contains 'question' and 'source' pairs. If we have a golden context dataset, it is the best option for evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2300e0fc-3e65-4f43-98b6-3564bc1ccb3e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "data[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6739202-90fa-46e8-bc7d-3c0ed33c8673",
   "metadata": {},
   "source": [
    "## Cold Start"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20636b3e-5211-41c9-a390-0cfa23654c44",
   "metadata": {
    "tags": []
   },
   "source": [
    "We may not always have a prepared dataset of questions and the best source to answer that question readily available. To address this cold start problem, we could use an LLM to look at our documents and generate questions that the specific chunk would answer. This provides us with quality questions and the exact source the answer is in. However, this dataset generation method could be a bit noisy. The generate questions may not always be resembling of what your users may ask and the specific chunk we say is the best source may also have that exact information in other chunks. Nonetheless, this is a great way to start our development process while we collect + manually label a high quality dataset.\n",
    "\n",
    "<img width=\"800\" src=\"https://images.ctfassets.net/xjan103pcp94/3QR9zkjtpgeqK8XKPteTav/76aa9e7743330e7fcf73b07332a7ddf2/image10.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f0192cc-40f7-4fe6-8175-25cedd51d9a2",
   "metadata": {
    "tags": []
   },
   "source": [
    "We need to define a few parameters first.  \n",
    "- Notably, the chunk size determines the size of the text chunk shown to the LLM when generating hypothetical question & answer pairs. This must be set below the context window limitation of the chosen LLM.\n",
    "- We choose a subsample ratio since we just want to construct a small representative subset for the purpose of evaluation and iteration. (We choose an even smaller subset for the purpose of the demonstration here).\n",
    "- We use `gpt-3.5-turbo` since it's fast and cheap. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff6951de-bf96-4e1a-a666-893e62f3eab8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "RAY_DOCS_DIRECTORY = Path(\"docs.ray.io/en/master/\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79cd3470-8542-4323-a70d-c3cc9fdb04a8",
   "metadata": {},
   "source": [
    "First, we load in the documents and chunk them to the appropriate sizes, creating LlamaIndex nodes. We already did the data processing in part 1, and have packaged the logic as a utility."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "790da50a-24c8-4d5f-8395-fb225f1c49fe",
   "metadata": {},
   "outputs": [],
   "source": [
    "from data import create_nodes\n",
    "\n",
    "# needs to be smaller than context window\n",
    "CHUNK_SIZE = 1024\n",
    "\n",
    "nodes = create_nodes(RAY_DOCS_DIRECTORY, chunk_size=CHUNK_SIZE, chunk_overlap=20).take_all()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5dcc6135-03b1-43b3-8c0a-155b8eff3037",
   "metadata": {},
   "outputs": [],
   "source": [
    "nodes = [node_dict[\"node\"] for node_dict in nodes]\n",
    "id_to_node = {node.node_id: node for node in nodes}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a9f820b-bc94-40a4-8e66-c78d5c2120b9",
   "metadata": {},
   "source": [
    "Now, we subsample the nodes to obtain a representative subset (here we use a very small subset for a fast demonstration)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cada634d-fc1e-4b1a-b1d5-d2266a518090",
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils import subsample\n",
    "\n",
    "SUBSAMPLE_RATIO = 0.01\n",
    "\n",
    "subsampled_nodes = subsample(nodes, SUBSAMPLE_RATIO)\n",
    "print('Subsampled {} nodes into {} nodes'.format(len(nodes), len(subsampled_nodes)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5856201-06db-4292-990c-21b1010e12c0",
   "metadata": {},
   "source": [
    "Now, we use LlamaIndex's built in utility `generate_qa_embedding_pairs` to create synthetic query/context pairs.\n",
    "\n",
    "(We can also use this utility for fine-tuning embeddings, hence the naming. More on this in part 3!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d718383c-74a2-4518-a51b-69a7857c5cde",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from llama_index.finetuning import generate_qa_embedding_pairs\n",
    "from llama_index.llms import OpenAI\n",
    "\n",
    "llm = OpenAI(model='gpt-3.5-turbo')\n",
    "synthetic_dataset = generate_qa_embedding_pairs(subsampled_nodes, llm=llm, num_questions_per_chunk=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87212c89-9356-45c9-8608-552b350b7e96",
   "metadata": {},
   "source": [
    "Now we will transform the shape of the data a bit to match the format of our labeled data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc5d468a-6a6e-468d-ae9d-95297dd2c860",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "synthetic_data = []\n",
    "for query_id, context_ids in synthetic_dataset.relevant_docs.items():\n",
    "    query = synthetic_dataset.queries[query_id]\n",
    "    golden_context = id_to_node[context_ids[0]].metadata['source']\n",
    "    entry = {\n",
    "        'question': query,\n",
    "        'source': golden_context,\n",
    "    }\n",
    "    synthetic_data.append(entry)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "486593e8-f24a-4278-8704-9845a90adbb5",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "synthetic_data[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1bd61e90-0b15-4cca-92ec-61b76804bedd",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from utils import write_jsonl\n",
    "\n",
    "write_jsonl(\"../datasets/synthetic-eval-dataset.jsonl\", synthetic_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64373866-d0f7-4704-b03b-5c466fc59afb",
   "metadata": {},
   "source": [
    "Since we already have a dataset with representative user queries and ground truth labels, we will use that for evaluation instead of a synthetically generated dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "330a7e10-18fc-495e-ab4a-da2f3ca00a97",
   "metadata": {},
   "source": [
    "## Evaluating Retrieval"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb6e7e79-5a2a-4247-bc3a-d093d0416761",
   "metadata": {},
   "source": [
    "The first component to evaluate in our RAG application is retrieval. Given a query, is our retriever pulling in the correct context to answer that query? Regardless of how good our LLM is, if it does not have the right context to answer the question, it cannot provide the right answer.\n",
    "\n",
    "We can use our golden context dataset to evaluate retrieval. The simplest approach is that for each query in our dataset, we can test to see if the correct source is included in any of the chunks that are retrieved by our retriever. This measures \"hit rate\".\n",
    "\n",
    "However, simply checking for existence can be misleading if we increase the number of chunks that we retrieve. Therefore, we also want to check the score that our retriever gives for the correct source. A higher score means our retriever is accurately determining the correct context. \n",
    "\n",
    "To summarize, for each query in our evaluation dataset, we will measure the following:\n",
    "1. Is the correct source included in any of the retrived chunks?\n",
    "2. What is the score our retriever gives to the correct source?\n",
    "\n",
    "<img width=\"800\" src=\"../images/retrieval-eval.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60a5958c-74d8-4a11-b46c-a12a8fd5ad99",
   "metadata": {},
   "source": [
    "First, let's a get a retriever over the vector database. We have packaged this as a utility. It is the same as we did in notebook 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b88e0d5f-a0ed-47e9-afcf-0ac882201e57",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from utils import get_retriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dad0afdc-2f44-4ca7-b5c0-2c6499f4f35e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "retriever = get_retriever(similarity_top_k=5, embedding_model_name='text-embedding-ada-002')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f3b5438-9930-4857-a9a5-165cbd15c79b",
   "metadata": {},
   "source": [
    "Now let's evaluate our retriever. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb4cd2ae-17ff-4e75-904f-a577499bc8ab",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from tqdm import tqdm\n",
    "results = []\n",
    "\n",
    "for entry in tqdm(data):\n",
    "    query = entry[\"question\"]\n",
    "    expected_source = entry['source']\n",
    "    \n",
    "    retrieved_nodes = retriever.retrieve(query)\n",
    "    retrieved_sources = [node.metadata['source'] for node in retrieved_nodes]\n",
    "    \n",
    "    # If our label does not include a section, then any sections on the page should be considered a hit.\n",
    "    if \"#\" not in expected_source:\n",
    "        retrieved_sources = [source.split(\"#\")[0] for source in retrieved_sources]\n",
    "    \n",
    "    if expected_source in retrieved_sources:\n",
    "        is_hit = True\n",
    "        score = retrieved_nodes[retrieved_sources.index(expected_source)].score\n",
    "    else:\n",
    "        is_hit = False\n",
    "        score = 0.0\n",
    "    \n",
    "    result = {\n",
    "        \"is_hit\": is_hit,\n",
    "        \"score\": score,\n",
    "        \"retrieved\": retrieved_sources,\n",
    "        \"expected\": expected_source,\n",
    "        \"query\": query,\n",
    "    }\n",
    "    results.append(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a12ecc64-efae-4a67-86de-3f3fbb36820b",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "results[:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d1ddd39-f141-4585-93a4-593daa146f60",
   "metadata": {},
   "source": [
    "Let's see how well our retriever does. It's not great right now, but we now have a solid metric to evaluate our retriever for future optimizations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df7d03f6-76bf-414d-a072-2862bdc2c1a8",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "total_hits = sum(result[\"is_hit\"] for result in results)\n",
    "hit_percentage = total_hits / len(results)\n",
    "hit_percentage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "069c754b-30d9-432a-98f0-1bcec9d0fe99",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "average_score = sum(result[\"score\"] for result in results) / len(results)\n",
    "average_score"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e852dd79-247f-47d3-8ffa-97f79781a68a",
   "metadata": {},
   "source": [
    "## End-to-end evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6193ec2a-3a68-412b-bbb9-997e57b27edf",
   "metadata": {},
   "source": [
    "While we can evaluate our retriever in isolation, ultimately we want to evaluate our RAG application end-to-end, which includes the final response generated from our LLM.\n",
    "\n",
    "To effectively evaluate our generated responses, we need \"ground truth\" responses. These ground truth responses can be generated by feeding the correct context to a \"golden\" LLM. Then, we can use an LLM to evaluate our generated responses compared to the ground truth responses.\n",
    "\n",
    "<img width=\"700\" src=\"https://images.ctfassets.net/xjan103pcp94/2lhpSUNrMmi7WAHpd3wslR/15facf649e30571e8d806d354f475f0b/image6.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da62c16b-605f-41cd-8894-b7f2e6ae460c",
   "metadata": {},
   "source": [
    "### Choosing a Golden LLM"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac0266f2-8a34-411e-a661-4c2fc8cbb5ef",
   "metadata": {},
   "source": [
    "To generate ground truth responses, and then to evaluate the generated responses vs. the ground truth, we need a \"golden\" LLM. But which LLM should we use? We now run into a problem: we need to determine the quality of different LLMs to choose as a \"golden\" LLM, but doing so requires a \"golden\" LLM. \n",
    "\n",
    "Leaderboards on general benchmarks provide a rough indication on which LLMs perform better, we will go with OpenAI's GPT-4 here since it's been shown to be [well aligned with human preferences](https://arxiv.org/pdf/2306.05685.pdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75423140-d93e-495a-8a3a-f1cb3cbb72db",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "def fetch_text_from_source(source: str):\n",
    "    url, anchor = source.split(\"#\") if \"#\" in source else (source, None)\n",
    "    file_path = Path(\"./\", url.split(\"https://\")[-1])\n",
    "    with open(file_path, \"r\", encoding=\"utf-8\") as file:\n",
    "        html_content = file.read()\n",
    "    soup = BeautifulSoup(html_content, \"html.parser\")\n",
    "    if anchor:\n",
    "        target_element = soup.find(id=anchor)\n",
    "        if target_element:\n",
    "            text = target_element.get_text()\n",
    "        else:\n",
    "            return fetch_text_from_source(source=url)\n",
    "    else:\n",
    "        text = soup.get_text()\n",
    "    return text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e578a402-99d5-4dbd-804e-e17c24518744",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "example_source = data[0][\"source\"]\n",
    "print(example_source)\n",
    "\n",
    "text = fetch_text_from_source(example_source)\n",
    "text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eed04e2f-ffcd-446d-9c16-4cb9c3f226ca",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from tqdm import tqdm\n",
    "from llama_index import ServiceContext\n",
    "from llama_index.llms import OpenAI\n",
    "from llama_index.response_synthesizers import get_response_synthesizer\n",
    "from llama_index.schema import TextNode, NodeWithScore\n",
    "\n",
    "def generate_responses(entries, llm):\n",
    "    context_window = llm.metadata.context_window - 500\n",
    "    service_context = ServiceContext.from_defaults(llm=llm, context_window=context_window)\n",
    "    rs = get_response_synthesizer(service_context=service_context)\n",
    "\n",
    "    responses = []\n",
    "    for entry in tqdm(entries):\n",
    "        query = entry[\"question\"]\n",
    "        source = entry[\"source\"]\n",
    "\n",
    "        context = fetch_text_from_source(source)\n",
    "        nodes = [NodeWithScore(node=TextNode(text=context))]\n",
    "\n",
    "        response = rs.synthesize(query, nodes=nodes)\n",
    "        responses.append(response.response)\n",
    "    return responses"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08f5c2a6-a5ac-4b9c-ab45-61d444b97b9b",
   "metadata": {},
   "source": [
    "We can now generate our reference responses. Let's generate 10 reference responses and save them to a file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5981d038-4b6c-4a97-8e05-ccf9121ade7e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "llm = OpenAI(model='gpt-4', temperature=0.0)\n",
    "ten_samples = data[:10]\n",
    "golden_responses = generate_responses(ten_samples, llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5fc8506c-0deb-43ae-9936-cd323635cc44",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "reference_dataset = [{\"question\": entry[\"question\"], \"source\": entry[\"source\"], \"response\": response} for entry, response in zip(ten_samples, golden_responses)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23632aab-2be3-49cb-905b-1b26d7bde46a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "with open(\"../datasets/golden-responses.json\", \"w\") as file:\n",
    "    json.dump(reference_dataset, file, indent=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23dbf6c6-74ea-4802-be16-f96f2982a642",
   "metadata": {},
   "source": [
    "## Evaluating our Query Engine"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ef0af94-f983-464f-8a4b-a91ff72d7f87",
   "metadata": {},
   "source": [
    "Once we have reference responses, we can get our generated responses from our query engine. Then pass both responses to our golden LLM to evaluate the responses from our application."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd15459b-a679-4f1b-832c-bb38cd467d0f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "with open(\"../datasets/golden-responses.json\", \"r\") as file:\n",
    "    golden_responses = json.load(file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f2a555f-20c0-4716-b4b2-1d9921edfa79",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "golden_responses[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8a53891-cd44-4efb-b9f9-9e3cc7de3a6b",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from utils import get_query_engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3adebada-bce2-4600-ac2f-4ba541ce32f1",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "query_engine = get_query_engine(similarity_top_k=5, llm_model_name='gpt-3.5-turbo', embedding_model_name='text-embedding-ada-002')\n",
    "\n",
    "# Store both the original response object and the response string.\n",
    "rag_responses = []\n",
    "rag_response_str = []\n",
    "\n",
    "for entry in tqdm(golden_responses):\n",
    "    query = entry[\"question\"]\n",
    "    response = query_engine.query(query)\n",
    "    rag_responses.append(response)\n",
    "    rag_response_str.append(response.response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a17e76f6-7333-4ea7-9357-3bc125c59cf8",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "rag_response_str[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a9a8456b-8a78-4e87-8110-3fd163f10864",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from llama_index.evaluation import CorrectnessEvaluator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef83fc36-e93d-4341-b7a6-d0a24a598872",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "eval_llm = OpenAI(model='gpt-4', temperature=0.0)\n",
    "service_context = ServiceContext.from_defaults(llm=eval_llm)\n",
    "evaluator = CorrectnessEvaluator(service_context=service_context)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90bb5e09-08e1-40e8-9129-0f441bbce72f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "eval_results = []\n",
    "for rag_response, golden_response in tqdm(list(zip(rag_response_str, golden_responses))):\n",
    "    query = golden_response[\"question\"]\n",
    "    golden_answer = golden_response[\"response\"]\n",
    "    generated_answer = rag_response\n",
    "    \n",
    "    eval_result = evaluator.evaluate(query=query, reference=golden_answer, response=generated_answer)\n",
    "    eval_results.append(eval_result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91e27497-0ec6-4ae3-bc0a-1c342afb3047",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "[r.score for r in eval_results]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3190240d-7bb1-4079-8bfa-0d6d0ed9166d",
   "metadata": {},
   "source": [
    "Let's save the query, both responses, and the score to a JSON file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8fc98a6d-a27a-4a42-a5c5-4574a336e679",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "scores = [\n",
    "    {\"question\": golden_response[\"question\"],\n",
    "     \"golden_response\": golden_response[\"response\"],\n",
    "     \"generated_response\": eval_result.response,\n",
    "     \"score\": eval_result.score,\n",
    "     \"reasoning\": eval_result.feedback,\n",
    "    }\n",
    "    for eval_result, golden_response in zip(eval_results, golden_responses)\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e00e6239-5c44-4c1a-89d5-7549fb0ad98a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "with open(\"eval-scores.json\", \"w\") as file:\n",
    "    json.dump(scores, file, indent=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "095ef29b-b968-4449-9a50-4234a5b294ab",
   "metadata": {},
   "source": [
    "We can also calculate the average scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6b7d3a9-d2c5-4381-b2d4-9c1aa1d21df1",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "average_scores = sum(score[\"score\"] for score in scores) / len(scores)\n",
    "average_scores"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f0229cb-2256-4762-a772-55740e617951",
   "metadata": {},
   "source": [
    "## Evaluation without Golden Responses"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d43b1970-ded0-4ee4-b49e-49447fad1533",
   "metadata": {},
   "source": [
    "Generating reference responses and then using them for evaluation can give us a more accurate assesment on how our query engine is performing. However, this approach can be expensive. We have to make an initial pass through GPT4 to generate the reference response, and then we have to make another pass through GPT4 to evaluate our application's responses against the reference response.\n",
    "\n",
    "We can explore other evaluation metrics to get a better sense on how our query engine is performing, without needing to make multiple passes to GPT4."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1905688d-9a0d-4e2a-81ea-40242f50905a",
   "metadata": {},
   "source": [
    "### Evaluating for faithfulness/relevancy"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6662a23a-8388-42cc-b986-0062afee3b48",
   "metadata": {},
   "source": [
    "One metric we can test is relevancy, which does not require generating reference responses. With this approach, we check to see if the generated response is relevant to at least one of the retrieved sources and to the query. This ensures that our LLM is not making up a response, but rather that it is relevant to the question that is being asked, and also that is relevant to at least one of the retrieved context.\n",
    "\n",
    "This does NOT check whether the response is a correct response.\n",
    "\n",
    "This capability is built into LlamaIndex, via the various `Evaluator` modules. We use gpt-4 as the evaluator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c767913f-fc0a-4437-932a-4a5106a5768c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from llama_index.evaluation import FaithfulnessEvaluator, RelevancyEvaluator\n",
    "from llama_index.llms import OpenAI\n",
    "from llama_index import ServiceContext\n",
    "\n",
    "def evaluate(queries: list, responses: list, metric: str):\n",
    "    llm = OpenAI(model=\"gpt-4\", temperature=0.0)\n",
    "    service_context = ServiceContext.from_defaults(llm=llm)\n",
    "    \n",
    "    \n",
    "    if metric == 'faithfulness':\n",
    "        evaluator = FaithfulnessEvaluator(service_context=service_context)\n",
    "    elif metric == 'relevancy':\n",
    "        evaluator = RelevancyEvaluator(service_context=service_context)\n",
    "    else:\n",
    "        raise ValueError(\"Unknown metric: \", metrc)\n",
    "\n",
    "    evals = []\n",
    "    for query, response in tqdm(list(zip(queries, responses))):\n",
    "        eval_result = evaluator.evaluate_response(query=query, response=response)\n",
    "        evals.append(eval_result)\n",
    "    \n",
    "    return evals\n",
    "\n",
    "def get_pass_rate(evals):\n",
    "    return len([val.passing for val in evals]) / len(evals)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ecd11db-ced6-433b-844d-03c6d0adc30a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "faithfulness_results = evaluate(queries=[sample[\"question\"] for sample in ten_samples], responses=rag_responses, metric='faithfulness')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed4555fa-5901-49df-8f66-afbe00d27079",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "faithfulness_score = get_pass_rate(faithfulness_results)\n",
    "faithfulness_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "84e8a04f-62f6-46cb-80e1-adde0195c1cc",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "relevancy_results = evaluate(queries=[sample[\"question\"] for sample in ten_samples], responses=rag_responses, metric='relevancy')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61b35f60-3989-459c-a26e-12ed35456e8f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "relevancy_score = get_pass_rate(relevancy_results)\n",
    "relevancy_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a535e536-8829-46af-96ad-c3ba832eead4",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}