ai-engineer-workshop/notebooks/01_rag.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "685a91ef-626a-4d76-8f7f-b89bfa6d1d6f",
   "metadata": {},
   "source": [
    "# Part 1: Developing the RAG application"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2ab54b8-5341-42fa-8790-93e71bbc43b5",
   "metadata": {},
   "source": [
    "- GitHub repository: https://github.com/Disiok/ai-engineer-workshop/\n",
    "- Ray documentation: https://docs.ray.io/\n",
    "- LlamaIndex documentation: https://gpt-index.readthedocs.io/en/stable/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "536f1270-5328-416e-90c5-9a8e087ae354",
   "metadata": {},
   "source": [
    "We will start by building our example RAG application: a Q&A app that given a question about Ray, can answer it using the Ray documentation.\n",
    "\n",
    "In this notebook we will learn how to:\n",
    "1. 💻 Develop a retrieval augmented generation (RAG) based LLM application.\n",
    "2. 🚀 Scale the major components (embed, index, etc.) in our application.\n",
    "\n",
    "We will use both [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/) and [Ray](https://docs.ray.io/) for developing our LLM application. \n",
    "\n",
    "<img width=\"500\" src=\"https://images.ctfassets.net/xjan103pcp94/4PX0l1ruKqfH17YvUiMFPw/c60a7a665125cb8056bebcc146c23b76/image8.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7aa52945-492f-47ae-aabc-18ad43430f6d",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1f4fa1b-e1a6-402e-8f8a-462b3d02c87d",
   "metadata": {},
   "source": [
    "Let's setup our credentials for Open AI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "e991060f-c95d-46f0-8bb9-7a310fc17ed3",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# os.environ[\"OPENAI_API_KEY\"] = ..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55bd9f4f-ba08-4178-a077-a31751ae91b9",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Step 1: Loading and parsing the Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8e0d3e2-f390-4023-b24b-6904bdd361c4",
   "metadata": {},
   "source": [
    "To build our RAG application, we first need to load, parse, and embed the data that we want to use for answering our questions. \n",
    "\n",
    "This data processing pipeline has 3 steps:\n",
    "1. First, we will load the latest documentation for Ray\n",
    "2. Then we will parse the documentation to extract out chunks of text\n",
    "3. Finally, we will **embed** each chunk. This creates a vector representation of the provided text snippet. This vector representation allows us to easily determine the similarity between two different text snippets.\n",
    "\n",
    "<img width=\"1000\" src=\"https://images.ctfassets.net/xjan103pcp94/3q5HUANQ4kS0V23cgEP0JF/ef3b62c5bc5c5c11b734fd3b73f6ea28/image3.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8ca95a6-d8c4-47c8-b960-c14094967e28",
   "metadata": {},
   "source": [
    "LlamaIndex provides utlities for loading our data, and also the abstractions for how we represent our data and their relationships.\n",
    "\n",
    "Ray, and in particular the Ray Data library, is used to scale out our data processing pipeline, allowing us to process data in parallel, leveraging the cores and GPUs in our Ray cluster. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b78c1823-ac58-4bc3-a0b7-b94e8a7bac52",
   "metadata": {},
   "source": [
    "### Load data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d310aaa2-9cfc-4a90-bf2b-76ac06f7f68b",
   "metadata": {},
   "source": [
    "The Ray documentation has already been downloaded and is stored in shared storage directory in our Anyscale workspace. We parse the html files in the downloaded documentation, and create a Ray Dataset out of the doc paths."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20d5d0b9-be8b-491c-8879-09532c70dee6",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "%cd ../datasets\n",
    "!unzip -o docs.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "140a2be2-aa55-4223-8c1b-20cc4d5a1f27",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "RAY_DOCS_DIRECTORY = \"../datasets/docs.ray.io/en/master/\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd4ae53d-f922-4240-af8b-985d943151fe",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import ray\n",
    "\n",
    "docs_path = Path(RAY_DOCS_DIRECTORY)\n",
    "ds = ray.data.from_items([{\"path\": path} for path in docs_path.rglob(\"*.html\") if not path.is_dir()])\n",
    "print(f\"{ds.count()} documents\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4a94e5f-aa03-4483-b3a7-0a4509769671",
   "metadata": {
    "tags": []
   },
   "source": [
    "Now that we have a dataset of all the paths to the html files, we now need to extract text from these HTML files. We want to do this in a generalized manner so that we can perform this extraction across all of our docs pages. \n",
    "\n",
    "Therefore, we use LlamaIndex's HTMLTagReader to identify the sections in our HTML page and then extract the text in between them. For each section of text, we create a LlamaIndex Document, and also store the source url for that section as part of the metadata for the Document. After extracting all the text, we return a list of LlamaIndex documents.\n",
    "\n",
    "<img width=\"800\" src=\"https://images.ctfassets.net/xjan103pcp94/1eFnKmG5xqPIFtPupZ327X/f6152723e18322b90aaa8be5d2d5a6e4/image5.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2ed7959-77c6-473a-a9c5-22bd4e4875f1",
   "metadata": {},
   "source": [
    "### Parse data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6c872e26-615b-4d91-96f5-603d3828c177",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from llama_index.readers import HTMLTagReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae9a63f7-e9da-4b33-b613-e1bf902d493d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def path_to_uri(path, scheme=\"https://\", domain=\"docs.ray.io\"):\n",
    "    # Converts the file path of a Ray documentation page to the original URL for the documentation.\n",
    "    # Example: /efs/shared_storage/goku/docs.ray.io/en/master/rllib-env.html -> https://docs.ray.io/en/master/rllib/rllib-env.html#environments\n",
    "    return scheme + domain + str(path).split(domain)[-1]\n",
    "\n",
    "def extract_sections(record):\n",
    "    # Given a HTML file path, extract out text from the section tags, and return a LlamaIndex document from each one. \n",
    "    html_file_path = record[\"path\"]\n",
    "    reader = HTMLTagReader(tag=\"section\")\n",
    "    documents = reader.load_data(html_file_path)\n",
    "    \n",
    "    # For each document, store the source URL as part of the metadata.\n",
    "    for document in documents:\n",
    "        document.metadata[\"source\"] = f\"{path_to_uri(document.metadata['file_path'])}#{document.metadata['tag_id']}\"\n",
    "    return [{\"document\": document} for document in documents]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5b096df-94e3-48bb-8f1b-a464b8e9ef0d",
   "metadata": {},
   "source": [
    "Let's try this out on a single example HTML file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef89386e-0203-446e-97dd-243197393eea",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "example_path = Path(RAY_DOCS_DIRECTORY, \"rllib/rllib-env.html\")\n",
    "document = extract_sections({\"path\": example_path})[0][\"document\"]\n",
    "print(document)\n",
    "print(\"\\n\")\n",
    "print(\"Document source: \", document.metadata[\"source\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea63b9b8-1873-4595-a5e4-fb7debc78f32",
   "metadata": {},
   "source": [
    "Now, let's use Ray Data to parallelize this across all of the HTML files. We can stitch together operations on our Ray dataset to map a function over each document. \n",
    "\n",
    "Ray Data is lazy by default, so can first stitch together our entire pipeline, and then trigger execution. This allows Ray Data to fully optimize resource usage for our pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a3dcf6e-a12f-423d-8141-e602798f6c42",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "sections_ds = ds.flat_map(extract_sections)\n",
    "sections_ds.schema()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5322a66f-4209-4e29-bba3-269052329ec8",
   "metadata": {},
   "source": [
    "### Chunk data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b77115cd-734a-4228-b49e-5546739b8694",
   "metadata": {},
   "source": [
    "We now have a list of Documents (with text and source of each section) but we shouldn't directly use this as context to our RAG application just yet. The text lengths of each section are all varied and many are quite large chunks. If were to use these large sections, then we'd be inserting a lot of noisy/unwanted context and because all LLMs have a maximum context length, we wouldn't be able to fit too many relevant contexts. Therefore, we're going to split the text within each section into smaller chunks. Intuitively, smaller chunks will encapsulate single/few concepts and will be less noisy compared to larger chunks. We're going to choose some typical text splitting values (ex. `chunk_size=512`) to create our chunks for now but we'll be experiments with a range of values later.\n",
    "\n",
    "<img src=\"../images/length-distribution.png\" alt=\"Section length distributions\" width=\"1000\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f58b811-153d-4cc0-9f04-a1df48ce82a9",
   "metadata": {},
   "source": [
    "Once again, we will use LlamaIndex's abstractions to chunk each Document into a **Node** with the provided chunk size. And we will use Ray Data to parallelize the chunking computation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7b8eecd-b74a-4d8b-b6af-942917755925",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from llama_index.node_parser import SimpleNodeParser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52d8bb61-ada8-4f9f-a838-64b4d86614b8",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "chunk_size = 512\n",
    "chunk_overlap = 50\n",
    "\n",
    "def chunk_document(document):\n",
    "    node_parser = SimpleNodeParser.from_defaults(\n",
    "        chunk_size=chunk_size,\n",
    "        chunk_overlap=chunk_overlap\n",
    "    )\n",
    "    nodes = node_parser.get_nodes_from_documents([document[\"document\"]])\n",
    "    return [{\"node\": node} for node in nodes]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d2dfc56-d6c4-47ff-8f3a-5a5420ebe9d8",
   "metadata": {},
   "source": [
    "Let's run an example over a single document. The document wil be chunked and will result in 2 nodes, each representing 1 chunk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddbbc34b-8cb5-4ee4-9d53-cd8d873afb0c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "sample_document = sections_ds.take(1)[0]\n",
    "\n",
    "# Nodes\n",
    "nodes = chunk_document(sample_document)\n",
    "\n",
    "print(\"Num chunks: \", len(nodes))\n",
    "print(f\"Example text: {nodes[0]['node'].text}\\n\")\n",
    "print(f\"Example metadata: {nodes[0]['node'].metadata}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0491bab5-bea6-4fbc-9347-0645e0df2e33",
   "metadata": {},
   "source": [
    "Now let's chunk all of our documents, stitching this operation into our Ray Dataset pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "018a79ac-d28e-474c-bbe3-da4cf8e0adaf",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy\n",
    "\n",
    "chunks_ds = sections_ds.flat_map(chunk_document, scheduling_strategy=NodeAffinitySchedulingStrategy(node_id=ray.get_runtime_context().get_node_id(), soft=False))\n",
    "chunks_ds.schema()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0321c4d-6244-4aa7-834f-f3d117e78f76",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Embed data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e5790a1-d153-4aaa-b28f-17f2c501810e",
   "metadata": {},
   "source": [
    "Now that we've created small chunks from our dataset, we need a way to identify the most relevant ones to a given query. A very effective and quick method is to embed our data using a pretrained model and use the same model to embed the query. We can then compute the distance between all of the chunk embeddings and our query embedding to determine the top k chunks. There are many different pretrained models to choose from to embed our data but the most popular ones can be discovered through [HuggingFace's Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) leadboard. These models were pretrained on very large text corpus through tasks such as next/masked token prediction that allows them to learn to represent subtokens in N dimensions and capture semantic relationships. We can leverage this to represent our data and make decisions such as the most relevant contexts to use to answer a given query. We're using Langchain's Embedding wrappers ([HuggingFaceEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.huggingface.HuggingFaceEmbeddings.html) and [OpenAIEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html)) to easily load the models and embed our document chunks.\n",
    "\n",
    "**Note**: embeddings aren't the only way to determine the more relevant chunks. We could also use an LLM to decide! However, because LLMs are much larger than these embedding models and have maximum context lengths, it's better to use embeddings to retrieve the top k chunks. And then we could use LLMs on the fewer k chunks to determine the <k chunks to use as the context to answer our query. We could also use reranking (ex. [Cohere Rerank](https://txt.cohere.com/rerank/)) to further identify the most relevant chunks to use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93b4fc25-4ac4-4913-b098-f5ea7ee35dd7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from llama_index.embeddings import OpenAIEmbedding, HuggingFaceEmbedding\n",
    "\n",
    "def get_embedding_model(model_name, embed_batch_size=100):\n",
    "    if model_name == \"text-embedding-ada-002\":\n",
    "            return OpenAIEmbedding(\n",
    "                model=model_name,\n",
    "                embed_batch_size=embed_batch_size,\n",
    "                api_key=os.environ[\"OPENAI_API_KEY\"])\n",
    "    else:\n",
    "        return HuggingFaceEmbedding(\n",
    "            model_name=model_name,\n",
    "            embed_batch_size=embed_batch_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05da2395-d40a-41ab-806f-1e7bd0f9048f",
   "metadata": {},
   "source": [
    "Here, we will use a Python **class** instead of a function to encapsulate the embedding logic. Since loading the embedding model is not cheap, we want to load the model just once and re-use the loaded model when embedding each batch of data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "abe0ee5f-72b4-4439-938f-30b5dacda82a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "class EmbedChunks:\n",
    "    def __init__(self, model_name):\n",
    "        self.embedding_model = get_embedding_model(model_name)\n",
    "    \n",
    "    def __call__(self, node_batch):\n",
    "        # Get the batch of text that we want to embed.\n",
    "        nodes = node_batch[\"node\"]\n",
    "        text = [node.text for node in nodes]\n",
    "        \n",
    "        # Embed the batch of text.\n",
    "        embeddings = self.embedding_model.get_text_embedding_batch(text)\n",
    "        assert len(nodes) == len(embeddings)\n",
    "\n",
    "        # Store the embedding in the LlamaIndex node.\n",
    "        for node, embedding in zip(nodes, embeddings):\n",
    "            node.embedding = embedding\n",
    "        return {\"embedded_nodes\": nodes}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a5c8865-a304-4f56-82d3-42741d67c3c8",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Specify the embedding model to use.\n",
    "embedding_model_name = \"text-embedding-ada-002\"\n",
    "\n",
    "# Specify \"text-embedding-ada-002\" for Open AI embeddings.\n",
    "# embedding_model_name = \"thenlper/gte-base\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c900079-807a-49a7-8748-946e1cb68c28",
   "metadata": {},
   "source": [
    "Let's try this out on an example chunk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5961125-cd55-4e1f-800d-d9383a8ef03b",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "example_chunk = chunks_ds.take_batch(1)\n",
    "embedder = EmbedChunks(model_name=embedding_model_name)\n",
    "example_node_with_embedding = embedder(example_chunk)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c707dd7-6101-4a8a-bebe-75b39f34802f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(example_node_with_embedding[\"embedded_nodes\"][0])\n",
    "print(\"\\n\")\n",
    "print(\"Embedding size: \", len(example_node_with_embedding[\"embedded_nodes\"][0].embedding))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dfea63ce-179a-44b0-96e7-80ff485fc9df",
   "metadata": {},
   "source": [
    "We're now able to embed our chunks at scale by using the [map_batches](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html) operation in our Ray Data pipeline.\n",
    "\n",
    "All we have to do is define the `batch_size` and the compute to use (we're using two workers, each with 1 GPU)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98253c11-214a-43c2-a9b9-7df65611b7f9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from ray.data import ActorPoolStrategy\n",
    "\n",
    "embedded_chunks = chunks_ds.map_batches(\n",
    "    EmbedChunks,\n",
    "    fn_constructor_kwargs={\"model_name\": embedding_model_name},\n",
    "    batch_size=100, \n",
    "    num_gpus=0,\n",
    "    compute=ActorPoolStrategy(size=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c706d2f-bf04-4a6c-8024-c792f8ac00da",
   "metadata": {},
   "source": [
    "### Index data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c354d707-3dfb-48ec-af8a-484814deeccb",
   "metadata": {},
   "source": [
    "Now that we have our embedded chunks, we need to index (store) them somewhere so that we can retrieve them quickly for inference. While there are many popular vector database options, we're going to use [Postgres](https://www.postgresql.org/) for it's simplificty and performance. We'll create a table (`document`) and write the (`text`, `source`, `embedding`) triplets for each embedded chunk we have.\n",
    "\n",
    "<img width=\"700\" src=\"https://images.ctfassets.net/xjan103pcp94/3z1ryYkOtUjj6N1IuavJPf/ae60dc4a10c94e2cc928c38701befb51/image2.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "587030d3-4b28-4cf3-82c4-08bfcc7fa3c9",
   "metadata": {},
   "source": [
    "As the final step in our data pipeline, we will store the embeddings into our Postgres database"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3f5b67c-adfd-4f97-a3f8-3d2891674b56",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Postgres Vector Store"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d26ef0f-14a5-423c-a429-c6d71dfe6e03",
   "metadata": {},
   "source": [
    "Let's setup our Postgres database. The following assume you have docker installed and launched postgres in a local container, i.e. via `docker-compose up -d`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1235c463-29bc-431d-b609-ad3ed06ef61c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Drop existing table if it exists\n",
    "docker exec -u postgres ai-engineer-workshop-postgres-1 psql -d postgres -c \"DROP TABLE IF EXISTS data_document;\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d2cd4e1-5d5b-4ccb-bb4e-5ef6bbc4f70c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from llama_index.vector_stores import PGVectorStore\n",
    "\n",
    "# First create the table.\n",
    "def get_postgres_store():\n",
    "    return PGVectorStore.from_params(\n",
    "            database=\"postgres\", \n",
    "            user=\"postgres\", \n",
    "            password=\"postgres\", \n",
    "            host=\"localhost\", \n",
    "            table_name=\"document\",\n",
    "            port=\"5432\",\n",
    "            embed_dim=1536,\n",
    "        )\n",
    "\n",
    "store = get_postgres_store()\n",
    "del store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "139ff2a5-fc04-456d-8f6a-f2b408ab80ba",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "class StoreResults:\n",
    "    def __init__(self):\n",
    "        self.vector_store = get_postgres_store()\n",
    "    \n",
    "    def __call__(self, batch):\n",
    "        embedded_nodes = batch[\"embedded_nodes\"]\n",
    "        self.vector_store.add(list(embedded_nodes))\n",
    "        return {}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3054ffa-50c3-4e71-9df7-11ad5905bfe0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Store all the embeddings in Postgres, and trigger exection of the Ray Data pipeline.\n",
    "from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy\n",
    "\n",
    "embedded_chunks.map_batches(\n",
    "    StoreResults,\n",
    "    batch_size=128,\n",
    "    num_cpus=1,\n",
    "    compute=ActorPoolStrategy(size=8),\n",
    "    # Since our database is only created on the head node, we need to force the Ray tasks to only executed on the head node.\n",
    "    scheduling_strategy=NodeAffinitySchedulingStrategy(node_id=ray.get_runtime_context().get_node_id(), soft=False)\n",
    "    \n",
    ").count()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac01c84b-019b-4005-9bb8-4d4366003ef2",
   "metadata": {},
   "source": [
    "Let's check our table to see how many chunks that we have stored."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24e45b87-f44a-4cbc-8a75-78a8db2fe20e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "docker exec -u postgres ai-engineer-workshop-postgres-1 psql -c \"SELECT count(*) FROM data_document;\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19b0a6c5-2963-43c4-b002-7b2c48c69842",
   "metadata": {},
   "source": [
    "## Step 2: Retrieval"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16d7b49e-5cda-4542-8286-1e004c59db9f",
   "metadata": {},
   "source": [
    "Now that we have processed, embedded, and stored all of our chunks from the Ray documentation, we can test out the retrieval portion of the application.\n",
    "\n",
    "In the retrieval portion, we want to pull the relevant context for a given query. We do this by embedding the query using the same embedding model we used to embed the chunks, and then check for similarity between the embedded query and all the embedded chunks to pull the most relevant context.\n",
    "\n",
    "<img width=\"1000\" src=\"https://images.ctfassets.net/xjan103pcp94/1hKBrFU2lyR5LLebFyq2ZL/8845c36ff98eb47005338de6ab6dbf50/image14.png\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6749178b-f02a-4c2a-8c62-3b0b8dcce1f6",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from llama_index import VectorStoreIndex, ServiceContext"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a3d0823-721f-4813-8619-b5bc7eb15333",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Create a connection to our Postgres vector store\n",
    "vector_store = get_postgres_store()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1d54980d-2749-44f4-a19d-c6dedf05638a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Use the same embedding model that we used to embed our documents.\n",
    "embedding_model = get_embedding_model(embedding_model_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d2f3071-c7b5-407f-a344-093449235a14",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Create our retriever.\n",
    "service_context = ServiceContext.from_defaults(embed_model=embedding_model, llm=None)\n",
    "index = VectorStoreIndex.from_vector_store(vector_store=vector_store, service_context=service_context)\n",
    "\n",
    "# Fetch the top 5 most relevant chunks.\n",
    "retriever = index.as_retriever(similarity_top_k=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3dceebd-9c40-4315-b940-3026d3190533",
   "metadata": {},
   "source": [
    "Now, let's try a sample query and pull the most relevant context. Looks like the retrieval is working great! From the eye-test, it looks like the chunks are all relevant to the query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ba1ce48-ba88-456c-8f42-262bb0a3ce40",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "query = \"What is the default batch size for map_batches?\"\n",
    "nodes = retriever.retrieve(query)\n",
    "\n",
    "for node in nodes:\n",
    "    print(node)\n",
    "    print(\"Source: \", node.metadata[\"source\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b06ee9e3-9482-41c3-9edb-852d2376a276",
   "metadata": {},
   "source": [
    "## Step 3: Response generation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f87f064e-50bd-46f4-b8ed-7a367b5dcb8b",
   "metadata": {},
   "source": [
    "With our retrieval working, we can now build the next portion of our LLM application, which is the actual response generation.\n",
    "\n",
    "In this step, we pass in both the query and the relevant contex to an LLM. The LLM synthesizes a response to the query given the context. Without this relevant context that we retreived, the LLM may not have been able to accurately answer our question. And as our data grows, we can just as easily embed and index any new data and be able to retrieve it to answer questions.\n",
    "\n",
    "<img width=\"500\" src=\"https://images.ctfassets.net/xjan103pcp94/38I8en8Tyf0cM4LUhjygoq/739d456c80841b4c28fe80f73ea5856b/image16.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db024edc-1922-4ec4-b90d-31def67a5f5e",
   "metadata": {},
   "source": [
    "Creating an end-to-end query engine becomes very easy with LlamaIndex and Anyscale Endpoints. With Anyscale endpoints, we can use open source LLMs, like Llama2 models, just as easy as Open AI, but more cost effectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4539eb1-b429-4a63-b96c-f8867953e518",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from llama_index.llms import OpenAI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97e1041f-ba18-48ac-b4e6-4527973f0159",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Use OpenAI as the LLM to LlamaIndex.\n",
    "llm = OpenAI(model=\"gpt-3.5-turbo\", temperature=0.1)\n",
    "\n",
    "# Use the same embedding model that we used to embed our documents.\n",
    "embedding_model = get_embedding_model(embedding_model_name)\n",
    "\n",
    "service_context = ServiceContext.from_defaults(embed_model=embedding_model, llm=llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e117dab2-b483-417f-af63-d65e4fe51231",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Create our query engine.\n",
    "vector_store = get_postgres_store()\n",
    "index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context)\n",
    "query_engine = index.as_query_engine(similarity_top_k=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f923c14d-acb3-4598-84c8-c715a36d131d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Get a response to our query.\n",
    "\n",
    "query = \"What is the default batch size for map_batches?\"\n",
    "response = query_engine.query(query)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cfa881e-a6ad-4d03-97a6-59a70827e4cf",
   "metadata": {},
   "source": [
    "Let's see the response to our query, as well as the retrieved context that we passed to the LLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f07b97eb-ef73-4fd2-9d80-49802048fe84",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(\"Response: \", response.response)\n",
    "print(\"\\n\")\n",
    "source_nodes = response.source_nodes\n",
    "\n",
    "for node in source_nodes:\n",
    "    print(\"Text: \", node.node.text)\n",
    "    print(\"Score: \", node.score)\n",
    "    print(\"Source: \", node.node.metadata[\"source\"])\n",
    "    print(\"\\n\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}