Compare commits

...

9 Commits

Author SHA1 Message Date
Jerry Liu 231fe1fbdc Merge branch 'main' into jerry/add_ppt_demo 2024-03-13 01:08:57 -07:00
Jerry Liu 7960468be4 cr 2024-03-13 00:21:54 -07:00
Jerry Liu b514ec66e7 cr 2024-03-12 23:20:45 -07:00
Jerry Liu 889aee54b8 Merge branch 'jerry/support_more_file_types' into jerry/add_ppt_demo 2024-03-12 23:18:24 -07:00
Jerry Liu ceb6cc8de3 cr 2024-03-12 23:14:58 -07:00
Jerry Liu d78ffdec51 cr 2024-03-12 23:09:20 -07:00
Jerry Liu 972754c2ca cr 2024-03-12 21:16:38 -07:00
Jerry Liu 9ea6a83e20 cr 2024-03-12 20:03:38 -07:00
Jerry Liu d3971c45eb cr 2024-03-12 20:02:06 -07:00
+405
View File
@@ -0,0 +1,405 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "e0647976-5597-4899-8678-e9a73c19f18b",
"metadata": {},
"source": [
"# LlamaParse over Powerpoint Files\n",
"\n",
"In this notebook we show you how to build a RAG pipeline over [our talk at PyData Global](https://docs.google.com/presentation/d/1rFQ0hPyYja3HKRdGEgjeDxr0MSE8wiQ2iu4mDtwR6fc/edit?usp=sharing) in 2023.\n",
"\n",
"We use LlamaParse to load in our slides in .pptx format, and use LlamaIndex to build a RAG pipeline over these files.\n",
"\n",
"**NOTE**: LlamaParse is capable of image extraction through JSON mode, in this notebook we stick with text."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "14cdcfaf-88b4-4489-9910-e362e0ccec53",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import nest_asyncio\n",
"nest_asyncio.apply()\n",
"\n",
"from llama_parse import LlamaParse"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6f5b5841-dd3e-4169-9bd4-6a672b5b34ee",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"llx-\""
]
},
{
"cell_type": "markdown",
"id": "a2a619a1-fdd4-4ff0-85f8-94c125c275eb",
"metadata": {},
"source": [
"## Download Data\n",
"\n",
"First, download the slides from https://docs.google.com/presentation/d/1rFQ0hPyYja3HKRdGEgjeDxr0MSE8wiQ2iu4mDtwR6fc/edit?usp=sharing and export in .pptx format, and put it in the folder that you're running this notebook.\n",
"\n",
"Name the file `pydata_global.pptx`."
]
},
{
"cell_type": "markdown",
"id": "a7e697d9-4463-4be4-908c-0a3e9179a342",
"metadata": {},
"source": [
"## [Basic] Build a RAG Pipeline over Powerpoint Text\n",
"\n",
"In this example, we use LlamaParse in markdown mode to extract out text from the slides, and we build a top-k RAG pipeline over it.\n",
"\n",
"**Notes**: \n",
"- This does not use our `MarkdownElementNodeParser`, which is tailored for documents with tables.\n",
"- This also does not parse out images (we show that in the next section).\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "0dd0f860-8e92-43a7-9443-ad1a4fb9365c",
"metadata": {},
"outputs": [],
"source": [
"parser = LlamaParse(result_type=\"markdown\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "fd932bef-ba82-4449-b7a0-5c2a9b55089f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Started parsing the file under job_id 9c687e37-4239-4c2f-b2a1-2564bfc98473\n"
]
}
],
"source": [
"docs = parser.load_data(\"pydata_global.pptx\")"
]
},
{
"cell_type": "markdown",
"id": "0f41c2bc-02cd-49b5-a98c-f986faa8fffc",
"metadata": {},
"source": [
"Let's take a look at a few slides."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "2a73e553-2194-4ac9-9764-0edab0d6fdce",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# Building and Productionizing RAG\n",
"\n",
"Jerry Liu, LlamaIndex co-founder/CEO\n",
"---\n",
"|Content|Page Number|\n",
"|---|---|\n",
"|Document Processing| |\n",
"|Tagging & Extraction| |\n",
"|Knowledge Base| |\n",
"|Knowledge Search & QA| |\n",
"|Workflow:| |\n",
"|Read latest messages from user A| |\n",
"|Send email suggesting next-steps| |\n",
"|Document| |\n",
"|Human:| |\n",
"|Agent:| |\n",
"|Topic:| |\n",
"|Summary:| |\n",
"|Author:| |\n",
"|Conversational Agent| |\n",
"|Workflow Automation| |\n",
"---\n",
"Context\n",
"\n",
"- LLMs are a phenomenal piece of technology for knowledge generation and reasoning. They are pre-trained on large amounts of publicly available data.\n",
"\n",
"Use Cases\n",
"\n",
"- Question-Answering\n",
"- Text Generation\n",
"- Summarization\n",
"- Planning\n",
"\n",
"# LLMs\n",
"---\n",
"|Context|\n",
"|---|\n",
"|How do we best augment LLMs with our own private data?|\n",
"|Raw Files|APIs|\n",
"| |salesforce|?|\n",
"| | |Use Cases|\n",
"| | |Question-Answering|\n",
"| | |Text Generation|\n",
"| | |Summarization|\n",
"|Vector Stores|SQL DBs|\n",
"| | |Planning|\n",
"| |LLMs|\n",
"| |Milvus|\n",
"---\n",
"Paradigms for inserting knowledge\n",
"\n",
"Retrieval Augmentation - Fix pe model, put context into pe prompt\n",
"Before college pe two main pings I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write pen, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters wip strong feelings, which I imagined made pem deep...\n",
"\n",
"Input Prompt\n",
"\n",
"Here is the context:\n",
"\n",
"Before college the two main things...\n",
"\n",
"Given the context, answer the following question:\n",
"\n",
"{query_str} LLM\n",
"---\n",
"Paradigms for inserting knowledge\n",
"\n",
"Fine-tuning - baking knowledge into pe weights of pe network\n",
"Before college pe two main pings I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write pen, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters wip strong feelings, which I imagined made pem deep... LLM RLHF, Adam, SGD, etc.\n",
"---\n",
"## LlamaIndex: A data framework for LLM applications\n",
"\n",
"|Data Ingestion (LlamaHub 🦙)|Data Structures|Queries|\n",
"|---|---|---|\n",
"|Connect your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.)|Store and index your data for different use cases. Integrate with different dbs (vector db, graph db, kv db)|Retrieve and query over data. Includes: QA, Summarization, Agents, and more|\n",
"---\n",
"# quickstart py\n",
"\n",
"from Llama_index import VectorStoreIndex, SimpleDirectoryReader\n",
"\n",
"SimpleDirectoryReader( ' data' ) . Load_datal)\n",
"\n",
"documents\n",
"\n",
"VectorStoreIndex.from_documents\n",
"\n",
"indexdocuments_\n",
"\n",
"index.as_query_engine()\n",
"\n",
"query_engine\n",
"\n",
"query_engine.query ( \"What did the authordo growingup?\" )\n",
"\n",
"response\n",
"\n",
"print(str(response) )Codelmage\n",
"---\n",
"NO_CONTENT_HERE\n",
"---\n",
"|Data Ingestion / Parsing|Data Querying|\n",
"|---|---|\n",
"|Chunk| |\n",
"|Chunk| |\n",
"|Doc|Chunk|\n",
"|Chunk|Chunk|\n",
"| |Vector|Chunk|LLM|\n",
"| | |Database|\n",
"|Chunk| |\n",
"| |5 Lines of Code in LlamaIndex!|\n",
"---\n",
"|Current RAG Stack (Data Ingestion/Parsing)|Process:|\n",
"|---|---|\n",
"|● Split up document(s) into even chunks.| |\n",
"|● Each chunk is a piece of raw text.| |\n",
"|Chunk|● Generate embedding for each chunk (e.g. OpenAI embeddings, sentence_transformer)|\n",
"|Chunk|● Store each chunk into a vector database|\n",
"|Doc|Chunk|\n",
"|Chunk|Vector Database|\n",
"|Chunk| |\n",
"---\n",
"|Current RAG Stack (Querying)|\n",
"|---|\n",
"|Process:|\n",
"|● Find top-k most similar chunks from vector database collection|\n",
"|● Plug into LLM response synthesis module|\n",
"|Chunk|Chunk|LLM|\n",
"|Vector|Chunk| |\n",
"|Database|\n",
"---\n",
"|Current RAG Stack (Querying)|\n",
"|---|\n",
"|Process:|\n",
"|● Find top-k most similar chunks from vector database collection|\n",
"|● Plug into LLM response synthesis module|\n",
"|Chunk|LLM|\n",
"|Chunk|\n",
"|Vector|\n",
"|Database|\n",
"|Retrieval|Synthesis|\n",
"---\n",
"|Query|Nodel|Response|Nodez|\n",
"|---|---|---|---|\n",
"|Create and Refine|Intermediate| | |\n",
"| | |Final|Response|\n",
"---\n",
"|Query|Node1|Node2|Node3|Node4|\n",
"|---|---|---|---|---|\n",
"|Tree Summarize| | | | |\n",
"---\n",
"Quickstart\n",
"\n",
"Link to Google Colab\n",
"---\n",
"NO_CONTENT_HERE\n",
"---\n",
"# Challenges with Naive RAG\n",
"\n",
"- Failure Modes\n",
"- Quality-Related (Hallucination, Accuracy)\n",
"- Non-Quality-Related (Latency, Cost, Syncing)\n",
"---\n",
"## Challenges with Naive RAG (Response Quality)\n",
"\n",
"|Bad Retrieval|Low Precision: Not all chunks in retrieved set are relevant|Hallucination + Lost in the Middle Problems|\n",
"|---|---|---|\n",
"| |Low Recall: Now all relevant chunks are retrieved.|Lacks enough context for LLM to synthesize an answer|\n",
"| |Outdated information: The data is redundant or out of date.| |\n",
"---\n",
"## Challenges with Naive RAG (Response Quality)\n",
"\n",
"|Bad Retrieval|Low Precision: Not all chunks in retrieved set are relevant|Hallucination + Lost in the Middle Problems|\n",
"|---|---|---|\n",
"| |Low Recall: Now all relevant chunks are retrieved.|Lacks enough context for LLM to synthesize an answer|\n",
"| |Outdated information: The data is redundant or out of date.| |\n",
"|Bad Response Generation|Hallucination: Model makes up an answer that isnt in the context.| |\n",
"| |Irrelevance: Model makes up an answer that doesnt answer the question.| |\n",
"| |Toxicity/Bias: Model makes up an answer t\n"
]
}
],
"source": [
"print(docs[0].get_content()[:5000])"
]
},
{
"cell_type": "markdown",
"id": "c2fa0a1a-1ed8-4a5a-a0c1-5792fe32634b",
"metadata": {},
"source": [
"## Build a RAG pipeline over these documents\n",
"\n",
"We now use LlamaIndex to build a RAG pipeline over these powerpoint slides."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c779547f-e4f7-4c84-9786-2b6b749827ab",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from llama_index.core import VectorStoreIndex"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "68b3a95e-ce19-4df1-9fdd-e6efb2fc423a",
"metadata": {},
"outputs": [],
"source": [
"index = VectorStoreIndex.from_documents(docs)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "a2ae28f6-4b3a-4130-8e65-0921b7678739",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"query_engine = index.as_query_engine()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "232091ee-aa22-4f51-838c-410024acc344",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"response = query_engine.query(\"What are some response quality challenges with naive RAG?\") "
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "75f32aa7-c308-4221-af60-779822cfdba1",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Some response quality challenges with naive RAG include issues related to bad retrieval, such as low precision where not all retrieved chunks are relevant, leading to problems like hallucination and being lost in the middle. Additionally, low recall can occur when not all relevant chunks are retrieved, resulting in a lack of sufficient context for the language model to synthesize an answer. Outdated information in the retrieved data can also pose a challenge. On the response generation side, challenges include hallucination where the model generates an answer not present in the context, irrelevance where the answer does not address the question, and toxicity/bias where the answer is harmful or offensive.\n"
]
}
],
"source": [
"print(str(response))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d309d8fb-750a-4393-a1b2-67b14b7c121f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "llama_parse",
"language": "python",
"name": "llama_parse"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}