cr

2026-07-01 21:44:37 -04:00 · 2024-03-14 00:47:55 -07:00
1 changed files with 422 additions and 0 deletions
@@ -0,0 +1,422 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "eld1dKaN7P8B"
+   },
+   "source": [
+    "# LlamaParse - Parsing Financial Powerpoints 📊\n",
+    "\n",
+    "In this cookbook we show you how to use LlamaParse to parse a financial powerpoint."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "goB1sV8zu_Xl"
+   },
+   "source": [
+    "## Installation\n",
+    "\n",
+    "Parsing instruction are part of the LlamaParse API. They can be access by directly specifying the parsing_instruction parameter in the API or by using LlamaParse python module (which we will use for this tutorial).\n",
+    "\n",
+    "To install llama-parse, just get it from `pip`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "7Y3_BwQLu-qK",
+    "outputId": "b1129c52-7a70-44cc-ad03-1f8d3a8c794a"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install llama-index\n",
+    "!pip install llama-parse\n",
+    "!pip install torch transformers python-pptx Pillow"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "i-Rg2D_Rvf2i"
+   },
+   "source": [
+    "## API Key\n",
+    "\n",
+    "The use of LlamaParse requires an API key which you can get here: https://cloud.llamaindex.ai/parse"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "af6i2P1vuU-U"
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"llx-...\"\n",
+    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "p8Eq-aX-wAEo"
+   },
+   "source": [
+    "**NOTE**: Since LlamaParse is natively async, running the sync code in a notebook requires the use of nest_asyncio.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "id": "4OB0BkTqv_0l",
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import nest_asyncio\n",
+    "\n",
+    "nest_asyncio.apply()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "dz927ecMyYo_"
+   },
+   "source": [
+    "## Importing the package\n",
+    "\n",
+    "To import llama_parse simply do:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "nSW-6sEwyXwx",
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from llama_parse import LlamaParse"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "l_D4YsAHwUSk"
+   },
+   "source": [
+    "## Using LlamaParse to Parse Presentations\n",
+    "\n",
+    "Like Powerpoints, presentations are often hard to extract for RAG. With LlamaParse we can now parse them and unclock their content of presentations for RAG.\n",
+    "\n",
+    "Let's download a financial report from the World Meteorological Association."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "d3qeuiyawT0U",
+    "outputId": "cec0ea0a-be8b-49b6-9376-797c91f63be7",
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "! mkdir data; wget \"https://meetings.wmo.int/Cg-19/PublishingImages/SitePages/FINAC-43/7%20-%20EC-77-Doc%205%20Financial%20Statements%20for%202022%20(FINAC).pptx\" -O data/presentation.pptx"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "Gbr8RiHEyF3-"
+   },
+   "source": [
+    "### Parsing the presentation\n",
+    "\n",
+    "Now let's parse it into Markdown with LlamaParse and the default LlamaIndex parser.\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "osocsofoJ42S"
+   },
+   "source": [
+    "#### Llama Index default"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "PTVy5XCNJwW-",
+    "outputId": "d0e2cc4b-1407-45a9-b5e6-d06f91a533b4",
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from llama_index.core import SimpleDirectoryReader\n",
+    "\n",
+    "vanilla_documents = SimpleDirectoryReader(\"./data/\").load_data()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "oucbsciZJwxt"
+   },
+   "source": [
+    "#### Llama Parse"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "3jKnXCuAyQ9_",
+    "outputId": "1f668f17-1e20-46e5-fbab-9a55e4b28891",
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Started parsing the file under job_id 56724c0d-e45a-4e30-ae8c-e416173c608a\n"
+     ]
+    }
+   ],
+   "source": [
+    "llama_parse_documents = LlamaParse(result_type=\"markdown\").load_data(\"./data/presentation.pptx\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's take a look at the parsed output from an example slide (see image below).\n",
+    "\n",
+    "As we can see the table is faithfully extracted!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "ation and mitigation\n",
+      "---\n",
+      "|Item|31 Dec 2022|31 Dec 2021|Change|\n",
+      "|---|---|---|---|\n",
+      "|Payables and accruals|4,685|4,066|619|\n",
+      "|Employee benefits|127,215|84,676|42,539|\n",
+      "|Contributions received in advance|6,975|10,192|(3,217)|\n",
+      "|Unearned revenue from exchange transactions|20|651|(631)|\n",
+      "|Deferred Revenue|71,301|55,737|15,564|\n",
+      "|Borrowings|28,229|29,002|(773)|\n",
+      "|Funds held in trust|30,373|29,014|1,359|\n",
+      "|Provisions|1,706|1,910|(204)|\n",
+      "|Total Liabilities|270,504|215,248|55,256|\n",
+      "---\n",
+      "## Liabilities\n",
+      "\n",
+      "Employee Ben\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(llama_parse_documents[0].get_content()[-2800:-2300])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "Compared against the original slide image.\n",
+    "![Demo](demo_ppt_financial_1.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "p4GVOdWzzvYg"
+   },
+   "source": [
+    "## Comparing the two for RAG\n",
+    "\n",
+    "The main difference between LlamaParse and the previous directory reader approach, it that LlamaParse will extract the document in a structured format, allowing better RAG."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "oVcdGus5NDxi"
+   },
+   "source": [
+    "### Query Engine on SimpleDirectoryReader results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {
+    "id": "DqXYsLCWNg9_",
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
+    "\n",
+    "vanilla_index = VectorStoreIndex.from_documents(vanilla_documents)\n",
+    "vanilla_query_engine = vanilla_index.as_query_engine()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "ZLkHt9l2Nbxx"
+   },
+   "source": [
+    "### Query Engine on LlamaParse Results\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {
+    "id": "ZllaDcfRNLv3",
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "llama_parse_index = VectorStoreIndex.from_documents(llama_parse_documents)\n",
+    "llama_parse_query_engine = llama_parse_index.as_query_engine()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "0dY_0_1bNg0X",
+    "tags": []
+   },
+   "source": [
+    "### Liability provision\n",
+    "What was the liability provision as of Dec 31 2021?\n",
+    "\n",
+    "<!-- <img src=\"https://drive.usercontent.google.com/download?id=184jVq0QyspDnmCyRfV0ebmJJxmAOJHba&authuser=0\" /> -->"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "Tmn-qNTEN-cb",
+    "outputId": "a9bffc00-9cfc-43d8-b159-596a6c1aca64",
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The liability provision as of December 31, 2021, included Employee Benefit Liabilities, Contributions received in advance (assessed contributions), and Deferred revenue.\n"
+     ]
+    }
+   ],
+   "source": [
+    "vanilla_response = vanilla_query_engine.query(\"What was the liability provision as of Dec 31 2021?\")\n",
+    "print(vanilla_response)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "4EZ_uqlROP7R",
+    "outputId": "0645a159-06c6-411e-d1f6-79ea95d32b42",
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The liability provision as of December 31, 2021, was 1,910 CHF.\n"
+     ]
+    }
+   ],
+   "source": [
+    "llama_parse_response = llama_parse_query_engine.query(\"What was the liability provision as of Dec 31 2021?\")\n",
+    "print(llama_parse_response)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "llama_parse",
+   "language": "python",
+   "name": "llama_parse"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}