Compare commits

...

1 Commits

Author SHA1 Message Date
Jerry Liu 82f8bd5b58 cr 2024-03-14 00:47:55 -07:00
@@ -0,0 +1,422 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "eld1dKaN7P8B"
},
"source": [
"# LlamaParse - Parsing Financial Powerpoints 📊\n",
"\n",
"In this cookbook we show you how to use LlamaParse to parse a financial powerpoint."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "goB1sV8zu_Xl"
},
"source": [
"## Installation\n",
"\n",
"Parsing instruction are part of the LlamaParse API. They can be access by directly specifying the parsing_instruction parameter in the API or by using LlamaParse python module (which we will use for this tutorial).\n",
"\n",
"To install llama-parse, just get it from `pip`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "7Y3_BwQLu-qK",
"outputId": "b1129c52-7a70-44cc-ad03-1f8d3a8c794a"
},
"outputs": [],
"source": [
"!pip install llama-index\n",
"!pip install llama-parse\n",
"!pip install torch transformers python-pptx Pillow"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "i-Rg2D_Rvf2i"
},
"source": [
"## API Key\n",
"\n",
"The use of LlamaParse requires an API key which you can get here: https://cloud.llamaindex.ai/parse"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "af6i2P1vuU-U"
},
"outputs": [],
"source": [
"import os\n",
"os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"llx-...\"\n",
"os.environ[\"OPENAI_API_KEY\"] = \"sk-...\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "p8Eq-aX-wAEo"
},
"source": [
"**NOTE**: Since LlamaParse is natively async, running the sync code in a notebook requires the use of nest_asyncio.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "4OB0BkTqv_0l",
"tags": []
},
"outputs": [],
"source": [
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dz927ecMyYo_"
},
"source": [
"## Importing the package\n",
"\n",
"To import llama_parse simply do:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "nSW-6sEwyXwx",
"tags": []
},
"outputs": [],
"source": [
"from llama_parse import LlamaParse"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "l_D4YsAHwUSk"
},
"source": [
"## Using LlamaParse to Parse Presentations\n",
"\n",
"Like Powerpoints, presentations are often hard to extract for RAG. With LlamaParse we can now parse them and unclock their content of presentations for RAG.\n",
"\n",
"Let's download a financial report from the World Meteorological Association."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "d3qeuiyawT0U",
"outputId": "cec0ea0a-be8b-49b6-9376-797c91f63be7",
"tags": []
},
"outputs": [],
"source": [
"! mkdir data; wget \"https://meetings.wmo.int/Cg-19/PublishingImages/SitePages/FINAC-43/7%20-%20EC-77-Doc%205%20Financial%20Statements%20for%202022%20(FINAC).pptx\" -O data/presentation.pptx"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Gbr8RiHEyF3-"
},
"source": [
"### Parsing the presentation\n",
"\n",
"Now let's parse it into Markdown with LlamaParse and the default LlamaIndex parser.\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "osocsofoJ42S"
},
"source": [
"#### Llama Index default"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "PTVy5XCNJwW-",
"outputId": "d0e2cc4b-1407-45a9-b5e6-d06f91a533b4",
"tags": []
},
"outputs": [],
"source": [
"from llama_index.core import SimpleDirectoryReader\n",
"\n",
"vanilla_documents = SimpleDirectoryReader(\"./data/\").load_data()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oucbsciZJwxt"
},
"source": [
"#### Llama Parse"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "3jKnXCuAyQ9_",
"outputId": "1f668f17-1e20-46e5-fbab-9a55e4b28891",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Started parsing the file under job_id 56724c0d-e45a-4e30-ae8c-e416173c608a\n"
]
}
],
"source": [
"llama_parse_documents = LlamaParse(result_type=\"markdown\").load_data(\"./data/presentation.pptx\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at the parsed output from an example slide (see image below).\n",
"\n",
"As we can see the table is faithfully extracted!"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ation and mitigation\n",
"---\n",
"|Item|31 Dec 2022|31 Dec 2021|Change|\n",
"|---|---|---|---|\n",
"|Payables and accruals|4,685|4,066|619|\n",
"|Employee benefits|127,215|84,676|42,539|\n",
"|Contributions received in advance|6,975|10,192|(3,217)|\n",
"|Unearned revenue from exchange transactions|20|651|(631)|\n",
"|Deferred Revenue|71,301|55,737|15,564|\n",
"|Borrowings|28,229|29,002|(773)|\n",
"|Funds held in trust|30,373|29,014|1,359|\n",
"|Provisions|1,706|1,910|(204)|\n",
"|Total Liabilities|270,504|215,248|55,256|\n",
"---\n",
"## Liabilities\n",
"\n",
"Employee Ben\n"
]
}
],
"source": [
"print(llama_parse_documents[0].get_content()[-2800:-2300])"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"Compared against the original slide image.\n",
"![Demo](demo_ppt_financial_1.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "p4GVOdWzzvYg"
},
"source": [
"## Comparing the two for RAG\n",
"\n",
"The main difference between LlamaParse and the previous directory reader approach, it that LlamaParse will extract the document in a structured format, allowing better RAG."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oVcdGus5NDxi"
},
"source": [
"### Query Engine on SimpleDirectoryReader results"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"id": "DqXYsLCWNg9_",
"tags": []
},
"outputs": [],
"source": [
"from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
"\n",
"vanilla_index = VectorStoreIndex.from_documents(vanilla_documents)\n",
"vanilla_query_engine = vanilla_index.as_query_engine()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZLkHt9l2Nbxx"
},
"source": [
"### Query Engine on LlamaParse Results\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"id": "ZllaDcfRNLv3",
"tags": []
},
"outputs": [],
"source": [
"llama_parse_index = VectorStoreIndex.from_documents(llama_parse_documents)\n",
"llama_parse_query_engine = llama_parse_index.as_query_engine()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0dY_0_1bNg0X",
"tags": []
},
"source": [
"### Liability provision\n",
"What was the liability provision as of Dec 31 2021?\n",
"\n",
"<!-- <img src=\"https://drive.usercontent.google.com/download?id=184jVq0QyspDnmCyRfV0ebmJJxmAOJHba&authuser=0\" /> -->"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Tmn-qNTEN-cb",
"outputId": "a9bffc00-9cfc-43d8-b159-596a6c1aca64",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The liability provision as of December 31, 2021, included Employee Benefit Liabilities, Contributions received in advance (assessed contributions), and Deferred revenue.\n"
]
}
],
"source": [
"vanilla_response = vanilla_query_engine.query(\"What was the liability provision as of Dec 31 2021?\")\n",
"print(vanilla_response)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "4EZ_uqlROP7R",
"outputId": "0645a159-06c6-411e-d1f6-79ea95d32b42",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The liability provision as of December 31, 2021, was 1,910 CHF.\n"
]
}
],
"source": [
"llama_parse_response = llama_parse_query_engine.query(\"What was the liability provision as of Dec 31 2021?\")\n",
"print(llama_parse_response)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "llama_parse",
"language": "python",
"name": "llama_parse"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.8"
}
},
"nbformat": 4,
"nbformat_minor": 4
}