mirror of
https://github.com/run-llama/llama_cloud_services.git
synced 2026-07-01 21:44:37 -04:00
Compare commits
1 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 82f8bd5b58 |
@@ -0,0 +1,422 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "eld1dKaN7P8B"
|
||||
},
|
||||
"source": [
|
||||
"# LlamaParse - Parsing Financial Powerpoints 📊\n",
|
||||
"\n",
|
||||
"In this cookbook we show you how to use LlamaParse to parse a financial powerpoint."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "goB1sV8zu_Xl"
|
||||
},
|
||||
"source": [
|
||||
"## Installation\n",
|
||||
"\n",
|
||||
"Parsing instruction are part of the LlamaParse API. They can be access by directly specifying the parsing_instruction parameter in the API or by using LlamaParse python module (which we will use for this tutorial).\n",
|
||||
"\n",
|
||||
"To install llama-parse, just get it from `pip`:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/"
|
||||
},
|
||||
"id": "7Y3_BwQLu-qK",
|
||||
"outputId": "b1129c52-7a70-44cc-ad03-1f8d3a8c794a"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install llama-index\n",
|
||||
"!pip install llama-parse\n",
|
||||
"!pip install torch transformers python-pptx Pillow"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "i-Rg2D_Rvf2i"
|
||||
},
|
||||
"source": [
|
||||
"## API Key\n",
|
||||
"\n",
|
||||
"The use of LlamaParse requires an API key which you can get here: https://cloud.llamaindex.ai/parse"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "af6i2P1vuU-U"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"llx-...\"\n",
|
||||
"os.environ[\"OPENAI_API_KEY\"] = \"sk-...\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "p8Eq-aX-wAEo"
|
||||
},
|
||||
"source": [
|
||||
"**NOTE**: Since LlamaParse is natively async, running the sync code in a notebook requires the use of nest_asyncio.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"id": "4OB0BkTqv_0l",
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import nest_asyncio\n",
|
||||
"\n",
|
||||
"nest_asyncio.apply()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "dz927ecMyYo_"
|
||||
},
|
||||
"source": [
|
||||
"## Importing the package\n",
|
||||
"\n",
|
||||
"To import llama_parse simply do:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "nSW-6sEwyXwx",
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from llama_parse import LlamaParse"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "l_D4YsAHwUSk"
|
||||
},
|
||||
"source": [
|
||||
"## Using LlamaParse to Parse Presentations\n",
|
||||
"\n",
|
||||
"Like Powerpoints, presentations are often hard to extract for RAG. With LlamaParse we can now parse them and unclock their content of presentations for RAG.\n",
|
||||
"\n",
|
||||
"Let's download a financial report from the World Meteorological Association."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/"
|
||||
},
|
||||
"id": "d3qeuiyawT0U",
|
||||
"outputId": "cec0ea0a-be8b-49b6-9376-797c91f63be7",
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"! mkdir data; wget \"https://meetings.wmo.int/Cg-19/PublishingImages/SitePages/FINAC-43/7%20-%20EC-77-Doc%205%20Financial%20Statements%20for%202022%20(FINAC).pptx\" -O data/presentation.pptx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "Gbr8RiHEyF3-"
|
||||
},
|
||||
"source": [
|
||||
"### Parsing the presentation\n",
|
||||
"\n",
|
||||
"Now let's parse it into Markdown with LlamaParse and the default LlamaIndex parser.\n",
|
||||
"\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "osocsofoJ42S"
|
||||
},
|
||||
"source": [
|
||||
"#### Llama Index default"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/"
|
||||
},
|
||||
"id": "PTVy5XCNJwW-",
|
||||
"outputId": "d0e2cc4b-1407-45a9-b5e6-d06f91a533b4",
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from llama_index.core import SimpleDirectoryReader\n",
|
||||
"\n",
|
||||
"vanilla_documents = SimpleDirectoryReader(\"./data/\").load_data()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "oucbsciZJwxt"
|
||||
},
|
||||
"source": [
|
||||
"#### Llama Parse"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/"
|
||||
},
|
||||
"id": "3jKnXCuAyQ9_",
|
||||
"outputId": "1f668f17-1e20-46e5-fbab-9a55e4b28891",
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Started parsing the file under job_id 56724c0d-e45a-4e30-ae8c-e416173c608a\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"llama_parse_documents = LlamaParse(result_type=\"markdown\").load_data(\"./data/presentation.pptx\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's take a look at the parsed output from an example slide (see image below).\n",
|
||||
"\n",
|
||||
"As we can see the table is faithfully extracted!"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 27,
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"ation and mitigation\n",
|
||||
"---\n",
|
||||
"|Item|31 Dec 2022|31 Dec 2021|Change|\n",
|
||||
"|---|---|---|---|\n",
|
||||
"|Payables and accruals|4,685|4,066|619|\n",
|
||||
"|Employee benefits|127,215|84,676|42,539|\n",
|
||||
"|Contributions received in advance|6,975|10,192|(3,217)|\n",
|
||||
"|Unearned revenue from exchange transactions|20|651|(631)|\n",
|
||||
"|Deferred Revenue|71,301|55,737|15,564|\n",
|
||||
"|Borrowings|28,229|29,002|(773)|\n",
|
||||
"|Funds held in trust|30,373|29,014|1,359|\n",
|
||||
"|Provisions|1,706|1,910|(204)|\n",
|
||||
"|Total Liabilities|270,504|215,248|55,256|\n",
|
||||
"---\n",
|
||||
"## Liabilities\n",
|
||||
"\n",
|
||||
"Employee Ben\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(llama_parse_documents[0].get_content()[-2800:-2300])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"Compared against the original slide image.\n",
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "p4GVOdWzzvYg"
|
||||
},
|
||||
"source": [
|
||||
"## Comparing the two for RAG\n",
|
||||
"\n",
|
||||
"The main difference between LlamaParse and the previous directory reader approach, it that LlamaParse will extract the document in a structured format, allowing better RAG."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "oVcdGus5NDxi"
|
||||
},
|
||||
"source": [
|
||||
"### Query Engine on SimpleDirectoryReader results"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"metadata": {
|
||||
"id": "DqXYsLCWNg9_",
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
|
||||
"\n",
|
||||
"vanilla_index = VectorStoreIndex.from_documents(vanilla_documents)\n",
|
||||
"vanilla_query_engine = vanilla_index.as_query_engine()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "ZLkHt9l2Nbxx"
|
||||
},
|
||||
"source": [
|
||||
"### Query Engine on LlamaParse Results\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"metadata": {
|
||||
"id": "ZllaDcfRNLv3",
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"llama_parse_index = VectorStoreIndex.from_documents(llama_parse_documents)\n",
|
||||
"llama_parse_query_engine = llama_parse_index.as_query_engine()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "0dY_0_1bNg0X",
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"### Liability provision\n",
|
||||
"What was the liability provision as of Dec 31 2021?\n",
|
||||
"\n",
|
||||
"<!-- <img src=\"https://drive.usercontent.google.com/download?id=184jVq0QyspDnmCyRfV0ebmJJxmAOJHba&authuser=0\" /> -->"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/"
|
||||
},
|
||||
"id": "Tmn-qNTEN-cb",
|
||||
"outputId": "a9bffc00-9cfc-43d8-b159-596a6c1aca64",
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"The liability provision as of December 31, 2021, included Employee Benefit Liabilities, Contributions received in advance (assessed contributions), and Deferred revenue.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vanilla_response = vanilla_query_engine.query(\"What was the liability provision as of Dec 31 2021?\")\n",
|
||||
"print(vanilla_response)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/"
|
||||
},
|
||||
"id": "4EZ_uqlROP7R",
|
||||
"outputId": "0645a159-06c6-411e-d1f6-79ea95d32b42",
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"The liability provision as of December 31, 2021, was 1,910 CHF.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"llama_parse_response = llama_parse_query_engine.query(\"What was the liability provision as of Dec 31 2021?\")\n",
|
||||
"print(llama_parse_response)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"provenance": []
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "llama_parse",
|
||||
"language": "python",
|
||||
"name": "llama_parse"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
Reference in New Issue
Block a user