Files
llama_cloud_services/examples/split/document_splitting/document_splitting.ipynb
T
2026-02-02 11:42:47 -06:00

548 lines
20 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Document Splitting with LlamaCloud\n",
"\n",
"This notebook demonstrates how to use the LlamaCloud **Split** API to automatically segment a concatenated PDF into logical document sections based on content categories.\n",
"\n",
"## Use Case\n",
"\n",
"When dealing with large PDFs that contain multiple distinct documents or sections (e.g., a bundle of research papers, a collection of reports), you often need to split them into individual segments. The Split API uses AI to:\n",
"\n",
"1. Analyze each page's content\n",
"2. Classify pages into user-defined categories\n",
"3. Group consecutive pages of the same category into segments\n",
"\n",
"## Example Document\n",
"\n",
"We'll use a PDF containing three concatenated documents:\n",
"- **Alan Turing's essay** \"Intelligent Machinery, A Heretical Theory\" (an essay)\n",
"- **ImageNet paper** (a research paper)\n",
"- **\"Attention is All You Need\"** paper (a research paper)\n",
"\n",
"We'll split this into segments categorized as either `essay` or `research_paper`.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: llama-cloud in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (0.1.44)\n",
"Requirement already satisfied: python-dotenv in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (1.2.1)\n",
"Requirement already satisfied: requests in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (2.32.5)\n",
"Requirement already satisfied: certifi>=2024.7.4 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from llama-cloud) (2025.11.12)\n",
"Requirement already satisfied: httpx>=0.20.0 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from llama-cloud) (0.28.1)\n",
"Requirement already satisfied: pydantic>=1.10 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from llama-cloud) (2.12.5)\n",
"Requirement already satisfied: charset_normalizer<4,>=2 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from requests) (3.4.4)\n",
"Requirement already satisfied: idna<4,>=2.5 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from requests) (3.11)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from requests) (2.5.0)\n",
"Requirement already satisfied: anyio in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from httpx>=0.20.0->llama-cloud) (4.11.0)\n",
"Requirement already satisfied: httpcore==1.* in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from httpx>=0.20.0->llama-cloud) (1.0.9)\n",
"Requirement already satisfied: h11>=0.16 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from httpcore==1.*->httpx>=0.20.0->llama-cloud) (0.16.0)\n",
"Requirement already satisfied: annotated-types>=0.6.0 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from pydantic>=1.10->llama-cloud) (0.7.0)\n",
"Requirement already satisfied: pydantic-core==2.41.5 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from pydantic>=1.10->llama-cloud) (2.41.5)\n",
"Requirement already satisfied: typing-extensions>=4.14.1 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from pydantic>=1.10->llama-cloud) (4.15.0)\n",
"Requirement already satisfied: typing-inspection>=0.4.2 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from pydantic>=1.10->llama-cloud) (0.4.2)\n",
"Requirement already satisfied: sniffio>=1.1 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from anyio->httpx>=0.20.0->llama-cloud) (1.3.1)\n",
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.3\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"# Install required packages\n",
"%pip install llama-cloud python-dotenv requests"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"✅ API configured with base URL: https://api.cloud.llamaindex.ai\n",
"✅ Project ID: using default project\n"
]
}
],
"source": [
"import os\n",
"import time\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"\n",
"# Load environment variables\n",
"load_dotenv()\n",
"\n",
"# Configuration\n",
"LLAMA_CLOUD_API_KEY = os.environ.get(\"LLAMA_CLOUD_API_KEY\", \"llx-...\")\n",
"BASE_URL = os.environ.get(\"LLAMA_CLOUD_BASE_URL\", \"https://api.cloud.llamaindex.ai\")\n",
"PROJECT_ID = os.environ.get(\"LLAMA_CLOUD_PROJECT_ID\", None)\n",
"\n",
"# Headers for API requests\n",
"headers = {\n",
" \"Authorization\": f\"Bearer {LLAMA_CLOUD_API_KEY}\",\n",
" \"Content-Type\": \"application/json\",\n",
"}\n",
"\n",
"print(f\"✅ API configured with base URL: {BASE_URL}\")\n",
"print(f\"✅ Project ID: {PROJECT_ID or 'using default project'}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Upload the PDF File\n",
"\n",
"First, we'll upload our concatenated PDF to LlamaCloud using the Files API. This can be done using the `llama-cloud` SDK.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"📤 Uploading ./data/turing+imagenet+attention.pdf...\n",
"✅ File uploaded successfully!\n",
" File name: turing+imagenet+attention.pdf\n"
]
}
],
"source": [
"from llama_cloud.client import LlamaCloud\n",
"\n",
"# Initialize the client\n",
"client = LlamaCloud(token=LLAMA_CLOUD_API_KEY, base_url=BASE_URL)\n",
"\n",
"# Path to the PDF file\n",
"pdf_path = \"./data/turing+imagenet+attention.pdf\"\n",
"\n",
"# Upload the file\n",
"print(f\"📤 Uploading {pdf_path}...\")\n",
"\n",
"with open(pdf_path, \"rb\") as f:\n",
" uploaded_file = client.files.upload_file(upload_file=f, project_id=PROJECT_ID)\n",
"\n",
"file_id = uploaded_file.id\n",
"print(f\"✅ File uploaded successfully!\")\n",
"print(f\" File name: {uploaded_file.name}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Create a Split Job\n",
"\n",
"Now we'll create a split job using the Split API. Since the Split API is in beta and not yet available in the SDK, we'll use raw HTTP requests.\n",
"\n",
"We define two categories:\n",
"- **essay**: For philosophical or reflective writing\n",
"- **research_paper**: For formal academic documents with methodology and citations\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"🔄 Creating split job...\n",
"✅ Split job created!\n",
" Job ID: spl-zsssb632a742aikliu96pqkb56t5\n",
" Status: pending\n",
" Categories: ['essay', 'research_paper']\n"
]
}
],
"source": [
"# Define the split job request\n",
"split_request = {\n",
" \"document_input\": {\n",
" \"type\": \"file_id\", # only file_id is supported for now\n",
" \"value\": file_id,\n",
" },\n",
" \"categories\": [\n",
" {\n",
" \"name\": \"essay\",\n",
" \"description\": \"A philosophical or reflective piece of writing that presents personal viewpoints, arguments, or thoughts on a topic without strict formal structure\",\n",
" },\n",
" {\n",
" \"name\": \"research_paper\",\n",
" \"description\": \"A formal academic document presenting original research, methodology, experiments, results, and conclusions with citations and references\",\n",
" },\n",
" ],\n",
"}\n",
"\n",
"# Create the split job\n",
"print(\"🔄 Creating split job...\")\n",
"response = requests.post(\n",
" f\"{BASE_URL}/api/v1/beta/split/jobs\",\n",
" params={\"project_id\": PROJECT_ID},\n",
" headers=headers,\n",
" json=split_request,\n",
")\n",
"response.raise_for_status()\n",
"\n",
"split_job = response.json()\n",
"job_id = split_job[\"id\"]\n",
"\n",
"print(f\"✅ Split job created!\")\n",
"print(f\" Job ID: {job_id}\")\n",
"print(f\" Status: {split_job['status']}\")\n",
"print(f\" Categories: {[c['name'] for c in split_job['categories']]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3: Poll for Job Completion\n",
"\n",
"The split job runs asynchronously. We'll poll the job status until it completes.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"⏳ Waiting for split job to complete...\n",
" Status: processing (elapsed: 0s)\n",
" Status: processing (elapsed: 5s)\n",
" Status: processing (elapsed: 11s)\n",
" Status: completed (elapsed: 16s)\n",
"\n",
"✅ Split job completed successfully!\n"
]
}
],
"source": [
"def poll_split_job(job_id: str, max_wait_seconds: int = 180, poll_interval: int = 5):\n",
" \"\"\"\n",
" Poll a split job until it reaches a terminal state.\n",
"\n",
" Args:\n",
" job_id: The split job ID\n",
" max_wait_seconds: Maximum time to wait for completion\n",
" poll_interval: Seconds between poll attempts\n",
"\n",
" Returns:\n",
" The completed job response\n",
" \"\"\"\n",
" start_time = time.time()\n",
"\n",
" while (time.time() - start_time) < max_wait_seconds:\n",
" response = requests.get(\n",
" f\"{BASE_URL}/api/v1/beta/split/jobs/{job_id}\",\n",
" params={\"project_id\": PROJECT_ID},\n",
" headers=headers,\n",
" )\n",
" response.raise_for_status()\n",
" job = response.json()\n",
"\n",
" status = job[\"status\"]\n",
" elapsed = int(time.time() - start_time)\n",
" print(f\" Status: {status} (elapsed: {elapsed}s)\")\n",
"\n",
" if status in [\"completed\", \"failed\"]:\n",
" return job\n",
"\n",
" time.sleep(poll_interval)\n",
"\n",
" raise TimeoutError(f\"Job did not complete within {max_wait_seconds} seconds\")\n",
"\n",
"\n",
"print(\"⏳ Waiting for split job to complete...\")\n",
"completed_job = poll_split_job(job_id)\n",
"\n",
"if completed_job[\"status\"] == \"completed\":\n",
" print(\"\\n✅ Split job completed successfully!\")\n",
"else:\n",
" print(\n",
" f\"\\n❌ Split job failed: {completed_job.get('error_message', 'Unknown error')}\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4: Analyze the Results\n",
"\n",
"Let's examine the split results to see how the document was segmented.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"📊 Split Results Summary\n",
"==================================================\n",
"Total segments found: 3\n",
"\n",
"Segments by category:\n",
" • essay: 1 segment(s)\n",
" • research_paper: 2 segment(s)\n"
]
}
],
"source": [
"# Get the segments from the result\n",
"segments = completed_job.get(\"result\", {}).get(\"segments\", [])\n",
"\n",
"print(f\"📊 Split Results Summary\")\n",
"print(f\"=\" * 50)\n",
"print(f\"Total segments found: {len(segments)}\")\n",
"print()\n",
"\n",
"# Count by category\n",
"category_counts = {}\n",
"for segment in segments:\n",
" cat = segment[\"category\"]\n",
" category_counts[cat] = category_counts.get(cat, 0) + 1\n",
"\n",
"print(\"Segments by category:\")\n",
"for cat, count in category_counts.items():\n",
" print(f\" • {cat}: {count} segment(s)\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"📄 Segment Details\n",
"==================================================\n",
"\n",
"Segment 1:\n",
" Category: essay\n",
" Pages 1-4 (4 pages)\n",
" Confidence: high\n",
"\n",
"Segment 2:\n",
" Category: research_paper\n",
" Pages 5-13 (9 pages)\n",
" Confidence: high\n",
"\n",
"Segment 3:\n",
" Category: research_paper\n",
" Pages 14-24 (11 pages)\n",
" Confidence: high\n"
]
}
],
"source": [
"# Display detailed segment information\n",
"print(f\"\\n📄 Segment Details\")\n",
"print(f\"=\" * 50)\n",
"\n",
"for i, segment in enumerate(segments, 1):\n",
" category = segment[\"category\"]\n",
" pages = segment[\"pages\"]\n",
" confidence = segment[\"confidence_category\"]\n",
"\n",
" # Format page range\n",
" if len(pages) == 1:\n",
" page_range = f\"Page {pages[0]}\"\n",
" else:\n",
" page_range = f\"Pages {min(pages)}-{max(pages)}\"\n",
"\n",
" print(f\"\\nSegment {i}:\")\n",
" print(f\" Category: {category}\")\n",
" print(f\" {page_range} ({len(pages)} page{'s' if len(pages) > 1 else ''})\")\n",
" print(f\" Confidence: {confidence}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Expected Results\n",
"\n",
"Based on our test document, we expect:\n",
"- **1 essay segment**: Alan Turing's \"Intelligent Machinery, A Heretical Theory\"\n",
"- **2 research paper segments**: ImageNet paper and \"Attention is All You Need\" paper\n",
"\n",
"The pages should be grouped consecutively, with no overlap between segments.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"✅ Validation\n",
"==================================================\n",
"Total pages assigned: 24\n",
"Unique pages: 24\n",
"✅ No page overlap detected - each page belongs to exactly one segment\n"
]
}
],
"source": [
"# Verify no page overlap\n",
"all_pages = []\n",
"for segment in segments:\n",
" all_pages.extend(segment[\"pages\"])\n",
"\n",
"unique_pages = set(all_pages)\n",
"\n",
"print(f\"\\n✅ Validation\")\n",
"print(f\"=\" * 50)\n",
"print(f\"Total pages assigned: {len(all_pages)}\")\n",
"print(f\"Unique pages: {len(unique_pages)}\")\n",
"\n",
"if len(all_pages) == len(unique_pages):\n",
" print(f\"✅ No page overlap detected - each page belongs to exactly one segment\")\n",
"else:\n",
" print(\n",
" f\"⚠️ Page overlap detected - {len(all_pages) - len(unique_pages)} duplicate assignments\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using `allow_uncategorized` Strategy\n",
"\n",
"You can also use the `allow_uncategorized` splitting strategy. This is useful when you want to capture pages that don't match any defined category.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"📝 With allow_uncategorized=True and only 'essay' category defined,\n",
" pages that don't match 'essay' will be grouped as 'uncategorized'.\n"
]
}
],
"source": [
"# Example with allow_uncategorized strategy\n",
"split_request_uncategorized = {\n",
" \"document_input\": {\"type\": \"file_id\", \"value\": file_id},\n",
" \"categories\": [\n",
" {\n",
" \"name\": \"essay\",\n",
" \"description\": \"A philosophical or reflective piece of writing that presents personal viewpoints, arguments, or thoughts on a topic\",\n",
" }\n",
" # Note: We only define 'essay' category\n",
" # Research papers will be classified as 'uncategorized'\n",
" ],\n",
" \"splitting_strategy\": {\"allow_uncategorized\": True},\n",
"}\n",
"\n",
"print(\"📝 With allow_uncategorized=True and only 'essay' category defined,\")\n",
"print(\" pages that don't match 'essay' will be grouped as 'uncategorized'.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"The LlamaCloud Split API provides a powerful way to automatically segment concatenated documents based on content categories. This is useful for:\n",
"\n",
"- **Document processing pipelines**: Automatically separate bundled documents before further processing\n",
"- **Content organization**: Categorize and organize mixed document collections\n",
"- **Information extraction**: Identify different document types within a single file\n",
"\n",
"### Key Features\n",
"\n",
"- **AI-powered classification**: Uses LLMs to understand page content and assign categories\n",
"- **Flexible categories**: Define any categories relevant to your use case\n",
"- **Confidence scoring**: Each segment includes a confidence level\n",
"- **Page-level granularity**: Results include exact page numbers for each segment\n",
"\n",
"### API Reference\n",
"\n",
"- **Create Split Job**: `POST /api/v1/beta/split/jobs`\n",
"- **Get Split Job**: `GET /api/v1/beta/split/jobs/{job_id}`\n",
"- **List Split Jobs**: `GET /api/v1/beta/split/jobs`\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}