llama_cloud_services/examples/split/document_splitting/document_splitting.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Document Splitting with LlamaCloud\n",
    "\n",
    "This notebook demonstrates how to use the LlamaCloud **Split** API to automatically segment a concatenated PDF into logical document sections based on content categories.\n",
    "\n",
    "## Use Case\n",
    "\n",
    "When dealing with large PDFs that contain multiple distinct documents or sections (e.g., a bundle of research papers, a collection of reports), you often need to split them into individual segments. The Split API uses AI to:\n",
    "\n",
    "1. Analyze each page's content\n",
    "2. Classify pages into user-defined categories\n",
    "3. Group consecutive pages of the same category into segments\n",
    "\n",
    "## Example Document\n",
    "\n",
    "We'll use a PDF containing three concatenated documents:\n",
    "- **Alan Turing's essay** \"Intelligent Machinery, A Heretical Theory\" (an essay)\n",
    "- **ImageNet paper** (a research paper)\n",
    "- **\"Attention is All You Need\"** paper (a research paper)\n",
    "\n",
    "We'll split this into segments categorized as either `essay` or `research_paper`.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: llama-cloud in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (0.1.44)\n",
      "Requirement already satisfied: python-dotenv in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (1.2.1)\n",
      "Requirement already satisfied: requests in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (2.32.5)\n",
      "Requirement already satisfied: certifi>=2024.7.4 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from llama-cloud) (2025.11.12)\n",
      "Requirement already satisfied: httpx>=0.20.0 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from llama-cloud) (0.28.1)\n",
      "Requirement already satisfied: pydantic>=1.10 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from llama-cloud) (2.12.5)\n",
      "Requirement already satisfied: charset_normalizer<4,>=2 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from requests) (3.4.4)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from requests) (3.11)\n",
      "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from requests) (2.5.0)\n",
      "Requirement already satisfied: anyio in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from httpx>=0.20.0->llama-cloud) (4.11.0)\n",
      "Requirement already satisfied: httpcore==1.* in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from httpx>=0.20.0->llama-cloud) (1.0.9)\n",
      "Requirement already satisfied: h11>=0.16 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from httpcore==1.*->httpx>=0.20.0->llama-cloud) (0.16.0)\n",
      "Requirement already satisfied: annotated-types>=0.6.0 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from pydantic>=1.10->llama-cloud) (0.7.0)\n",
      "Requirement already satisfied: pydantic-core==2.41.5 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from pydantic>=1.10->llama-cloud) (2.41.5)\n",
      "Requirement already satisfied: typing-extensions>=4.14.1 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from pydantic>=1.10->llama-cloud) (4.15.0)\n",
      "Requirement already satisfied: typing-inspection>=0.4.2 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from pydantic>=1.10->llama-cloud) (0.4.2)\n",
      "Requirement already satisfied: sniffio>=1.1 in /Users/javier/llama_cloud_services/.venv/lib/python3.11/site-packages (from anyio->httpx>=0.20.0->llama-cloud) (1.3.1)\n",
      "\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.3\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "# Install required packages\n",
    "%pip install llama-cloud python-dotenv requests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ API configured with base URL: https://api.cloud.llamaindex.ai\n",
      "✅ Project ID: using default project\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import time\n",
    "import requests\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "# Load environment variables\n",
    "load_dotenv()\n",
    "\n",
    "# Configuration\n",
    "LLAMA_CLOUD_API_KEY = os.environ.get(\"LLAMA_CLOUD_API_KEY\", \"llx-...\")\n",
    "BASE_URL = os.environ.get(\"LLAMA_CLOUD_BASE_URL\", \"https://api.cloud.llamaindex.ai\")\n",
    "PROJECT_ID = os.environ.get(\"LLAMA_CLOUD_PROJECT_ID\", None)\n",
    "\n",
    "# Headers for API requests\n",
    "headers = {\n",
    "    \"Authorization\": f\"Bearer {LLAMA_CLOUD_API_KEY}\",\n",
    "    \"Content-Type\": \"application/json\",\n",
    "}\n",
    "\n",
    "print(f\"✅ API configured with base URL: {BASE_URL}\")\n",
    "print(f\"✅ Project ID: {PROJECT_ID or 'using default project'}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Upload the PDF File\n",
    "\n",
    "First, we'll upload our concatenated PDF to LlamaCloud using the Files API. This can be done using the `llama-cloud` SDK.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📤 Uploading ./data/turing+imagenet+attention.pdf...\n",
      "✅ File uploaded successfully!\n",
      "   File name: turing+imagenet+attention.pdf\n"
     ]
    }
   ],
   "source": [
    "from llama_cloud.client import LlamaCloud\n",
    "\n",
    "# Initialize the client\n",
    "client = LlamaCloud(token=LLAMA_CLOUD_API_KEY, base_url=BASE_URL)\n",
    "\n",
    "# Path to the PDF file\n",
    "pdf_path = \"./data/turing+imagenet+attention.pdf\"\n",
    "\n",
    "# Upload the file\n",
    "print(f\"📤 Uploading {pdf_path}...\")\n",
    "\n",
    "with open(pdf_path, \"rb\") as f:\n",
    "    uploaded_file = client.files.upload_file(upload_file=f, project_id=PROJECT_ID)\n",
    "\n",
    "file_id = uploaded_file.id\n",
    "print(f\"✅ File uploaded successfully!\")\n",
    "print(f\"   File name: {uploaded_file.name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Create a Split Job\n",
    "\n",
    "Now we'll create a split job using the Split API. Since the Split API is in beta and not yet available in the SDK, we'll use raw HTTP requests.\n",
    "\n",
    "We define two categories:\n",
    "- **essay**: For philosophical or reflective writing\n",
    "- **research_paper**: For formal academic documents with methodology and citations\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🔄 Creating split job...\n",
      "✅ Split job created!\n",
      "   Job ID: spl-zsssb632a742aikliu96pqkb56t5\n",
      "   Status: pending\n",
      "   Categories: ['essay', 'research_paper']\n"
     ]
    }
   ],
   "source": [
    "# Define the split job request\n",
    "split_request = {\n",
    "    \"document_input\": {\n",
    "        \"type\": \"file_id\",  # only file_id is supported for now\n",
    "        \"value\": file_id,\n",
    "    },\n",
    "    \"categories\": [\n",
    "        {\n",
    "            \"name\": \"essay\",\n",
    "            \"description\": \"A philosophical or reflective piece of writing that presents personal viewpoints, arguments, or thoughts on a topic without strict formal structure\",\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"research_paper\",\n",
    "            \"description\": \"A formal academic document presenting original research, methodology, experiments, results, and conclusions with citations and references\",\n",
    "        },\n",
    "    ],\n",
    "}\n",
    "\n",
    "# Create the split job\n",
    "print(\"🔄 Creating split job...\")\n",
    "response = requests.post(\n",
    "    f\"{BASE_URL}/api/v1/beta/split/jobs\",\n",
    "    params={\"project_id\": PROJECT_ID},\n",
    "    headers=headers,\n",
    "    json=split_request,\n",
    ")\n",
    "response.raise_for_status()\n",
    "\n",
    "split_job = response.json()\n",
    "job_id = split_job[\"id\"]\n",
    "\n",
    "print(f\"✅ Split job created!\")\n",
    "print(f\"   Job ID: {job_id}\")\n",
    "print(f\"   Status: {split_job['status']}\")\n",
    "print(f\"   Categories: {[c['name'] for c in split_job['categories']]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Poll for Job Completion\n",
    "\n",
    "The split job runs asynchronously. We'll poll the job status until it completes.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "⏳ Waiting for split job to complete...\n",
      "   Status: processing (elapsed: 0s)\n",
      "   Status: processing (elapsed: 5s)\n",
      "   Status: processing (elapsed: 11s)\n",
      "   Status: completed (elapsed: 16s)\n",
      "\n",
      "✅ Split job completed successfully!\n"
     ]
    }
   ],
   "source": [
    "def poll_split_job(job_id: str, max_wait_seconds: int = 180, poll_interval: int = 5):\n",
    "    \"\"\"\n",
    "    Poll a split job until it reaches a terminal state.\n",
    "\n",
    "    Args:\n",
    "        job_id: The split job ID\n",
    "        max_wait_seconds: Maximum time to wait for completion\n",
    "        poll_interval: Seconds between poll attempts\n",
    "\n",
    "    Returns:\n",
    "        The completed job response\n",
    "    \"\"\"\n",
    "    start_time = time.time()\n",
    "\n",
    "    while (time.time() - start_time) < max_wait_seconds:\n",
    "        response = requests.get(\n",
    "            f\"{BASE_URL}/api/v1/beta/split/jobs/{job_id}\",\n",
    "            params={\"project_id\": PROJECT_ID},\n",
    "            headers=headers,\n",
    "        )\n",
    "        response.raise_for_status()\n",
    "        job = response.json()\n",
    "\n",
    "        status = job[\"status\"]\n",
    "        elapsed = int(time.time() - start_time)\n",
    "        print(f\"   Status: {status} (elapsed: {elapsed}s)\")\n",
    "\n",
    "        if status in [\"completed\", \"failed\"]:\n",
    "            return job\n",
    "\n",
    "        time.sleep(poll_interval)\n",
    "\n",
    "    raise TimeoutError(f\"Job did not complete within {max_wait_seconds} seconds\")\n",
    "\n",
    "\n",
    "print(\"⏳ Waiting for split job to complete...\")\n",
    "completed_job = poll_split_job(job_id)\n",
    "\n",
    "if completed_job[\"status\"] == \"completed\":\n",
    "    print(\"\\n✅ Split job completed successfully!\")\n",
    "else:\n",
    "    print(\n",
    "        f\"\\n❌ Split job failed: {completed_job.get('error_message', 'Unknown error')}\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Analyze the Results\n",
    "\n",
    "Let's examine the split results to see how the document was segmented.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📊 Split Results Summary\n",
      "==================================================\n",
      "Total segments found: 3\n",
      "\n",
      "Segments by category:\n",
      "   • essay: 1 segment(s)\n",
      "   • research_paper: 2 segment(s)\n"
     ]
    }
   ],
   "source": [
    "# Get the segments from the result\n",
    "segments = completed_job.get(\"result\", {}).get(\"segments\", [])\n",
    "\n",
    "print(f\"📊 Split Results Summary\")\n",
    "print(f\"=\" * 50)\n",
    "print(f\"Total segments found: {len(segments)}\")\n",
    "print()\n",
    "\n",
    "# Count by category\n",
    "category_counts = {}\n",
    "for segment in segments:\n",
    "    cat = segment[\"category\"]\n",
    "    category_counts[cat] = category_counts.get(cat, 0) + 1\n",
    "\n",
    "print(\"Segments by category:\")\n",
    "for cat, count in category_counts.items():\n",
    "    print(f\"   • {cat}: {count} segment(s)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "📄 Segment Details\n",
      "==================================================\n",
      "\n",
      "Segment 1:\n",
      "   Category: essay\n",
      "   Pages 1-4 (4 pages)\n",
      "   Confidence: high\n",
      "\n",
      "Segment 2:\n",
      "   Category: research_paper\n",
      "   Pages 5-13 (9 pages)\n",
      "   Confidence: high\n",
      "\n",
      "Segment 3:\n",
      "   Category: research_paper\n",
      "   Pages 14-24 (11 pages)\n",
      "   Confidence: high\n"
     ]
    }
   ],
   "source": [
    "# Display detailed segment information\n",
    "print(f\"\\n📄 Segment Details\")\n",
    "print(f\"=\" * 50)\n",
    "\n",
    "for i, segment in enumerate(segments, 1):\n",
    "    category = segment[\"category\"]\n",
    "    pages = segment[\"pages\"]\n",
    "    confidence = segment[\"confidence_category\"]\n",
    "\n",
    "    # Format page range\n",
    "    if len(pages) == 1:\n",
    "        page_range = f\"Page {pages[0]}\"\n",
    "    else:\n",
    "        page_range = f\"Pages {min(pages)}-{max(pages)}\"\n",
    "\n",
    "    print(f\"\\nSegment {i}:\")\n",
    "    print(f\"   Category: {category}\")\n",
    "    print(f\"   {page_range} ({len(pages)} page{'s' if len(pages) > 1 else ''})\")\n",
    "    print(f\"   Confidence: {confidence}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Expected Results\n",
    "\n",
    "Based on our test document, we expect:\n",
    "- **1 essay segment**: Alan Turing's \"Intelligent Machinery, A Heretical Theory\"\n",
    "- **2 research paper segments**: ImageNet paper and \"Attention is All You Need\" paper\n",
    "\n",
    "The pages should be grouped consecutively, with no overlap between segments.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "✅ Validation\n",
      "==================================================\n",
      "Total pages assigned: 24\n",
      "Unique pages: 24\n",
      "✅ No page overlap detected - each page belongs to exactly one segment\n"
     ]
    }
   ],
   "source": [
    "# Verify no page overlap\n",
    "all_pages = []\n",
    "for segment in segments:\n",
    "    all_pages.extend(segment[\"pages\"])\n",
    "\n",
    "unique_pages = set(all_pages)\n",
    "\n",
    "print(f\"\\n✅ Validation\")\n",
    "print(f\"=\" * 50)\n",
    "print(f\"Total pages assigned: {len(all_pages)}\")\n",
    "print(f\"Unique pages: {len(unique_pages)}\")\n",
    "\n",
    "if len(all_pages) == len(unique_pages):\n",
    "    print(f\"✅ No page overlap detected - each page belongs to exactly one segment\")\n",
    "else:\n",
    "    print(\n",
    "        f\"⚠️  Page overlap detected - {len(all_pages) - len(unique_pages)} duplicate assignments\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using `allow_uncategorized` Strategy\n",
    "\n",
    "You can also use the `allow_uncategorized` splitting strategy. This is useful when you want to capture pages that don't match any defined category.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📝 With allow_uncategorized=True and only 'essay' category defined,\n",
      "   pages that don't match 'essay' will be grouped as 'uncategorized'.\n"
     ]
    }
   ],
   "source": [
    "# Example with allow_uncategorized strategy\n",
    "split_request_uncategorized = {\n",
    "    \"document_input\": {\"type\": \"file_id\", \"value\": file_id},\n",
    "    \"categories\": [\n",
    "        {\n",
    "            \"name\": \"essay\",\n",
    "            \"description\": \"A philosophical or reflective piece of writing that presents personal viewpoints, arguments, or thoughts on a topic\",\n",
    "        }\n",
    "        # Note: We only define 'essay' category\n",
    "        # Research papers will be classified as 'uncategorized'\n",
    "    ],\n",
    "    \"splitting_strategy\": {\"allow_uncategorized\": True},\n",
    "}\n",
    "\n",
    "print(\"📝 With allow_uncategorized=True and only 'essay' category defined,\")\n",
    "print(\"   pages that don't match 'essay' will be grouped as 'uncategorized'.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "The LlamaCloud Split API provides a powerful way to automatically segment concatenated documents based on content categories. This is useful for:\n",
    "\n",
    "- **Document processing pipelines**: Automatically separate bundled documents before further processing\n",
    "- **Content organization**: Categorize and organize mixed document collections\n",
    "- **Information extraction**: Identify different document types within a single file\n",
    "\n",
    "### Key Features\n",
    "\n",
    "- **AI-powered classification**: Uses LLMs to understand page content and assign categories\n",
    "- **Flexible categories**: Define any categories relevant to your use case\n",
    "- **Confidence scoring**: Each segment includes a confidence level\n",
    "- **Page-level granularity**: Results include exact page numbers for each segment\n",
    "\n",
    "### API Reference\n",
    "\n",
    "- **Create Split Job**: `POST /api/v1/beta/split/jobs`\n",
    "- **Get Split Job**: `GET /api/v1/beta/split/jobs/{job_id}`\n",
    "- **List Split Jobs**: `GET /api/v1/beta/split/jobs`\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}