Files
llama_cloud_services/examples/batch/parse/batch_parse_directory.ipynb
T
2026-02-02 11:42:47 -06:00

816 lines
28 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "cell-0",
"metadata": {},
"source": [
"# Batch Parse with LlamaCloud Directories\n",
"\n",
"This notebook demonstrates how to use LlamaCloud's batch processing API to parse multiple files in a directory. The workflow includes:\n",
"\n",
"1. **Creating a Directory** - Set up a directory to organize your files\n",
"2. **Uploading Files** - Upload multiple files to the directory\n",
"3. **Starting a Batch Parse Job** - Kick off batch processing on all files\n",
"4. **Monitoring Progress** - Check the status and view results\n",
"\n",
"This is useful when you need to parse many documents at once, as the batch API handles the orchestration and provides progress tracking."
]
},
{
"cell_type": "markdown",
"id": "0c2b5e1a",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
"cell_type": "markdown",
"id": "cell-1",
"metadata": {},
"source": [
"## Setup and Installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-2",
"metadata": {},
"outputs": [],
"source": [
"%pip install llama-cloud python-dotenv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-3",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from dotenv import load_dotenv\n",
"import httpx\n",
"\n",
"# Load environment variables\n",
"load_dotenv()\n",
"\n",
"# Set your API key\n",
"LLAMA_CLOUD_API_KEY = os.environ.get(\"LLAMA_CLOUD_API_KEY\", \"llx-...\")\n",
"\n",
"# Optional: Set base URL (defaults to https://api.cloud.llamaindex.ai if not set)\n",
"LLAMA_CLOUD_BASE_URL = os.environ.get(\n",
" \"LLAMA_CLOUD_BASE_URL\", \"https://api.cloud.llamaindex.ai\"\n",
")\n",
"\n",
"# Optional: Set project_id if you have one, otherwise it will use your default project\n",
"PROJECT_ID = os.environ.get(\"LLAMA_CLOUD_PROJECT_ID\", None)\n",
"\n",
"print(\"✅ API key configured\")\n",
"print(f\" Base URL: {LLAMA_CLOUD_BASE_URL}\")"
]
},
{
"cell_type": "markdown",
"id": "cell-4",
"metadata": {},
"source": [
"## Setup HTTP Client\n",
"\n",
"Since the current version of the llama-cloud SDK has some issues with the beta endpoints, we'll use direct HTTP requests with httpx for reliability."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-5",
"metadata": {},
"outputs": [],
"source": [
"# Create HTTP client with authentication\n",
"headers = {\n",
" \"Authorization\": f\"Bearer {LLAMA_CLOUD_API_KEY}\",\n",
"}\n",
"\n",
"print(\"✅ HTTP client configured\")\n",
"print(f\" Using base URL: {LLAMA_CLOUD_BASE_URL}\")"
]
},
{
"cell_type": "markdown",
"id": "cell-6",
"metadata": {},
"source": [
"## Step 1: Create a Directory\n",
"\n",
"First, we'll create a directory to organize our files. Directories help you group related files together for batch processing."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-7",
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"# Create a directory with a timestamp in the name\n",
"timestamp = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
"directory_name = f\"batch-parse-demo-{timestamp}\"\n",
"\n",
"# Create directory using HTTP request\n",
"response = httpx.post(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/directories\",\n",
" headers=headers,\n",
" params={\"project_id\": PROJECT_ID},\n",
" json={\n",
" \"name\": directory_name,\n",
" \"description\": \"Demo directory for batch parse example\",\n",
" },\n",
" timeout=60.0,\n",
")\n",
"\n",
"if response.status_code in [200, 201]:\n",
" directory = response.json()\n",
" directory_id = directory[\"id\"]\n",
" project_id = directory[\"project_id\"]\n",
"\n",
" print(f\"✅ Created directory: {directory['name']}\")\n",
" print(f\" Directory ID: {directory_id}\")\n",
" print(f\" Project ID: {project_id}\")\n",
"else:\n",
" raise Exception(\n",
" f\"Failed to create directory: {response.status_code} - {response.text}\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "cell-8",
"metadata": {},
"source": [
"## Step 2: Upload Files to the Directory\n",
"\n",
"Now we'll upload some files to our directory. For this demo, we'll download some sample PDFs and upload them.\n",
"\n",
"You can replace these with your own files."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-9",
"metadata": {},
"outputs": [],
"source": [
"# Create a directory for sample files\n",
"import requests\n",
"\n",
"os.makedirs(\"sample_files\", exist_ok=True)\n",
"\n",
"# Sample documents to download\n",
"sample_docs = {\n",
" \"attention.pdf\": \"https://arxiv.org/pdf/1706.03762.pdf\",\n",
" \"bert.pdf\": \"https://arxiv.org/pdf/1810.04805.pdf\",\n",
"}\n",
"\n",
"# Download sample documents\n",
"for filename, url in sample_docs.items():\n",
" filepath = f\"sample_files/{filename}\"\n",
" if not os.path.exists(filepath):\n",
" print(f\"📥 Downloading {filename}...\")\n",
" response = requests.get(url)\n",
" if response.status_code == 200:\n",
" with open(filepath, \"wb\") as f:\n",
" f.write(response.content)\n",
" print(f\" ✅ Downloaded {filename}\")\n",
" else:\n",
" print(f\" ❌ Failed to download {filename}\")\n",
" else:\n",
" print(f\"📁 {filename} already exists\")\n",
"\n",
"print(\"\\n✅ Sample files ready!\")"
]
},
{
"cell_type": "markdown",
"id": "cell-10",
"metadata": {},
"source": [
"### Upload Files to Directory\n",
"\n",
"Now let's upload the files to our directory using the `upload_file_to_directory` endpoint."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-11",
"metadata": {},
"outputs": [],
"source": [
"uploaded_files = []\n",
"\n",
"# Workaround: Use direct HTTP requests instead of SDK due to SDK bug\n",
"import httpx\n",
"\n",
"for filename in os.listdir(\"sample_files\"):\n",
" if filename.endswith(\".pdf\"):\n",
" filepath = f\"sample_files/{filename}\"\n",
"\n",
" print(f\"📤 Uploading {filename}...\")\n",
"\n",
" # Upload file using direct HTTP request (SDK has a bug with file uploads)\n",
" with open(filepath, \"rb\") as f:\n",
" # Prepare the multipart form data correctly\n",
" files = {\"upload_file\": (filename, f, \"application/pdf\")}\n",
"\n",
" # Make the request directly\n",
" response = httpx.post(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/directories/{directory_id}/files/upload\",\n",
" params={\"project_id\": project_id},\n",
" files=files,\n",
" headers={\"Authorization\": f\"Bearer {LLAMA_CLOUD_API_KEY}\"},\n",
" timeout=60.0,\n",
" )\n",
"\n",
" if response.status_code in [200, 201]:\n",
" directory_file = response.json()\n",
" uploaded_files.append(directory_file)\n",
" print(f\" ✅ Uploaded: {directory_file.get('display_name')}\")\n",
" print(f\" File ID: {directory_file.get('id')}\")\n",
" else:\n",
" print(f\" ❌ Upload failed: {response.status_code}\")\n",
" print(f\" Error: {response.text[:200]}\")\n",
"\n",
"print(f\"\\n✅ Uploaded {len(uploaded_files)} files to directory\")"
]
},
{
"cell_type": "markdown",
"id": "cell-12",
"metadata": {},
"source": [
"## Step 3: Create a Batch Parse Job\n",
"\n",
"Now that we have files in our directory, let's create a batch parse job to process them all at once.\n",
"\n",
"The batch processing API uses the same configuration as LlamaParse."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-13",
"metadata": {},
"outputs": [],
"source": [
"# Configure the parse job\n",
"# This configuration will apply to all files in the directory\n",
"job_config = {\n",
" \"job_name\": \"parse_raw_file_job\", # Must match the JobNames enum value\n",
" \"partitions\": {},\n",
" \"parameters\": {\n",
" \"type\": \"parse\",\n",
" \"lang\": \"en\",\n",
" \"fast_mode\": True,\n",
" },\n",
"}\n",
"\n",
"print(\"✅ Job configuration created\")\n",
"print(f\" Language: {job_config['parameters']['lang']}\")\n",
"print(f\" Fast mode: {job_config['parameters']['fast_mode']}\")"
]
},
{
"cell_type": "markdown",
"id": "cell-14",
"metadata": {},
"source": [
"### Submit the Batch Job\n",
"\n",
"Now let's submit the batch job to process all files in the directory."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-15",
"metadata": {},
"outputs": [],
"source": [
"print(f\"🚀 Submitting batch parse job for directory: {directory_id}\")\n",
"print(f\" Processing {len(uploaded_files)} files...\\n\")\n",
"\n",
"# Submit batch job using HTTP request\n",
"response = httpx.post(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/batch-processing\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id},\n",
" json={\n",
" \"directory_id\": directory_id,\n",
" \"job_config\": job_config,\n",
" \"page_size\": 100, # Number of files to fetch per batch\n",
" \"continue_as_new_threshold\": 10, # Workflow continuation threshold\n",
" },\n",
" timeout=60.0,\n",
")\n",
"\n",
"if response.status_code in [200, 201]:\n",
" batch_job = response.json()\n",
" batch_job_id = batch_job[\"id\"]\n",
"\n",
" print(\"✅ Batch job submitted successfully!\")\n",
" print(f\" Batch Job ID: {batch_job_id}\")\n",
" print(f\" Workflow ID: {batch_job.get('workflow_id')}\")\n",
" print(f\" Status: {batch_job.get('status')}\")\n",
" print(f\" Total Items: {batch_job.get('total_items')}\")\n",
"else:\n",
" raise Exception(\n",
" f\"Failed to create batch job: {response.status_code} - {response.text}\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "cell-16",
"metadata": {},
"source": [
"## Step 4: Monitor Job Progress\n",
"\n",
"Now let's monitor the batch job progress. We'll poll the status endpoint to see how the job is progressing."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-17",
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"\n",
"def print_job_status(status_data):\n",
" \"\"\"Helper function to print job status in a readable format.\"\"\"\n",
" job = status_data[\"job\"]\n",
" progress_pct = status_data[\"progress_percentage\"]\n",
"\n",
" print(f\"\\n{'='*60}\")\n",
" print(f\"Job Status: {job['status']}\")\n",
" print(f\"{'='*60}\")\n",
" print(f\"Total Items: {job['total_items']}\")\n",
" print(f\"Completed: {job['processed_items']}\")\n",
" print(f\"Failed: {job['failed_items']}\")\n",
" print(f\"Skipped: {job['skipped_items']}\")\n",
" print(f\"Progress: {progress_pct:.1f}%\")\n",
"\n",
" if job.get(\"completed_at\"):\n",
" print(f\"Completed At: {job['completed_at']}\")\n",
" elif job.get(\"started_at\"):\n",
" print(f\"Started At: {job['started_at']}\")\n",
"\n",
" print(f\"{'='*60}\")\n",
"\n",
"\n",
"# Poll for status updates\n",
"print(\"🔄 Monitoring batch job progress...\")\n",
"print(\n",
" \"Note: It may take a few seconds for the workflow to initialize and count files.\\n\"\n",
")\n",
"\n",
"max_polls = 60 # Maximum number of status checks (increased for longer jobs)\n",
"poll_interval = 10 # Seconds between checks\n",
"\n",
"for i in range(max_polls):\n",
" response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/batch-processing/{batch_job_id}\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id},\n",
" timeout=60.0,\n",
" )\n",
"\n",
" if response.status_code == 200:\n",
" status_data = response.json()\n",
" print_job_status(status_data)\n",
"\n",
" # Check if job is complete\n",
" job_status = status_data[\"job\"][\"status\"]\n",
" if job_status in [\"completed\", \"failed\", \"cancelled\"]:\n",
" print(f\"\\n✅ Job finished with status: {job_status}\")\n",
" break\n",
"\n",
" if i < max_polls - 1:\n",
" print(f\"\\n⏳ Waiting {poll_interval} seconds before next check...\")\n",
" time.sleep(poll_interval)\n",
" else:\n",
" print(f\"Error getting status: {response.status_code} - {response.text}\")\n",
" break\n",
"else:\n",
" print(f\"\\n⚠️ Reached maximum polling attempts. Job may still be running.\")"
]
},
{
"cell_type": "markdown",
"id": "cell-18",
"metadata": {},
"source": [
"## Step 5: View Job Items\n",
"\n",
"Let's look at the individual items in the batch job to see which files were processed successfully."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-19",
"metadata": {},
"outputs": [],
"source": [
"# Get all items in the batch job\n",
"response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/batch-processing/{batch_job_id}/items\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id, \"limit\": 100},\n",
" timeout=60.0,\n",
")\n",
"\n",
"if response.status_code == 200:\n",
" items_response = response.json()\n",
"\n",
" print(f\"\\n📋 Batch Job Items ({items_response['total_size']} total)\")\n",
" print(f\"{'='*80}\\n\")\n",
"\n",
" for item in items_response[\"items\"]:\n",
" status_emoji = (\n",
" \"✅\"\n",
" if item[\"status\"] == \"completed\"\n",
" else \"❌\"\n",
" if item[\"status\"] == \"failed\"\n",
" else \"⏳\"\n",
" )\n",
" print(f\"{status_emoji} {item['item_name']}\")\n",
" print(f\" Status: {item['status']}\")\n",
" print(f\" Item ID: {item['item_id']}\")\n",
"\n",
" if item.get(\"error_message\"):\n",
" print(f\" Error: {item['error_message']}\")\n",
"\n",
" print()\n",
"else:\n",
" print(f\"Error listing items: {response.status_code} - {response.text}\")"
]
},
{
"cell_type": "markdown",
"id": "cell-20",
"metadata": {},
"source": [
"## Step 6: Retrieve Processing Results\n",
"\n",
"For each completed file, we can retrieve the processing results to see where the parsed output is stored."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-21",
"metadata": {},
"outputs": [],
"source": [
"# Get processing results for a specific item\n",
"if items_response[\"items\"]:\n",
" first_item = items_response[\"items\"][0]\n",
"\n",
" print(f\"\\n🔍 Processing results for: {first_item['item_name']}\")\n",
" print(f\"{'='*80}\\n\")\n",
"\n",
" response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/batch-processing/items/{first_item['item_id']}/processing-results\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id},\n",
" timeout=60.0,\n",
" )\n",
"\n",
" if response.status_code == 200:\n",
" results = response.json()\n",
"\n",
" print(f\"Item: {results['item_name']}\")\n",
" print(f\"Total processing runs: {len(results['processing_results'])}\\n\")\n",
"\n",
" for i, result in enumerate(results[\"processing_results\"], 1):\n",
" print(f\"Run {i}:\")\n",
" print(f\" Job Type: {result['job_type']}\")\n",
" print(f\" Processed At: {result['processed_at']}\")\n",
" print(f\" Parameters Hash: {result['parameters_hash']}\")\n",
"\n",
" if result.get(\"output_s3_path\"):\n",
" print(f\" Output S3 Path: {result['output_s3_path']}\")\n",
"\n",
" if result.get(\"output_metadata\"):\n",
" print(f\" Output Metadata: {result['output_metadata']}\")\n",
"\n",
" print()\n",
" else:\n",
" print(f\"Error getting results: {response.status_code} - {response.text}\")"
]
},
{
"cell_type": "markdown",
"id": "cell-22",
"metadata": {},
"source": [
"## Optional: List All Batch Jobs\n",
"\n",
"You can also list all batch jobs in your project to see the history of batch processing operations."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-23",
"metadata": {},
"outputs": [],
"source": [
"# List all parse jobs in the project\n",
"response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/batch-processing\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id, \"job_type\": \"parse\", \"limit\": 10},\n",
" timeout=60.0,\n",
")\n",
"\n",
"if response.status_code == 200:\n",
" jobs_response = response.json()\n",
"\n",
" print(f\"\\n📊 Recent Batch Parse Jobs ({jobs_response['total_size']} total)\")\n",
" print(f\"{'='*80}\\n\")\n",
"\n",
" for job in jobs_response[\"items\"]:\n",
" status_emoji = (\n",
" \"✅\"\n",
" if job[\"status\"] == \"completed\"\n",
" else \"❌\"\n",
" if job[\"status\"] == \"failed\"\n",
" else \"⏳\"\n",
" )\n",
" print(f\"{status_emoji} Job ID: {job['id']}\")\n",
" print(f\" Status: {job['status']}\")\n",
" print(f\" Directory: {job['directory_id']}\")\n",
" print(f\" Total Items: {job['total_items']}\")\n",
" print(f\" Completed: {job['processed_items']}\")\n",
" print(f\" Created: {job['created_at']}\")\n",
" print()\n",
"else:\n",
" print(f\"Error listing jobs: {response.status_code} - {response.text}\")"
]
},
{
"cell_type": "markdown",
"id": "uug7591rkq",
"metadata": {},
"source": [
"## Step 7: Retrieve Parsed Text Results\n",
"\n",
"Once the batch job is complete, each BatchJobItem will have a `job_id` field that maps to a parse job ID. We can use this ID with the standard parse client methods to fetch the actual parsed text results."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "vpp0vxtc0y",
"metadata": {},
"outputs": [],
"source": [
"# Get all completed items and their job IDs\n",
"completed_items = [\n",
" item for item in items_response[\"items\"] if item[\"status\"] == \"completed\"\n",
"]\n",
"\n",
"print(f\"📄 Found {len(completed_items)} completed items\\n\")\n",
"print(f\"{'='*80}\\n\")\n",
"\n",
"# Display the job_id for each completed item\n",
"for item in completed_items:\n",
" print(f\"📝 {item['item_name']}\")\n",
" print(f\" Item ID: {item['item_id']}\")\n",
" print(f\" Parse Job ID: {item['job_id']}\")\n",
" print()"
]
},
{
"cell_type": "markdown",
"id": "4gck6hwpnl6",
"metadata": {},
"source": [
"### Fetch Parsed Text for a Specific Document\n",
"\n",
"Now let's use the `job_id` to retrieve the actual parsed text content using the parse client methods."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "g191kvgxxvk",
"metadata": {},
"outputs": [],
"source": [
"# Get the parsed text for the first completed item\n",
"if completed_items:\n",
" first_completed = completed_items[0]\n",
"\n",
" print(f\"📖 Retrieving parsed text for: {first_completed['item_name']}\")\n",
" print(f\" Using Parse Job ID: {first_completed['job_id']}\\n\")\n",
" print(f\"{'='*80}\\n\")\n",
"\n",
" # Use the job_id to fetch the parse result\n",
" response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/parsing/job/{first_completed['job_id']}/result/text\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id},\n",
" timeout=60.0,\n",
" )\n",
"\n",
" if response.status_code == 200:\n",
" parse_result = response.text\n",
"\n",
" print(f\"✅ Retrieved parsed text ({len(parse_result)} characters)\\n\")\n",
"\n",
" # Display first 1000 characters as a preview\n",
" print(\"Preview (first 1000 characters):\")\n",
" print(\"-\" * 80)\n",
" print(parse_result[:1000])\n",
" print(\"-\" * 80)\n",
"\n",
" if len(parse_result) > 1000:\n",
" print(f\"\\n... and {len(parse_result) - 1000} more characters\")\n",
" else:\n",
" print(\n",
" f\"Error retrieving parse result: {response.status_code} - {response.text}\"\n",
" )\n",
"else:\n",
" print(\"⚠️ No completed items found to retrieve results from\")"
]
},
{
"cell_type": "markdown",
"id": "2olccb4l8fj",
"metadata": {},
"source": [
"### Retrieve Parsed Results in Other Formats\n",
"\n",
"You can also retrieve the parsed results in JSON or Markdown format using different client methods."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "lcqsfxiw0sr",
"metadata": {},
"outputs": [],
"source": [
"if completed_items:\n",
" first_completed = completed_items[0]\n",
"\n",
" print(\n",
" f\"📋 Retrieving parse results in different formats for: {first_completed['item_name']}\\n\"\n",
" )\n",
"\n",
" # Get as JSON (includes structured data with pages, images, etc.)\n",
" print(\"1️⃣ Retrieving as JSON...\")\n",
" response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/parsing/job/{first_completed['job_id']}/result/json\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id},\n",
" timeout=60.0,\n",
" )\n",
"\n",
" if response.status_code == 200:\n",
" json_result = response.json()\n",
" print(f\" ✅ JSON result with {len(json_result['pages'])} pages\")\n",
" print(f\" Keys: {list(json_result.keys())}\\n\")\n",
" else:\n",
" print(f\" Error: {response.status_code}\\n\")\n",
"\n",
" # Get as Markdown\n",
" print(\"2️⃣ Retrieving as Markdown...\")\n",
" response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/parsing/job/{first_completed['job_id']}/result/markdown\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id},\n",
" timeout=60.0,\n",
" )\n",
"\n",
" if response.status_code == 200:\n",
" markdown_result = response.text\n",
" print(f\" ✅ Markdown result ({len(markdown_result)} characters)\\n\")\n",
"\n",
" # Display markdown preview\n",
" print(\"Markdown Preview (first 500 characters):\")\n",
" print(\"-\" * 80)\n",
" print(markdown_result[:500])\n",
" print(\"-\" * 80)\n",
"\n",
" if len(markdown_result) > 500:\n",
" print(f\"\\n... and {len(markdown_result) - 500} more characters\")\n",
" else:\n",
" print(f\" Error: {response.status_code}\")\n",
"else:\n",
" print(\"⚠️ No completed items found to retrieve results from\")"
]
},
{
"cell_type": "markdown",
"id": "lr61wqkfq3",
"metadata": {},
"source": [
"### Batch Process All Parsed Results\n",
"\n",
"You can also loop through all completed items to retrieve and process all the parsed results."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "kltydf9xzkl",
"metadata": {},
"outputs": [],
"source": [
"# Process all completed items\n",
"print(f\"🔄 Processing all {len(completed_items)} completed items...\\n\")\n",
"print(f\"{'='*80}\\n\")\n",
"\n",
"all_results = {}\n",
"\n",
"for item in completed_items:\n",
" print(f\"📄 Processing: {item['item_name']}\")\n",
" print(f\" Parse Job ID: {item['job_id']}\")\n",
"\n",
" try:\n",
" # Retrieve the parsed text for this item\n",
" response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/parsing/job/{item['job_id']}/result/text\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id},\n",
" timeout=60.0,\n",
" )\n",
"\n",
" if response.status_code == 200:\n",
" parsed_text = response.text\n",
"\n",
" all_results[item[\"item_name\"]] = {\n",
" \"job_id\": item[\"job_id\"],\n",
" \"text\": parsed_text,\n",
" \"length\": len(parsed_text),\n",
" }\n",
"\n",
" print(f\" ✅ Retrieved {len(parsed_text)} characters\")\n",
" else:\n",
" all_results[item[\"item_name\"]] = {\n",
" \"job_id\": item[\"job_id\"],\n",
" \"error\": f\"HTTP {response.status_code}\",\n",
" }\n",
" print(f\" ❌ Error: HTTP {response.status_code}\")\n",
"\n",
" except Exception as e:\n",
" print(f\" ❌ Error: {str(e)}\")\n",
" all_results[item[\"item_name\"]] = {\"job_id\": item[\"job_id\"], \"error\": str(e)}\n",
"\n",
" print()\n",
"\n",
"print(f\"{'='*80}\")\n",
"print(f\"\\n✅ Processed {len(all_results)} items\")\n",
"print(f\"\\nSummary:\")\n",
"for name, result in all_results.items():\n",
" if \"error\" in result:\n",
" print(f\" ❌ {name}: Error - {result['error']}\")\n",
" else:\n",
" print(f\" ✅ {name}: {result['length']:,} characters\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}