mirror of
https://github.com/run-llama/llama_cloud_services.git
synced 2026-07-01 21:44:37 -04:00
816 lines
28 KiB
Plaintext
816 lines
28 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-0",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Batch Parse with LlamaCloud Directories\n",
|
|
"\n",
|
|
"This notebook demonstrates how to use LlamaCloud's batch processing API to parse multiple files in a directory. The workflow includes:\n",
|
|
"\n",
|
|
"1. **Creating a Directory** - Set up a directory to organize your files\n",
|
|
"2. **Uploading Files** - Upload multiple files to the directory\n",
|
|
"3. **Starting a Batch Parse Job** - Kick off batch processing on all files\n",
|
|
"4. **Monitoring Progress** - Check the status and view results\n",
|
|
"\n",
|
|
"This is useful when you need to parse many documents at once, as the batch API handles the orchestration and provides progress tracking."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0c2b5e1a",
|
|
"metadata": {},
|
|
"source": [
|
|
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-1",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Setup and Installation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cell-2",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%pip install llama-cloud python-dotenv"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cell-3",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"from dotenv import load_dotenv\n",
|
|
"import httpx\n",
|
|
"\n",
|
|
"# Load environment variables\n",
|
|
"load_dotenv()\n",
|
|
"\n",
|
|
"# Set your API key\n",
|
|
"LLAMA_CLOUD_API_KEY = os.environ.get(\"LLAMA_CLOUD_API_KEY\", \"llx-...\")\n",
|
|
"\n",
|
|
"# Optional: Set base URL (defaults to https://api.cloud.llamaindex.ai if not set)\n",
|
|
"LLAMA_CLOUD_BASE_URL = os.environ.get(\n",
|
|
" \"LLAMA_CLOUD_BASE_URL\", \"https://api.cloud.llamaindex.ai\"\n",
|
|
")\n",
|
|
"\n",
|
|
"# Optional: Set project_id if you have one, otherwise it will use your default project\n",
|
|
"PROJECT_ID = os.environ.get(\"LLAMA_CLOUD_PROJECT_ID\", None)\n",
|
|
"\n",
|
|
"print(\"✅ API key configured\")\n",
|
|
"print(f\" Base URL: {LLAMA_CLOUD_BASE_URL}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-4",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Setup HTTP Client\n",
|
|
"\n",
|
|
"Since the current version of the llama-cloud SDK has some issues with the beta endpoints, we'll use direct HTTP requests with httpx for reliability."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cell-5",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create HTTP client with authentication\n",
|
|
"headers = {\n",
|
|
" \"Authorization\": f\"Bearer {LLAMA_CLOUD_API_KEY}\",\n",
|
|
"}\n",
|
|
"\n",
|
|
"print(\"✅ HTTP client configured\")\n",
|
|
"print(f\" Using base URL: {LLAMA_CLOUD_BASE_URL}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-6",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 1: Create a Directory\n",
|
|
"\n",
|
|
"First, we'll create a directory to organize our files. Directories help you group related files together for batch processing."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cell-7",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from datetime import datetime\n",
|
|
"\n",
|
|
"# Create a directory with a timestamp in the name\n",
|
|
"timestamp = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
|
|
"directory_name = f\"batch-parse-demo-{timestamp}\"\n",
|
|
"\n",
|
|
"# Create directory using HTTP request\n",
|
|
"response = httpx.post(\n",
|
|
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/directories\",\n",
|
|
" headers=headers,\n",
|
|
" params={\"project_id\": PROJECT_ID},\n",
|
|
" json={\n",
|
|
" \"name\": directory_name,\n",
|
|
" \"description\": \"Demo directory for batch parse example\",\n",
|
|
" },\n",
|
|
" timeout=60.0,\n",
|
|
")\n",
|
|
"\n",
|
|
"if response.status_code in [200, 201]:\n",
|
|
" directory = response.json()\n",
|
|
" directory_id = directory[\"id\"]\n",
|
|
" project_id = directory[\"project_id\"]\n",
|
|
"\n",
|
|
" print(f\"✅ Created directory: {directory['name']}\")\n",
|
|
" print(f\" Directory ID: {directory_id}\")\n",
|
|
" print(f\" Project ID: {project_id}\")\n",
|
|
"else:\n",
|
|
" raise Exception(\n",
|
|
" f\"Failed to create directory: {response.status_code} - {response.text}\"\n",
|
|
" )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-8",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 2: Upload Files to the Directory\n",
|
|
"\n",
|
|
"Now we'll upload some files to our directory. For this demo, we'll download some sample PDFs and upload them.\n",
|
|
"\n",
|
|
"You can replace these with your own files."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cell-9",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create a directory for sample files\n",
|
|
"import requests\n",
|
|
"\n",
|
|
"os.makedirs(\"sample_files\", exist_ok=True)\n",
|
|
"\n",
|
|
"# Sample documents to download\n",
|
|
"sample_docs = {\n",
|
|
" \"attention.pdf\": \"https://arxiv.org/pdf/1706.03762.pdf\",\n",
|
|
" \"bert.pdf\": \"https://arxiv.org/pdf/1810.04805.pdf\",\n",
|
|
"}\n",
|
|
"\n",
|
|
"# Download sample documents\n",
|
|
"for filename, url in sample_docs.items():\n",
|
|
" filepath = f\"sample_files/{filename}\"\n",
|
|
" if not os.path.exists(filepath):\n",
|
|
" print(f\"📥 Downloading {filename}...\")\n",
|
|
" response = requests.get(url)\n",
|
|
" if response.status_code == 200:\n",
|
|
" with open(filepath, \"wb\") as f:\n",
|
|
" f.write(response.content)\n",
|
|
" print(f\" ✅ Downloaded {filename}\")\n",
|
|
" else:\n",
|
|
" print(f\" ❌ Failed to download {filename}\")\n",
|
|
" else:\n",
|
|
" print(f\"📁 {filename} already exists\")\n",
|
|
"\n",
|
|
"print(\"\\n✅ Sample files ready!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-10",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Upload Files to Directory\n",
|
|
"\n",
|
|
"Now let's upload the files to our directory using the `upload_file_to_directory` endpoint."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cell-11",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"uploaded_files = []\n",
|
|
"\n",
|
|
"# Workaround: Use direct HTTP requests instead of SDK due to SDK bug\n",
|
|
"import httpx\n",
|
|
"\n",
|
|
"for filename in os.listdir(\"sample_files\"):\n",
|
|
" if filename.endswith(\".pdf\"):\n",
|
|
" filepath = f\"sample_files/{filename}\"\n",
|
|
"\n",
|
|
" print(f\"📤 Uploading {filename}...\")\n",
|
|
"\n",
|
|
" # Upload file using direct HTTP request (SDK has a bug with file uploads)\n",
|
|
" with open(filepath, \"rb\") as f:\n",
|
|
" # Prepare the multipart form data correctly\n",
|
|
" files = {\"upload_file\": (filename, f, \"application/pdf\")}\n",
|
|
"\n",
|
|
" # Make the request directly\n",
|
|
" response = httpx.post(\n",
|
|
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/directories/{directory_id}/files/upload\",\n",
|
|
" params={\"project_id\": project_id},\n",
|
|
" files=files,\n",
|
|
" headers={\"Authorization\": f\"Bearer {LLAMA_CLOUD_API_KEY}\"},\n",
|
|
" timeout=60.0,\n",
|
|
" )\n",
|
|
"\n",
|
|
" if response.status_code in [200, 201]:\n",
|
|
" directory_file = response.json()\n",
|
|
" uploaded_files.append(directory_file)\n",
|
|
" print(f\" ✅ Uploaded: {directory_file.get('display_name')}\")\n",
|
|
" print(f\" File ID: {directory_file.get('id')}\")\n",
|
|
" else:\n",
|
|
" print(f\" ❌ Upload failed: {response.status_code}\")\n",
|
|
" print(f\" Error: {response.text[:200]}\")\n",
|
|
"\n",
|
|
"print(f\"\\n✅ Uploaded {len(uploaded_files)} files to directory\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-12",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 3: Create a Batch Parse Job\n",
|
|
"\n",
|
|
"Now that we have files in our directory, let's create a batch parse job to process them all at once.\n",
|
|
"\n",
|
|
"The batch processing API uses the same configuration as LlamaParse."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cell-13",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Configure the parse job\n",
|
|
"# This configuration will apply to all files in the directory\n",
|
|
"job_config = {\n",
|
|
" \"job_name\": \"parse_raw_file_job\", # Must match the JobNames enum value\n",
|
|
" \"partitions\": {},\n",
|
|
" \"parameters\": {\n",
|
|
" \"type\": \"parse\",\n",
|
|
" \"lang\": \"en\",\n",
|
|
" \"fast_mode\": True,\n",
|
|
" },\n",
|
|
"}\n",
|
|
"\n",
|
|
"print(\"✅ Job configuration created\")\n",
|
|
"print(f\" Language: {job_config['parameters']['lang']}\")\n",
|
|
"print(f\" Fast mode: {job_config['parameters']['fast_mode']}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-14",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Submit the Batch Job\n",
|
|
"\n",
|
|
"Now let's submit the batch job to process all files in the directory."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cell-15",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(f\"🚀 Submitting batch parse job for directory: {directory_id}\")\n",
|
|
"print(f\" Processing {len(uploaded_files)} files...\\n\")\n",
|
|
"\n",
|
|
"# Submit batch job using HTTP request\n",
|
|
"response = httpx.post(\n",
|
|
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/batch-processing\",\n",
|
|
" headers=headers,\n",
|
|
" params={\"project_id\": project_id},\n",
|
|
" json={\n",
|
|
" \"directory_id\": directory_id,\n",
|
|
" \"job_config\": job_config,\n",
|
|
" \"page_size\": 100, # Number of files to fetch per batch\n",
|
|
" \"continue_as_new_threshold\": 10, # Workflow continuation threshold\n",
|
|
" },\n",
|
|
" timeout=60.0,\n",
|
|
")\n",
|
|
"\n",
|
|
"if response.status_code in [200, 201]:\n",
|
|
" batch_job = response.json()\n",
|
|
" batch_job_id = batch_job[\"id\"]\n",
|
|
"\n",
|
|
" print(\"✅ Batch job submitted successfully!\")\n",
|
|
" print(f\" Batch Job ID: {batch_job_id}\")\n",
|
|
" print(f\" Workflow ID: {batch_job.get('workflow_id')}\")\n",
|
|
" print(f\" Status: {batch_job.get('status')}\")\n",
|
|
" print(f\" Total Items: {batch_job.get('total_items')}\")\n",
|
|
"else:\n",
|
|
" raise Exception(\n",
|
|
" f\"Failed to create batch job: {response.status_code} - {response.text}\"\n",
|
|
" )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-16",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 4: Monitor Job Progress\n",
|
|
"\n",
|
|
"Now let's monitor the batch job progress. We'll poll the status endpoint to see how the job is progressing."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cell-17",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import time\n",
|
|
"\n",
|
|
"\n",
|
|
"def print_job_status(status_data):\n",
|
|
" \"\"\"Helper function to print job status in a readable format.\"\"\"\n",
|
|
" job = status_data[\"job\"]\n",
|
|
" progress_pct = status_data[\"progress_percentage\"]\n",
|
|
"\n",
|
|
" print(f\"\\n{'='*60}\")\n",
|
|
" print(f\"Job Status: {job['status']}\")\n",
|
|
" print(f\"{'='*60}\")\n",
|
|
" print(f\"Total Items: {job['total_items']}\")\n",
|
|
" print(f\"Completed: {job['processed_items']}\")\n",
|
|
" print(f\"Failed: {job['failed_items']}\")\n",
|
|
" print(f\"Skipped: {job['skipped_items']}\")\n",
|
|
" print(f\"Progress: {progress_pct:.1f}%\")\n",
|
|
"\n",
|
|
" if job.get(\"completed_at\"):\n",
|
|
" print(f\"Completed At: {job['completed_at']}\")\n",
|
|
" elif job.get(\"started_at\"):\n",
|
|
" print(f\"Started At: {job['started_at']}\")\n",
|
|
"\n",
|
|
" print(f\"{'='*60}\")\n",
|
|
"\n",
|
|
"\n",
|
|
"# Poll for status updates\n",
|
|
"print(\"🔄 Monitoring batch job progress...\")\n",
|
|
"print(\n",
|
|
" \"Note: It may take a few seconds for the workflow to initialize and count files.\\n\"\n",
|
|
")\n",
|
|
"\n",
|
|
"max_polls = 60 # Maximum number of status checks (increased for longer jobs)\n",
|
|
"poll_interval = 10 # Seconds between checks\n",
|
|
"\n",
|
|
"for i in range(max_polls):\n",
|
|
" response = httpx.get(\n",
|
|
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/batch-processing/{batch_job_id}\",\n",
|
|
" headers=headers,\n",
|
|
" params={\"project_id\": project_id},\n",
|
|
" timeout=60.0,\n",
|
|
" )\n",
|
|
"\n",
|
|
" if response.status_code == 200:\n",
|
|
" status_data = response.json()\n",
|
|
" print_job_status(status_data)\n",
|
|
"\n",
|
|
" # Check if job is complete\n",
|
|
" job_status = status_data[\"job\"][\"status\"]\n",
|
|
" if job_status in [\"completed\", \"failed\", \"cancelled\"]:\n",
|
|
" print(f\"\\n✅ Job finished with status: {job_status}\")\n",
|
|
" break\n",
|
|
"\n",
|
|
" if i < max_polls - 1:\n",
|
|
" print(f\"\\n⏳ Waiting {poll_interval} seconds before next check...\")\n",
|
|
" time.sleep(poll_interval)\n",
|
|
" else:\n",
|
|
" print(f\"Error getting status: {response.status_code} - {response.text}\")\n",
|
|
" break\n",
|
|
"else:\n",
|
|
" print(f\"\\n⚠️ Reached maximum polling attempts. Job may still be running.\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-18",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 5: View Job Items\n",
|
|
"\n",
|
|
"Let's look at the individual items in the batch job to see which files were processed successfully."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cell-19",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Get all items in the batch job\n",
|
|
"response = httpx.get(\n",
|
|
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/batch-processing/{batch_job_id}/items\",\n",
|
|
" headers=headers,\n",
|
|
" params={\"project_id\": project_id, \"limit\": 100},\n",
|
|
" timeout=60.0,\n",
|
|
")\n",
|
|
"\n",
|
|
"if response.status_code == 200:\n",
|
|
" items_response = response.json()\n",
|
|
"\n",
|
|
" print(f\"\\n📋 Batch Job Items ({items_response['total_size']} total)\")\n",
|
|
" print(f\"{'='*80}\\n\")\n",
|
|
"\n",
|
|
" for item in items_response[\"items\"]:\n",
|
|
" status_emoji = (\n",
|
|
" \"✅\"\n",
|
|
" if item[\"status\"] == \"completed\"\n",
|
|
" else \"❌\"\n",
|
|
" if item[\"status\"] == \"failed\"\n",
|
|
" else \"⏳\"\n",
|
|
" )\n",
|
|
" print(f\"{status_emoji} {item['item_name']}\")\n",
|
|
" print(f\" Status: {item['status']}\")\n",
|
|
" print(f\" Item ID: {item['item_id']}\")\n",
|
|
"\n",
|
|
" if item.get(\"error_message\"):\n",
|
|
" print(f\" Error: {item['error_message']}\")\n",
|
|
"\n",
|
|
" print()\n",
|
|
"else:\n",
|
|
" print(f\"Error listing items: {response.status_code} - {response.text}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-20",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 6: Retrieve Processing Results\n",
|
|
"\n",
|
|
"For each completed file, we can retrieve the processing results to see where the parsed output is stored."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cell-21",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Get processing results for a specific item\n",
|
|
"if items_response[\"items\"]:\n",
|
|
" first_item = items_response[\"items\"][0]\n",
|
|
"\n",
|
|
" print(f\"\\n🔍 Processing results for: {first_item['item_name']}\")\n",
|
|
" print(f\"{'='*80}\\n\")\n",
|
|
"\n",
|
|
" response = httpx.get(\n",
|
|
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/batch-processing/items/{first_item['item_id']}/processing-results\",\n",
|
|
" headers=headers,\n",
|
|
" params={\"project_id\": project_id},\n",
|
|
" timeout=60.0,\n",
|
|
" )\n",
|
|
"\n",
|
|
" if response.status_code == 200:\n",
|
|
" results = response.json()\n",
|
|
"\n",
|
|
" print(f\"Item: {results['item_name']}\")\n",
|
|
" print(f\"Total processing runs: {len(results['processing_results'])}\\n\")\n",
|
|
"\n",
|
|
" for i, result in enumerate(results[\"processing_results\"], 1):\n",
|
|
" print(f\"Run {i}:\")\n",
|
|
" print(f\" Job Type: {result['job_type']}\")\n",
|
|
" print(f\" Processed At: {result['processed_at']}\")\n",
|
|
" print(f\" Parameters Hash: {result['parameters_hash']}\")\n",
|
|
"\n",
|
|
" if result.get(\"output_s3_path\"):\n",
|
|
" print(f\" Output S3 Path: {result['output_s3_path']}\")\n",
|
|
"\n",
|
|
" if result.get(\"output_metadata\"):\n",
|
|
" print(f\" Output Metadata: {result['output_metadata']}\")\n",
|
|
"\n",
|
|
" print()\n",
|
|
" else:\n",
|
|
" print(f\"Error getting results: {response.status_code} - {response.text}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-22",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Optional: List All Batch Jobs\n",
|
|
"\n",
|
|
"You can also list all batch jobs in your project to see the history of batch processing operations."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cell-23",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# List all parse jobs in the project\n",
|
|
"response = httpx.get(\n",
|
|
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/batch-processing\",\n",
|
|
" headers=headers,\n",
|
|
" params={\"project_id\": project_id, \"job_type\": \"parse\", \"limit\": 10},\n",
|
|
" timeout=60.0,\n",
|
|
")\n",
|
|
"\n",
|
|
"if response.status_code == 200:\n",
|
|
" jobs_response = response.json()\n",
|
|
"\n",
|
|
" print(f\"\\n📊 Recent Batch Parse Jobs ({jobs_response['total_size']} total)\")\n",
|
|
" print(f\"{'='*80}\\n\")\n",
|
|
"\n",
|
|
" for job in jobs_response[\"items\"]:\n",
|
|
" status_emoji = (\n",
|
|
" \"✅\"\n",
|
|
" if job[\"status\"] == \"completed\"\n",
|
|
" else \"❌\"\n",
|
|
" if job[\"status\"] == \"failed\"\n",
|
|
" else \"⏳\"\n",
|
|
" )\n",
|
|
" print(f\"{status_emoji} Job ID: {job['id']}\")\n",
|
|
" print(f\" Status: {job['status']}\")\n",
|
|
" print(f\" Directory: {job['directory_id']}\")\n",
|
|
" print(f\" Total Items: {job['total_items']}\")\n",
|
|
" print(f\" Completed: {job['processed_items']}\")\n",
|
|
" print(f\" Created: {job['created_at']}\")\n",
|
|
" print()\n",
|
|
"else:\n",
|
|
" print(f\"Error listing jobs: {response.status_code} - {response.text}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "uug7591rkq",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 7: Retrieve Parsed Text Results\n",
|
|
"\n",
|
|
"Once the batch job is complete, each BatchJobItem will have a `job_id` field that maps to a parse job ID. We can use this ID with the standard parse client methods to fetch the actual parsed text results."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "vpp0vxtc0y",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Get all completed items and their job IDs\n",
|
|
"completed_items = [\n",
|
|
" item for item in items_response[\"items\"] if item[\"status\"] == \"completed\"\n",
|
|
"]\n",
|
|
"\n",
|
|
"print(f\"📄 Found {len(completed_items)} completed items\\n\")\n",
|
|
"print(f\"{'='*80}\\n\")\n",
|
|
"\n",
|
|
"# Display the job_id for each completed item\n",
|
|
"for item in completed_items:\n",
|
|
" print(f\"📝 {item['item_name']}\")\n",
|
|
" print(f\" Item ID: {item['item_id']}\")\n",
|
|
" print(f\" Parse Job ID: {item['job_id']}\")\n",
|
|
" print()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4gck6hwpnl6",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Fetch Parsed Text for a Specific Document\n",
|
|
"\n",
|
|
"Now let's use the `job_id` to retrieve the actual parsed text content using the parse client methods."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "g191kvgxxvk",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Get the parsed text for the first completed item\n",
|
|
"if completed_items:\n",
|
|
" first_completed = completed_items[0]\n",
|
|
"\n",
|
|
" print(f\"📖 Retrieving parsed text for: {first_completed['item_name']}\")\n",
|
|
" print(f\" Using Parse Job ID: {first_completed['job_id']}\\n\")\n",
|
|
" print(f\"{'='*80}\\n\")\n",
|
|
"\n",
|
|
" # Use the job_id to fetch the parse result\n",
|
|
" response = httpx.get(\n",
|
|
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/parsing/job/{first_completed['job_id']}/result/text\",\n",
|
|
" headers=headers,\n",
|
|
" params={\"project_id\": project_id},\n",
|
|
" timeout=60.0,\n",
|
|
" )\n",
|
|
"\n",
|
|
" if response.status_code == 200:\n",
|
|
" parse_result = response.text\n",
|
|
"\n",
|
|
" print(f\"✅ Retrieved parsed text ({len(parse_result)} characters)\\n\")\n",
|
|
"\n",
|
|
" # Display first 1000 characters as a preview\n",
|
|
" print(\"Preview (first 1000 characters):\")\n",
|
|
" print(\"-\" * 80)\n",
|
|
" print(parse_result[:1000])\n",
|
|
" print(\"-\" * 80)\n",
|
|
"\n",
|
|
" if len(parse_result) > 1000:\n",
|
|
" print(f\"\\n... and {len(parse_result) - 1000} more characters\")\n",
|
|
" else:\n",
|
|
" print(\n",
|
|
" f\"Error retrieving parse result: {response.status_code} - {response.text}\"\n",
|
|
" )\n",
|
|
"else:\n",
|
|
" print(\"⚠️ No completed items found to retrieve results from\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2olccb4l8fj",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Retrieve Parsed Results in Other Formats\n",
|
|
"\n",
|
|
"You can also retrieve the parsed results in JSON or Markdown format using different client methods."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "lcqsfxiw0sr",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"if completed_items:\n",
|
|
" first_completed = completed_items[0]\n",
|
|
"\n",
|
|
" print(\n",
|
|
" f\"📋 Retrieving parse results in different formats for: {first_completed['item_name']}\\n\"\n",
|
|
" )\n",
|
|
"\n",
|
|
" # Get as JSON (includes structured data with pages, images, etc.)\n",
|
|
" print(\"1️⃣ Retrieving as JSON...\")\n",
|
|
" response = httpx.get(\n",
|
|
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/parsing/job/{first_completed['job_id']}/result/json\",\n",
|
|
" headers=headers,\n",
|
|
" params={\"project_id\": project_id},\n",
|
|
" timeout=60.0,\n",
|
|
" )\n",
|
|
"\n",
|
|
" if response.status_code == 200:\n",
|
|
" json_result = response.json()\n",
|
|
" print(f\" ✅ JSON result with {len(json_result['pages'])} pages\")\n",
|
|
" print(f\" Keys: {list(json_result.keys())}\\n\")\n",
|
|
" else:\n",
|
|
" print(f\" Error: {response.status_code}\\n\")\n",
|
|
"\n",
|
|
" # Get as Markdown\n",
|
|
" print(\"2️⃣ Retrieving as Markdown...\")\n",
|
|
" response = httpx.get(\n",
|
|
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/parsing/job/{first_completed['job_id']}/result/markdown\",\n",
|
|
" headers=headers,\n",
|
|
" params={\"project_id\": project_id},\n",
|
|
" timeout=60.0,\n",
|
|
" )\n",
|
|
"\n",
|
|
" if response.status_code == 200:\n",
|
|
" markdown_result = response.text\n",
|
|
" print(f\" ✅ Markdown result ({len(markdown_result)} characters)\\n\")\n",
|
|
"\n",
|
|
" # Display markdown preview\n",
|
|
" print(\"Markdown Preview (first 500 characters):\")\n",
|
|
" print(\"-\" * 80)\n",
|
|
" print(markdown_result[:500])\n",
|
|
" print(\"-\" * 80)\n",
|
|
"\n",
|
|
" if len(markdown_result) > 500:\n",
|
|
" print(f\"\\n... and {len(markdown_result) - 500} more characters\")\n",
|
|
" else:\n",
|
|
" print(f\" Error: {response.status_code}\")\n",
|
|
"else:\n",
|
|
" print(\"⚠️ No completed items found to retrieve results from\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "lr61wqkfq3",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Batch Process All Parsed Results\n",
|
|
"\n",
|
|
"You can also loop through all completed items to retrieve and process all the parsed results."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "kltydf9xzkl",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Process all completed items\n",
|
|
"print(f\"🔄 Processing all {len(completed_items)} completed items...\\n\")\n",
|
|
"print(f\"{'='*80}\\n\")\n",
|
|
"\n",
|
|
"all_results = {}\n",
|
|
"\n",
|
|
"for item in completed_items:\n",
|
|
" print(f\"📄 Processing: {item['item_name']}\")\n",
|
|
" print(f\" Parse Job ID: {item['job_id']}\")\n",
|
|
"\n",
|
|
" try:\n",
|
|
" # Retrieve the parsed text for this item\n",
|
|
" response = httpx.get(\n",
|
|
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/parsing/job/{item['job_id']}/result/text\",\n",
|
|
" headers=headers,\n",
|
|
" params={\"project_id\": project_id},\n",
|
|
" timeout=60.0,\n",
|
|
" )\n",
|
|
"\n",
|
|
" if response.status_code == 200:\n",
|
|
" parsed_text = response.text\n",
|
|
"\n",
|
|
" all_results[item[\"item_name\"]] = {\n",
|
|
" \"job_id\": item[\"job_id\"],\n",
|
|
" \"text\": parsed_text,\n",
|
|
" \"length\": len(parsed_text),\n",
|
|
" }\n",
|
|
"\n",
|
|
" print(f\" ✅ Retrieved {len(parsed_text)} characters\")\n",
|
|
" else:\n",
|
|
" all_results[item[\"item_name\"]] = {\n",
|
|
" \"job_id\": item[\"job_id\"],\n",
|
|
" \"error\": f\"HTTP {response.status_code}\",\n",
|
|
" }\n",
|
|
" print(f\" ❌ Error: HTTP {response.status_code}\")\n",
|
|
"\n",
|
|
" except Exception as e:\n",
|
|
" print(f\" ❌ Error: {str(e)}\")\n",
|
|
" all_results[item[\"item_name\"]] = {\"job_id\": item[\"job_id\"], \"error\": str(e)}\n",
|
|
"\n",
|
|
" print()\n",
|
|
"\n",
|
|
"print(f\"{'='*80}\")\n",
|
|
"print(f\"\\n✅ Processed {len(all_results)} items\")\n",
|
|
"print(f\"\\nSummary:\")\n",
|
|
"for name, result in all_results.items():\n",
|
|
" if \"error\" in result:\n",
|
|
" print(f\" ❌ {name}: Error - {result['error']}\")\n",
|
|
" else:\n",
|
|
" print(f\" ✅ {name}: {result['length']:,} characters\")"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|