Update docs for extract (#852)

* Update docs for extract

* add more details on async
This commit is contained in:
Neeraj Pradhan
2025-08-07 13:59:53 -07:00
committed by GitHub
parent b407a5edb5
commit c56fb5d8f7
+93 -21
View File
@@ -2,6 +2,29 @@
LlamaExtract provides a simple API for extracting structured data from unstructured documents like PDFs, text files and images.
## Table of Contents
- [Quick Start](#quick-start)
- [Supported File Types](#supported-file-types)
- [Different Input Types](#different-input-types)
- [Async Extraction](#async-extraction)
- [Core Concepts](#core-concepts)
- [Defining Schemas](#defining-schemas)
- [Using Pydantic (Recommended)](#using-pydantic-recommended)
- [Using JSON Schema](#using-json-schema)
- [Important restrictions on JSON/Pydantic Schema](#important-restrictions-on-jsonpydantic-schema)
- [Extraction Configuration](#extraction-configuration)
- [Configuration Options](#configuration-options)
- [Extraction Agents (Advanced)](#extraction-agents-advanced)
- [Creating Agents](#creating-agents)
- [Agent Batch Processing](#agent-batch-processing)
- [Updating Agent Schemas](#updating-agent-schemas)
- [Managing Agents](#managing-agents)
- [When to Use Agents vs Direct Extraction](#when-to-use-agents-vs-direct-extraction)
- [Installation](#installation)
- [Tips & Best Practices](#tips--best-practices)
- [Additional Resources](#additional-resources)
## Quick Start
The simplest way to get started is to use the stateless API with the extraction configuration and the file/text to extract from:
@@ -12,7 +35,7 @@ from llama_cloud import ExtractConfig, ExtractMode
from pydantic import BaseModel, Field
# Initialize client
extractor = LlamaExtract()
extractor = LlamaExtract(api_key="YOUR_API_KEY")
# Define schema using Pydantic
@@ -64,7 +87,9 @@ result = extractor.extract(Resume, config, SourceText(text_content=text))
### Async Extraction
For better performance with multiple files or when integrating with async applications:
For better performance with multiple files or when integrating with async applications.
Here `queue_extraction` will enqueue the extraction jobs and exit. Alternatively, you
can use `aextract` to poll for the job and return the extraction results.
```python
import asyncio
@@ -80,10 +105,18 @@ async def extract_resumes():
Resume, config, ["resume1.pdf", "resume2.pdf"]
)
print(f"Queued {len(jobs)} extraction jobs")
return jobs
# Run async function
asyncio.run(extract_resumes())
jobs = asyncio.run(extract_resumes())
# Check job status
for job in jobs:
status = agent.get_extraction_job(job.id).status
print(f"Job {job.id}: {status}")
# Get results when complete
results = [agent.get_extraction_run_for_job(job.id) for job in jobs]
```
## Core Concepts
@@ -159,24 +192,6 @@ config = ExtractConfig(extraction_mode=ExtractMode.FAST)
result = extractor.extract(schema, config, "resume.pdf")
```
## Extraction Configuration
Configure how extraction is performed using `ExtractConfig`:
```python
from llama_cloud import ExtractConfig, ExtractMode
# Fast extraction (lower accuracy, faster processing)
fast_config = ExtractConfig(extraction_mode=ExtractMode.FAST)
# Balanced extraction (good balance of speed and accuracy)
balanced_config = ExtractConfig(extraction_mode=ExtractMode.BALANCED)
# Use different configs for different needs
result = extractor.extract(schema, fast_config, "simple_document.pdf")
result = extractor.extract(schema, balanced_config, "complex_document.pdf")
```
### Important restrictions on JSON/Pydantic Schema
_LlamaExtract only supports a subset of the JSON Schema specification._ While limited, it should
@@ -194,6 +209,62 @@ be sufficient for a wide variety of use-cases.
your extraction workflow to fit within these constraints, e.g. by extracting subset of fields
and later merging them together.
## Extraction Configuration
Configure how extraction is performed using `ExtractConfig`. The schema is the most important part, but several configuration options can significantly impact the extraction process.
```python
from llama_cloud import ExtractConfig, ExtractMode, ChunkMode, ExtractTarget
# Basic configuration
config = ExtractConfig(
extraction_mode=ExtractMode.BALANCED, # FAST, BALANCED, MULTIMODAL, PREMIUM
extraction_target=ExtractTarget.PER_DOC, # PER_DOC, PER_PAGE
system_prompt="Focus on the most recent data",
page_range="1-5,10-15", # Extract from specific pages
)
# Advanced configuration
advanced_config = ExtractConfig(
extraction_mode=ExtractMode.MULTIMODAL,
chunk_mode=ChunkMode.PAGE, # PAGE, SECTION
high_resolution_mode=True, # Better OCR accuracy
invalidate_cache=False, # Bypass cached results
cite_sources=True, # Enable source citations
use_reasoning=True, # Enable reasoning (not in FAST mode)
confidence_scores=True, # MULTIMODAL/PREMIUM only
)
```
### Key Configuration Options
**Extraction Mode**: Controls processing quality and speed
- `FAST`: Fastest processing, suitable for simple documents with no OCR
- `BALANCED`: Good speed/accuracy tradeoff for text-rich documents
- `MULTIMODAL`: For visually rich documents with text, tables, and images (recommended)
- `PREMIUM`: Highest accuracy with OCR, complex table/header detection
**Extraction Target**: Defines extraction scope
- `PER_DOC`: Apply schema to entire document (default)
- `PER_PAGE`: Apply schema to each page, returns array of results
**Advanced Options**:
- `system_prompt`: Additional system-level instructions
- `page_range`: Specific pages to extract (e.g., "1,3,5-7,9")
- `chunk_mode`: Document splitting strategy (`PAGE` or `SECTION`)
- `high_resolution_mode`: Better OCR for small text (slower processing)
**Extensions** (return additional metadata):
- `cite_sources`: Source tracing for extracted fields
- `use_reasoning`: Explanations for extraction decisions
- `confidence_scores`: Quantitative confidence measures (MULTIMODAL/PREMIUM only)
For complete configuration options, advanced settings, and detailed examples, see the [LlamaExtract Configuration Documentation](https://docs.cloud.llamaindex.ai/llamaextract/features/options).
## Extraction Agents (Advanced)
For reusable extraction workflows, you can create extraction agents that encapsulate both schema and configuration:
@@ -326,6 +397,7 @@ Another option (orthogonal to the above) is to break the document into smaller s
## Additional Resources
- [Extract Documentation](https://docs.cloud.llamaindex.ai/llamaextract/getting_started) - Details on Extract features, API and examples.
- [Example Notebook](docs/examples-py/extract/resume_screening.ipynb) - Detailed walkthrough of resume parsing
- [Example Application with TypeScript](./examples-ts/extract/) - End-to-end examples using LlamaExtract TypeScript client.
- [Discord Community](https://discord.com/invite/eN6D2HQ4aX) - Get help and share feedback