mirror of
https://github.com/run-llama/llama_cloud_services.git
synced 2026-07-01 21:44:37 -04:00
Update docs for extract (#852)
* Update docs for extract * add more details on async
This commit is contained in:
+93
-21
@@ -2,6 +2,29 @@
|
||||
|
||||
LlamaExtract provides a simple API for extracting structured data from unstructured documents like PDFs, text files and images.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Quick Start](#quick-start)
|
||||
- [Supported File Types](#supported-file-types)
|
||||
- [Different Input Types](#different-input-types)
|
||||
- [Async Extraction](#async-extraction)
|
||||
- [Core Concepts](#core-concepts)
|
||||
- [Defining Schemas](#defining-schemas)
|
||||
- [Using Pydantic (Recommended)](#using-pydantic-recommended)
|
||||
- [Using JSON Schema](#using-json-schema)
|
||||
- [Important restrictions on JSON/Pydantic Schema](#important-restrictions-on-jsonpydantic-schema)
|
||||
- [Extraction Configuration](#extraction-configuration)
|
||||
- [Configuration Options](#configuration-options)
|
||||
- [Extraction Agents (Advanced)](#extraction-agents-advanced)
|
||||
- [Creating Agents](#creating-agents)
|
||||
- [Agent Batch Processing](#agent-batch-processing)
|
||||
- [Updating Agent Schemas](#updating-agent-schemas)
|
||||
- [Managing Agents](#managing-agents)
|
||||
- [When to Use Agents vs Direct Extraction](#when-to-use-agents-vs-direct-extraction)
|
||||
- [Installation](#installation)
|
||||
- [Tips & Best Practices](#tips--best-practices)
|
||||
- [Additional Resources](#additional-resources)
|
||||
|
||||
## Quick Start
|
||||
|
||||
The simplest way to get started is to use the stateless API with the extraction configuration and the file/text to extract from:
|
||||
@@ -12,7 +35,7 @@ from llama_cloud import ExtractConfig, ExtractMode
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
# Initialize client
|
||||
extractor = LlamaExtract()
|
||||
extractor = LlamaExtract(api_key="YOUR_API_KEY")
|
||||
|
||||
|
||||
# Define schema using Pydantic
|
||||
@@ -64,7 +87,9 @@ result = extractor.extract(Resume, config, SourceText(text_content=text))
|
||||
|
||||
### Async Extraction
|
||||
|
||||
For better performance with multiple files or when integrating with async applications:
|
||||
For better performance with multiple files or when integrating with async applications.
|
||||
Here `queue_extraction` will enqueue the extraction jobs and exit. Alternatively, you
|
||||
can use `aextract` to poll for the job and return the extraction results.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
@@ -80,10 +105,18 @@ async def extract_resumes():
|
||||
Resume, config, ["resume1.pdf", "resume2.pdf"]
|
||||
)
|
||||
print(f"Queued {len(jobs)} extraction jobs")
|
||||
return jobs
|
||||
|
||||
|
||||
# Run async function
|
||||
asyncio.run(extract_resumes())
|
||||
jobs = asyncio.run(extract_resumes())
|
||||
# Check job status
|
||||
for job in jobs:
|
||||
status = agent.get_extraction_job(job.id).status
|
||||
print(f"Job {job.id}: {status}")
|
||||
|
||||
# Get results when complete
|
||||
results = [agent.get_extraction_run_for_job(job.id) for job in jobs]
|
||||
```
|
||||
|
||||
## Core Concepts
|
||||
@@ -159,24 +192,6 @@ config = ExtractConfig(extraction_mode=ExtractMode.FAST)
|
||||
result = extractor.extract(schema, config, "resume.pdf")
|
||||
```
|
||||
|
||||
## Extraction Configuration
|
||||
|
||||
Configure how extraction is performed using `ExtractConfig`:
|
||||
|
||||
```python
|
||||
from llama_cloud import ExtractConfig, ExtractMode
|
||||
|
||||
# Fast extraction (lower accuracy, faster processing)
|
||||
fast_config = ExtractConfig(extraction_mode=ExtractMode.FAST)
|
||||
|
||||
# Balanced extraction (good balance of speed and accuracy)
|
||||
balanced_config = ExtractConfig(extraction_mode=ExtractMode.BALANCED)
|
||||
|
||||
# Use different configs for different needs
|
||||
result = extractor.extract(schema, fast_config, "simple_document.pdf")
|
||||
result = extractor.extract(schema, balanced_config, "complex_document.pdf")
|
||||
```
|
||||
|
||||
### Important restrictions on JSON/Pydantic Schema
|
||||
|
||||
_LlamaExtract only supports a subset of the JSON Schema specification._ While limited, it should
|
||||
@@ -194,6 +209,62 @@ be sufficient for a wide variety of use-cases.
|
||||
your extraction workflow to fit within these constraints, e.g. by extracting subset of fields
|
||||
and later merging them together.
|
||||
|
||||
## Extraction Configuration
|
||||
|
||||
Configure how extraction is performed using `ExtractConfig`. The schema is the most important part, but several configuration options can significantly impact the extraction process.
|
||||
|
||||
```python
|
||||
from llama_cloud import ExtractConfig, ExtractMode, ChunkMode, ExtractTarget
|
||||
|
||||
# Basic configuration
|
||||
config = ExtractConfig(
|
||||
extraction_mode=ExtractMode.BALANCED, # FAST, BALANCED, MULTIMODAL, PREMIUM
|
||||
extraction_target=ExtractTarget.PER_DOC, # PER_DOC, PER_PAGE
|
||||
system_prompt="Focus on the most recent data",
|
||||
page_range="1-5,10-15", # Extract from specific pages
|
||||
)
|
||||
|
||||
# Advanced configuration
|
||||
advanced_config = ExtractConfig(
|
||||
extraction_mode=ExtractMode.MULTIMODAL,
|
||||
chunk_mode=ChunkMode.PAGE, # PAGE, SECTION
|
||||
high_resolution_mode=True, # Better OCR accuracy
|
||||
invalidate_cache=False, # Bypass cached results
|
||||
cite_sources=True, # Enable source citations
|
||||
use_reasoning=True, # Enable reasoning (not in FAST mode)
|
||||
confidence_scores=True, # MULTIMODAL/PREMIUM only
|
||||
)
|
||||
```
|
||||
|
||||
### Key Configuration Options
|
||||
|
||||
**Extraction Mode**: Controls processing quality and speed
|
||||
|
||||
- `FAST`: Fastest processing, suitable for simple documents with no OCR
|
||||
- `BALANCED`: Good speed/accuracy tradeoff for text-rich documents
|
||||
- `MULTIMODAL`: For visually rich documents with text, tables, and images (recommended)
|
||||
- `PREMIUM`: Highest accuracy with OCR, complex table/header detection
|
||||
|
||||
**Extraction Target**: Defines extraction scope
|
||||
|
||||
- `PER_DOC`: Apply schema to entire document (default)
|
||||
- `PER_PAGE`: Apply schema to each page, returns array of results
|
||||
|
||||
**Advanced Options**:
|
||||
|
||||
- `system_prompt`: Additional system-level instructions
|
||||
- `page_range`: Specific pages to extract (e.g., "1,3,5-7,9")
|
||||
- `chunk_mode`: Document splitting strategy (`PAGE` or `SECTION`)
|
||||
- `high_resolution_mode`: Better OCR for small text (slower processing)
|
||||
|
||||
**Extensions** (return additional metadata):
|
||||
|
||||
- `cite_sources`: Source tracing for extracted fields
|
||||
- `use_reasoning`: Explanations for extraction decisions
|
||||
- `confidence_scores`: Quantitative confidence measures (MULTIMODAL/PREMIUM only)
|
||||
|
||||
For complete configuration options, advanced settings, and detailed examples, see the [LlamaExtract Configuration Documentation](https://docs.cloud.llamaindex.ai/llamaextract/features/options).
|
||||
|
||||
## Extraction Agents (Advanced)
|
||||
|
||||
For reusable extraction workflows, you can create extraction agents that encapsulate both schema and configuration:
|
||||
@@ -326,6 +397,7 @@ Another option (orthogonal to the above) is to break the document into smaller s
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- [Extract Documentation](https://docs.cloud.llamaindex.ai/llamaextract/getting_started) - Details on Extract features, API and examples.
|
||||
- [Example Notebook](docs/examples-py/extract/resume_screening.ipynb) - Detailed walkthrough of resume parsing
|
||||
- [Example Application with TypeScript](./examples-ts/extract/) - End-to-end examples using LlamaExtract TypeScript client.
|
||||
- [Discord Community](https://discord.com/invite/eN6D2HQ4aX) - Get help and share feedback
|
||||
|
||||
Reference in New Issue
Block a user