mirror of
https://github.com/run-llama/llama_extract.git
synced 2026-07-01 01:37:54 -04:00
c9efc2b7daffcd3a56bb9665a1243e2370b497c7
Pin llama-cloud for publishing
LlamaExtract
⚠️ EXPERIMENTAL This library is under active development with frequent breaking changes. APIs and functionality may change significantly between versions. If you're interested in being an early adopter, please contact us at support@llamaindex.ai or join our Discord.
LlamaExtract provides a simple API for extracting structured data from unstructured documents like PDFs, text files and images (upcoming).
Quick Start
from llama_extract import LlamaExtract
from pydantic import BaseModel, Field
# Initialize client
extractor = LlamaExtract()
# Define schema using Pydantic
class Resume(BaseModel):
name: str = Field(description="Full name of candidate")
email: str = Field(description="Email address")
skills: list[str] = Field(description="Technical skills and technologies")
# Create extraction agent
agent = extractor.create_agent(name="resume-parser", data_schema=Resume)
# Extract data from document
result = agent.extract("resume.pdf")
print(result.data)
Core Concepts
- Extraction Agents: Reusable extractors configured with a specific schema and extraction settings.
- Data Schema: Structure definition for the data you want to extract.
- Extraction Jobs: Asynchronous extraction tasks that can be monitored.
Defining Schemas
Schemas can be defined using either Pydantic models or JSON Schema:
Using Pydantic (Recommended)
from pydantic import BaseModel, Field
from typing import List, Optional
class Experience(BaseModel):
company: str = Field(description="Company name")
title: str = Field(description="Job title")
start_date: Optional[str] = Field(description="Start date of employment")
end_date: Optional[str] = Field(description="End date of employment")
class Resume(BaseModel):
name: str = Field(description="Candidate name")
experience: List[Experience] = Field(description="Work history")
Using JSON Schema
schema = {
"type": "object",
"properties": {
"name": {"type": "string", "description": "Candidate name"},
"experience": {
"type": "array",
"items": {
"type": "object",
"properties": {
"company": {"type": "string"},
"title": {"type": "string"},
"start_date": {"type": "string"},
"end_date": {"type": "string"},
},
},
},
},
}
agent = extractor.create_agent(name="resume-parser", data_schema=schema)
Other Extraction APIs
Batch Processing
Process multiple files asynchronously:
# Queue multiple files for extraction
jobs = await agent.queue_extraction(["resume1.pdf", "resume2.pdf"])
# Check job status
for job in jobs:
status = agent.get_extraction_job(job.id).status
print(f"Job {job.id}: {status}")
# Get results when complete
results = [agent.get_extraction_run_for_job(job.id) for job in jobs]
Updating Schemas
Schemas can be modified and updated after creation:
# Update schema
agent.data_schema = new_schema
# Save changes
agent.save()
Managing Agents
# List all agents
agents = extractor.list_agents()
# Get specific agent
agent = extractor.get_agent(name="resume-parser")
# Delete agent
extractor.delete_agent(agent.id)
Installation
pip install llama-extract==0.1.0
Tips & Best Practices
-
Schema Design:
- Make fields optional when data might not always be present.
- Use descriptive field names and detailed descriptions. Use descriptions to pass formatting instructions or few-shot examples.
- Start simple and iterate on schema complexity.
-
Running Extractions:
- Note that resetting
agent.schemawill not save the schema to the database, until you callagent.save, but it will be used for running extractions. - Check job status prior to accessing results. Any extraction error should be available as part of
job.errororextraction_run.errorfields for debugging. - Consider async operations (
queue_extraction) for large-scale extraction once you have finalized your schema.
- Note that resetting
Additional Resources
- Example Notebook - Detailed walkthrough of resume parsing
- Discord Community - Get help and share feedback
Description
Releases
4