Update readmes on UI starters to link to datasets and explain application in more detail (#64)

* Update readmes on UI starters to link to datasets and be a little more explanatory

* Update README.md

* Update README.md

* Fix typo, version

* version
This commit is contained in:
Adrian Lyjak
2025-11-24 14:16:18 -05:00
committed by GitHub
parent 299f1d6068
commit 28a1c6b7af
+62 -18
View File
@@ -1,31 +1,75 @@
# Data Extraction and Ingestion
# Invoice Extraction and Contract Reconciliation
This is a starter for LlamaAgents. See the [LlamaAgents (llamactl) getting started guide](https://developers.llamaindex.ai/python/llamaagents/llamactl/getting-started/) for context on local development and deployment.
This template provides a LlamaAgents application for extracting structured data from invoices
and reconciling it against contract documents using LlamaExtract, LlamaCloud Index, and Agent Data.
It helps finance and operations teams validate that incoming invoices comply with agreed contract terms
by automatically detecting mismatches in payment terms, totals, and other key fields.
To run the application, install [`uv`](https://docs.astral.sh/uv/) and run `uvx llamactl serve`.
# Running the application
## Simple customizations
This is a starter for LlamaAgents. See the
[LlamaAgents (llamactl) getting started guide](https://developers.llamaindex.ai/python/llamaagents/llamactl/getting-started/)
for context on local development and deployment.
For some basic customizations, you can modify `src/extraction_review/config.py`
To run the application locally, clone this repo, install [`uv`](https://docs.astral.sh/uv/) and run `uvx llamactl serve`.
- **`USE_REMOTE_EXTRACTION_SCHEMA`**: Set to `False` to define your own Pydantic `ExtractionSchema` in this file. Set to `True` to reuse the schema from an existing LlamaCloud Extraction Agent.
- **`EXTRACTION_AGENT_NAME`**: Logical name for your Extraction Agent. When `USE_REMOTE_EXTRACTION_SCHEMA` is `False`, this name is used to upsert the agent with your local schema; when `True`, it is used to fetch an existing agent.
- **`EXTRACTED_DATA_COLLECTION`**: The Agent Data collection name used to store extractions (namespaced by agent name and environment).
- **`ExtractionSchema`**: When using a local schema, edit this Pydantic model to match the fields you want extracted. Prefer optional types where possible to allow for partial extractions.
This application can also be deployed directly to [LlamaCloud](https://cloud.llamaindex.ai) via the UI,
or with `llamactl deployment create`.
The UI fetches the JSON Schema and collection name from the backend metadata workflow at runtime, and dynamically
generates an editing UI based on the schema.
## Features
## Complex customizations
- **Invoice data extraction**: Uses a Pydantic `InvoiceExtractionSchema` to extract key invoice fields
(vendor, dates, PO number, line items, subtotals, tax, totals, and more) via a LlamaExtract agent.
- **Contract indexing and retrieval**: Includes an `index-contract` workflow that downloads contract files
from LlamaCloud and indexes them into a dedicated `contracts` LlamaCloud Index for retrieval.
- **Automated reconciliation**: Matches invoices to the most relevant contracts using retrieval plus an LLM,
then produces an `InvoiceWithReconciliation` record with match confidence, rationale, and structured discrepancies.
- **Agent Data storage**: Stores reconciled invoice records in LlamaCloud Agent Data, deduplicated by file hash,
so that re-processing the same file replaces prior results instead of duplicating them.
- **UI integration**: A web UI lets you upload invoices and contracts, monitor workflow progress,
and review or edit extracted and reconciled data.
For more complex customizations, you can edit the rest of the application. For example, you could
- Modify the existing file processing workflow to provide additional context for the extraction process
- Take further action based on the extracted data.
- Add additional workflows to submit data upon approval.
## Example Documents
You can find sample invoice and contract PDF files to test the application with
[here](https://github.com/run-llama/llama-datasets/tree/main/llama_agents/invoice-contracts).
## Configuration
All main configuration is in `src/extraction_review/config.py`.
## How It Works
The application uses a multi-step workflow powered by LlamaIndex:
1. **File Upload**: Users upload invoice or contract documents through the UI, which are stored in LlamaCloud.
2. **Index Contracts**: Contract files are processed by the `index-contract` workflow and indexed into
the `contracts` LlamaCloud Index.
3. **Download Invoice**: The `process-file` workflow downloads the selected invoice file from LlamaCloud storage.
4. **Extraction**: A LlamaExtract agent runs against the invoice using `InvoiceExtractionSchema`, returning
structured invoice data plus field-level metadata.
5. **Contract Retrieval**: The workflow queries the contracts index with a query built from invoice fields
(vendor, PO number, invoice number, etc.) and retrieves the most relevant contracts.
6. **Reconciliation**: An LLM compares the invoice to the retrieved contracts, selects the best match,
and produces an `InvoiceWithReconciliation` object with match confidence, rationale, and discrepancy list.
7. **Storage**: The reconciled invoice data is wrapped in an `ExtractedData` record (including file hash)
and stored in Agent Data, replacing any previous records for the same file hash.
8. **Review**: The UI displays the stored data for review, editing, and export.
### Workflows
The application includes three main workflows:
- **`process-file`** (`src/extraction_review/process_file.py`): Main workflow for processing invoices
end-to-end (download → extract → reconcile → store).
- **`index-contract`** (`src/extraction_review/index_contract.py`): Workflow for downloading and indexing
contract documents into a LlamaCloud Index for later retrieval during reconciliation.
- **`metadata`** (`src/extraction_review/metadata_workflow.py`): Exposes configuration metadata to the UI,
returning the JSON Schema for `InvoiceWithReconciliation` and the Agent Data collection name.
## Linting and type checking
Python and javascript pacakges contain helpful scripts to lint, format, and type check the code.
Python and javascript packages contain helpful scripts to lint, format, and type check the code.
To check and fix python code:
@@ -45,4 +89,4 @@ pnpm run typecheck
pnpm run test
# run all at once
pnpm run all-fix
```
```