Compare commits

...

213 Commits

Author SHA1 Message Date
Logan f385e96ab8 Delete parse.md 2026-03-24 19:27:52 -06:00
Logan c3e4696b5f Delete index.md 2026-03-24 19:27:41 -06:00
Logan 1e40c9cf94 Delete extract.md 2026-03-24 19:27:25 -06:00
Logan 802bc2a9f8 Add deprecation notice and clean up README
Added deprecation notice and removed outdated content.
2026-03-24 19:26:59 -06:00
Neeraj Pradhan 5ea758b853 More robust extract tests with pytest xdist (#1117) 2026-02-16 16:16:15 -08:00
dependabot[bot] 208b6f2fa5 build(deps): bump slackapi/slack-github-action from 1.27.0 to 2.1.1 (#1092)
Bumps [slackapi/slack-github-action](https://github.com/slackapi/slack-github-action) from 1.27.0 to 2.1.1.
- [Release notes](https://github.com/slackapi/slack-github-action/releases)
- [Commits](https://github.com/slackapi/slack-github-action/compare/v1.27.0...v2.1.1)

---
updated-dependencies:
- dependency-name: slackapi/slack-github-action
  dependency-version: 2.1.1
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-02-14 21:03:05 -06:00
github-actions[bot] e1b9143f79 chore: version packages (#1116)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-02-13 15:29:09 -08:00
Neeraj Pradhan 232c55bd6a Bump up patch version (#1115) 2026-02-13 15:20:52 -08:00
Neeraj Pradhan ab6f2f8da5 Allows xlsx files in the sdk for extract (#1114) 2026-02-13 14:44:25 -08:00
github-actions[bot] 66c2639ec8 chore: version packages (#1112) 2026-02-11 15:18:43 -06:00
Logan da1916c69f more loudly deprecate ancient llama-parse package (#1111) 2026-02-11 15:16:01 -06:00
Neeraj Pradhan 345e272573 Lower frequency for e2e tests (#1110) 2026-02-11 09:07:15 -08:00
github-actions[bot] d70fbac1ce chore: version packages (#1103) 2026-02-02 11:46:39 -06:00
Logan 2358df10c6 add notice (don't merge until ready) (#1065) 2026-02-02 11:42:47 -06:00
Neeraj Pradhan 829628cc86 Use unique filenames when running dist tests (#1101) 2026-01-30 14:00:27 -08:00
Neeraj Pradhan 42b7bbd1ae Use sonnet when testing premium mode in extract e2e (#1098)
* Use sonnet when testing premium mode in extract e2e

* fix parse model
2026-01-27 16:16:48 -08:00
Neeraj Pradhan 38da9a52d7 Invalidate cache when running extract tests (#1097) 2026-01-26 17:33:23 -08:00
Neeraj Pradhan 1e7ec40ee7 Fix verbose logging on slack channel (#1096) 2026-01-26 17:12:50 -08:00
Neeraj Pradhan dd83c1a9d0 Add retries to all extract sdk functions uniformly (#1095) 2026-01-26 12:05:16 -08:00
Neeraj Pradhan 7cb83f5cd3 Change cron schedule for hourly extract tests (#1094) 2026-01-26 10:15:34 -08:00
Neeraj Pradhan b05266be6d Try to reparse scheduled workflow (#1093) 2026-01-26 09:56:22 -08:00
Neeraj Pradhan eab4798165 Force github reparse of the workflow (#1090) 2026-01-23 11:36:28 -08:00
Neeraj Pradhan b174fa8fab Run hourly extract tests to catch SDK schema drifts (#1089)
* Run hourly extract tests to catch SDK schema drifts

* fix url

* fix prod/staging env
2026-01-22 18:18:45 -08:00
github-actions[bot] b12ffef916 chore: version packages (#1087)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-01-21 12:44:43 -08:00
Neeraj Pradhan 07ec282257 Bump up patch version for python packages (#1086) 2026-01-21 12:30:23 -08:00
Neeraj Pradhan 013b689812 Bump up minor version for python packages (#1085) 2026-01-21 12:13:13 -08:00
Adrian Lyjak 3040951cb8 Use error description in invalid extraction error (#1081)
* fix: display extraction job error in InvalidExtractionData exception

Refactored InvalidExtractionData to read the `error` field from
ExtractRun and prominently display it in the exception message.
The job-level error is now stored in the `extraction_error` attribute
and included in the invalid_item's metadata as `job_error`.

* Create three-yaks-beg.md

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-01-18 17:43:21 -05:00
github-actions[bot] 9239498945 chore: version packages (#1076)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-01-14 19:15:05 +01:00
Pierre-Loic Doulcet 19cbb25631 remove extension filter (#1075)
* remove extension filter

* changeset

* Update ninety-goats-look.md

Make it a patch version

* Update package.json

back out of version bump

* Update pyproject.toml

back out of version bump

* Update package.json

back out of version bump

* Update pyproject.toml

back out of version bump

---------

Co-authored-by: Adrian Lyjak <adrianlyjak@gmail.com>
2026-01-14 19:13:39 +01:00
github-actions[bot] 812e2f7d72 chore: version packages (#1073)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-01-12 19:03:13 +01:00
Clelia (Astra) Bertelli d7864afe3f fix: bug fix retry logic in Classify and Extract (#1066)
* fix: bug fix retry logic in Classify and Extract

* chore: apply suggestion

* chore: add PARTIAL_SUCCESS to classify
2026-01-12 18:57:40 +01:00
github-actions[bot] ade8d027a5 chore: version packages (#1071)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-01-09 20:29:00 -05:00
Adrian Lyjak 997bcc8531 forgot ts changeset (#1070) 2026-01-09 20:23:29 -05:00
github-actions[bot] 8be554c234 chore: version packages (#1068)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-01-09 18:56:51 -05:00
Adrian Lyjak f777cab0c5 Add bounding box type support to TS too (#1069)
ts too
2026-01-09 18:55:16 -05:00
Adrian Lyjak b9b83c953d Parse bounding boxes from extract jobs results in agent data (#1067) 2026-01-09 18:47:57 -05:00
github-actions[bot] 3ec7024626 chore: version packages (#1058) 2025-12-10 11:53:30 -06:00
Logan d5b18a03fa Remove generate from build path to fix publishing (#1057) 2025-12-10 11:52:43 -06:00
Clelia (Astra) Bertelli 18dd04b6de docs: correct links in readme (#1056) 2025-12-10 17:08:58 +01:00
github-actions[bot] 685a5e6ccc chore: version packages (#1054) 2025-12-09 15:30:13 -06:00
Jim Geurts 576c3d9076 feat: support zod v4 & v3 (#1052) 2025-12-09 15:29:23 -06:00
Logan c8321d2bc5 improve parse ts polling (#1053) 2025-12-09 15:21:19 -06:00
Tuana Çelik 131bbed7aa batch parse sctript with asyncio (#1051)
* batch parse sctript with asyncio

* lint

---------

Co-authored-by: Logan Markewich <logan.markewich@live.com>
2025-12-08 18:50:11 +01:00
Javier Torres 41c8ac2348 docs: Split Example Notebook (#1044)
* split notebook

* Lint
2025-12-08 13:57:20 +01:00
github-actions[bot] 32c53cdf96 chore: version packages (#1046) 2025-12-04 20:43:29 -06:00
Logan 71db318fc2 add tier/version to api (#1045) 2025-12-04 20:42:17 -06:00
George He dac0f79e51 Fix sheets API client (#1032) 2025-12-03 16:39:47 -06:00
github-actions[bot] 32487763d5 chore: version packages (#1043) 2025-12-03 14:52:26 -06:00
Daniel Bustamante Ospina 06c3c556e6 Add new fields to SpreadsheetParsingConfig and update validation tests (#1042) 2025-12-03 14:50:23 -06:00
github-actions[bot] e5dcaa83df chore: version packages (#1041)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-12-03 11:03:36 -08:00
Neeraj Pradhan 1b7198dc62 Bump llama cloud services and parse versions (#1040) 2025-12-03 10:39:35 -08:00
github-actions[bot] 9cfe074206 chore: version packages (#1039) 2025-12-02 12:16:50 -06:00
Logan ae30990ada line level bbox (#1038) 2025-12-02 12:12:17 -06:00
github-actions[bot] 8f1c359abc chore: version packages (#1037) 2025-12-02 09:50:07 -06:00
Logan 0a110de9c7 Dummy release (#1036) 2025-12-02 09:45:52 -06:00
github-actions[bot] d705b16923 chore: version packages (#1035) 2025-12-02 09:43:20 -06:00
Logan ca781132c8 No more presigned URLs by default (#1034) 2025-12-02 09:41:49 -06:00
Roman Isecke 7a68b0fb68 docs: add batch parse directory example notebook (#1009)
* create notebook to parse a batch of documents

* remove local dev code

* tidy

* don't git track the sample pdfs

* update notebook to use client

* add logic to fetch parse results using job id from batch item

* generate example for fetching results via parse job id

* fix linting

* convert notebook to use httpx rather than client for now

* fix linting
2025-12-01 13:57:18 -05:00
George He 87dec5433d Add timeouts to E2E GHA (#1031)
* Add timeouts

* Session timeouts too
2025-11-27 14:57:59 -08:00
Pierre-Loic Doulcet 99f4eba8d0 Pierre/more parse parameters (#1027)
* up python sdk

* bupmVErsion

* Update py/llama_cloud_services/parse/base.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update py/llama_cloud_services/parse/base.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-25 14:43:27 +01:00
github-actions[bot] 54561e2dd2 chore: version packages (#1025) 2025-11-24 16:41:22 -06:00
Logan Markewich bfaec79a8f changeset 2025-11-24 16:37:58 -06:00
Logan Markewich 3e0e522a6b update ts 2025-11-24 16:36:31 -06:00
Logan Markewich f70b6d87ec update py 2025-11-24 16:31:15 -06:00
Logan Markewich 693b5b83b1 improve llama-sheets example 2025-11-24 09:44:11 -06:00
Neeraj Pradhan ad38ef5cd7 Add notebook for tabular extraction (#1017) 2025-11-18 09:47:07 -08:00
Logan Markewich 4c4c6e6575 fix sheets test 2025-11-17 16:14:29 -06:00
github-actions[bot] 740b47d9dc chore: version packages (#1016)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-11-17 16:11:18 -06:00
Logan f3233deb2e propagate retrieval metadata to retrieved nodes (#1015) 2025-11-17 16:06:52 -06:00
github-actions[bot] fd45127678 chore: version packages (#1014)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-11-17 21:18:09 +01:00
Clelia (Astra) Bertelli 0506c88735 chore: rename classifyclient and keep it backward compatible (#1013)
* chore: rename classifyclient and keep it backward compatible

* chore: Replace ClassifyClient in notebooks

* chore: changesets
2025-11-17 21:16:23 +01:00
Logan 4bc9eb6c0d beta sheets API (#992) 2025-11-17 11:32:06 -06:00
Patricia 5a3dac655c Add support for custom metadata in file upload methods (#1012) 2025-11-17 11:18:11 -06:00
github-actions[bot] 519254efbe chore: version packages (#999)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-11-04 14:18:27 -05:00
Adrian Lyjak 6ab56b79f3 fix version breaking (#998) 2025-11-04 14:14:38 -05:00
Adrian Lyjak e020e3e2b1 Remove organization id from classify (#997) 2025-11-04 14:05:19 -05:00
Adrian Lyjak f293547910 destructured keyword params for classify (#996) 2025-11-04 14:04:41 -05:00
github-actions[bot] 662bc37462 chore: version packages (#995) 2025-11-03 20:15:50 -06:00
Neeraj Pradhan 9f1ef4ef1f Bump to version 0.6.78 (#994) 2025-11-03 20:11:18 -06:00
github-actions[bot] 1243573924 chore: version packages (#991) 2025-10-30 10:11:16 -06:00
Preston Carlson 407292b177 Fix: Return partial results on job failure (#990)
* Return partial result on failed job, especially job id

* Maintains NO_DATA_FOUND_IN_FILE throw behavior
2025-10-23 13:44:41 -07:00
Clelia (Astra) Bertelli a7df7c0912 docs: add llamaclassify demo (#989) 2025-10-23 17:38:57 +02:00
github-actions[bot] c758144bfe chore: version packages (#988)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-10-22 14:41:44 +02:00
Clelia (Astra) Bertelli fee516dd19 feat: add classify to ts sdk (#985)
* feat: add classify to ts sdk

* ci: changesets

* chore: camelCase for everyone; refactor: slimmer logic for fileContents/filePaths handling

* chore: implement claude suggestions
2025-10-22 14:39:20 +02:00
Neeraj Pradhan 032fbd5768 Add common SourceText class for classify/extract text inputs (#986) 2025-10-21 13:37:41 -07:00
Jerry Liu 970e864514 improve classify notebook (#983) 2025-10-20 10:07:35 -07:00
github-actions[bot] d0649ece6e chore: version packages (#982) 2025-10-16 16:58:29 -06:00
MartijnLeplae 5d4cabd843 Add ImageNode support in TypeScript (#969) 2025-10-16 16:56:28 -06:00
github-actions[bot] 9070a6ac16 chore: version packages (#981) 2025-10-15 12:01:34 -06:00
Bogdan Gheorghe 4f24f537f6 Add agressive table extraction argument (#980) 2025-10-15 11:57:34 -06:00
github-actions[bot] 8859a203e2 chore: version packages (#977) 2025-10-14 19:03:36 -06:00
dependabot[bot] b091364054 build(deps): bump astral-sh/setup-uv from 6 to 7 (#974) 2025-10-14 19:02:32 -06:00
dependabot[bot] 43b1a013ca build(deps): bump github/codeql-action from 3 to 4 (#973) 2025-10-14 19:02:20 -06:00
Logan f81532e7f2 safest types possible for parse (#976) 2025-10-14 19:02:07 -06:00
github-actions[bot] 986d3987d3 chore: version packages (#965)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-10-14 08:14:49 -06:00
Logan 1bf522311f fix default bbox values (#975) 2025-10-14 07:44:35 -06:00
Preston Carlson 24166dcfc8 Only escape single dollar sign in notebook md (#964)
* Limit escaping to lone dollar signs - preserve double dollar for latex equations

* Updated uv.lock via make lint

* Patch bump

* Unit test for _format_markdown_for_notebook

Test doesn't depend on getting real results/is just testing a string manipulation function, so inserting before other tests. Should move to its own file if we add additional formatting configurations
2025-10-07 08:06:03 -07:00
github-actions[bot] bfb7f3973f chore: version packages (#956)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-10-06 11:15:55 -04:00
dependabot[bot] 979f643c77 build(deps): bump actions/checkout from 4 to 5 (#961) 2025-10-06 09:12:38 -06:00
dependabot[bot] aefd89cf1b build(deps): bump actions/setup-python from 5 to 6 (#960) 2025-10-06 09:12:30 -06:00
dependabot[bot] 8ea2b2c64e build(deps): bump pnpm/action-setup from 3 to 4 (#959) 2025-10-06 09:12:20 -06:00
dependabot[bot] 4a9a2a21d8 build(deps): bump astral-sh/setup-uv from 3 to 6 (#958) 2025-10-06 09:12:08 -06:00
Logan e6a7939206 loosen packaging requirements (#962) 2025-10-06 09:11:57 -06:00
Adrian Lyjak 104a03e829 fix: re-enable js publishing (#963) 2025-10-06 11:10:46 -04:00
Terry Zhao 6e0f2f4ca0 citation can be null (#869)
* citation can be null

* Add changeset

---------

Co-authored-by: Terry Zhao <terryzhao@runllama.ai>
Co-authored-by: Adrian Lyjak <adrianlyjak@gmail.com>
2025-10-04 16:26:11 -04:00
dependabot[bot] 0708d11f8a Bump actions/setup-node from 4 to 5 (#909)
Bumps [actions/setup-node](https://github.com/actions/setup-node) from 4 to 5.
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](https://github.com/actions/setup-node/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/setup-node
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-10-04 16:21:50 -04:00
github-actions[bot] be19185503 chore: version packages (#954)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-10-03 20:14:04 -04:00
Adrian Lyjak 7571b0d6c4 Missed some things again with tag fixes (#955)
guh
2025-10-03 20:12:53 -04:00
Adrian Lyjak ad6734bf80 fixup tagging more better (#953)
* fix: correct private field type in py/package.json to be recognized by pnpm

* use packages more directly, make public

* add bump

* fix crash
2025-10-03 19:53:57 -04:00
github-actions[bot] 9ec2a8322e chore: version packages (#952) 2025-10-03 15:11:14 -06:00
Logan 51011b9f30 fix changeset harder (#951) 2025-10-03 15:09:58 -06:00
Logan 09805f9e15 swap changesets (#949) 2025-10-03 15:06:00 -06:00
Adrian Lyjak 8ced6f6eab fix: explicitly tag. I thought the action did this (#948) 2025-10-03 16:59:41 -04:00
Preston Carlson 081ddeca34 Escaping dollar signs in md output when running in a jupyter notebook (#945) 2025-10-03 14:52:26 -06:00
Adrian Lyjak 2460908789 Disable npm release (#946) 2025-10-03 16:13:16 -04:00
Adrian Lyjak c226d6a54c Fix more bugs in publishing (#944) 2025-10-03 11:16:43 -04:00
Adrian Lyjak 5d4c682eb2 fix: theres just one publish token (#943) 2025-10-03 10:56:10 -04:00
github-actions[bot] f72d3535c8 chore: version packages (#941)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-10-03 10:25:11 -04:00
Adrian Lyjak 1ea09a366e Update llama-cloud dep (#940) 2025-10-03 09:56:56 -04:00
Adrian Lyjak d4bbeb6389 ignore nvmrc (#942)
ignore npmrc
2025-10-03 00:21:32 -04:00
Adrian Lyjak d028397603 version and release via changesets (#849) 2025-10-03 00:08:52 -04:00
Emanuel Ferreira 35ea8476db docs: parse -> classify -> extract (#931) 2025-09-24 18:52:15 -03:00
Logan 3e5f7c4f1e Update parse.md 2025-09-24 11:35:13 -06:00
Adrian Lyjak 9d9b816644 Handle reasoning field conflict (#929)
* Handle reasoning field conflict

* update version to 0.6.69
2025-09-22 11:29:11 -04:00
Adrian Lyjak 83555f76e6 Handle validation errors for agent data retrieval (#928)
* feat: Add untyped agent data retrieval and handling

Introduces methods to retrieve agent data as untyped dictionaries,
handling validation errors gracefully. This allows for more flexible
data access when strict typing is not required or when data may be
malformed.

Co-authored-by: adrian <adrian@runllama.ai>

* Expose raw api result

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
2025-09-22 11:28:49 -04:00
Adrian Lyjak 5edf5f914a Support creating indexes in a specified project_id (#924)
* Support creating indexes in a specified project_id

* Bump
2025-09-18 11:07:07 -04:00
Adrian Lyjak 22e4975cb2 Refactor agent fields in llama_cloud_services (#921) 2025-09-17 15:14:40 -04:00
Peter Rowlands (변기호) bc2f04379b py: bump version to v.0.6.66 (#920) 2025-09-16 19:34:18 +09:00
Peter Rowlands (변기호) f9f951d5d8 parse: expose spreadsheet_force_formula_computation option (#919) 2025-09-16 19:28:03 +09:00
Emmanuel Ferdman 355129fea5 Fix colab broken links (#750)
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-09-14 23:10:21 +02:00
Adrian Lyjak d9aed80ded fix: v prefix goes deeper. Fix more (#899) 2025-09-08 17:45:06 -04:00
Pierre-Loic Doulcet c07d2d70a8 update parse package (#911) 2025-09-08 09:46:32 -06:00
Neeraj Pradhan ed6937a5a9 Fix uv sync; remove poetry lock (#906) 2025-09-05 17:13:31 -07:00
Neeraj Pradhan 34c15932a3 Bump version to 0.6.64 (#904) 2025-09-05 17:05:21 -07:00
Neeraj Pradhan b18ea96d11 Remove report generation related code from llama_cloud_services (#905) 2025-09-05 16:41:28 -07:00
Clelia (Astra) Bertelli 196ab827f5 fix: make ts release beautiful again (#902) 2025-09-05 10:41:39 -06:00
Peter Rowlands (변기호) ba4cb4d5e9 parse: expose page.slideSpeakerNotes (#889) 2025-09-05 15:48:44 +09:00
Adrian Lyjak 58d883b825 fix: "v" prefix being added to js versions (#898) 2025-09-04 15:39:27 -04:00
Adrian Lyjak 5fc5ebfc6c client unification (#895)
read from the shared client
2025-09-04 14:12:28 -04:00
Adrian Lyjak fe3e20fd53 Update version script, and unify the linting so that prettier is more consistent (#897)
Add version script, and unify the linting so that prettier is more consistent
2025-09-04 14:09:27 -04:00
Jerry Liu e7e59459ab getting started LlamaCloudIndex notebook (#891) 2025-09-02 14:52:39 -06:00
Logan Markewich f4d7c84e19 remove stale param 2025-09-02 13:37:16 -06:00
Yannis Panagis 9050a346e4 Added "SourceText" to __init__.py (#892) 2025-09-02 13:28:24 -06:00
Sourabh Desai 9690ccf4ea Fix tag push command in CONTRIBUTING.md (#894)
seems to be missing one little `v`
2025-09-02 10:49:14 -07:00
Sourabh Desai 97745f0f1c version bump to 0.6.63 (#893) 2025-09-02 10:36:51 -07:00
Sourabh Desai 61a696b9db add file names in return values (#888) 2025-08-29 15:55:18 -07:00
Sourabh Desai 3e01adaf0e add alternative builder method (#887)
* add alternative builder method

* fix test
2025-08-29 15:55:04 -07:00
Adrian Lyjak 37393b7e98 fix: Make env based api url overrideable (#881) 2025-08-20 20:51:09 -06:00
Jerry Liu ecd859a67c fix preset notebook: give outputs in markdown (#883) 2025-08-20 20:50:11 -06:00
Logan decca8e671 update all example notebooks (#882) 2025-08-20 20:49:53 -06:00
Jerry Liu 5ea0815187 add a starter notebook for llamaparse presets (#874) 2025-08-19 09:22:07 -07:00
Sourabh Desai cf149650f5 add acreate_classify_job (#878) 2025-08-18 15:36:01 -07:00
dependabot[bot] 4c6c231ea4 Bump actions/checkout from 4 to 5 (#875) 2025-08-18 12:58:31 -06:00
Jerry Liu 5955b26509 fix composite retriever (#873)
* cr

* cr
2025-08-18 11:23:24 +02:00
Adrian Lyjak 31f54bca55 feat: support passing a pre-uploaded file directly (#871)
* feat: support passing a pre-uploaded file directly

* bump version
2025-08-14 15:32:55 -04:00
Adrian Lyjak b1ae7bb736 handle extract error field (#870) 2025-08-14 11:08:50 -04:00
Adrian Lyjak 31fe12e0da parallelize e2e tests (#867)
parallelise e2e tests
2025-08-14 10:00:12 -04:00
Terry Zhao 90b0c5e295 feat: export ExtractedFieldMetadata and ExtractedFieldMetadataDict types (#868)
* feat: export ExtractedFieldMetadata and ExtractedFieldMetadataDict types from beta/agent module

- Add missing type exports for ExtractedFieldMetadata and ExtractedFieldMetadataDict
- These types are used by ExtractedData interface but were not accessible externally
- Fixes issue where dependent types could not be imported separately

* bump version

* fix lint

---------

Co-authored-by: Terry Zhao <terryzhao@runllama.ai>
2025-08-13 14:43:48 -07:00
Adrian Lyjak 79fe1930cf Re-order extraction metadata union for better parsing (#865)
* Re-order args so that pydantic doesn't parse nested dict to a empty extraction result

* Use a citations array instead
2025-08-13 16:22:06 -04:00
Sourabh Desai ab225c3eab Classifier SDK (#837)
* add files client

* add classification SDK (beta/experimental)

* lint

* lint

* update files client

* add polling timeout

* move e2e test settings to conftest.py

* unused params

* use e2e settings class

* make org id optional

* ordering params

* fix tests

* add sync support
2025-08-13 09:50:39 -07:00
Sourabh Desai 6f1de75909 fix presigned urls + add very necessary test (#864) 2025-08-12 15:28:54 -07:00
Sourabh Desai 230ed64e41 missing await (#863)
missed this await
2025-08-12 13:54:34 -07:00
Logan ef126c3a93 remove print (#861) 2025-08-11 17:42:55 -07:00
Logan 51a7534733 support llama parse audio (#859) 2025-08-11 12:57:01 -07:00
Sourabh Desai 4f5d2bde13 add files client (#836)
* add files client

* lint

* update files client

* move e2e test settings to conftest.py

* unused params

* make org id optional
2025-08-08 15:54:00 -07:00
Clelia (Astra) Bertelli 3d05fe5d77 chore: bump ts version for parse (#855) 2025-08-08 11:43:28 +02:00
Clelia (Astra) Bertelli c16ca673af feat: add parse and getTables methods to LlamaParseReader (#851)
* feat: add parse and getTables methods to LlamaParseReader

* feat: add tests

* fix: loop logic to fix test 🙈

* chore: implement suggestions
2025-08-08 11:35:54 +02:00
Neeraj Pradhan 6619034bce Bump version to 0.6.56 (#853) 2025-08-07 15:42:19 -07:00
Neeraj Pradhan c56fb5d8f7 Update docs for extract (#852)
* Update docs for extract

* add more details on async
2025-08-07 13:59:53 -07:00
Peter Rowlands (변기호) b407a5edb5 parse: expose HTML output for result table items when possible (#850) 2025-08-07 08:44:09 -06:00
Clelia (Astra) Bertelli e6a27d17fb wip: implementing Extract in TS (#839)
* wip: implementing Extract in TS

* feat: main implementation (untested)

* ci: lint

* feat: add stateless api support and retries mechanisms

* refactor: working LlamaExtract + tests

* refactor: working LlamaExtract + tests

* correct stateless extraction test

* correct stateless extraction test

* chore: intervals are now in seconds, extractStateless -> extract, support for multiple file types

* fix: infer file type

* fix: infer file type

* fix: change agent name

* docs: adding example

* docs: add link to example in extract.md
2025-08-07 12:18:58 +02:00
Peter Rowlands (변기호) 34077fd479 py: bump version to 0.6.55 (#846) 2025-08-06 13:02:35 +09:00
Peter Rowlands (변기호) 7a68ad5a7f utils/parse: add method to check pypi for package updates (#844)
add utils method to check pypi for package updates
2025-08-06 12:36:42 +09:00
Neeraj Pradhan 74a1b6c2f2 Update Extract with stateless API (#840) 2025-08-05 13:33:07 -07:00
Clelia (Astra) Bertelli 9a90ae5264 fix: run e2e only on 3.12 (#838)
* fix: run e2e only on 3.12

* ci: workflow name and linting

* ci: job name correction 🤦

* fix: test e2e only on PR

* chore: differentiate between e2e and non-e2e tests

* ci: run all tests using explicit patterns

* chore: moving tests

* fix: change name to test_index in unit_tests
2025-08-05 21:45:16 +02:00
Clelia (Astra) Bertelli 310c1bc105 docs: move ts examples in their own top-level folder (#845) 2025-08-05 19:06:32 +02:00
Marcus Schiesser cd20b29299 chore: build before releaes (#843)
* chore: add e2e tests and use monorepo for TS

* chore: build main package to run e2e tests

* chore: add build before releasing

* fix linting

---------

Co-authored-by: Logan Markewich <logan.markewich@live.com>
2025-08-05 10:09:27 +02:00
Neeraj Pradhan 0cb7aeb81c Add claude code workflow with restricted access (#841) 2025-08-04 17:02:41 -07:00
Marcus Schiesser 98db5eeeae chore: remove llamaindex dep (#826)
* chore: remove llamaindex dep

* chore: remove all dependency on llamaindex

* feat: restructure docs/examples

* chore: remove llamaindex dep

* chore: remove all dependency on llamaindex

* simplify querytool

* fix tests

* revert version

* add missing import

* remove unused file

* feat: change default description to adapt it to LlamaCloud Index

---------

Co-authored-by: Clelia (Astra) Bertelli <clelia@runllama.ai>
2025-08-04 11:48:24 +02:00
Adrian Lyjak c21cb34ff6 fix: Fix bugs in ExtractedFieldMetadata parser (#834)
* fix: Fix bugs in ExtractedFieldMetadata parser

- Wasn't recursing through lists properly
- Fix field names, names changed or I copied incorrectly
- Handle reasoning on a parent object

* version script fixes

* update versions

* skip the unrelated failing test for now
2025-08-01 16:08:16 -04:00
Adrian Lyjak e28c7b9d92 Copy extracted citations to the new repo (#832)
* Copy extracted citations to the new repo

* fix spell check

* ignore examples too

* tweak timeout

* add changes to github actions

* shrug
2025-07-31 19:34:24 +02:00
Clelia (Astra) Bertelli ee4e565604 Example Notebooks (#829)
* fix: add symlink to avoid breaking links

* feat: copy examples
2025-07-31 16:54:12 +02:00
Clelia (Astra) Bertelli 6dbb089f4c delete examples (#830) 2025-07-31 16:53:54 +02:00
Logan Markewich c4b694db8d update symlink 2025-07-31 08:44:30 -06:00
Clelia (Astra) Bertelli 97f428ad06 fix: add symlink to avoid breaking links (#828) 2025-07-31 08:39:44 -06:00
Clelia (Astra) Bertelli ef92ee5408 feat: add ts examples (clean) (#822)
* feat: add ts examples (clean)

* chore: correct title
2025-07-31 11:25:29 +02:00
Logan d094668d03 Update extract.md 2025-07-30 14:58:25 -06:00
Logan 5bb5fc1625 Update parse.md 2025-07-30 14:58:09 -06:00
Logan 1d57e0071d Update parse.md 2025-07-30 14:57:31 -06:00
Logan 2a344c4f5c Update extract.md 2025-07-30 14:56:33 -06:00
Logan ce02559b8d Update README.md (#824) 2025-07-30 14:55:21 -06:00
Harshit Budhiraja e42746e372 docs(readme): update hyperlinks to correct targets (#820) 2025-07-30 14:53:43 -06:00
Clelia (Astra) Bertelli 3149dfd03a fix: no git checks on pnpm publish (#823) 2025-07-30 21:25:23 +02:00
Clelia (Astra) Bertelli e499fdbdab fix: add release to NPM (#819) 2025-07-30 20:55:41 +02:00
Clelia (Astra) Bertelli e57df39248 Merge index into main (#821)
* wip: monorepo changes

* fix ci for the time being

* fix ci for the time being pt2

* wip: first cloud refactoring for ts

* chore: restore original package

* fix: imports, package.json, tsconfig.json, client, reader

* feat: adjustments after local testing

* ci: github actions for typescript

* ci: typescript ci

* ci: nvmrc 🤦

* ci: remove cache 🤦

* ci: actions

* ci: actions (i lost count)

* ci: pnpm run format

* ci: pnpm run format

* chore: migrate llama-parse to uv

* add tests

* remove unneeded readme

* update workflows

* feat: modify py release workflow, adding uv version, bump version for llama-cloud-services to latest

* uv lock

* ci: python tests all tests

* fix: lock file pulling in wrong version of numpy

* feat: add index to llama-cloud-services (#817)

---------

Co-authored-by: Logan Markewich <logan.markewich@live.com>
Co-authored-by: Adrian Lyjak <adrianlyjak@gmail.com>
2025-07-30 19:46:36 +02:00
Clelia (Astra) Bertelli 09b192b98b Adding TS llama-cloud-services and moving llama-parse to uv (#811)
* wip: monorepo changes

* fix ci for the time being

* fix ci for the time being pt2

* wip: first cloud refactoring for ts

* chore: restore original package

* fix: imports, package.json, tsconfig.json, client, reader

* feat: adjustments after local testing

* ci: github actions for typescript

* ci: typescript ci

* ci: nvmrc 🤦

* ci: remove cache 🤦

* ci: actions

* ci: actions (i lost count)

* ci: pnpm run format

* ci: pnpm run format

* chore: migrate llama-parse to uv

* add tests

* remove unneeded readme

* update workflows

* feat: modify py release workflow, adding uv version, bump version for llama-cloud-services to latest

* uv lock

* ci: python tests all tests

* fix: lock file pulling in wrong version of numpy

---------

Co-authored-by: Logan Markewich <logan.markewich@live.com>
Co-authored-by: Adrian Lyjak <adrianlyjak@gmail.com>
2025-07-30 17:59:08 +02:00
Adrian Lyjak 13f01a0621 Adding support for page citations, and refactor the confidence into the field metadata (#815) 2025-07-30 10:55:29 -04:00
Javier Torres cf879a1a58 Bump llama-cloud version (#814) 2025-07-28 16:06:31 -05:00
Tuana Çelik fcdf2ab63e Fixes to multimodal report generation (#809) 2025-07-23 16:28:53 -06:00
Adrian Lyjak 083d8109c2 Make versioning a little easier, and fix llama_parse version (#808)
* Make versioning a little easier

* fix up ci
2025-07-21 18:49:07 -04:00
Adrian Lyjak 89cfc8b25f feat: default to _public agent data (#803)
* feat: default to _public agent data
* version bump
2025-07-21 15:58:03 -04:00
Peter Rowlands (변기호) c46e157f92 parse: expose preserve_very_small_text option (#806) 2025-07-21 14:19:15 +09:00
Peter Rowlands (변기호) 05d6026d37 bump version to v0.6.50 (#802) 2025-07-18 18:59:25 +09:00
Peter Rowlands (변기호) 8e98d5c146 parse: expose functionality to get raw job results (#801)
* add LlamaParse.get_result()

* add JobResult.get_text/get_markdown/get_json

* add tests
2025-07-18 18:50:29 +09:00
Adrian Lyjak 3f311c0669 Bump v0.6.49 (#797) 2025-07-16 19:42:09 -04:00
Adrian Lyjak b1a2f9d42b Add new method to fetch the full, non-paginated markdown (#796)
Add new method to fetch the full, non-paginated markdown for proper merge_tables_across_pages_in_markdown support
2025-07-16 19:29:57 -04:00
Neeraj Pradhan 142f55c94c Update to version 0.6.48 (#795)
* Update to version 0.6.48

* pin version

* poetry lock

* adjust warnings

* collect all agents for cleanup
2025-07-16 13:24:44 -07:00
Clelia (Astra) Bertelli 230a110e52 chore: vbump to 0.6.47 and example notebook (#794)
* chore: vbump to 0.6.47 and example notebook

* chore: update llama-parse pyproject.toml
2025-07-16 19:08:44 +02:00
Clelia (Astra) Bertelli 83e2b031cd feat: add table extraction for LlamaParse as CSV files (#793)
* feat: add table extraction for LlamaParse as CSV files

* chore: poetry lock

* chore: add tests

* fix: handle the case where no tables are present

* chore: implement suggestions
2025-07-16 17:08:09 +02:00
Adrian Lyjak 4844e26e5c Improve Agent Data interface, and add file related fields to extracted data for file tracking (#785)
Add file related fields for file tracking. Simplify API
2025-07-09 14:27:24 -04:00
Pierre-Loic Doulcet 70a049af3c merge_tables_across_pages_in_markdown parse parameter (#786)
* merge_tables_across_pages_in_markdown parse parameter

* base.py
2025-07-09 19:03:48 +02:00
Adrian Lyjak dc11776c86 Add nicer hand-written agent data interface (#782)
* Add nicer hand-written agent data interface

* bump to 0.6.44
2025-07-08 17:49:00 -04:00
Logan 2448a42b90 relax pydantic job object (#784) 2025-07-08 12:12:56 -06:00
324 changed files with 126030 additions and 17411 deletions
+8
View File
@@ -0,0 +1,8 @@
# Changesets
Hello and welcome! This folder has been automatically generated by `@changesets/cli`, a build tool that works
with multi-package repos, or single-package repos to help you version and publish your code. You can
find the full documentation for it [in our repository](https://github.com/changesets/changesets)
We have a quick list of common questions to get you started engaging with this project in
[our documentation](https://github.com/changesets/changesets/blob/main/docs/common-questions.md)
+11
View File
@@ -0,0 +1,11 @@
{
"$schema": "https://unpkg.com/@changesets/config@3.1.1/schema.json",
"changelog": "@changesets/cli/changelog",
"commit": false,
"fixed": [],
"linked": [],
"access": "restricted",
"baseBranch": "main",
"updateInternalDependencies": "patch",
"ignore": []
}
-48
View File
@@ -1,48 +0,0 @@
name: Build Package
# Build package on its own without additional pip install
on:
push:
branches:
- main
pull_request:
env:
POETRY_VERSION: "1.6.1"
jobs:
build:
runs-on: ${{ matrix.os }}
strategy:
# You can use PyPy versions in python-version.
# For example, pypy-2.7 and pypy-3.8
matrix:
os: [ubuntu-latest, windows-latest]
python-version: ["3.9"]
steps:
- uses: actions/checkout@v4
- name: Set up python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: ${{ env.POETRY_VERSION }}
- name: Install deps
shell: bash
run: poetry install
- name: Ensure lock works
shell: bash
run: poetry lock
- name: Build
shell: bash
run: poetry build
- name: Test installing built package
shell: bash
run: python -m pip install .
- name: Test import
shell: bash
working-directory: ${{ vars.RUNNER_TEMP }}
run: python -c "import llama_cloud_services"
+53
View File
@@ -0,0 +1,53 @@
name: Build Package - Python
# Build package on its own without additional pip install
on:
push:
branches:
- main
paths:
- "py/**"
pull_request:
paths:
- "py/**"
env:
UV_VERSION: "0.7.20"
jobs:
build:
runs-on: ${{ matrix.os }}
strategy:
# You can use PyPy versions in python-version.
# For example, pypy-2.7 and pypy-3.8
matrix:
os: [ubuntu-latest, windows-latest]
python-version: ["3.9"]
steps:
- uses: actions/checkout@v5
- name: Install uv
uses: astral-sh/setup-uv@v7
with:
version: ${{ env.UV_VERSION }}
- name: Set up Python
run: uv python install
- name: Display Python version
run: python --version
- name: Build
working-directory: py
run: uv build
- name: Test installing built package
shell: bash
working-directory: py
run: |
uv venv
uv pip install dist/*.whl
- name: Test import
working-directory: py
run: uv run -- python -c "import llama_cloud_services"
+34
View File
@@ -0,0 +1,34 @@
name: Build Package - TypeScript
on:
push:
branches:
- main
paths:
- "ts/**"
pull_request:
paths:
- "ts/**"
jobs:
pre_release:
name: Pre Release
runs-on: ubuntu-latest
steps:
- name: Checkout Repo
uses: actions/checkout@v5
- uses: pnpm/action-setup@v4
- name: Setup Node.js
uses: actions/setup-node@v5
with:
node-version-file: "ts/llama_cloud_services/.nvmrc"
- name: Install dependencies
working-directory: ts/llama_cloud_services/
run: pnpm install --no-frozen-lockfile
- name: Build
working-directory: ts/llama_cloud_services/
run: pnpm run build
+95
View File
@@ -0,0 +1,95 @@
name: Claude Code
on:
issue_comment:
types: [created]
pull_request_review_comment:
types: [created]
issues:
types: [opened, assigned]
pull_request_review:
types: [submitted]
jobs:
claude:
if: |
(github.event_name == 'issue_comment' && contains(github.event.comment.body, '@claude')) ||
(github.event_name == 'pull_request_review_comment' && contains(github.event.comment.body, '@claude')) ||
(github.event_name == 'pull_request_review' && contains(github.event.review.body, '@claude')) ||
(github.event_name == 'issues' && (contains(github.event.issue.body, '@claude') || contains(github.event.issue.title, '@claude')))
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: read
issues: read
id-token: write
steps:
- name: Check repository access
id: check-access
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
# Get the user who triggered the event
case "${{ github.event_name }}" in
"issue_comment")
USER="${{ github.event.comment.user.login }}"
;;
"pull_request_review_comment")
USER="${{ github.event.comment.user.login }}"
;;
"pull_request_review")
USER="${{ github.event.review.user.login }}"
;;
"issues")
USER="${{ github.event.issue.user.login }}"
;;
esac
echo "Checking repository access for user: $USER"
# Check if user has write access to the repository
REPO="${{ github.repository }}"
if gh api repos/$REPO/collaborators/$USER/permission --jq '.permission' | grep -E "(admin|write)" > /dev/null 2>&1; then
echo "User $USER has write access to the repository"
echo "authorized=true" >> $GITHUB_OUTPUT
else
echo "User $USER does not have write access to the repository"
echo "authorized=false" >> $GITHUB_OUTPUT
exit 1
fi
- name: Checkout repository
if: steps.check-access.outputs.authorized == 'true'
uses: actions/checkout@v5
with:
fetch-depth: 1
- name: Run Claude Code
if: steps.check-access.outputs.authorized == 'true'
id: claude
uses: anthropics/claude-code-action@beta
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_GITHUB_API_KEY }}
# Optional: Specify model (defaults to Claude Sonnet 4, uncomment for Claude Opus 4)
# model: "claude-opus-4-20250514"
# Optional: Customize the trigger phrase (default: @claude)
# trigger_phrase: "/claude"
# Optional: Trigger when specific user is assigned to an issue
# assignee_trigger: "claude-bot"
# Optional: Allow Claude to run specific commands
# Allow bash commands to be run, for things like running tests, linting, etc.
allowed_tools: "Bash(rg:*),Bash(find:*),Bash(grep:*),Bash(pnpm:*),Bash(npm:*),Bash(uv:*),Bash(pip:*),Bash(pipx:*),Bash(make:*),Bash(cd:*),WebFetch"
# Optional: Add custom instructions for Claude to customize its behavior for your project
# custom_instructions: |
# Follow our coding standards
# Ensure all new code has tests
# Use TypeScript for new files
# Optional: Custom environment variables for Claude
# claude_env: |
# NODE_ENV: test
+3 -3
View File
@@ -26,16 +26,16 @@ jobs:
steps:
- name: Checkout repository
uses: actions/checkout@v4
uses: actions/checkout@v5
# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL
uses: github/codeql-action/init@v3
uses: github/codeql-action/init@v4
with:
languages: python
dependency-caching: true
- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v3
uses: github/codeql-action/analyze@v4
with:
category: "/language:python"
+162
View File
@@ -0,0 +1,162 @@
name: Extract E2E Tests (every 4 hours)
on:
schedule:
- cron: "0 */4 * * *"
workflow_dispatch:
# Allows manual triggering
inputs:
environment:
description: "Environment to run the tests in"
required: false
default: staging
type: choice
options:
- staging
- production
notify_slack:
description: "Notify Slack"
required: false
default: false
type: boolean
workflow_call:
env:
UV_VERSION: "0.7.20"
PYTHON_VERSION: "3.12"
SLACK_CHANNEL_ID: C078PHNTF44 # Extract channel ID
API_E2E_LOG_PATH: ${{ github.workspace }}/extract-e2e.log
jobs:
extract-e2e:
name: "Extract E2E Tests (${{ matrix.environment }})"
runs-on: ubuntu-latest
timeout-minutes: 30
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ matrix.environment }}
cancel-in-progress: true
strategy:
fail-fast: false
matrix:
environment: ${{ github.event_name == 'schedule' && fromJson('["staging", "production"]') || fromJson(format('["{0}"]', github.event.inputs.environment || 'staging')) }}
steps:
- name: Set runtime inputs
id: runtime
run: |
environment=${{ matrix.environment }}
notify_slack=${{ github.event.inputs.notify_slack || github.event_name == 'schedule' }}
echo "environment=${environment}" >> $GITHUB_OUTPUT
echo "notify_slack=${notify_slack}" >> $GITHUB_OUTPUT
if [ "${environment}" = "production" ]; then
echo "LLAMA_CLOUD_BASE_URL=https://api.cloud.llamaindex.ai" >> $GITHUB_ENV
api_key_secret="${{ secrets.LLAMA_CLOUD_API_KEY }}"
project_id_secret="${{ secrets.LLAMA_CLOUD_PROJECT_ID }}"
else
echo "LLAMA_CLOUD_BASE_URL=https://api.staging.llamaindex.ai" >> $GITHUB_ENV
api_key_secret="${{ secrets.LLAMA_CLOUD_API_KEY_STAGING }}"
project_id_secret="${{ secrets.LLAMA_CLOUD_PROJECT_ID_STAGING }}"
fi
if [ -n "$api_key_secret" ]; then
echo "LLAMA_CLOUD_API_KEY=$api_key_secret" >> $GITHUB_ENV
fi
if [ -n "$project_id_secret" ]; then
echo "LLAMA_CLOUD_PROJECT_ID=$project_id_secret" >> $GITHUB_ENV
fi
- uses: actions/checkout@v5
with:
fetch-depth: 0
- name: Install uv
uses: astral-sh/setup-uv@v7
with:
version: ${{ env.UV_VERSION }}
- name: Set up Python
run: uv python install ${{ env.PYTHON_VERSION }} && uv python pin ${{ env.PYTHON_VERSION }}
- name: Run Extract E2E tests
id: extract-tests
continue-on-error: true
working-directory: py
run: |
set -o pipefail
rm -f "$API_E2E_LOG_PATH"
uv run pytest -v -n 8 --timeout=300 --session-timeout=1740 tests/extract/ 2>&1 | tee "$API_E2E_LOG_PATH"
- name: Extract pytest failure summary
id: failed-tests
if: steps.extract-tests.outcome == 'failure' || cancelled()
run: |
summary="$(python3 - <<'PY'
import os
import re
from pathlib import Path
log_path = Path(os.environ["API_E2E_LOG_PATH"])
if not log_path.exists():
print("Test log not found.")
raise SystemExit(0)
lines = log_path.read_text(errors="ignore").splitlines()
# Find the "short test summary info" section
start = None
for i, line in enumerate(lines):
if line.startswith("=") and "short test summary info" in line:
start = i + 1
break
if start is None:
print("No test summary found.")
raise SystemExit(0)
# Extract just the FAILED/ERROR lines (test name + short reason)
failed_tests = []
for line in lines[start:]:
if line.startswith("="):
break # End of section
if line.startswith("FAILED ") or line.startswith("ERROR "):
# Extract test name and truncate the error message
match = re.match(r"(FAILED|ERROR) ([\w/:.\[\]_-]+)", line)
if match:
failed_tests.append(f"{match.group(1)}: {match.group(2)}")
if failed_tests:
print("\n".join(failed_tests[:20])) # Limit to 20 tests max
else:
print("No failed tests found in summary.")
PY
)"
if [ -z "$summary" ]; then
summary="Failed test summary not available. Review the full run logs."
fi
{
printf 'summary<<EOF\n%s\nEOF\n' "$summary"
} >> "$GITHUB_OUTPUT"
- name: Check test results
if: always()
run: |
if [ "${{ steps.extract-tests.outcome }}" == "failure" ]; then
echo "Extract E2E tests failed"
exit 1
fi
- name: Post to Extract Slack channel
id: slack
if: (failure() || cancelled()) && steps.runtime.outputs.notify_slack == 'true'
uses: slackapi/slack-github-action@v2.1.1
with:
channel-id: ${{ env.SLACK_CHANNEL_ID }}
slack-message: |
:red_circle: *Extract E2E Failed* (${{ steps.runtime.outputs.environment }})
```
${{ steps.failed-tests.outputs.summary }}
```
<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Run>
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
+22 -13
View File
@@ -1,4 +1,4 @@
name: Linting
name: Lint
on:
push:
@@ -7,7 +7,7 @@ on:
pull_request:
env:
POETRY_VERSION: "1.6.1"
UV_VERSION: "0.7.20"
jobs:
build:
@@ -18,20 +18,29 @@ jobs:
matrix:
python-version: ["3.9"]
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5
with:
fetch-depth: ${{ github.event_name == 'pull_request' && 2 || 0 }}
- name: Set up python ${{ matrix.python-version }}
uses: actions/setup-python@v5
- name: Install uv
uses: astral-sh/setup-uv@v7
with:
python-version: ${{ matrix.python-version }}
- name: Install Poetry
uses: snok/install-poetry@v1
version: ${{ env.UV_VERSION }}
- name: Set up Python
run: uv python install ${{ matrix.python-version }}
- uses: pnpm/action-setup@v4
- name: Setup Node.js
uses: actions/setup-node@v5
with:
version: ${{ env.POETRY_VERSION }}
- name: Install pre-commit
shell: bash
run: poetry run pip install pre-commit
node-version-file: "ts/llama_cloud_services/.nvmrc"
- name: Install dependencies
run: pnpm install --no-frozen-lockfile
- name: Run linter
shell: bash
run: poetry run make lint
working-directory: py
run: uv run -- pre-commit run -a
# the js checks are run roundaboutly through lint-staged, and -a doesn't run it. Run them directly.
- run: pnpm -w --filter llama-cloud-services run lint
- run: pnpm -w --filter llama-cloud-services run format:check
-83
View File
@@ -1,83 +0,0 @@
name: Publish llama-parse to PyPI / GitHub
on:
push:
tags:
- "v*"
workflow_dispatch:
env:
POETRY_VERSION: "1.6.1"
PYTHON_VERSION: "3.9"
jobs:
build-n-publish:
name: Build and publish to PyPI
if: github.repository == 'run-llama/llama_cloud_services'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: ${{ env.POETRY_VERSION }}
- name: Install deps
shell: bash
run: pip install -e .
- name: Build and publish llama-cloud-services
uses: JRubics/poetry-publish@v2.1
with:
pypi_token: ${{ secrets.LLAMA_PARSE_PYPI_TOKEN }}
poetry_install_options: "--without dev"
- name: Wait for PyPI to update
run: |
sleep 120
- name: Update llama-parse lock file
run: |
cd llama_parse && poetry lock
- name: Build and publish llama-parse
uses: JRubics/poetry-publish@v2.1
with:
package_directory: "./llama_parse"
pypi_token: ${{ secrets.LLAMA_PARSE_PYPI_TOKEN }}
poetry_install_options: "--without dev"
- name: Create GitHub Release
id: create_release
uses: actions/create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # This token is provided by Actions, you do not need to create your own token
with:
tag_name: ${{ github.ref }}
release_name: ${{ github.ref }}
draft: false
prerelease: false
- name: Get Asset name
run: |
export PKG=$(ls dist/ | grep tar)
set -- $PKG
echo "name=$1" >> $GITHUB_ENV
- name: Upload Release Asset (sdist) to GitHub
id: upload-release-asset
uses: actions/upload-release-asset@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
upload_url: ${{ steps.create_release.outputs.upload_url }}
asset_path: dist/${{ env.name }}
asset_name: ${{ env.name }}
asset_content_type: application/zip
+39
View File
@@ -0,0 +1,39 @@
name: Test end-to-end - Python
on:
pull_request:
paths:
- "py/**"
env:
UV_VERSION: "0.7.20"
LLAMA_CLOUD_API_KEY: ${{ secrets.LLAMA_CLOUD_API_KEY }}
jobs:
test_e2e:
runs-on: ubuntu-latest
timeout-minutes: 30
strategy:
# You can use PyPy versions in python-version.
# For example, pypy-2.7 and pypy-3.8
matrix:
python-version: ["3.12"]
steps:
- uses: actions/checkout@v5
with:
fetch-depth: 0
- name: Install uv
uses: astral-sh/setup-uv@v7
with:
version: ${{ env.UV_VERSION }}
- name: Set up Python
run: uv python install ${{ matrix.python-version }} && uv python pin ${{ matrix.python-version }}
- name: Run Tests
working-directory: py
run: make e2e
- name: Remove virtual environment
working-directory: py
run: rm -rf .venv/
+42
View File
@@ -0,0 +1,42 @@
name: Test - Python
on:
push:
branches:
- main
paths:
- "py/**"
pull_request:
paths:
- "py/**"
env:
UV_VERSION: "0.7.20"
jobs:
test:
runs-on: ubuntu-latest
strategy:
# You can use PyPy versions in python-version.
# For example, pypy-2.7 and pypy-3.8
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v5
with:
fetch-depth: 0
- name: Install uv
uses: astral-sh/setup-uv@v7
with:
version: ${{ env.UV_VERSION }}
- name: Set up Python
run: uv python install ${{ matrix.python-version }} && uv python pin ${{ matrix.python-version }}
- name: Run Tests
working-directory: py
run: uv run pytest unit_tests/ -v
- name: Remove virtual environment
working-directory: py
run: rm -rf .venv/
+39
View File
@@ -0,0 +1,39 @@
name: Test - TypeScript
on:
push:
branches:
- main
paths:
- "ts/**"
pull_request:
paths:
- "ts/**"
env:
TURBO_TOKEN: ${{ secrets.TURBO_TOKEN }}
TURBO_TEAM: ${{ vars.TURBO_TEAM }}
TURBO_REMOTE_ONLY: true
LLAMA_CLOUD_API_KEY: ${{ secrets.LLAMA_CLOUD_API_KEY }}
jobs:
test:
name: Test - TypeScript
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- uses: pnpm/action-setup@v4
- name: Setup Node.js
uses: actions/setup-node@v5
with:
node-version-file: "ts/llama_cloud_services/.nvmrc"
- name: Install dependencies
run: pnpm -r install --no-frozen-lockfile
- name: Build package
run: pnpm --filter llama-cloud-services build
- name: Run Tests
working-directory: ts/llama_cloud_services/
run: pnpm test
- name: Run e2e tests
working-directory: ts/e2e-tests/
run: pnpm test
-40
View File
@@ -1,40 +0,0 @@
name: Unit Testing
on:
push:
branches:
- main
pull_request:
env:
POETRY_VERSION: "1.6.1"
LLAMA_CLOUD_API_KEY: ${{ secrets.LLAMA_CLOUD_API_KEY }}
jobs:
test:
runs-on: ubuntu-latest
strategy:
# You can use PyPy versions in python-version.
# For example, pypy-2.7 and pypy-3.8
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Set up python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: ${{ env.POETRY_VERSION }}
- name: Install deps
shell: bash
run: poetry install --with dev
- name: Run testing
env:
CI: true
shell: bash
run: poetry run pytest tests
@@ -0,0 +1,61 @@
name: Version Bump and Release
on:
push:
branches:
- main
concurrency: ${{ github.workflow }}-${{ github.ref }}
jobs:
release:
name: Release
runs-on: ubuntu-latest
# Only run on main branch pushes
if: github.ref == 'refs/heads/main'
steps:
- name: Checkout Repo
uses: actions/checkout@v5
- uses: pnpm/action-setup@v4
- name: Setup Node.js
uses: actions/setup-node@v5
with:
node-version: "22"
cache: "pnpm"
- name: Setup Python
uses: actions/setup-python@v6
with:
python-version: "3.11"
- name: Install uv
uses: astral-sh/setup-uv@v7
- name: Install dependencies
run: pnpm install
- name: Add auth token to .npmrc file
run: |
cat << EOF >> ".npmrc"
//registry.npmjs.org/:_authToken=$NPM_TOKEN
EOF
env:
NPM_TOKEN: ${{ secrets.NPM_TOKEN }}
- name: Create Release Pull Request or Publish packages
id: changesets
uses: changesets/action@v1
with:
commit: "chore: version packages"
title: "chore: version packages"
# Custom version script
version: pnpm -w run version
# Custom publish script
publish: pnpm -w run publish
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
NPM_TOKEN: ${{ secrets.NPM_TOKEN }}
UV_PUBLISH_TOKEN: ${{ secrets.PYPI_TOKEN }}
LLAMA_PARSE_PYPI_TOKEN: ${{ secrets.LLAMA_PARSE_PYPI_TOKEN }}
+5
View File
@@ -5,3 +5,8 @@ __pycache__/
.idea
.env*
.ipynb_checkpoints*
*_cache/
node_modules/
.turbo/
dist/
.npmrc
+13 -10
View File
@@ -15,25 +15,26 @@ repos:
- id: end-of-file-fixer
- id: mixed-line-ending
- id: trailing-whitespace
exclude: ^ts/llama_cloud_services/src/client/
- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.1.5
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
exclude: ".*poetry.lock"
exclude: ".*uv.lock|examples/"
- repo: https://github.com/psf/black-pre-commit-mirror
rev: 23.10.1
hooks:
- id: black-jupyter
name: black-src
alias: black
exclude: ".*poetry.lock"
exclude: ".*uv.lock|examples/extract/solar_panel_e2e_comparison.ipynb"
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.0.1
hooks:
- id: mypy
exclude: ^tests/
exclude: ^py/tests|^py/unit_tests|^examples
additional_dependencies:
[
"types-requests",
@@ -59,17 +60,19 @@ repos:
additional_dependencies: [black==23.10.1]
# Using PEP 8's line length in docs prevents excess left/right scrolling
args: [--line-length=79]
- repo: https://github.com/pre-commit/mirrors-prettier
rev: v3.0.3
- repo: local
hooks:
- id: prettier
exclude: poetry.lock
- id: lint-staged
name: Run lint-staged for TS files
entry: pnpm -w exec lint-staged
language: system
pass_filenames: false
- repo: https://github.com/codespell-project/codespell
rev: v2.2.6
hooks:
- id: codespell
additional_dependencies: [tomli]
exclude: ^(poetry.lock|examples)
exclude: ^(uv.lock|docs|ts|examples|pnpm-lock.yaml)
args:
[
"--ignore-words-list",
@@ -84,6 +87,6 @@ repos:
rev: v0.23.1
hooks:
- id: toml-sort-fix
exclude: ".*poetry.lock"
exclude: ".*uv.lock"
exclude: .github/ISSUE_TEMPLATE
exclude: ^(.github/ISSUE_TEMPLATE|ts/llama_cloud_services/src/client|pnpm-lock.yaml)
+33
View File
@@ -0,0 +1,33 @@
# Python
## Installation
This project uses uv. Create a virtual environment, and run `uv sync`
## Versioning (Maintainers only)
Before merging your changes, make sure to bump the versions.
Make a version bump to `pyproject.toml`. If the underlying dependency on the llamacloud platform OpenAPI
sdk needs bumping, make sure to bring that in as well. If updating dependencies, run `uv lock`.
The legacy `llama_parse` package re-exports some of `llama_cloud_services` in the old namespace. The
versions need to be kept consistent to sidecar it with `llama_cloud_services`. Bump it's version in `llama_parse/pyproject.toml`, and also bump it's dependency version of `llama-cloud-services` to match.
**Note**: Don't worry about updating the `llama_parse/poetry.lock` file when bumping versions. The GitHub action will automatically run `poetry lock` for the llama_parse package during the build process (though it doesn't commit the updated lockfile back to the repo).
You can also do this with `./scripts/version-bump.py set 0.x.x` if you have `uv` installed.
Once the change is merged, push a tag `git tag -a v0.x.x -m 0.x.x` and `git push origin v0.x.x`.
This tagging step can be done with `./scripts/version-bump tag`.
# Typescript
## Installation
...
## Versioning
...
-14
View File
@@ -1,14 +0,0 @@
GIT_ROOT ?= $(shell git rev-parse --show-toplevel)
help: ## Show all Makefile targets.
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[33m%-30s\033[0m %s\n", $$1, $$2}'
format: ## Run code autoformatters (black).
pre-commit install
git ls-files | xargs pre-commit run black --files
lint: ## Run linters: pre-commit (black, ruff, codespell) and mypy
pre-commit install && git ls-files | xargs pre-commit run --show-diff-on-failure --files
test: ## Run tests via pytest
pytest tests
+9 -64
View File
@@ -4,67 +4,12 @@
# Llama Cloud Services
This repository contains the code for hand-written SDKs and clients for interacting with LlamaCloud.
This includes:
- [LlamaParse](./parse.md) - A GenAI-native document parser that can parse complex document data for any downstream LLM use case (Agents, RAG, data processing, etc.).
- [LlamaReport (beta/invite-only)](./report.md) - A prebuilt agentic report builder that can be used to build reports from a variety of data sources.
- [LlamaExtract](./extract.md) - A prebuilt agentic data extractor that can be used to transform data into a structured JSON representation.
## Getting Started
Install the package:
```bash
pip install llama-cloud-services
```
Then, get your API key from [LlamaCloud](https://cloud.llamaindex.ai/).
Then, you can use the services in your code:
```python
from llama_cloud_services import LlamaParse, LlamaReport, LlamaExtract
parser = LlamaParse(api_key="YOUR_API_KEY")
report = LlamaReport(api_key="YOUR_API_KEY")
extract = LlamaExtract(api_key="YOUR_API_KEY")
```
See the quickstart guides for each service for more information:
- [LlamaParse](./parse.md)
- [LlamaReport (beta/invite-only)](./report.md)
- [LlamaExtract](./extract.md)
## Switch to EU SaaS 🇪🇺
If you are interested in using LlamaCloud services in the EU, you can adjust your base URL to `https://api.cloud.eu.llamaindex.ai`.
You can also create your API key in the EU region [here](https://cloud.eu.llamaindex.ai).
```python
from llama_cloud_services import (
LlamaParse,
LlamaReport,
LlamaExtract,
EU_BASE_URL,
)
parser = LlamaParse(api_key="YOUR_API_KEY", base_url=EU_BASE_URL)
report = LlamaReport(api_key="YOUR_API_KEY", base_url=EU_BASE_URL)
extract = LlamaExtract(api_key="YOUR_API_KEY", base_url=EU_BASE_URL)
```
## Documentation
You can see complete SDK and API documentation for each service on [our official docs](https://docs.cloud.llamaindex.ai/).
## Terms of Service
See the [Terms of Service Here](./TOS.pdf).
## Get in Touch (LlamaCloud)
You can get in touch with us by following our [contact link](https://www.llamaindex.ai/contact).
> **⚠️ DEPRECATION NOTICE**
>
> This repository and its packages are deprecated and will be maintained until **May 1, 2026**.
>
> **Please migrate to the new packages:**
> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))
> - **TypeScript**: `npm install @llamaindex/llama-cloud` ([GitHub](https://github.com/run-llama/llama-cloud-ts))
>
> The new packages provide the same functionality with improved performance, better support, and active development.
+8
View File
@@ -0,0 +1,8 @@
# LlamaCloud Services Examples - Python
In this folder you will find several TypeScript end-to-end applications that contain examples regarding:
- [LlamaParse](./parse/)
- [LlamaCloud Index](./index/)
Follow the instructions in each example folder to get started!
+21
View File
@@ -0,0 +1,21 @@
node_modules
package-lock.json
yarn.lock
.DS_Store
.cache
.env
.vercel
.output
.nitro
/build/
/api/
/server/build
/public/build# Sentry Config File
.env.sentry-build-plugin
/test-results/
/playwright-report/
/blob-report/
/playwright/.cache/
.tanstack
.vscode
+4
View File
@@ -0,0 +1,4 @@
**/build
**/public
pnpm-lock.yaml
routeTree.gen.ts
+88
View File
@@ -0,0 +1,88 @@
# LlamaClassify Demo
A TypeScript demo application showcasing the power of **LlamaClassify** - an agentic documents classification service from [LlamaCloud](https://cloud.llamaindex.ai). This demo allows you to classify financial documents among three different types (Cash flow statement, Income Statement and Balance Sheet).
## Table of Contents
- [Features](#features)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#usage)
- [Start the Demo](#start-the-demo)
- [How It Works](#how-it-works)
- [Troubleshooting](#troubleshooting)
- [Common Issues](#common-issues)
- [License](#license)
- [Contributing](#contributing)
## Features
- 📄 **Documemt Classification**: Classify files based on well-defined rules you can customized and play around with.
- 🤖 **Reasoning-based Actionable Insights**: Get in-depth, reasoning based insights on the document classification, accompanied by confidence scores.
- 🎨 **Beautiful UI**: [DaisyUI](https://daisyui.com)-based interface powered by [TanStack](https://tanstack.com)
-**Fast Development**: Hot reload support with development mode
- 🛠️ **TypeScript**: Full TypeScript support with strict type checking
## Prerequisites
- Node.js (version 22 or higher)
- pnpm package manager
- LlamaCloud API key
## Installation
1. Clone the repository:
```bash
git clone https://github.com/run-llama/llama_cloud_services
cd lama_cloud_services/examples-ts/classify/
```
2. Install dependencies:
```bash
npm install
```
3. Set up your environment variables:
```bash
# Add your API key to your environment
export LLAMA_CLOUD_API_KEY="your-llamacloud-api-key"
```
## Usage
### Start the Demo
```bash
npm run dev
```
The application will be up and running on http://localhost:3000
## How It Works
1. **Document Input**: Enter the path to your document when prompted
2. **Parsing**: LlamaClassify, based on the rules you can find [here](./src/utils/classifier.ts), processes the document and classifies it
3. **Results**: The classification outcome, as well as the reasoning behind it and the confidence score, are displayed in the UI.
## Troubleshooting
### Common Issues
1. **Module Resolution Errors**: Ensure you're using Node.js 22+ and have all dependencies installed
2. **API Key Issues**: Verify your LlamaCloud API key is correctly set
3. **File Path Errors**: Use absolute paths or ensure relative paths are correct from the project root
## License
MIT License - see the [LICENSE](../../LICENSE) file for details.
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Run `npm run format` and `npm run lint`
5. Submit a pull request
+34
View File
@@ -0,0 +1,34 @@
{
"name": "tanstack-start-example-basic",
"private": true,
"sideEffects": false,
"type": "module",
"scripts": {
"dev": "vite dev",
"build": "vite build && tsc --noEmit",
"start": "node .output/server/index.mjs"
},
"dependencies": {
"@tanstack/react-router": "^1.133.22",
"@tanstack/react-router-devtools": "^1.133.22",
"@tanstack/react-start": "^1.133.22",
"llama-cloud-services": "file:../../ts/llama_cloud_services",
"react": "^19.0.0",
"react-dom": "^19.0.0",
"tailwind-merge": "^2.6.0",
"zod": "^3.24.2"
},
"devDependencies": {
"@tailwindcss/postcss": "^4.1.15",
"@types/node": "^22.5.4",
"@types/react": "^19.0.8",
"@types/react-dom": "^19.0.3",
"@vitejs/plugin-react": "^4.6.0",
"daisyui": "^5.3.7",
"postcss": "^8.5.1",
"tailwindcss": "^4.1.15",
"typescript": "^5.7.2",
"vite": "^7.1.7",
"vite-tsconfig-paths": "^5.1.4"
}
}
+5
View File
@@ -0,0 +1,5 @@
export default {
plugins: {
'@tailwindcss/postcss': {},
},
}
Binary file not shown.

After

Width:  |  Height:  |  Size: 3.3 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 862 B

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.0 KiB

@@ -0,0 +1,19 @@
{
"name": "",
"short_name": "",
"icons": [
{
"src": "/android-chrome-192x192.png",
"sizes": "192x192",
"type": "image/png"
},
{
"src": "/android-chrome-512x512.png",
"sizes": "512x512",
"type": "image/png"
}
],
"theme_color": "#ffffff",
"background_color": "#ffffff",
"display": "standalone"
}
@@ -0,0 +1,53 @@
import {
ErrorComponent,
Link,
rootRouteId,
useMatch,
useRouter,
} from '@tanstack/react-router'
import type { ErrorComponentProps } from '@tanstack/react-router'
export function DefaultCatchBoundary({ error }: ErrorComponentProps) {
const router = useRouter()
const isRoot = useMatch({
strict: false,
select: (state) => state.id === rootRouteId,
})
console.error('DefaultCatchBoundary Error:', error)
return (
<div className="min-w-0 flex-1 p-4 flex flex-col items-center justify-center gap-6">
<ErrorComponent error={error} />
<div className="flex gap-2 items-center flex-wrap">
<button
onClick={() => {
router.invalidate()
}}
className={`px-2 py-1 bg-gray-600 dark:bg-gray-700 rounded-sm text-white uppercase font-extrabold`}
>
Try Again
</button>
{isRoot ? (
<Link
to="/"
className={`px-2 py-1 bg-gray-600 dark:bg-gray-700 rounded-sm text-white uppercase font-extrabold`}
>
Home
</Link>
) : (
<Link
to="/"
className={`px-2 py-1 bg-gray-600 dark:bg-gray-700 rounded-sm text-white uppercase font-extrabold`}
onClick={(e) => {
e.preventDefault()
window.history.back()
}}
>
Go Back
</Link>
)}
</div>
</div>
)
}
@@ -0,0 +1,25 @@
import { Link } from '@tanstack/react-router'
export function NotFound({ children }: { children?: any }) {
return (
<div className="space-y-2 p-2">
<div className="text-gray-600 dark:text-gray-400">
{children || <p>The page you are looking for does not exist.</p>}
</div>
<p className="flex items-center gap-2 flex-wrap">
<button
onClick={() => window.history.back()}
className="bg-emerald-500 text-white px-2 py-1 rounded-sm uppercase font-black text-sm"
>
Go back
</button>
<Link
to="/"
className="bg-cyan-600 text-white px-2 py-1 rounded-sm uppercase font-black text-sm"
>
Start Over
</Link>
</p>
</div>
)
}
+225
View File
@@ -0,0 +1,225 @@
/* eslint-disable */
// @ts-nocheck
// noinspection JSUnusedGlobalSymbols
// This file was automatically generated by TanStack Router.
// You should NOT make any changes in this file as it will be overwritten.
// Additionally, you should also exclude this file from your linter and/or formatter to prevent it from being checked or modified.
import { Route as rootRouteImport } from './routes/__root'
import { Route as UsersRouteImport } from './routes/users'
import { Route as IndexRouteImport } from './routes/index'
import { Route as UsersIndexRouteImport } from './routes/users.index'
import { Route as PostsIndexRouteImport } from './routes/posts.index'
import { Route as UsersUserIdRouteImport } from './routes/users.$userId'
import { Route as PostsPostIdRouteImport } from './routes/posts.$postId'
import { Route as ApiClassifyRouteImport } from './routes/api/classify'
import { Route as PostsPostIdDeepRouteImport } from './routes/posts_.$postId.deep'
const UsersRoute = UsersRouteImport.update({
id: '/users',
path: '/users',
getParentRoute: () => rootRouteImport,
} as any)
const IndexRoute = IndexRouteImport.update({
id: '/',
path: '/',
getParentRoute: () => rootRouteImport,
} as any)
const UsersIndexRoute = UsersIndexRouteImport.update({
id: '/',
path: '/',
getParentRoute: () => UsersRoute,
} as any)
const PostsIndexRoute = PostsIndexRouteImport.update({
id: '/posts/',
path: '/posts/',
getParentRoute: () => rootRouteImport,
} as any)
const UsersUserIdRoute = UsersUserIdRouteImport.update({
id: '/$userId',
path: '/$userId',
getParentRoute: () => UsersRoute,
} as any)
const PostsPostIdRoute = PostsPostIdRouteImport.update({
id: '/posts/$postId',
path: '/posts/$postId',
getParentRoute: () => rootRouteImport,
} as any)
const ApiClassifyRoute = ApiClassifyRouteImport.update({
id: '/api/classify',
path: '/api/classify',
getParentRoute: () => rootRouteImport,
} as any)
const PostsPostIdDeepRoute = PostsPostIdDeepRouteImport.update({
id: '/posts_/$postId/deep',
path: '/posts/$postId/deep',
getParentRoute: () => rootRouteImport,
} as any)
export interface FileRoutesByFullPath {
'/': typeof IndexRoute
'/users': typeof UsersRouteWithChildren
'/api/classify': typeof ApiClassifyRoute
'/posts/$postId': typeof PostsPostIdRoute
'/users/$userId': typeof UsersUserIdRoute
'/posts': typeof PostsIndexRoute
'/users/': typeof UsersIndexRoute
'/posts/$postId/deep': typeof PostsPostIdDeepRoute
}
export interface FileRoutesByTo {
'/': typeof IndexRoute
'/api/classify': typeof ApiClassifyRoute
'/posts/$postId': typeof PostsPostIdRoute
'/users/$userId': typeof UsersUserIdRoute
'/posts': typeof PostsIndexRoute
'/users': typeof UsersIndexRoute
'/posts/$postId/deep': typeof PostsPostIdDeepRoute
}
export interface FileRoutesById {
__root__: typeof rootRouteImport
'/': typeof IndexRoute
'/users': typeof UsersRouteWithChildren
'/api/classify': typeof ApiClassifyRoute
'/posts/$postId': typeof PostsPostIdRoute
'/users/$userId': typeof UsersUserIdRoute
'/posts/': typeof PostsIndexRoute
'/users/': typeof UsersIndexRoute
'/posts_/$postId/deep': typeof PostsPostIdDeepRoute
}
export interface FileRouteTypes {
fileRoutesByFullPath: FileRoutesByFullPath
fullPaths:
| '/'
| '/users'
| '/api/classify'
| '/posts/$postId'
| '/users/$userId'
| '/posts'
| '/users/'
| '/posts/$postId/deep'
fileRoutesByTo: FileRoutesByTo
to:
| '/'
| '/api/classify'
| '/posts/$postId'
| '/users/$userId'
| '/posts'
| '/users'
| '/posts/$postId/deep'
id:
| '__root__'
| '/'
| '/users'
| '/api/classify'
| '/posts/$postId'
| '/users/$userId'
| '/posts/'
| '/users/'
| '/posts_/$postId/deep'
fileRoutesById: FileRoutesById
}
export interface RootRouteChildren {
IndexRoute: typeof IndexRoute
UsersRoute: typeof UsersRouteWithChildren
ApiClassifyRoute: typeof ApiClassifyRoute
PostsPostIdRoute: typeof PostsPostIdRoute
PostsIndexRoute: typeof PostsIndexRoute
PostsPostIdDeepRoute: typeof PostsPostIdDeepRoute
}
declare module '@tanstack/react-router' {
interface FileRoutesByPath {
'/users': {
id: '/users'
path: '/users'
fullPath: '/users'
preLoaderRoute: typeof UsersRouteImport
parentRoute: typeof rootRouteImport
}
'/': {
id: '/'
path: '/'
fullPath: '/'
preLoaderRoute: typeof IndexRouteImport
parentRoute: typeof rootRouteImport
}
'/users/': {
id: '/users/'
path: '/'
fullPath: '/users/'
preLoaderRoute: typeof UsersIndexRouteImport
parentRoute: typeof UsersRoute
}
'/posts/': {
id: '/posts/'
path: '/posts'
fullPath: '/posts'
preLoaderRoute: typeof PostsIndexRouteImport
parentRoute: typeof rootRouteImport
}
'/users/$userId': {
id: '/users/$userId'
path: '/$userId'
fullPath: '/users/$userId'
preLoaderRoute: typeof UsersUserIdRouteImport
parentRoute: typeof UsersRoute
}
'/posts/$postId': {
id: '/posts/$postId'
path: '/posts/$postId'
fullPath: '/posts/$postId'
preLoaderRoute: typeof PostsPostIdRouteImport
parentRoute: typeof rootRouteImport
}
'/api/classify': {
id: '/api/classify'
path: '/api/classify'
fullPath: '/api/classify'
preLoaderRoute: typeof ApiClassifyRouteImport
parentRoute: typeof rootRouteImport
}
'/posts_/$postId/deep': {
id: '/posts_/$postId/deep'
path: '/posts/$postId/deep'
fullPath: '/posts/$postId/deep'
preLoaderRoute: typeof PostsPostIdDeepRouteImport
parentRoute: typeof rootRouteImport
}
}
}
interface UsersRouteChildren {
UsersUserIdRoute: typeof UsersUserIdRoute
UsersIndexRoute: typeof UsersIndexRoute
}
const UsersRouteChildren: UsersRouteChildren = {
UsersUserIdRoute: UsersUserIdRoute,
UsersIndexRoute: UsersIndexRoute,
}
const UsersRouteWithChildren = UsersRoute._addFileChildren(UsersRouteChildren)
const rootRouteChildren: RootRouteChildren = {
IndexRoute: IndexRoute,
UsersRoute: UsersRouteWithChildren,
ApiClassifyRoute: ApiClassifyRoute,
PostsPostIdRoute: PostsPostIdRoute,
PostsIndexRoute: PostsIndexRoute,
PostsPostIdDeepRoute: PostsPostIdDeepRoute,
}
export const routeTree = rootRouteImport
._addFileChildren(rootRouteChildren)
._addFileTypes<FileRouteTypes>()
import type { getRouter } from './router.tsx'
import type { createStart } from '@tanstack/react-start'
declare module '@tanstack/react-start' {
interface Register {
ssr: true
router: Awaited<ReturnType<typeof getRouter>>
}
}
+15
View File
@@ -0,0 +1,15 @@
import { createRouter } from '@tanstack/react-router'
import { routeTree } from './routeTree.gen'
import { DefaultCatchBoundary } from './components/DefaultCatchBoundary'
import { NotFound } from './components/NotFound'
export function getRouter() {
const router = createRouter({
routeTree,
defaultPreload: 'intent',
defaultErrorComponent: DefaultCatchBoundary,
defaultNotFoundComponent: () => <NotFound />,
scrollRestoration: true,
})
return router
}
+128
View File
@@ -0,0 +1,128 @@
/// <reference types="vite/client" />
import {
HeadContent,
Scripts,
createRootRoute,
} from '@tanstack/react-router'
import * as React from 'react'
import { DefaultCatchBoundary } from '~/components/DefaultCatchBoundary'
import { NotFound } from '~/components/NotFound'
import { seo } from '~/utils/seo'
export const Route = createRootRoute({
head: () => ({
meta: [
{
charSet: 'utf-8',
},
{
name: 'viewport',
content: 'width=device-width, initial-scale=1',
},
...seo({
title:
'Financial Documents Classification Agent',
description: `Classify financial documents as balance sheets, income statements and cash flow statemets. `,
}),
],
links: [
{ rel: 'stylesheet', href: "https://cdn.jsdelivr.net/npm/daisyui@5" },
{
rel: 'apple-touch-icon',
sizes: '180x180',
href: '/apple-touch-icon.png',
},
{
rel: 'icon',
type: 'image/png',
sizes: '32x32',
href: '/favicon-32x32.png',
},
{
rel: 'icon',
type: 'image/png',
sizes: '16x16',
href: '/favicon-16x16.png',
},
{ rel: 'manifest', href: '/site.webmanifest', color: '#fffff' },
{ rel: 'icon', href: '/favicon.ico' },
],
scripts: [
{
src: '/customScript.js',
type: 'text/javascript',
},
{
src: "https://cdn.jsdelivr.net/npm/@tailwindcss/browser@4",
type: "text/javascript",
}
],
}),
errorComponent: DefaultCatchBoundary,
notFoundComponent: () => <NotFound />,
shellComponent: RootDocument,
})
function RootDocument({ children }: { children: React.ReactNode }) {
return (
<html>
<head>
<HeadContent />
</head>
<body>
<div className="navbar bg-base-100 shadow-sm">
<div className="navbar-start">
<div className="dropdown">
<div tabIndex={0} role="button" className="btn btn-ghost btn-circle">
<svg
xmlns="http://www.w3.org/2000/svg"
className="h-5 w-5"
fill="none"
viewBox="0 0 24 24"
stroke="currentColor"
>
<path
strokeLinecap="round"
strokeLinejoin="round"
strokeWidth="2"
d="M4 6h16M4 12h16M4 18h7"
/>
</svg>
</div>
<ul
tabIndex={0}
className="menu menu-lg dropdown-content bg-base-100 rounded-box z-1 mt-3 w-80 p-2 shadow"
>
<li><a href="/">Home</a></li>
<li><a href="https://cloud.llamaindex.ai">Get Started with LlamaCloud</a></li>
<li><a href="https://developers.llamaindex.ai/python/cloud/llamaclassify/getting_started/">LlamaClassify Docs</a></li>
</ul>
</div>
</div>
<div className="navbar-center">
<a className="btn btn-ghost text-xl" href="/">Financial Documents Classification Agent</a>
</div>
<div className="navbar-end">
<a href="https://github.com/run-llama/llama_cloud_services/main/blob/examples-ts/classify">
<button className="btn btn-ghost btn-circle">
<div className="indicator">
<svg
xmlns="http://www.w3.org/2000/svg"
className="h-10 w-10"
fill="currentColor"
viewBox="0 0 640 512"
>
<path d="M237.9 461.4C237.9 463.4 235.6 465 232.7 465C229.4 465.3 227.1 463.7 227.1 461.4C227.1 459.4 229.4 457.8 232.3 457.8C235.3 457.5 237.9 459.1 237.9 461.4zM206.8 456.9C206.1 458.9 208.1 461.2 211.1 461.8C213.7 462.8 216.7 461.8 217.3 459.8C217.9 457.8 216 455.5 213 454.6C210.4 453.9 207.5 454.9 206.8 456.9zM251 455.2C248.1 455.9 246.1 457.8 246.4 460.1C246.7 462.1 249.3 463.4 252.3 462.7C255.2 462 257.2 460.1 256.9 458.1C256.6 456.2 253.9 454.9 251 455.2zM316.8 72C178.1 72 72 177.3 72 316C72 426.9 141.8 521.8 241.5 555.2C254.3 557.5 258.8 549.6 258.8 543.1C258.8 536.9 258.5 502.7 258.5 481.7C258.5 481.7 188.5 496.7 173.8 451.9C173.8 451.9 162.4 422.8 146 415.3C146 415.3 123.1 399.6 147.6 399.9C147.6 399.9 172.5 401.9 186.2 425.7C208.1 464.3 244.8 453.2 259.1 446.6C261.4 430.6 267.9 419.5 275.1 412.9C219.2 406.7 162.8 398.6 162.8 302.4C162.8 274.9 170.4 261.1 186.4 243.5C183.8 237 175.3 210.2 189 175.6C209.9 169.1 258 202.6 258 202.6C278 197 299.5 194.1 320.8 194.1C342.1 194.1 363.6 197 383.6 202.6C383.6 202.6 431.7 169 452.6 175.6C466.3 210.3 457.8 237 455.2 243.5C471.2 261.2 481 275 481 302.4C481 398.9 422.1 406.6 366.2 412.9C375.4 420.8 383.2 435.8 383.2 459.3C383.2 493 382.9 534.7 382.9 542.9C382.9 549.4 387.5 557.3 400.2 555C500.2 521.8 568 426.9 568 316C568 177.3 455.5 72 316.8 72zM169.2 416.9C167.9 417.9 168.2 420.2 169.9 422.1C171.5 423.7 173.8 424.4 175.1 423.1C176.4 422.1 176.1 419.8 174.4 417.9C172.8 416.3 170.5 415.6 169.2 416.9zM158.4 408.8C157.7 410.1 158.7 411.7 160.7 412.7C162.3 413.7 164.3 413.4 165 412C165.7 410.7 164.7 409.1 162.7 408.1C160.7 407.5 159.1 407.8 158.4 408.8zM190.8 444.4C189.2 445.7 189.8 448.7 192.1 450.6C194.4 452.9 197.3 453.2 198.6 451.6C199.9 450.3 199.3 447.3 197.3 445.4C195.1 443.1 192.1 442.8 190.8 444.4zM179.4 429.7C177.8 430.7 177.8 433.3 179.4 435.6C181 437.9 183.7 438.9 185 437.9C186.6 436.6 186.6 434 185 431.7C183.6 429.4 181 428.4 179.4 429.7z" />
</svg>
</div>
</button>
</a>
</div>
</div>
<hr />
{children}
<Scripts />
</body>
</html>
)
}
@@ -0,0 +1,45 @@
import { createFileRoute } from '@tanstack/react-router'
import { classifier, classificationRules, parsingConfig } from '~/utils/classifier'
export const Route = createFileRoute('/api/classify')({
component: RouteComponent,
server: {
handlers: {
POST: async ({ request }) => {
const body = await request.formData()
const fl = body.get("file") as File;
if (!fl) {
return new Response(JSON.stringify({"result": "you need to provide a file"}))
}
const buff = await fl.arrayBuffer()
const rawRes = await classifier.classify(
classificationRules,
parsingConfig,
{ fileContents: [new Uint8Array(buff)] },
)
const results = rawRes.items
let classification = ""
for (const result of results) {
if ("result" in result && result.result) {
classification += `
<div class="card bg-base-100 shadow-xl p-6 mb-4">
<div class="space-y-3">
<p><span class="font-semibold">📄 Document:</span> ${fl.name}</p>
<p><span class="font-semibold">🏷️ Type:</span> <span class="badge badge-primary">${result.result.type}</span></p>
<p><span class="font-semibold">📊 Confidence:</span> ${result.result.confidence*100}%</p>
<p><span class="font-semibold">💭 Reasoning:</span> ${result.result.reasoning}</p>
</div>
</div>
`
}
}
return new Response(JSON.stringify({"result": classification}))
},
},
},
})
function RouteComponent() {
return
}
+99
View File
@@ -0,0 +1,99 @@
import { createFileRoute } from '@tanstack/react-router'
import { useRef, useState } from 'react'
export const Route = createFileRoute('/')({
component: Home,
})
function Home() {
const [file, setFile] = useState<null | File>(null)
const fileInputRef = useRef<HTMLInputElement>(null)
const [reply, setReply] = useState<null | string>(null)
const [loading, setLoading] = useState<boolean>(false)
const handleFileChange = (event: React.ChangeEvent<HTMLInputElement>) => {
const selectedFile = event.target.files?.[0]
if (selectedFile) {
setFile(selectedFile)
}
}
const handleClearFile = () => {
if (file) {
setFile(null)
}
if (fileInputRef.current) {
fileInputRef.current.value = ''
}
if (reply) {
setReply(null)
}
}
const handleClassify = async () => {
if (!file) return
if (reply) {
setReply(null)
}
setLoading(true)
try {
const formData = new FormData()
formData.append('file', file)
const res = await fetch('/api/classify', {
method: 'POST',
body: formData,
})
const data = await res.json()
setReply(data.result)
} catch (error) {
console.error('Error:', error)
} finally {
setLoading(false)
}
}
return (
<div className="flex flex-col justify-center items-center gap-y-8">
<br />
<h1 className="text-xl font-bold text-gray-700">AI-Powered finacial document classification</h1>
<h2 className="text-lg font-semibold text-gray-500">Need help sorting out the financial documents jungle? Let our classification agent handle it!</h2>
<fieldset className="fieldset bg-base-100 border-base-300 rounded-box w-200 border p-4">
<legend className="fieldset-legend text-lg">Upload your financial document here</legend>
<label className="label flex justify-center">
<input type="file" className="file-input" onChange={handleFileChange} accept='application/pdf' ref={fileInputRef} />
</label>
</fieldset>
{file && (
<div className="flex flex-col justify-center items-center gap-y-8">
<p className="text-sm text-gray-600">Selected file: {file.name}</p>
<div className='grid grid-cols-2 gap-x-6'>
<button
type="button"
className='btn bg-gray-500 text-white shadow-lg hover:bg-gray-600 hover:shadow-xl rounded'
onClick={handleClassify}
>
Classify
</button>
<button
onClick={handleClearFile}
type="button"
className="px-4 py-2 bg-red-300 text-black rounded hover:bg-red-400 hover:shadow-xl shadow-lg"
>
Clear
</button>
</div>
</div>
)}
{loading && (
<span className="loading loading-spinner text-primary"></span>
)}
{reply && (
<div
className="max-w-2xl w-full"
dangerouslySetInnerHTML={{ __html: reply }}
/>
)}
</div>
)
}
@@ -0,0 +1,23 @@
import { LlamaClassify, ClassifierRule, ClassifyParsingConfiguration } from "llama-cloud-services"
export const classifier = new LlamaClassify(process.env.LLAMA_CLOUD_API_KEY);
export const classificationRules: ClassifierRule[] = [
{
description: "Shows a company's assets, liabilities, and shareholders' equity at a specific point in time, providing a snapshot of financial position.",
type: "balance_sheet"
},
{
description: "Reports cash inflows and outflows from operating, investing, and financing activities, highlighting liquidity and cash management.",
type: "cash_flow_statement"
},
{
description: "Summarizes revenues, expenses, and profits over a period, indicating financial performance and profitability.",
type: "income_statement"
},
];
export const parsingConfig: ClassifyParsingConfiguration = {
lang: "en",
max_pages: 20,
}
+33
View File
@@ -0,0 +1,33 @@
export const seo = ({
title,
description,
keywords,
image,
}: {
title: string
description?: string
image?: string
keywords?: string
}) => {
const tags = [
{ title },
{ name: 'description', content: description },
{ name: 'keywords', content: keywords },
{ name: 'twitter:title', content: title },
{ name: 'twitter:description', content: description },
{ name: 'twitter:creator', content: '@tannerlinsley' },
{ name: 'twitter:site', content: '@tannerlinsley' },
{ name: 'og:type', content: 'website' },
{ name: 'og:title', content: title },
{ name: 'og:description', content: description },
...(image
? [
{ name: 'twitter:image', content: image },
{ name: 'twitter:card', content: 'summary_large_image' },
{ name: 'og:image', content: image },
]
: []),
]
return tags
}
+22
View File
@@ -0,0 +1,22 @@
{
"include": ["**/*.ts", "**/*.tsx"],
"compilerOptions": {
"strict": true,
"esModuleInterop": true,
"jsx": "react-jsx",
"module": "ESNext",
"moduleResolution": "Bundler",
"lib": ["DOM", "DOM.Iterable", "ES2022"],
"isolatedModules": true,
"resolveJsonModule": true,
"skipLibCheck": true,
"target": "ES2022",
"allowJs": true,
"forceConsistentCasingInFileNames": true,
"baseUrl": ".",
"paths": {
"~/*": ["./src/*"]
},
"noEmit": true
}
}
+19
View File
@@ -0,0 +1,19 @@
import { tanstackStart } from '@tanstack/react-start/plugin/vite'
import { defineConfig } from 'vite'
import tsConfigPaths from 'vite-tsconfig-paths'
import viteReact from '@vitejs/plugin-react'
export default defineConfig({
server: {
port: 3000,
},
plugins: [
tsConfigPaths({
projects: ['./tsconfig.json'],
}),
tanstackStart({
srcDirectory: 'src',
}),
viteReact(),
],
})
+122
View File
@@ -0,0 +1,122 @@
# LlamaExtract Demo
A TypeScript demo application showcasing the power of **LlamaExract** - a structured data extraction agentic service from [LlamaCloud](https://cloud.llamaindex.ai). This demo allows you to extract structured information from scientific papers and get them into a nice markdown format.
## Table of Contents
- [Features](#features)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#usage)
- [Start the Demo](#start-the-demo)
- [Development Mode](#development-mode)
- [Build the Project](#build-the-project)
- [Code Quality](#code-quality)
- [Quick Commands Reference](#quick-commands-reference)
- [How It Works](#how-it-works)
- [API Dependencies](#api-dependencies)
- [Troubleshooting](#troubleshooting)
- [Common Issues](#common-issues)
- [License](#license)
- [Contributing](#contributing)
## Features
- 📄 **Structured Data Extraction**: Extract data from your files effortlessly, and structure them the way you want!
- 🤖 **Markdown Rendering**: Generate markdown directly from your extracted data
- 🎨 **Beautiful CLI**: Styled console interface with colors and ASCII art
-**Fast Development**: Hot reload support with watch mode
- 🛠️ **TypeScript**: Full TypeScript support with strict type checking
## Prerequisites
- Node.js (version 18 or higher)
- pnpm package manager
- LlamaCloud API key
## Installation
1. Clone the repository:
```bash
git clone https://github.com/run-llama/llama_cloud_services
cd lama_cloud_services/examples-ts/extract/
```
2. Install dependencies:
```bash
npm install
```
3. Set up your environment variables:
```bash
# Add your API key to your environment
export LLAMA_CLOUD_API_KEY="your-llamacloud-api-key"
```
## Usage
### Start the Demo
```bash
npm run start
```
The application will display a welcome screen and prompt you to enter the path to a document you'd like to process.
### Development Mode
For development with hot reload:
```bash
npm run dev
```
### Build the Project
```bash
npm run build
```
### Code Quality
Format code:
```bash
npm run format
```
Lint code:
```bash
npm run lint
```
## How It Works
1. **Document Input**: Enter the path to your document when prompted
2. **Parsing**: LlamaExtract, based on the schema you can find [here](./src/schema.ts), processes the document and extracts structured data
3. **Markdown Rendering**: The extracted content is rendered into beautiful markdown
4. **Results**: View the results directly in your terminal
## Troubleshooting
### Common Issues
1. **Module Resolution Errors**: Ensure you're using Node.js 18+ and have all dependencies installed
2. **API Key Issues**: Verify your LlamaCloud API key is correctly set
3. **File Path Errors**: Use absolute paths or ensure relative paths are correct from the project root
## License
MIT License - see the [LICENSE](../../LICENSE) file for details.
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Run `npm run format` and `npm run lint`
5. Submit a pull request
+14
View File
@@ -0,0 +1,14 @@
import js from "@eslint/js";
import globals from "globals";
import tseslint from "typescript-eslint";
import { defineConfig } from "eslint/config";
export default defineConfig([
{
files: ["**/*.{js,mjs,cjs,ts,mts,cts}"],
plugins: { js },
extends: ["js/recommended"],
languageOptions: { globals: globals.browser },
},
tseslint.configs.recommended,
]);
File diff suppressed because it is too large Load Diff
+37
View File
@@ -0,0 +1,37 @@
{
"name": "llama-extract-demo",
"version": "0.1.0",
"description": "Demo for LlamaExtract in TypeScript",
"main": "index.js",
"scripts": {
"test": "echo \"There are no tests\"",
"start": "npm exec tsx src/index.ts",
"lint": "eslint ./src/",
"format": "prettier --write ./src/",
"build": "tsc",
"dev": "npm exec tsx --watch src/index.ts"
},
"author": "LlamaIndex",
"license": "MIT",
"dependencies": {
"cli-markdown": "^3.5.1",
"consola": "^3.4.2",
"figlet": "^1.8.2",
"llama-cloud-services": "file:../../ts/llama_cloud_services",
"marked": "^15.0.12",
"marked-terminal": "^7.3.0",
"picocolors": "^1.1.1"
},
"devDependencies": {
"@eslint/js": "^9.32.0",
"@types/figlet": "^1.7.0",
"@types/marked-terminal": "^6.1.1",
"@types/node": "^24.2.0",
"eslint": "^9.32.0",
"globals": "^16.3.0",
"jiti": "^2.5.1",
"prettier": "^3.6.2",
"typescript": "^5.9.2",
"typescript-eslint": "^8.39.0"
}
}
+47
View File
@@ -0,0 +1,47 @@
import { LlamaExtract, ExtractConfig } from "llama-cloud-services";
import cliMarkdown from "cli-markdown";
import { logger } from "./logger";
import pc from "picocolors";
import { consoleInput, renderLogo } from "./utils";
import { dataSchema } from "./schema";
import { renderMarkdown, ResearchData } from "./markdown";
export async function main(): Promise<number> {
const extractClient = new LlamaExtract(
process.env.LLAMA_CLOUD_API_KEY!,
"https://api.cloud.llamaindex.ai",
);
await renderLogo();
logger.log(
`Welcome to ${pc.bold(
pc.magentaBright("LlamaExtract Demo✨"),
)}, our demo for ${pc.bold(pc.green("LlamaExtract"))}, a ${pc.bold(
pc.cyan("LlamaCloud☁️"),
)} (https://cloud.llamaindex.ai) product!.\nIn this demo we are going to try extracting relevant information ${pc.bold(
pc.yellowBright("from scientific papers"),
)}. Type the path to the paper you would like to process below👇\nIf you wish to exit, just type ${pc.bold(
pc.gray("quit"),
)}.\n`,
);
while (true) {
const userInput = await consoleInput();
if (userInput.toLowerCase() == "quit") {
break;
}
try {
const generatedData = await extractClient.extract(
dataSchema,
{} as ExtractConfig,
userInput,
);
const research = renderMarkdown(generatedData?.data as ResearchData); // Added await here
logger.log(`${pc.bold(pc.cyan("Extracted information:✨"))}:\n`);
logger.log(cliMarkdown(research));
} catch (error) {
logger.error(`Error processing file: ${error}`);
}
}
return 0;
}
main().catch(console.error);
+8
View File
@@ -0,0 +1,8 @@
import { createConsola } from "consola";
import type { ConsolaInstance } from "consola";
export const logger: ConsolaInstance = createConsola({
formatOptions: {
date: false,
},
});
+172
View File
@@ -0,0 +1,172 @@
type Author = {
name: string;
affiliation?: string;
email?: string;
};
type Methodology = {
approach?: string;
participants?: string;
methods?: string[];
};
type Result = {
finding?: string;
significance?: string;
supportingData?: string;
};
type Reference = {
title: string;
authors: string;
year?: string;
relevance?: string;
};
type Discussion = {
implications?: string[];
limitations?: string[];
futureWork?: string[];
};
type Publication = {
journal?: string;
year: string;
doi?: string;
url?: string;
};
export type ResearchData = {
title: string;
authors: Author[];
abstract: string;
keywords?: string[];
mainFindings: string[];
methodology?: Methodology;
results?: Result[];
discussion?: Discussion;
references?: Reference[];
publication?: Publication;
};
export function renderMarkdown(data: ResearchData): string {
const {
title,
authors,
abstract,
keywords,
mainFindings,
methodology,
results,
discussion,
references,
publication,
} = data;
const md: string[] = [];
md.push(`# ${title}\n`);
// Authors
md.push(`## Authors`);
md.push(
authors
.map(
(author) =>
`- **${author.name}**${
author.affiliation ? `, *${author.affiliation}*` : ""
}${author.email ? ` (${author.email})` : ""}`,
)
.join("\n"),
);
// Abstract
md.push(`\n## Abstract\n${abstract}`);
// Keywords
if (keywords && keywords.length > 0) {
md.push(`\n## Keywords\n${keywords.map((k) => `- ${k}`).join("\n")}`);
}
// Main Findings
md.push(
`\n## Main Findings\n${mainFindings.map((f) => `- ${f}`).join("\n")}`,
);
// Methodology
if (methodology) {
md.push(`\n## Methodology`);
if (methodology.approach) md.push(`**Approach:** ${methodology.approach}`);
if (methodology.participants)
md.push(`**Participants:** ${methodology.participants}`);
if (methodology.methods?.length) {
md.push(
`**Methods:**\n${methodology.methods.map((m) => `- ${m}`).join("\n")}`,
);
}
}
// Results
if (results?.length) {
md.push(`\n## Results`);
results.forEach((result, i) => {
md.push(`\n### Result ${i + 1}`);
if (result.finding) md.push(`- **Finding:** ${result.finding}`);
if (result.significance)
md.push(`- **Significance:** ${result.significance}`);
if (result.supportingData)
md.push(`- **Supporting Data:** ${result.supportingData}`);
});
}
// Discussion
if (discussion) {
md.push(`\n## Discussion`);
if (discussion.implications?.length) {
md.push(
`### Implications\n${discussion.implications
.map((d) => `- ${d}`)
.join("\n")}`,
);
}
if (discussion.limitations?.length) {
md.push(
`### Limitations\n${discussion.limitations
.map((d) => `- ${d}`)
.join("\n")}`,
);
}
if (discussion.futureWork?.length) {
md.push(
`### Future Work\n${discussion.futureWork
.map((d) => `- ${d}`)
.join("\n")}`,
);
}
}
// References
if (references?.length) {
md.push(`\n## References`);
references.forEach((ref, i) => {
md.push(
`\n**[${i + 1}]** ${ref.title} — *${ref.authors}*${
ref.year ? ` (${ref.year})` : ""
}`,
);
if (ref.relevance) md.push(`> ${ref.relevance}`);
});
}
// Publication Info
if (publication) {
md.push(`\n## Publication`);
if (publication.journal) md.push(`- **Journal:** ${publication.journal}`);
if (publication.year) md.push(`- **Year:** ${publication.year}`);
if (publication.doi) md.push(`- **DOI:** ${publication.doi}`);
if (publication.url)
md.push(`- **URL:** [${publication.url}](${publication.url})`);
}
return md.join("\n");
}
+169
View File
@@ -0,0 +1,169 @@
export const dataSchema = {
type: "object",
required: ["title", "authors", "abstract", "mainFindings"],
properties: {
title: {
type: "string",
description: "The full title of the research paper",
},
authors: {
type: "array",
description: "List of all authors of the paper",
items: {
type: "object",
properties: {
name: {
type: "string",
description: "Full name of the author",
},
affiliation: {
type: "string",
description:
"Institution or organization the author is affiliated with",
},
email: {
type: "string",
description: "Contact email of the author if provided",
},
},
},
},
abstract: {
type: "string",
description: "Complete abstract or summary of the paper",
},
keywords: {
type: "array",
description:
"Key terms and phrases that describe the paper's main topics",
items: {
type: "string",
},
},
mainFindings: {
type: "array",
description: "Key findings, conclusions, or contributions of the paper",
items: {
type: "string",
},
},
methodology: {
type: "object",
description: "Research methods and approaches used",
properties: {
approach: {
type: "string",
description: "Overall research approach or study design",
},
participants: {
type: "string",
description: "Description of study participants or data sources",
},
methods: {
type: "array",
description: "Specific methods, techniques, or tools used",
items: {
type: "string",
},
},
},
},
results: {
type: "array",
description: "Main results and outcomes of the research",
items: {
type: "object",
properties: {
finding: {
type: "string",
description: "Description of the specific result or finding",
},
significance: {
type: "string",
description:
"Statistical significance or importance of the finding",
},
supportingData: {
type: "string",
description: "Relevant statistics, measurements, or data points",
},
},
},
},
discussion: {
type: "object",
properties: {
implications: {
type: "array",
description: "Theoretical or practical implications of the findings",
items: {
type: "string",
},
},
limitations: {
type: "array",
description: "Study limitations or constraints",
items: {
type: "string",
},
},
futureWork: {
type: "array",
description: "Suggested future research directions",
items: {
type: "string",
},
},
},
},
references: {
type: "array",
description:
"Key papers cited that are crucial to understanding this work",
items: {
type: "object",
properties: {
title: {
type: "string",
description: "Title of the cited paper",
},
authors: {
type: "string",
description: "Authors of the cited paper",
},
year: {
type: "string",
description: "Publication year",
},
relevance: {
type: "string",
description: "Why this reference is important to the current paper",
},
},
required: ["title", "authors"],
},
},
publication: {
type: "object",
properties: {
journal: {
type: "string",
description: "Name of the journal or conference",
},
year: {
type: "string",
description: "Year of publication",
},
doi: {
type: "string",
description: "Digital Object Identifier (DOI) of the paper",
},
url: {
type: "string",
description: "URL where the paper can be accessed",
},
},
required: ["year"],
},
},
};
+4
View File
@@ -0,0 +1,4 @@
declare module "cli-markdown" {
function cliMarkdown(input: string): string;
export default cliMarkdown;
}
+33
View File
@@ -0,0 +1,33 @@
import * as readline from "readline/promises";
import figlet from "figlet";
import pc from "picocolors";
export async function renderLogo(): Promise<void> {
const logoText = figlet.textSync("Extract Demo", {
font: "ANSI Shadow",
horizontalLayout: "default",
verticalLayout: "default",
width: 100,
whitespaceBreak: true,
});
// Add some styling with picocolors
const styledLogo = pc.bold(pc.redBright(logoText));
// Add some padding/margin
console.log("\n");
console.log(styledLogo);
console.log(pc.gray("─".repeat(60)));
console.log("\n");
}
export async function consoleInput(): Promise<string> {
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
});
const answer = await rl.question("Path to your file: ");
rl.close();
return answer;
}
+131
View File
@@ -0,0 +1,131 @@
# LlamaCloud Index Demo
A TypeScript demo application showcasing the power of **LlamaCloud Index** - a fully automated document ingestion and retrieval serviced offered within [LlamaCloud](https://cloud.llamaindex.ai). This demo allows you to ask questions, retrieve relevant contextual information and generate AI-powered responses using OpenAI's GPT models.
## Table of Contents
- [Features](#features)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#usage)
- [Start the Demo](#start-the-demo)
- [Development Mode](#development-mode)
- [Build the Project](#build-the-project)
- [Code Quality](#code-quality)
- [Quick Commands Reference](#quick-commands-reference)
- [How It Works](#how-it-works)
- [API Dependencies](#api-dependencies)
- [Troubleshooting](#troubleshooting)
- [Common Issues](#common-issues)
- [License](#license)
- [Contributing](#contributing)
## Features
- 🤖 **RAG**: Simple-yet-effective Retrieval Augmented Generation pipeline built on top of LlamaCloud Index and OpenAI
- 🎨 **Beautiful CLI**: Styled console interface with colors and ASCII art
-**Fast Development**: Hot reload support with watch mode
- 🛠️ **TypeScript**: Full TypeScript support with strict type checking
## Prerequisites
- Node.js (version 18 or higher)
- pnpm package manager
- OpenAI API key
- LlamaCloud API key
- An existing LlamaCloud Index pipeline
## Installation
1. Clone the repository:
```bash
git clone https://github.com/run-llama/llama_cloud_services
cd lama_cloud_services/examples-ts/index/
```
2. Install dependencies:
```bash
pnpm install
```
3. Set up your environment variables:
```bash
export OPENAI_API_KEY="your-openai-api-key"
export LLAMA_CLOUD_API_KEY="your-llamacloud-api-key"
export PIPELINE_NAME="your-pipeline-name"
```
4. Or write them into a `.env` file:
```env
OPENAI_API_KEY="your-openai-api-key"
LLAMA_CLOUD_API_KEY="your-llamacloud-api-key"
PIPELINE_NAME="your-pipeline-name"
```
## Usage
### Start the Demo
```bash
pnpm run start
```
The application will display a welcome screen and prompt you to start chatting!
### Development Mode
For development with hot reload:
```bash
pnpm run dev
```
### Build the Project
```bash
pnpm run build
```
### Code Quality
Format code:
```bash
pnpm run format
```
Lint code:
```bash
pnpm run lint
```
## How It Works
1. **Message Input**: Enter a message
2. **Retrieval**: Several nodes are retrieved from the LlamaCloud index you specified
3. **AI Response Generation**: The retrieved information is passed on to the AI model, along with its relevance score, and a reply to your original message is generated starting from that.
4. **Results**: View the AI-generated summary in your terminal
## Troubleshooting
### Common Issues
1. **Module Resolution Errors**: Ensure you're using Node.js 18+ and have all dependencies installed
2. **API Key Issues**: Verify your OpenAI and LlamaCloud API keys are correctly set
## License
MIT License - see the [LICENSE](../../LICENSE) file for details.
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Run `pnpm run format` and `pnpm run lint`
5. Submit a pull request
+15
View File
@@ -0,0 +1,15 @@
import js from "@eslint/js";
import globals from "globals";
import tseslint from "typescript-eslint";
import { defineConfig } from "eslint/config";
export default defineConfig([
{
files: ["**/*.{js,mjs,cjs,ts,mts,cts}"],
plugins: { js },
extends: ["js/recommended"],
languageOptions: { globals: globals.browser },
},
{ files: ["**/*.js"], languageOptions: { sourceType: "script" } },
tseslint.configs.recommended,
]);
+48
View File
@@ -0,0 +1,48 @@
{
"name": "llama-chat",
"version": "0.1.0",
"description": "Demo for LlamaCloud Index in TypeScript",
"type": "module",
"main": "index.js",
"scripts": {
"test": "echo \"There are no tests\"",
"start": "pnpm exec tsx src/index.ts",
"lint": "eslint ./src/",
"format": "prettier --write ./src/",
"build": "tsc",
"dev": "pnpm exec tsx --watch src/index.ts"
},
"keywords": [
"ai",
"rag",
"retrieval",
"pipeline",
"llms",
"chatbot"
],
"author": "LlamaIndex",
"license": "MIT",
"packageManager": "pnpm@10.12.4",
"devDependencies": {
"@eslint/js": "^9.32.0",
"@types/figlet": "^1.7.0",
"@types/node": "^24.1.0",
"@typescript-eslint/eslint-plugin": "^8.38.0",
"@typescript-eslint/parser": "^8.38.0",
"eslint": "^9.32.0",
"globals": "^16.3.0",
"jiti": "^2.5.1",
"prettier": "^3.6.2",
"typescript": "^5.8.3",
"typescript-eslint": "^8.38.0"
},
"dependencies": {
"@ai-sdk/openai": "^1.3.23",
"ai": "^4.3.19",
"consola": "^3.4.2",
"dotenv": "^17.2.1",
"figlet": "^1.8.2",
"llama-cloud-services": "link:../../ts/llama_cloud_services",
"picocolors": "^1.1.1"
}
}
+1770
View File
File diff suppressed because it is too large Load Diff
+48
View File
@@ -0,0 +1,48 @@
import { LlamaCloudIndex } from "llama-cloud-services";
import { logger } from "./logger";
import pc from "picocolors";
import {
consoleInput,
retrievalAugmentedGeneration,
renderLogo,
} from "./utils";
import dotenv from "dotenv";
dotenv.config();
export async function main(): Promise<number> {
const index = new LlamaCloudIndex({
name: process.env.PIPELINE_NAME as string,
projectName: "Default",
apiKey: process.env.LLAMA_CLOUD_API_KEY, // can provide API-key in the constructor or in the env
});
const retriever = index.asRetriever({
similarityTopK: 5,
});
await renderLogo();
logger.log(
`Welcome to ${pc.bold(
pc.magentaBright("✨LlamaChat✨"),
)}, our demo for ${pc.bold(pc.green("Index🦙"))}, a ${pc.bold(
pc.cyan("LlamaCloud☁️"),
)} (https://cloud.llamaindex.ai) product!.\nType a question below, and you will get an answer!👇\nIf you wish to exit, just type ${pc.bold(
pc.gray("quit"),
)}.\n`,
);
while (true) {
const userInput = await consoleInput();
if (userInput.toLowerCase() == "quit") {
break;
}
try {
const nodes = await retriever.retrieve(userInput);
const summary = await retrievalAugmentedGeneration(nodes, userInput);
logger.log(`${pc.bold(pc.magentaBright("LlamaChat✨:"))}\n${summary}`);
} catch (error) {
logger.error(`Error processing your request: ${error}`);
}
}
return 0;
}
main().catch(console.error);
+8
View File
@@ -0,0 +1,8 @@
import { createConsola } from "consola";
import type { ConsolaInstance } from "consola";
export const logger: ConsolaInstance = createConsola({
formatOptions: {
date: false,
},
});
+56
View File
@@ -0,0 +1,56 @@
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
import { NodeWithScore, MetadataMode } from "llamaindex";
import * as readline from "readline/promises";
import figlet from "figlet";
import pc from "picocolors";
export async function renderLogo(): Promise<void> {
const logoText = figlet.textSync("LlamaChat", {
font: "ANSI Shadow",
horizontalLayout: "default",
verticalLayout: "default",
width: 100,
whitespaceBreak: true,
});
// Add some styling with picocolors
const styledLogo = pc.bold(pc.yellowBright(logoText));
// Add some padding/margin
console.log("\n");
console.log(styledLogo);
console.log(pc.gray("─".repeat(60)));
console.log("\n");
}
export async function consoleInput(): Promise<string> {
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
});
const answer = await rl.question(pc.cyanBright("You✨:"));
rl.close();
return answer;
}
export async function retrievalAugmentedGeneration(
nodes: NodeWithScore[],
prompt: string,
): Promise<string> {
let mainText: string = "";
for (const node of nodes) {
mainText += `\t{information: '${node.node.getContent(
MetadataMode.ALL,
)}', relevanceScore: '${node.score ?? "no score"}'}\n`;
}
const { text } = await generateText({
model: openai("gpt-4.1"),
prompt: `[\n${mainText}\n]\n\nBased on the information you are given and on the relevance score of that (where -1 means no score available), answer to this user prompt: '${prompt}'`,
});
return text;
}
+22
View File
@@ -0,0 +1,22 @@
{
"compilerOptions": {
"target": "ES2022",
"module": "ES2022",
"lib": ["ES2022"],
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true,
"declaration": true,
"declarationMap": true,
"sourceMap": true,
"types": ["node"],
"moduleResolution": "bundler",
"allowSyntheticDefaultImports": true,
"resolveJsonModule": true
},
"include": ["src/**/*"],
"exclude": ["node_modules", "dist"]
}
+124
View File
@@ -0,0 +1,124 @@
# LlamaParse Demo
A TypeScript demo application showcasing the power of **LlamaParse** - an intelligent document parsing service from [LlamaCloud](https://cloud.llamaindex.ai). This demo allows you to parse various document formats and generate AI-powered summaries using OpenAI's GPT models.
## Table of Contents
- [Features](#features)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#usage)
- [Start the Demo](#start-the-demo)
- [Development Mode](#development-mode)
- [Build the Project](#build-the-project)
- [Code Quality](#code-quality)
- [Quick Commands Reference](#quick-commands-reference)
- [How It Works](#how-it-works)
- [API Dependencies](#api-dependencies)
- [Troubleshooting](#troubleshooting)
- [Common Issues](#common-issues)
- [License](#license)
- [Contributing](#contributing)
## Features
- 📄 **Document Parsing**: Parse PDFs, Word docs, and other formats using LlamaParse
- 🤖 **AI Summaries**: Generate intelligent summaries using OpenAI GPT-4
- 🎨 **Beautiful CLI**: Styled console interface with colors and ASCII art
-**Fast Development**: Hot reload support with watch mode
- 🛠️ **TypeScript**: Full TypeScript support with strict type checking
## Prerequisites
- Node.js (version 18 or higher)
- pnpm package manager
- OpenAI API key
- LlamaCloud API key
## Installation
1. Clone the repository:
```bash
git clone https://github.com/run-llama/llama_cloud_services
cd lama_cloud_services/examples-ts/parse/
```
2. Install dependencies:
```bash
pnpm install
```
3. Set up your environment variables:
```bash
# Add your API keys to your environment
export OPENAI_API_KEY="your-openai-api-key"
export LLAMA_CLOUD_API_KEY="your-llamacloud-api-key"
```
## Usage
### Start the Demo
```bash
pnpm run start
```
The application will display a welcome screen and prompt you to enter the path to a document you'd like to process.
### Development Mode
For development with hot reload:
```bash
pnpm run dev
```
### Build the Project
```bash
pnpm run build
```
### Code Quality
Format code:
```bash
pnpm run format
```
Lint code:
```bash
pnpm run lint
```
## How It Works
1. **Document Input**: Enter the path to your document when prompted
2. **Parsing**: LlamaParse processes the document and extracts structured content
3. **AI Summary**: The extracted content is sent to OpenAI GPT-4 for summarization
4. **Results**: View the AI-generated summary in your terminal
## Troubleshooting
### Common Issues
1. **Module Resolution Errors**: Ensure you're using Node.js 18+ and have all dependencies installed
2. **API Key Issues**: Verify your OpenAI and LlamaCloud API keys are correctly set
3. **File Path Errors**: Use absolute paths or ensure relative paths are correct from the project root
## License
MIT License - see the [LICENSE](../../LICENSE) file for details.
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Run `pnpm run format` and `pnpm run lint`
5. Submit a pull request
Binary file not shown.
+15
View File
@@ -0,0 +1,15 @@
import js from "@eslint/js";
import globals from "globals";
import tseslint from "typescript-eslint";
import { defineConfig } from "eslint/config";
export default defineConfig([
{
files: ["**/*.{js,mjs,cjs,ts,mts,cts}"],
plugins: { js },
extends: ["js/recommended"],
languageOptions: { globals: globals.browser },
},
{ files: ["**/*.js"], languageOptions: { sourceType: "script" } },
tseslint.configs.recommended,
]);
+47
View File
@@ -0,0 +1,47 @@
{
"name": "llamaparse-demo",
"version": "0.1.0",
"description": "Demo for LlamaParse in TypeScript",
"type": "module",
"main": "index.js",
"scripts": {
"test": "echo \"There are no tests\"",
"start": "pnpm exec tsx src/index.ts",
"lint": "eslint ./src/",
"format": "prettier --write ./src/",
"build": "tsc",
"dev": "pnpm exec tsx --watch src/index.ts"
},
"keywords": [
"ai",
"ocr",
"parsing",
"intelligent-document-processing",
"pdf",
"llms"
],
"author": "LlamaIndex",
"license": "MIT",
"packageManager": "pnpm@10.12.4",
"devDependencies": {
"@eslint/js": "^9.32.0",
"@types/figlet": "^1.7.0",
"@types/node": "^24.1.0",
"@typescript-eslint/eslint-plugin": "^8.38.0",
"@typescript-eslint/parser": "^8.38.0",
"eslint": "^9.32.0",
"globals": "^16.3.0",
"jiti": "^2.5.1",
"prettier": "^3.6.2",
"typescript": "^5.8.3",
"typescript-eslint": "^8.38.0"
},
"dependencies": {
"@ai-sdk/openai": "^1.3.23",
"ai": "^4.3.19",
"consola": "^3.4.2",
"figlet": "^1.8.2",
"llama-cloud-services": "link:../../ts/llama_cloud_services",
"picocolors": "^1.1.1"
}
}
+1758
View File
File diff suppressed because it is too large Load Diff
+34
View File
@@ -0,0 +1,34 @@
import { LlamaParseReader } from "llama-cloud-services";
import { logger } from "./logger";
import pc from "picocolors";
import { consoleInput, generateSummary, renderLogo } from "./utils";
export async function main(): Promise<number> {
const reader = new LlamaParseReader({ resultType: "markdown" });
await renderLogo();
logger.log(
`Welcome to ${pc.bold(
pc.magentaBright("✨LlamaParse Demo✨"),
)}, our demo for ${pc.bold(pc.green("LlamaParse🦙"))}, a ${pc.bold(
pc.cyan("LlamaCloud☁️"),
)} (https://cloud.llamaindex.ai) product!.\nType the path to the document you would like to process below👇\nIf you wish to exit, just type ${pc.bold(
pc.gray("quit"),
)}.\n`,
);
while (true) {
const userInput = await consoleInput();
if (userInput.toLowerCase() == "quit") {
break;
}
try {
const documents = await reader.loadData(userInput);
const summary = await generateSummary(documents); // Added await here
logger.log(`${pc.bold(pc.cyan("AI-generated summary✨"))}:\n${summary}`);
} catch (error) {
logger.error(`Error processing file: ${error}`);
}
}
return 0;
}
main().catch(console.error);
+8
View File
@@ -0,0 +1,8 @@
import { createConsola } from "consola";
import type { ConsolaInstance } from "consola";
export const logger: ConsolaInstance = createConsola({
formatOptions: {
date: false,
},
});
+51
View File
@@ -0,0 +1,51 @@
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
import { Document } from "llamaindex";
import * as readline from "readline/promises";
import figlet from "figlet";
import pc from "picocolors";
export async function renderLogo(): Promise<void> {
const logoText = figlet.textSync("LlamaParse Demo", {
font: "ANSI Shadow",
horizontalLayout: "default",
verticalLayout: "default",
width: 100,
whitespaceBreak: true,
});
// Add some styling with picocolors
const styledLogo = pc.bold(pc.magentaBright(logoText));
// Add some padding/margin
console.log("\n");
console.log(styledLogo);
console.log(pc.gray("─".repeat(60)));
console.log("\n");
}
export async function consoleInput(): Promise<string> {
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
});
const answer = await rl.question("Path to your file: ");
rl.close();
return answer;
}
export async function generateSummary(documents: Document[]): Promise<string> {
let mainText: string = "";
for (const document of documents) {
mainText += `${document.text}\n\n---\n\n`;
}
const { text } = await generateText({
model: openai("gpt-4.1"),
prompt: `</chat>\n\t<text>${mainText}</text>\n\t<instructions>Could you please generate a summary of the given text?</instructions>\n</chat>`,
});
return text;
}
+22
View File
@@ -0,0 +1,22 @@
{
"compilerOptions": {
"target": "ES2022",
"module": "ES2022",
"lib": ["ES2022"],
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true,
"declaration": true,
"declarationMap": true,
"sourceMap": true,
"types": ["node"],
"moduleResolution": "bundler",
"allowSyntheticDefaultImports": true,
"resolveJsonModule": true
},
"include": ["src/**/*"],
"exclude": ["node_modules", "dist"]
}
+19
View File
@@ -0,0 +1,19 @@
# LlamaCloud Services Examples - Python
> **⚠️ DEPRECATION NOTICE**
>
> This repository and its packages are deprecated and will be maintained until **May 1, 2026**.
>
> **Please migrate to the new packages:**
> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))
> - **TypeScript**: `npm install @llamaindex/llama-cloud` ([GitHub](https://github.com/run-llama/llama-cloud-ts))
>
> The new packages provide the same functionality with improved performance, better support, and active development.
In this folder you will find several python notebooks that contain examples regarding:
- [LlamaParse](./parse/)
- [LlamaExtract](./extract/)
- [LlamaCloudIndex](./index/)
Follow the instructions in each notebook to get started!
+1
View File
@@ -0,0 +1 @@
sample_files/
@@ -0,0 +1,815 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "cell-0",
"metadata": {},
"source": [
"# Batch Parse with LlamaCloud Directories\n",
"\n",
"This notebook demonstrates how to use LlamaCloud's batch processing API to parse multiple files in a directory. The workflow includes:\n",
"\n",
"1. **Creating a Directory** - Set up a directory to organize your files\n",
"2. **Uploading Files** - Upload multiple files to the directory\n",
"3. **Starting a Batch Parse Job** - Kick off batch processing on all files\n",
"4. **Monitoring Progress** - Check the status and view results\n",
"\n",
"This is useful when you need to parse many documents at once, as the batch API handles the orchestration and provides progress tracking."
]
},
{
"cell_type": "markdown",
"id": "0c2b5e1a",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
"cell_type": "markdown",
"id": "cell-1",
"metadata": {},
"source": [
"## Setup and Installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-2",
"metadata": {},
"outputs": [],
"source": [
"%pip install llama-cloud python-dotenv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-3",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from dotenv import load_dotenv\n",
"import httpx\n",
"\n",
"# Load environment variables\n",
"load_dotenv()\n",
"\n",
"# Set your API key\n",
"LLAMA_CLOUD_API_KEY = os.environ.get(\"LLAMA_CLOUD_API_KEY\", \"llx-...\")\n",
"\n",
"# Optional: Set base URL (defaults to https://api.cloud.llamaindex.ai if not set)\n",
"LLAMA_CLOUD_BASE_URL = os.environ.get(\n",
" \"LLAMA_CLOUD_BASE_URL\", \"https://api.cloud.llamaindex.ai\"\n",
")\n",
"\n",
"# Optional: Set project_id if you have one, otherwise it will use your default project\n",
"PROJECT_ID = os.environ.get(\"LLAMA_CLOUD_PROJECT_ID\", None)\n",
"\n",
"print(\"✅ API key configured\")\n",
"print(f\" Base URL: {LLAMA_CLOUD_BASE_URL}\")"
]
},
{
"cell_type": "markdown",
"id": "cell-4",
"metadata": {},
"source": [
"## Setup HTTP Client\n",
"\n",
"Since the current version of the llama-cloud SDK has some issues with the beta endpoints, we'll use direct HTTP requests with httpx for reliability."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-5",
"metadata": {},
"outputs": [],
"source": [
"# Create HTTP client with authentication\n",
"headers = {\n",
" \"Authorization\": f\"Bearer {LLAMA_CLOUD_API_KEY}\",\n",
"}\n",
"\n",
"print(\"✅ HTTP client configured\")\n",
"print(f\" Using base URL: {LLAMA_CLOUD_BASE_URL}\")"
]
},
{
"cell_type": "markdown",
"id": "cell-6",
"metadata": {},
"source": [
"## Step 1: Create a Directory\n",
"\n",
"First, we'll create a directory to organize our files. Directories help you group related files together for batch processing."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-7",
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"# Create a directory with a timestamp in the name\n",
"timestamp = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
"directory_name = f\"batch-parse-demo-{timestamp}\"\n",
"\n",
"# Create directory using HTTP request\n",
"response = httpx.post(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/directories\",\n",
" headers=headers,\n",
" params={\"project_id\": PROJECT_ID},\n",
" json={\n",
" \"name\": directory_name,\n",
" \"description\": \"Demo directory for batch parse example\",\n",
" },\n",
" timeout=60.0,\n",
")\n",
"\n",
"if response.status_code in [200, 201]:\n",
" directory = response.json()\n",
" directory_id = directory[\"id\"]\n",
" project_id = directory[\"project_id\"]\n",
"\n",
" print(f\"✅ Created directory: {directory['name']}\")\n",
" print(f\" Directory ID: {directory_id}\")\n",
" print(f\" Project ID: {project_id}\")\n",
"else:\n",
" raise Exception(\n",
" f\"Failed to create directory: {response.status_code} - {response.text}\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "cell-8",
"metadata": {},
"source": [
"## Step 2: Upload Files to the Directory\n",
"\n",
"Now we'll upload some files to our directory. For this demo, we'll download some sample PDFs and upload them.\n",
"\n",
"You can replace these with your own files."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-9",
"metadata": {},
"outputs": [],
"source": [
"# Create a directory for sample files\n",
"import requests\n",
"\n",
"os.makedirs(\"sample_files\", exist_ok=True)\n",
"\n",
"# Sample documents to download\n",
"sample_docs = {\n",
" \"attention.pdf\": \"https://arxiv.org/pdf/1706.03762.pdf\",\n",
" \"bert.pdf\": \"https://arxiv.org/pdf/1810.04805.pdf\",\n",
"}\n",
"\n",
"# Download sample documents\n",
"for filename, url in sample_docs.items():\n",
" filepath = f\"sample_files/{filename}\"\n",
" if not os.path.exists(filepath):\n",
" print(f\"📥 Downloading {filename}...\")\n",
" response = requests.get(url)\n",
" if response.status_code == 200:\n",
" with open(filepath, \"wb\") as f:\n",
" f.write(response.content)\n",
" print(f\" ✅ Downloaded {filename}\")\n",
" else:\n",
" print(f\" ❌ Failed to download {filename}\")\n",
" else:\n",
" print(f\"📁 {filename} already exists\")\n",
"\n",
"print(\"\\n✅ Sample files ready!\")"
]
},
{
"cell_type": "markdown",
"id": "cell-10",
"metadata": {},
"source": [
"### Upload Files to Directory\n",
"\n",
"Now let's upload the files to our directory using the `upload_file_to_directory` endpoint."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-11",
"metadata": {},
"outputs": [],
"source": [
"uploaded_files = []\n",
"\n",
"# Workaround: Use direct HTTP requests instead of SDK due to SDK bug\n",
"import httpx\n",
"\n",
"for filename in os.listdir(\"sample_files\"):\n",
" if filename.endswith(\".pdf\"):\n",
" filepath = f\"sample_files/{filename}\"\n",
"\n",
" print(f\"📤 Uploading {filename}...\")\n",
"\n",
" # Upload file using direct HTTP request (SDK has a bug with file uploads)\n",
" with open(filepath, \"rb\") as f:\n",
" # Prepare the multipart form data correctly\n",
" files = {\"upload_file\": (filename, f, \"application/pdf\")}\n",
"\n",
" # Make the request directly\n",
" response = httpx.post(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/directories/{directory_id}/files/upload\",\n",
" params={\"project_id\": project_id},\n",
" files=files,\n",
" headers={\"Authorization\": f\"Bearer {LLAMA_CLOUD_API_KEY}\"},\n",
" timeout=60.0,\n",
" )\n",
"\n",
" if response.status_code in [200, 201]:\n",
" directory_file = response.json()\n",
" uploaded_files.append(directory_file)\n",
" print(f\" ✅ Uploaded: {directory_file.get('display_name')}\")\n",
" print(f\" File ID: {directory_file.get('id')}\")\n",
" else:\n",
" print(f\" ❌ Upload failed: {response.status_code}\")\n",
" print(f\" Error: {response.text[:200]}\")\n",
"\n",
"print(f\"\\n✅ Uploaded {len(uploaded_files)} files to directory\")"
]
},
{
"cell_type": "markdown",
"id": "cell-12",
"metadata": {},
"source": [
"## Step 3: Create a Batch Parse Job\n",
"\n",
"Now that we have files in our directory, let's create a batch parse job to process them all at once.\n",
"\n",
"The batch processing API uses the same configuration as LlamaParse."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-13",
"metadata": {},
"outputs": [],
"source": [
"# Configure the parse job\n",
"# This configuration will apply to all files in the directory\n",
"job_config = {\n",
" \"job_name\": \"parse_raw_file_job\", # Must match the JobNames enum value\n",
" \"partitions\": {},\n",
" \"parameters\": {\n",
" \"type\": \"parse\",\n",
" \"lang\": \"en\",\n",
" \"fast_mode\": True,\n",
" },\n",
"}\n",
"\n",
"print(\"✅ Job configuration created\")\n",
"print(f\" Language: {job_config['parameters']['lang']}\")\n",
"print(f\" Fast mode: {job_config['parameters']['fast_mode']}\")"
]
},
{
"cell_type": "markdown",
"id": "cell-14",
"metadata": {},
"source": [
"### Submit the Batch Job\n",
"\n",
"Now let's submit the batch job to process all files in the directory."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-15",
"metadata": {},
"outputs": [],
"source": [
"print(f\"🚀 Submitting batch parse job for directory: {directory_id}\")\n",
"print(f\" Processing {len(uploaded_files)} files...\\n\")\n",
"\n",
"# Submit batch job using HTTP request\n",
"response = httpx.post(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/batch-processing\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id},\n",
" json={\n",
" \"directory_id\": directory_id,\n",
" \"job_config\": job_config,\n",
" \"page_size\": 100, # Number of files to fetch per batch\n",
" \"continue_as_new_threshold\": 10, # Workflow continuation threshold\n",
" },\n",
" timeout=60.0,\n",
")\n",
"\n",
"if response.status_code in [200, 201]:\n",
" batch_job = response.json()\n",
" batch_job_id = batch_job[\"id\"]\n",
"\n",
" print(\"✅ Batch job submitted successfully!\")\n",
" print(f\" Batch Job ID: {batch_job_id}\")\n",
" print(f\" Workflow ID: {batch_job.get('workflow_id')}\")\n",
" print(f\" Status: {batch_job.get('status')}\")\n",
" print(f\" Total Items: {batch_job.get('total_items')}\")\n",
"else:\n",
" raise Exception(\n",
" f\"Failed to create batch job: {response.status_code} - {response.text}\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "cell-16",
"metadata": {},
"source": [
"## Step 4: Monitor Job Progress\n",
"\n",
"Now let's monitor the batch job progress. We'll poll the status endpoint to see how the job is progressing."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-17",
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"\n",
"def print_job_status(status_data):\n",
" \"\"\"Helper function to print job status in a readable format.\"\"\"\n",
" job = status_data[\"job\"]\n",
" progress_pct = status_data[\"progress_percentage\"]\n",
"\n",
" print(f\"\\n{'='*60}\")\n",
" print(f\"Job Status: {job['status']}\")\n",
" print(f\"{'='*60}\")\n",
" print(f\"Total Items: {job['total_items']}\")\n",
" print(f\"Completed: {job['processed_items']}\")\n",
" print(f\"Failed: {job['failed_items']}\")\n",
" print(f\"Skipped: {job['skipped_items']}\")\n",
" print(f\"Progress: {progress_pct:.1f}%\")\n",
"\n",
" if job.get(\"completed_at\"):\n",
" print(f\"Completed At: {job['completed_at']}\")\n",
" elif job.get(\"started_at\"):\n",
" print(f\"Started At: {job['started_at']}\")\n",
"\n",
" print(f\"{'='*60}\")\n",
"\n",
"\n",
"# Poll for status updates\n",
"print(\"🔄 Monitoring batch job progress...\")\n",
"print(\n",
" \"Note: It may take a few seconds for the workflow to initialize and count files.\\n\"\n",
")\n",
"\n",
"max_polls = 60 # Maximum number of status checks (increased for longer jobs)\n",
"poll_interval = 10 # Seconds between checks\n",
"\n",
"for i in range(max_polls):\n",
" response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/batch-processing/{batch_job_id}\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id},\n",
" timeout=60.0,\n",
" )\n",
"\n",
" if response.status_code == 200:\n",
" status_data = response.json()\n",
" print_job_status(status_data)\n",
"\n",
" # Check if job is complete\n",
" job_status = status_data[\"job\"][\"status\"]\n",
" if job_status in [\"completed\", \"failed\", \"cancelled\"]:\n",
" print(f\"\\n✅ Job finished with status: {job_status}\")\n",
" break\n",
"\n",
" if i < max_polls - 1:\n",
" print(f\"\\n⏳ Waiting {poll_interval} seconds before next check...\")\n",
" time.sleep(poll_interval)\n",
" else:\n",
" print(f\"Error getting status: {response.status_code} - {response.text}\")\n",
" break\n",
"else:\n",
" print(f\"\\n⚠️ Reached maximum polling attempts. Job may still be running.\")"
]
},
{
"cell_type": "markdown",
"id": "cell-18",
"metadata": {},
"source": [
"## Step 5: View Job Items\n",
"\n",
"Let's look at the individual items in the batch job to see which files were processed successfully."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-19",
"metadata": {},
"outputs": [],
"source": [
"# Get all items in the batch job\n",
"response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/batch-processing/{batch_job_id}/items\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id, \"limit\": 100},\n",
" timeout=60.0,\n",
")\n",
"\n",
"if response.status_code == 200:\n",
" items_response = response.json()\n",
"\n",
" print(f\"\\n📋 Batch Job Items ({items_response['total_size']} total)\")\n",
" print(f\"{'='*80}\\n\")\n",
"\n",
" for item in items_response[\"items\"]:\n",
" status_emoji = (\n",
" \"✅\"\n",
" if item[\"status\"] == \"completed\"\n",
" else \"❌\"\n",
" if item[\"status\"] == \"failed\"\n",
" else \"⏳\"\n",
" )\n",
" print(f\"{status_emoji} {item['item_name']}\")\n",
" print(f\" Status: {item['status']}\")\n",
" print(f\" Item ID: {item['item_id']}\")\n",
"\n",
" if item.get(\"error_message\"):\n",
" print(f\" Error: {item['error_message']}\")\n",
"\n",
" print()\n",
"else:\n",
" print(f\"Error listing items: {response.status_code} - {response.text}\")"
]
},
{
"cell_type": "markdown",
"id": "cell-20",
"metadata": {},
"source": [
"## Step 6: Retrieve Processing Results\n",
"\n",
"For each completed file, we can retrieve the processing results to see where the parsed output is stored."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-21",
"metadata": {},
"outputs": [],
"source": [
"# Get processing results for a specific item\n",
"if items_response[\"items\"]:\n",
" first_item = items_response[\"items\"][0]\n",
"\n",
" print(f\"\\n🔍 Processing results for: {first_item['item_name']}\")\n",
" print(f\"{'='*80}\\n\")\n",
"\n",
" response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/batch-processing/items/{first_item['item_id']}/processing-results\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id},\n",
" timeout=60.0,\n",
" )\n",
"\n",
" if response.status_code == 200:\n",
" results = response.json()\n",
"\n",
" print(f\"Item: {results['item_name']}\")\n",
" print(f\"Total processing runs: {len(results['processing_results'])}\\n\")\n",
"\n",
" for i, result in enumerate(results[\"processing_results\"], 1):\n",
" print(f\"Run {i}:\")\n",
" print(f\" Job Type: {result['job_type']}\")\n",
" print(f\" Processed At: {result['processed_at']}\")\n",
" print(f\" Parameters Hash: {result['parameters_hash']}\")\n",
"\n",
" if result.get(\"output_s3_path\"):\n",
" print(f\" Output S3 Path: {result['output_s3_path']}\")\n",
"\n",
" if result.get(\"output_metadata\"):\n",
" print(f\" Output Metadata: {result['output_metadata']}\")\n",
"\n",
" print()\n",
" else:\n",
" print(f\"Error getting results: {response.status_code} - {response.text}\")"
]
},
{
"cell_type": "markdown",
"id": "cell-22",
"metadata": {},
"source": [
"## Optional: List All Batch Jobs\n",
"\n",
"You can also list all batch jobs in your project to see the history of batch processing operations."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-23",
"metadata": {},
"outputs": [],
"source": [
"# List all parse jobs in the project\n",
"response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/beta/batch-processing\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id, \"job_type\": \"parse\", \"limit\": 10},\n",
" timeout=60.0,\n",
")\n",
"\n",
"if response.status_code == 200:\n",
" jobs_response = response.json()\n",
"\n",
" print(f\"\\n📊 Recent Batch Parse Jobs ({jobs_response['total_size']} total)\")\n",
" print(f\"{'='*80}\\n\")\n",
"\n",
" for job in jobs_response[\"items\"]:\n",
" status_emoji = (\n",
" \"✅\"\n",
" if job[\"status\"] == \"completed\"\n",
" else \"❌\"\n",
" if job[\"status\"] == \"failed\"\n",
" else \"⏳\"\n",
" )\n",
" print(f\"{status_emoji} Job ID: {job['id']}\")\n",
" print(f\" Status: {job['status']}\")\n",
" print(f\" Directory: {job['directory_id']}\")\n",
" print(f\" Total Items: {job['total_items']}\")\n",
" print(f\" Completed: {job['processed_items']}\")\n",
" print(f\" Created: {job['created_at']}\")\n",
" print()\n",
"else:\n",
" print(f\"Error listing jobs: {response.status_code} - {response.text}\")"
]
},
{
"cell_type": "markdown",
"id": "uug7591rkq",
"metadata": {},
"source": [
"## Step 7: Retrieve Parsed Text Results\n",
"\n",
"Once the batch job is complete, each BatchJobItem will have a `job_id` field that maps to a parse job ID. We can use this ID with the standard parse client methods to fetch the actual parsed text results."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "vpp0vxtc0y",
"metadata": {},
"outputs": [],
"source": [
"# Get all completed items and their job IDs\n",
"completed_items = [\n",
" item for item in items_response[\"items\"] if item[\"status\"] == \"completed\"\n",
"]\n",
"\n",
"print(f\"📄 Found {len(completed_items)} completed items\\n\")\n",
"print(f\"{'='*80}\\n\")\n",
"\n",
"# Display the job_id for each completed item\n",
"for item in completed_items:\n",
" print(f\"📝 {item['item_name']}\")\n",
" print(f\" Item ID: {item['item_id']}\")\n",
" print(f\" Parse Job ID: {item['job_id']}\")\n",
" print()"
]
},
{
"cell_type": "markdown",
"id": "4gck6hwpnl6",
"metadata": {},
"source": [
"### Fetch Parsed Text for a Specific Document\n",
"\n",
"Now let's use the `job_id` to retrieve the actual parsed text content using the parse client methods."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "g191kvgxxvk",
"metadata": {},
"outputs": [],
"source": [
"# Get the parsed text for the first completed item\n",
"if completed_items:\n",
" first_completed = completed_items[0]\n",
"\n",
" print(f\"📖 Retrieving parsed text for: {first_completed['item_name']}\")\n",
" print(f\" Using Parse Job ID: {first_completed['job_id']}\\n\")\n",
" print(f\"{'='*80}\\n\")\n",
"\n",
" # Use the job_id to fetch the parse result\n",
" response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/parsing/job/{first_completed['job_id']}/result/text\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id},\n",
" timeout=60.0,\n",
" )\n",
"\n",
" if response.status_code == 200:\n",
" parse_result = response.text\n",
"\n",
" print(f\"✅ Retrieved parsed text ({len(parse_result)} characters)\\n\")\n",
"\n",
" # Display first 1000 characters as a preview\n",
" print(\"Preview (first 1000 characters):\")\n",
" print(\"-\" * 80)\n",
" print(parse_result[:1000])\n",
" print(\"-\" * 80)\n",
"\n",
" if len(parse_result) > 1000:\n",
" print(f\"\\n... and {len(parse_result) - 1000} more characters\")\n",
" else:\n",
" print(\n",
" f\"Error retrieving parse result: {response.status_code} - {response.text}\"\n",
" )\n",
"else:\n",
" print(\"⚠️ No completed items found to retrieve results from\")"
]
},
{
"cell_type": "markdown",
"id": "2olccb4l8fj",
"metadata": {},
"source": [
"### Retrieve Parsed Results in Other Formats\n",
"\n",
"You can also retrieve the parsed results in JSON or Markdown format using different client methods."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "lcqsfxiw0sr",
"metadata": {},
"outputs": [],
"source": [
"if completed_items:\n",
" first_completed = completed_items[0]\n",
"\n",
" print(\n",
" f\"📋 Retrieving parse results in different formats for: {first_completed['item_name']}\\n\"\n",
" )\n",
"\n",
" # Get as JSON (includes structured data with pages, images, etc.)\n",
" print(\"1️⃣ Retrieving as JSON...\")\n",
" response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/parsing/job/{first_completed['job_id']}/result/json\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id},\n",
" timeout=60.0,\n",
" )\n",
"\n",
" if response.status_code == 200:\n",
" json_result = response.json()\n",
" print(f\" ✅ JSON result with {len(json_result['pages'])} pages\")\n",
" print(f\" Keys: {list(json_result.keys())}\\n\")\n",
" else:\n",
" print(f\" Error: {response.status_code}\\n\")\n",
"\n",
" # Get as Markdown\n",
" print(\"2️⃣ Retrieving as Markdown...\")\n",
" response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/parsing/job/{first_completed['job_id']}/result/markdown\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id},\n",
" timeout=60.0,\n",
" )\n",
"\n",
" if response.status_code == 200:\n",
" markdown_result = response.text\n",
" print(f\" ✅ Markdown result ({len(markdown_result)} characters)\\n\")\n",
"\n",
" # Display markdown preview\n",
" print(\"Markdown Preview (first 500 characters):\")\n",
" print(\"-\" * 80)\n",
" print(markdown_result[:500])\n",
" print(\"-\" * 80)\n",
"\n",
" if len(markdown_result) > 500:\n",
" print(f\"\\n... and {len(markdown_result) - 500} more characters\")\n",
" else:\n",
" print(f\" Error: {response.status_code}\")\n",
"else:\n",
" print(\"⚠️ No completed items found to retrieve results from\")"
]
},
{
"cell_type": "markdown",
"id": "lr61wqkfq3",
"metadata": {},
"source": [
"### Batch Process All Parsed Results\n",
"\n",
"You can also loop through all completed items to retrieve and process all the parsed results."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "kltydf9xzkl",
"metadata": {},
"outputs": [],
"source": [
"# Process all completed items\n",
"print(f\"🔄 Processing all {len(completed_items)} completed items...\\n\")\n",
"print(f\"{'='*80}\\n\")\n",
"\n",
"all_results = {}\n",
"\n",
"for item in completed_items:\n",
" print(f\"📄 Processing: {item['item_name']}\")\n",
" print(f\" Parse Job ID: {item['job_id']}\")\n",
"\n",
" try:\n",
" # Retrieve the parsed text for this item\n",
" response = httpx.get(\n",
" f\"{LLAMA_CLOUD_BASE_URL}/api/v1/parsing/job/{item['job_id']}/result/text\",\n",
" headers=headers,\n",
" params={\"project_id\": project_id},\n",
" timeout=60.0,\n",
" )\n",
"\n",
" if response.status_code == 200:\n",
" parsed_text = response.text\n",
"\n",
" all_results[item[\"item_name\"]] = {\n",
" \"job_id\": item[\"job_id\"],\n",
" \"text\": parsed_text,\n",
" \"length\": len(parsed_text),\n",
" }\n",
"\n",
" print(f\" ✅ Retrieved {len(parsed_text)} characters\")\n",
" else:\n",
" all_results[item[\"item_name\"]] = {\n",
" \"job_id\": item[\"job_id\"],\n",
" \"error\": f\"HTTP {response.status_code}\",\n",
" }\n",
" print(f\" ❌ Error: HTTP {response.status_code}\")\n",
"\n",
" except Exception as e:\n",
" print(f\" ❌ Error: {str(e)}\")\n",
" all_results[item[\"item_name\"]] = {\"job_id\": item[\"job_id\"], \"error\": str(e)}\n",
"\n",
" print()\n",
"\n",
"print(f\"{'='*80}\")\n",
"print(f\"\\n✅ Processed {len(all_results)} items\")\n",
"print(f\"\\nSummary:\")\n",
"for name, result in all_results.items():\n",
" if \"error\" in result:\n",
" print(f\" ❌ {name}: Error - {result['error']}\")\n",
" else:\n",
" print(f\" ✅ {name}: {result['length']:,} characters\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
@@ -7,7 +7,7 @@
"source": [
"# Extraction and Analysis over a Fidelity Multi-Fund Annual Report\n",
"\n",
"<a href=\"https://colab.research.google.com/github/run-llama/llama_cloud_services-demo/blob/main/examples/extract/asset_manager_fund_analysis.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
"<a href=\"https://colab.research.google.com/github/run-llama/llama_cloud_services/blob/main/examples/extract/asset_manager_fund_analysis.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
"\n",
"In this notebook we show you how to create an agentic document workflow over a complex document that contains annual reports for multiple funds - each fund reports financials in a standardized reporting structure, and it's all consolidated in the same document.\n",
"\n",
@@ -16,6 +16,14 @@
"![](asset_manager_fund_analysis.png)\n"
]
},
{
"cell_type": "markdown",
"id": "cbafd7ee",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
"cell_type": "markdown",
"id": "cda2e5e9-fe9d-42d9-9387-f529d970ff7b",
@@ -7,7 +7,7 @@
"source": [
"# Automotive Equity Research: A Multi-Step Agentic Workflow\n",
"\n",
"<a href=\"https://colab.research.google.com/github/run-llama/llama_cloud_services-demo/blob/main/examples/extract/automotive_sector_analysis.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
"<a href=\"https://colab.research.google.com/github/run-llama/llama_cloud_services/blob/main/examples/extract/automotive_sector_analysis.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
"\n",
"This notebook demonstrates an endtoend agentic workflow using LlamaExtract and the LlamaIndex eventdriven workflow framework for automotive sector analysis.\n",
"\n",
@@ -20,6 +20,14 @@
"This workflow is designed for equity research analysts and investment professionals."
]
},
{
"cell_type": "markdown",
"id": "e7979faf",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
"cell_type": "code",
"execution_count": null,
Binary file not shown.

After

Width:  |  Height:  |  Size: 287 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 769 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 942 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.5 MiB

@@ -19,6 +19,13 @@
"The example we go through below is also replicable within Llama Cloud as well, where you will also be able to pick between a number of pre-defined schemas, instead of building your own."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
"cell_type": "code",
"execution_count": null,
+7
View File
@@ -15,6 +15,13 @@
"Dow Jones Industrial Average (DJIA) is a stock market index that consists of 30 large companies listed on the New York Stock Exchange and the NASDAQ and is considered a good proxy for the overall US stock market. For this exercise, we will extract the insider transactions for all the companies in the DJIA. Let's first get the list of tickers in the Dow Jones Industrial Average using Wikipedia."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
"cell_type": "code",
"execution_count": null,
@@ -16,6 +16,14 @@
"This approach reduces manual data entry, improves extraction accuracy and standardization, and provides traceability for each technical detail."
]
},
{
"cell_type": "markdown",
"id": "8d1efe6e",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
"cell_type": "markdown",
"id": "a3b8c8d5-ff3e-48ce-b0b8-29b6b1f517f8",
+7
View File
@@ -11,6 +11,13 @@
"Take a look at one of the resumes in the `data/resumes` directory. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
"cell_type": "code",
"execution_count": null,
+8
View File
@@ -20,6 +20,14 @@
"> **Note:** This principle of what fields generalize across your target documents and what might be optional is an important one to keep in mind when designing your schema. \n"
]
},
{
"cell_type": "markdown",
"id": "355adfd4",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
"cell_type": "code",
"execution_count": null,
@@ -21,6 +21,14 @@
"The following notebook uses the eventdriven syntax (with custom events, steps, and a workflow class) adapted from the technical datasheet and contract review examples."
]
},
{
"cell_type": "markdown",
"id": "ab7be988",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
"cell_type": "markdown",
"id": "36d8e34e-ed98-46ac-b744-1642f6e253d5",
+516
View File
@@ -0,0 +1,516 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a7oq3cfnync",
"metadata": {},
"source": [
"# Extracting Repeating Entities from Documents\n",
"\n",
"This notebook demonstrates how to use the `PER_TABLE_ROW` extraction target to extract structured data from documents containing repeating entities like tables, lists, or catalogs.\n",
"\n",
"## Why Use the Tabular Extraction Target?\n",
"\n",
"`PER_DOC` (refer to the table below for a quick overview of the different extraction targets) is the default extraction target in LlamaExtract, which looks at the entire document's context when doing an extraction. When extracting lists of entities, LLM-based extraction has a critical failure mode — it often **only extracts the first few tens of entries** from a long list. This happens because LLMs have limited attention spans for repetitive data. Document-level extraction doesn't guarantee exhaustive coverage, and long lists lead to incomplete extractions.\n",
"\n",
"**The Solution**: `PER_TABLE_ROW` solves this by processing each entity individually or in smaller batches, ensuring **exhaustive extraction** of all entries regardless of list length.\n",
"\n",
"### Entity-Level Extraction\n",
"\n",
"When using `extraction_target=ExtractTarget.PER_TABLE_ROW`, you define a schema for a **single entity** (e.g., one hospital, one product, one invoice line item), not the full document. LlamaExtract automatically:\n",
"- Detects the formatting patterns that distinguish individual entities (table rows, list items, section headers, etc.)\n",
"- Applies your schema to each identified entity\n",
"- Returns a `list[YourSchema]` with one object per entity\n",
"\n",
"This approach is ideal when each entity locally contains all the information needed for your schema.\n",
"\n",
"### Choosing the Right Extraction Target\n",
"\n",
"| Extraction Target | Best For | Returns |\n",
"|-------------------|----------|---------|\n",
"| `PER_DOC` | Single-entity documents, summaries, or short lists | One JSON object for entire document |\n",
"| `PER_PAGE` | Multi-page documents where each page is independent | One JSON object per page |\n",
"| `PER_TABLE_ROW` | **Long lists, tables, catalogs with repeating entities** | List of JSON objects (one per entity) |\n",
"\n",
"📖 For more details, see the [Extraction Target documentation](https://developers.llamaindex.ai/python/cloud/llamaextract/features/concepts/#extraction-target)."
]
},
{
"cell_type": "markdown",
"id": "cb760594",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9427d1de",
"metadata": {},
"outputs": [],
"source": [
"from dotenv import load_dotenv\n",
"from llama_cloud_services import LlamaExtract\n",
"\n",
"\n",
"# Load environment variables (put LLAMA_CLOUD_API_KEY in your .env file)\n",
"load_dotenv(override=True)\n",
"\n",
"# Optionally, add your project id/organization id\n",
"llama_extract = LlamaExtract()"
]
},
{
"cell_type": "markdown",
"id": "4426b360",
"metadata": {},
"source": [
"## Table of Hospitals by County and Insurance Plans\n",
"\n",
"We have a PDF document with a list of hospitals by county and different insurance plans offered by Blue Shield of California. \n",
"\n",
"\n",
"![First few entries from the PDF](./data/tables/bsc_page1.png)"
]
},
{
"cell_type": "markdown",
"id": "c86sjymhn1r",
"metadata": {},
"source": [
"We want to extract each hospital from this table along with a list of applicable insurance plans. \n",
"\n",
"### Example 1: Structured Table\n",
"\n",
"This is an ideal use case for `PER_TABLE_ROW` extraction:\n",
"- **Clear structure**: The document has explicit table formatting with rows and columns\n",
"- **Repeating entities**: Each row represents one hospital with consistent attributes\n",
"- **Local information**: All data for each hospital (county, name, plans) is contained within its row\n",
"\n",
"Notice that our `Hospital` schema describes a **single hospital**, not the full document. LlamaExtract will return a `list[Hospital]` with one entry per table row."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7c61a802",
"metadata": {},
"outputs": [],
"source": [
"from pydantic import BaseModel, Field\n",
"\n",
"\n",
"class Hospital(BaseModel):\n",
" \"\"\"List of hospitals by county available for different BSC plans\"\"\"\n",
"\n",
" county: str = Field(description=\"County name\")\n",
" hospital_name: str = Field(description=\"Name of the hospital\")\n",
" plan_names: list[str] = Field(\n",
" description=\"List of plans available at the hospital. One of: Trio HMO, SaveNet, Access+ HMO, BlueHPN PPO, Tandem PPO, PPO\"\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b8a69b7a",
"metadata": {},
"outputs": [],
"source": [
"from llama_cloud_services.extract import ExtractConfig, ExtractMode, ExtractTarget\n",
"\n",
"\n",
"result = await llama_extract.aextract(\n",
" data_schema=Hospital,\n",
" files=\"./data/tables/BSC-Hospital-List-by-County.pdf\",\n",
" config=ExtractConfig(\n",
" extraction_mode=ExtractMode.PREMIUM,\n",
" extraction_target=ExtractTarget.PER_TABLE_ROW,\n",
" parse_model=\"anthropic-sonnet-4.5\",\n",
" ),\n",
")"
]
},
{
"cell_type": "markdown",
"id": "43722cda",
"metadata": {},
"source": [
"### Results"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "95b5aca6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"380"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(result.data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1e355770",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'county': 'Alameda',\n",
" 'hospital_name': 'Alameda Hospital',\n",
" 'plan_names': ['Trio HMO',\n",
" 'SaveNet',\n",
" 'Access+ HMO',\n",
" 'BlueHPN PPO',\n",
" 'Tandem PPO',\n",
" 'PPO']},\n",
" {'county': 'Alameda',\n",
" 'hospital_name': 'Alta Bates Med Ctr Herrick Campus',\n",
" 'plan_names': ['Trio HMO',\n",
" 'Access+ HMO',\n",
" 'BlueHPN PPO',\n",
" 'Tandem PPO',\n",
" 'PPO']},\n",
" {'county': 'Alameda',\n",
" 'hospital_name': 'Alta Bates Summit Med Ctr Alta Bates Campus',\n",
" 'plan_names': ['Trio HMO',\n",
" 'Access+ HMO',\n",
" 'BlueHPN PPO',\n",
" 'Tandem PPO',\n",
" 'PPO']},\n",
" {'county': 'Alameda',\n",
" 'hospital_name': 'Alta Bates Summit Med Ctr Summit Campus',\n",
" 'plan_names': ['Trio HMO',\n",
" 'Access+ HMO',\n",
" 'BlueHPN PPO',\n",
" 'Tandem PPO',\n",
" 'PPO']},\n",
" {'county': 'Alameda',\n",
" 'hospital_name': 'Alta Bates Summit Medical Center',\n",
" 'plan_names': ['Trio HMO',\n",
" 'Access+ HMO',\n",
" 'BlueHPN PPO',\n",
" 'Tandem PPO',\n",
" 'PPO']},\n",
" {'county': 'Alameda',\n",
" 'hospital_name': 'BHC Fremont Hospital',\n",
" 'plan_names': ['Trio HMO',\n",
" 'SaveNet',\n",
" 'Access+ HMO',\n",
" 'BlueHPN PPO',\n",
" 'Tandem PPO',\n",
" 'PPO']},\n",
" {'county': 'Alameda',\n",
" 'hospital_name': 'Centre For Neuro Skills San Francisco',\n",
" 'plan_names': ['Trio HMO',\n",
" 'SaveNet',\n",
" 'Access+ HMO',\n",
" 'BlueHPN PPO',\n",
" 'Tandem PPO',\n",
" 'PPO']},\n",
" {'county': 'Alameda',\n",
" 'hospital_name': 'Eden Medical Center',\n",
" 'plan_names': ['Trio HMO', 'Access+ HMO', 'PPO']},\n",
" {'county': 'Alameda',\n",
" 'hospital_name': 'Fairmont Hospital',\n",
" 'plan_names': ['Trio HMO',\n",
" 'SaveNet',\n",
" 'Access+ HMO',\n",
" 'BlueHPN PPO',\n",
" 'Tandem PPO',\n",
" 'PPO']},\n",
" {'county': 'Alameda',\n",
" 'hospital_name': 'Highland Hospital',\n",
" 'plan_names': ['Trio HMO',\n",
" 'SaveNet',\n",
" 'Access+ HMO',\n",
" 'BlueHPN PPO',\n",
" 'Tandem PPO',\n",
" 'PPO']}]"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result.data[:10]"
]
},
{
"cell_type": "markdown",
"id": "e28f0de8",
"metadata": {},
"source": [
"![](./data/tables/bsc_results.png)"
]
},
{
"cell_type": "markdown",
"id": "di156pb7s6j",
"metadata": {},
"source": [
"**Success!** We extracted all **380 hospitals** from the multi-page PDF. Each entity was correctly parsed with its county, hospital name, and applicable insurance plans. With `PER_DOC`, we would likely have only gotten the first 20-30 entries."
]
},
{
"cell_type": "markdown",
"id": "gelvl6db268",
"metadata": {},
"source": [
"## Extracting from a Toy Catalog\n",
"\n",
"### Example 2: Semi-Structured List\n",
"\n",
"The `PER_TABLE_ROW` extraction target also works well for documents that aren't explicit tables but have similar properties:\n",
"- **Ordered listing**: The toys are listed sequentially with visual separation (section headers, spacing)\n",
"- **Repeating pattern**: Each toy entry has a consistent structure (code, name, specs, description)\n",
"- **Local information**: All attributes for each toy are grouped together in its entry\n",
"\n",
"Even though this isn't a traditional table format, each toy entity locally contains all the information needed for our schema. LlamaExtract detects the formatting patterns that distinguish each toy and extracts them as separate entities.\n",
"\n",
"![](./data/tables/toy_catalog_page.png)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8cf0b2db",
"metadata": {},
"outputs": [],
"source": [
"from pydantic import BaseModel, Field\n",
"\n",
"\n",
"class ToyCatalog(BaseModel):\n",
" \"\"\"Product information from a toy catalog.\"\"\"\n",
"\n",
" section_name: str = Field(\n",
" description=\"The name of the toy section (e.g. Table Toys, Active Toys).\"\n",
" )\n",
" product_code: str = Field(\n",
" description=\"The unique product code for the toy (e.g., GA457).\"\n",
" )\n",
" toy_name: str = Field(description=\"The name of the toy.\")\n",
" age_range: str = Field(\n",
" description=\"The recommended age range for the toy (e.g., 6 +, 4 +).\",\n",
" )\n",
" player_range: str = Field(\n",
" description=\"The number of players the toy is designed for (e.g., 2, 2-4, 1-6).\",\n",
" )\n",
" material: str = Field(\n",
" description=\"The primary material(s) the toy is made of (e.g., wood, cardboard).\",\n",
" )\n",
" description: str = Field(\n",
" description=\"A brief description of the toy and its components and dimensions.\",\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "mysu1i2qo9e",
"metadata": {},
"source": [
"### Results\n",
"\n",
"Again, our schema represents a **single toy product**, not the entire catalog. The system will return a `list[ToyCatalog]` with one entry per toy."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5b38b806",
"metadata": {},
"outputs": [],
"source": [
"result = await llama_extract.aextract(\n",
" data_schema=ToyCatalog,\n",
" files=\"./data/tables/Click-BS-Toys-Catalogue-2024.pdf\",\n",
" config=ExtractConfig(\n",
" extraction_mode=ExtractMode.PREMIUM,\n",
" extraction_target=ExtractTarget.PER_TABLE_ROW,\n",
" parse_model=\"anthropic-sonnet-4.5\",\n",
" ),\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "91aface0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"153"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(result.data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "51278736",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'section_name': 'Table Toys',\n",
" 'product_code': 'GA457',\n",
" 'toy_name': 'Dots and Boxes',\n",
" 'age_range': '6+',\n",
" 'player_range': '2',\n",
" 'material': 'wood',\n",
" 'description': 'base 17x17 cm\\n50 border pieces 4x1,2x0,3 cm\\n34 trees 2,6x1,4 cm'},\n",
" {'section_name': 'Table Toys',\n",
" 'product_code': 'GA456',\n",
" 'toy_name': '3 In a Row',\n",
" 'age_range': '8+',\n",
" 'player_range': '2',\n",
" 'material': 'wood, pine, cardboard',\n",
" 'description': 'base 24x22,5x2,5 cm\\n30 cards 5,5x5 cm\\n6 chips'},\n",
" {'section_name': 'Table Toys',\n",
" 'product_code': 'GA467',\n",
" 'toy_name': 'Which Cow am i?',\n",
" 'age_range': '6+',\n",
" 'player_range': '2',\n",
" 'material': 'wood, beech',\n",
" 'description': '2 cow bases 56x4x4,5 cm\\n16 cards 4x5 cm'},\n",
" {'section_name': 'Table Toys',\n",
" 'product_code': 'GA460',\n",
" 'toy_name': 'Balance Bunnies',\n",
" 'age_range': '4+',\n",
" 'player_range': '2',\n",
" 'material': 'wood',\n",
" 'description': '1 base 35x12x25 cm\\n7 bunnies 7 foxes\\n1 dice 3 cm'},\n",
" {'section_name': 'Table Toys',\n",
" 'product_code': 'GA462',\n",
" 'toy_name': 'Color Combination Race',\n",
" 'age_range': '4+',\n",
" 'player_range': '2-4',\n",
" 'material': 'wood, cardboard',\n",
" 'description': 'base 6,5x6,5x15 cm, rings 5,5x5,5x0,5 mm\\ncardholder 6x6x2 cm, cards 5,5x5,5 cm\\ncolor cards Ø 15,5 cm - Ø 7 cm'},\n",
" {'section_name': 'Table Toys',\n",
" 'product_code': 'GA465',\n",
" 'toy_name': 'Plop It',\n",
" 'age_range': '6+',\n",
" 'player_range': '2-4',\n",
" 'material': 'wood, elastic, cardboard',\n",
" 'description': 'Catch the right balls and plop them in the net!\\n* 2 ploppers 8x5 cm\\n* 2 net holders Ø 5cm, length 55 cm\\n* 6 cards 1,5x2,5 cm, 30 balls Ø 2,5 cm\\n* 1 rope 120 cm'},\n",
" {'section_name': 'Table Toys',\n",
" 'product_code': 'GA466',\n",
" 'toy_name': 'Whack a Shape',\n",
" 'age_range': '4+',\n",
" 'player_range': '2-4',\n",
" 'material': 'wood',\n",
" 'description': '* base 38,5x15,5 cm\\n* 2 stands 36 half balls, 4 hammers\\n* 1 dice 2,5 cm\\n* 4 cards'},\n",
" {'section_name': 'Table Toys',\n",
" 'product_code': 'GA458',\n",
" 'toy_name': 'Sling Puck | Table Hockey',\n",
" 'age_range': '6+',\n",
" 'player_range': '2',\n",
" 'material': 'wood',\n",
" 'description': '* double sides base 39x21x3 cm\\n* 10 chips Ø 2,5 cm\\n* 2 pushers 4x4x3 cm'},\n",
" {'section_name': 'Table Toys',\n",
" 'product_code': 'GA039',\n",
" 'toy_name': 'DIY Birdhouse',\n",
" 'age_range': '3+',\n",
" 'player_range': '1',\n",
" 'material': 'wood',\n",
" 'description': '* house 9x9x13 cm'},\n",
" {'section_name': 'Table Toys',\n",
" 'product_code': 'GA319',\n",
" 'toy_name': 'Triangle Domino',\n",
" 'age_range': '6+',\n",
" 'player_range': '2-4',\n",
" 'material': 'wood',\n",
" 'description': '* 35 triangles 10x10 x10 cm'}]"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result.data[:10]"
]
},
{
"cell_type": "markdown",
"id": "d1810c0a",
"metadata": {},
"source": [
"![](./data/tables/toy_catalog_results.png)"
]
},
{
"cell_type": "markdown",
"id": "ezur9gnhmsb",
"metadata": {},
"source": [
"**Success!** Despite the semi-structured format, we extracted all **152 toy products** from the catalog (there's an extra repeated extracted toy from the Appendix section). LlamaExtract automatically detected the visual patterns separating each toy entry and applied our schema to each one."
]
},
{
"cell_type": "markdown",
"id": "aeyr3io29u",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"The `PER_TABLE_ROW` extraction target is powerful for extracting repeating structured entities from documents. Key takeaways:\n",
"\n",
"1. **Schema design**: Define your schema for a single entity, not the full document. The system returns `list[YourSchema]`.\n",
"\n",
"2. **Works with various formats**: Not just traditional tables—any document with distinguishable repeating entities (bullets, numbering, headers, visual separation, etc.). The common requirement is that each entity should contain all the necessary data for your schema within its local context.\n",
"\n",
"3. **Automatic pattern detection**: LlamaExtract identifies the formatting patterns that distinguish entities and applies your schema to each one."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
@@ -7,7 +7,7 @@
"source": [
"# Dynamic Section Retrieval with LlamaParse\n",
"\n",
"<a href=\"https://colab.research.google.com/github/run-llama/llama_cloud_services-demo/blob/main/examples/parse/advanced_rag/dynamic_section_retrieval.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
"<a href=\"https://colab.research.google.com/github/run-llama/llama_cloud_services/blob/main/examples/parse/advanced_rag/dynamic_section_retrieval.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
"\n",
"This notebook showcases a concept called \"dynamic section retrieval\".\n",
"\n",
@@ -19,7 +19,20 @@
"\n",
"![](dynamic_section_retrieval_img.png)\n",
"\n",
"This helps provide a solution to the common chunking problem of retrieving chunks that are only subsets of the entire section you're meant to retrieve."
"This helps provide a solution to the common chunking problem of retrieving chunks that are only subsets of the entire section you're meant to retrieve.\n",
"\n",
"Status:\n",
"| Last Executed | Version | State |\n",
"|---------------|---------|------------|\n",
"| Aug-19-2025 | 0.6.61 | Maintained |"
]
},
{
"cell_type": "markdown",
"id": "e2b422f5",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
@@ -32,18 +45,6 @@
"Install core packages and download relevant files. Here we load some popular ICLR 2024 papers."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "71bd0714-324f-48b3-8a93-72c6c3a10b53",
"metadata": {},
"outputs": [],
"source": [
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()"
]
},
{
"cell_type": "code",
"execution_count": null,
@@ -51,8 +52,7 @@
"metadata": {},
"outputs": [],
"source": [
"!pip install llama-index\n",
"!pip install llama-index-core\n",
"!pip install \"llama-index>=0.13.0<0.14.0\" \"llama-index-vector-stores-chroma>=0.5.1<0.6.0\"\n",
"!pip install llama-cloud-services"
]
},
@@ -101,48 +101,7 @@
"execution_count": null,
"id": "80137d15-f22b-47eb-adce-ac295ced7e71",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"mkdir: iclr_docs: File exists\n",
"--2024-11-10 16:18:56-- https://openreview.net/pdf?id=VTF8yNQM66\n",
"Resolving openreview.net (openreview.net)... 35.184.86.251\n",
"Connecting to openreview.net (openreview.net)|35.184.86.251|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 2680380 (2.6M) [application/pdf]\n",
"Saving to: iclr_docs/swebench.pdf\n",
"\n",
"iclr_docs/swebench. 100%[===================>] 2.56M 7.22MB/s in 0.4s \n",
"\n",
"2024-11-10 16:18:57 (7.22 MB/s) - iclr_docs/swebench.pdf saved [2680380/2680380]\n",
"\n",
"--2024-11-10 16:18:57-- https://openreview.net/pdf?id=hSyW5go0v8\n",
"Resolving openreview.net (openreview.net)... 35.184.86.251\n",
"Connecting to openreview.net (openreview.net)|35.184.86.251|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 1244749 (1.2M) [application/pdf]\n",
"Saving to: iclr_docs/selfrag.pdf\n",
"\n",
"iclr_docs/selfrag.p 100%[===================>] 1.19M 4.21MB/s in 0.3s \n",
"\n",
"2024-11-10 16:18:58 (4.21 MB/s) - iclr_docs/selfrag.pdf saved [1244749/1244749]\n",
"\n",
"--2024-11-10 16:18:58-- https://openreview.net/pdf?id=c5pwL0Soay\n",
"Resolving openreview.net (openreview.net)... 35.184.86.251\n",
"Connecting to openreview.net (openreview.net)|35.184.86.251|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 4775879 (4.6M) [application/pdf]\n",
"Saving to: iclr_docs/metra.pdf\n",
"\n",
"iclr_docs/metra.pdf 100%[===================>] 4.55M 4.06MB/s in 1.1s \n",
"\n",
"2024-11-10 16:19:00 (4.06 MB/s) - iclr_docs/metra.pdf saved [4775879/4775879]\n",
"\n"
]
}
],
"outputs": [],
"source": [
"!mkdir \"{data_dir}\"\n",
"for url, paper in zip(urls, papers):\n",
@@ -168,8 +127,8 @@
"from llama_index.llms.openai import OpenAI\n",
"from llama_index.embeddings.openai import OpenAIEmbedding\n",
"\n",
"embed_model = OpenAIEmbedding(model=\"text-embedding-3-large\")\n",
"llm = OpenAI(model=\"gpt-4o\")\n",
"embed_model = OpenAIEmbedding(model=\"text-embedding-3-large\", api_key=\"sk-...\")\n",
"llm = OpenAI(model=\"gpt-5-mini\", api_key=\"sk-...\")\n",
"\n",
"Settings.embed_model = embed_model\n",
"Settings.llm = llm"
@@ -192,7 +151,15 @@
"source": [
"from llama_cloud_services import LlamaParse\n",
"\n",
"parser = LlamaParse(result_type=\"markdown\")"
"parser = LlamaParse(\n",
" parse_mode=\"parse_page_with_agent\",\n",
" model=\"openai-gpt-4-1-mini\",\n",
" high_res_ocr=True,\n",
" adaptive_long_table=True,\n",
" outlined_table_extraction=True,\n",
" output_tables_as_HTML=True,\n",
" api_key=\"llx-...\",\n",
")"
]
},
{
@@ -201,30 +168,56 @@
"id": "f9d6f0e8-323e-4786-a4a8-e393441ecd61",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Getting job results: 0%| | 0/3 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Started parsing the file under job_id 827f328d-b72e-4b70-8b4b-47dbba859d69\n",
"Started parsing the file under job_id d3104cd5-731e-4def-bdbc-889e8731989c\n",
"Started parsing the file under job_id 6046274e-e522-46af-9185-3c036e9c3ad6\n"
"Started parsing the file under job_id d8f0df2d-5b55-4e4f-bbe9-81cf4b8a4782\n",
"Started parsing the file under job_id 6aef247f-f548-43f5-9ddb-cf8ba8373130\n",
"Started parsing the file under job_id 5c1c4baf-fa43-4ed4-b671-16c45f99461c\n",
"..."
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Getting job results: 67%|██████▋ | 2/3 [01:40<00:46, 46.97s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"....."
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Getting job results: 100%|██████████| 3/3 [05:49<00:00, 116.59s/it]\n"
]
}
],
"source": [
"from pathlib import Path\n",
"\n",
"paper_dicts = {}\n",
"\n",
"paths_to_parse = []\n",
"for paper_path in papers:\n",
" paper_base = Path(paper_path).stem\n",
" full_paper_path = str(Path(data_dir) / paper_path)\n",
" md_json_objs = parser.get_json_result(full_paper_path)\n",
" json_dicts = md_json_objs[0][\"pages\"]\n",
" paper_dicts[paper_path] = {\n",
" \"paper_path\": full_paper_path,\n",
" \"json_dicts\": json_dicts,\n",
" }"
" paths_to_parse.append(full_paper_path)\n",
"\n",
"\n",
"results = await parser.aparse(paths_to_parse)"
]
},
{
@@ -234,44 +227,7 @@
"source": [
"#### Get Text Nodes\n",
"\n",
"Convert the dictionary above into TextNode objects that we can put into a vector store."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "18c24174-05ce-417f-8dd2-79c3f375db03",
"metadata": {},
"outputs": [],
"source": [
"from llama_index.core.schema import TextNode\n",
"from typing import Optional"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8e331dfe-a627-4e23-8c57-70ab1d9342e4",
"metadata": {},
"outputs": [],
"source": [
"# NOTE: these are utility functions to sort the dumped images by the page number\n",
"# (they are formatted like \"{uuid}-{page_num}.jpg\"\n",
"import re\n",
"\n",
"\n",
"def get_page_number(file_name):\n",
" match = re.search(r\"-page-(\\d+)\\.jpg$\", str(file_name))\n",
" if match:\n",
" return int(match.group(1))\n",
" return 0\n",
"\n",
"\n",
"def _get_sorted_image_files(image_dir):\n",
" \"\"\"Get image files sorted by page.\"\"\"\n",
" raw_files = [f for f in list(Path(image_dir).iterdir()) if f.is_file()]\n",
" sorted_files = sorted(raw_files, key=get_page_number)\n",
" return sorted_files"
"Using each result object, we can create a list of text nodes with metadata attached."
]
},
{
@@ -281,21 +237,20 @@
"metadata": {},
"outputs": [],
"source": [
"from copy import deepcopy\n",
"from pathlib import Path\n",
"from llama_index.core.schema import TextNode\n",
"\n",
"\n",
"# attach image metadata to the text nodes\n",
"def get_text_nodes(json_dicts, paper_path):\n",
"def get_text_nodes(result):\n",
" \"\"\"Split docs into nodes, by separator.\"\"\"\n",
" nodes = []\n",
"\n",
" md_texts = [d[\"md\"] for d in json_dicts]\n",
" md_texts = [page.md for page in result.pages]\n",
"\n",
" for idx, md_text in enumerate(md_texts):\n",
" chunk_metadata = {\n",
" \"page_num\": idx + 1,\n",
" \"paper_path\": paper_path,\n",
" \"paper_path\": result.file_name,\n",
" }\n",
" node = TextNode(\n",
" text=md_text,\n",
@@ -316,11 +271,28 @@
"# this will combine all nodes from all papers into a single list\n",
"all_text_nodes = []\n",
"text_nodes_dict = {}\n",
"for paper_path, paper_dict in paper_dicts.items():\n",
" json_dicts = paper_dict[\"json_dicts\"]\n",
" text_nodes = get_text_nodes(json_dicts, paper_dict[\"paper_path\"])\n",
"for result in results:\n",
" text_nodes = get_text_nodes(result)\n",
" all_text_nodes.extend(text_nodes)\n",
" text_nodes_dict[paper_path] = text_nodes"
" text_nodes_dict[result.file_name] = text_nodes"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2e8fb9df",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"106\n"
]
}
],
"source": [
"print(len(all_text_nodes))"
]
},
{
@@ -442,18 +414,15 @@
" The user will give the document text below.\n",
" \n",
" \"\"\"\n",
" llm = llm or OpenAI(model=\"gpt-4o\")\n",
" llm = llm or OpenAI(model=\"gpt-5-mini\", api_key=\"sk-...\")\n",
" sllm = llm.as_structured_llm(SectionsOutput)\n",
"\n",
" chat_template = ChatPromptTemplate(\n",
" [\n",
" ChatMessage.from_str(system_prompt, \"system\"),\n",
" ChatMessage.from_str(\"Document text: {doc_text}\", \"user\"),\n",
" ]\n",
" )\n",
" result = await llm.astructured_predict(\n",
" SectionsOutput, chat_template, doc_text=doc_text\n",
" )\n",
" return result.sections\n",
" messages = [\n",
" ChatMessage(content=system_prompt, role=\"system\"),\n",
" ChatMessage(content=f\"Document text: {doc_text}\", role=\"user\"),\n",
" ]\n",
" result = await sllm.achat(messages)\n",
" return result.raw.sections\n",
"\n",
"\n",
"async def arefine_sections(\n",
@@ -472,23 +441,20 @@
" Given this, return the list of indexes that are valid. Do NOT include the indexes to be removed.\n",
" \n",
" \"\"\"\n",
" llm = llm or OpenAI(model=\"gpt-4o\")\n",
"\n",
" chat_template = ChatPromptTemplate(\n",
" [\n",
" ChatMessage.from_str(system_prompt, \"system\"),\n",
" ChatMessage.from_str(\"Sections in text:\\n\\n{sections}\", \"user\"),\n",
" ]\n",
" )\n",
" llm = llm or OpenAI(model=\"gpt-5-mini\", api_key=\"sk-...\")\n",
" sllm = llm.as_structured_llm(ValidSections)\n",
"\n",
" section_texts = \"\\n\".join(\n",
" [f\"{idx}: {json.dumps(s.dict())}\" for idx, s in enumerate(sections)]\n",
" [f\"{idx}: {json.dumps(s.model_dump())}\" for idx, s in enumerate(sections)]\n",
" )\n",
"\n",
" result = await llm.astructured_predict(\n",
" ValidSections, chat_template, sections=section_texts\n",
" )\n",
" valid_indexes = result.valid_indexes\n",
" messages = [\n",
" ChatMessage(content=system_prompt, role=\"system\"),\n",
" ChatMessage(content=f\"Sections in text:\\n\\n{section_texts}\", role=\"user\"),\n",
" ]\n",
"\n",
" result = await sllm.achat(messages)\n",
" valid_indexes = result.raw.valid_indexes\n",
"\n",
" new_sections = [s for idx, s in enumerate(sections) if idx in valid_indexes]\n",
" return new_sections\n",
@@ -514,17 +480,7 @@
"execution_count": null,
"id": "6e360a5c-29bd-4d86-9a21-f46013bab39a",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████████████████████████████████████████████████████████████████| 51/51 [00:11<00:00, 4.35it/s]\n",
"100%|██████████████████████████████████████████████████████████████████████| 30/30 [00:09<00:00, 3.05it/s]\n",
"100%|██████████████████████████████████████████████████████████████████████| 25/25 [00:07<00:00, 3.22it/s]\n"
]
}
],
"outputs": [],
"source": [
"sections_dict = asyncio_run(acreate_sections(text_nodes_dict))"
]
@@ -538,36 +494,36 @@
{
"data": {
"text/plain": [
"[SectionOutput(section_name='1', section_title='INTRODUCTION', start_page_number=1, is_subsection=False, description='# 1 INTRODUCTION'),\n",
" SectionOutput(section_name='2', section_title='BENCHMARK CONSTRUCTION', start_page_number=2, is_subsection=False, description='# BENCHMARK CONSTRUCTION'),\n",
" SectionOutput(section_name='2.2', section_title='TASK FORMULATION', start_page_number=3, is_subsection=True, description='# 2.2 TASK FORMULATION'),\n",
" SectionOutput(section_name='2.3', section_title='FEATURES OF SWE-BENCH', start_page_number=3, is_subsection=True, description='# 2.3 FEATURES OF SWE-BENCH'),\n",
" SectionOutput(section_name='3', section_title='SWE-LLAMA: FINE-TUNING CODELLAMA FOR SWE-BENCH', start_page_number=3, is_subsection=False, description='# 3 SWE-LLAMA: FINE-TUNING CODELLAMA FOR SWE-BENCH'),\n",
"[SectionOutput(section_name='1', section_title='Introduction', start_page_number=1, is_subsection=False, description='## 1 Introduction'),\n",
" SectionOutput(section_name='2.2', section_title='TASK FORMULATION', start_page_number=3, is_subsection=True, description='## 2.2 TASK FORMULATION'),\n",
" SectionOutput(section_name='2.3', section_title='FEATURES OF SWE-BENCH', start_page_number=3, is_subsection=True, description='## 2.3 FEATURES OF SWE-BENCH'),\n",
" SectionOutput(section_name='3', section_title='SWE-LLAMA: FINE-TUNING CODELLAMA FOR SWE-BENCH', start_page_number=3, is_subsection=False, description='## 3 SWE-LLAMA: FINE-TUNING CODELLAMA FOR SWE-BENCH'),\n",
" SectionOutput(section_name='4', section_title='EXPERIMENTAL SETUP', start_page_number=4, is_subsection=False, description='# 4 EXPERIMENTAL SETUP'),\n",
" SectionOutput(section_name='4.1', section_title='RETRIEVAL-BASED APPROACH', start_page_number=4, is_subsection=True, description='# 4.1 RETRIEVAL-BASED APPROACH'),\n",
" SectionOutput(section_name='4.2', section_title='INPUT FORMAT', start_page_number=5, is_subsection=True, description='# 4.2 INPUT FORMAT'),\n",
" SectionOutput(section_name='4.3', section_title='MODELS', start_page_number=5, is_subsection=True, description='# 4.3 MODELS'),\n",
" SectionOutput(section_name='4.1', section_title='RETRIEVAL-BASED APPROACH', start_page_number=4, is_subsection=True, description='## 4.1 RETRIEVAL-BASED APPROACH'),\n",
" SectionOutput(section_name='4.2', section_title='INPUT FORMAT', start_page_number=5, is_subsection=True, description='## 4.2 INPUT FORMAT'),\n",
" SectionOutput(section_name='4.3', section_title='MODELS', start_page_number=5, is_subsection=True, description='## 4.3 MODELS'),\n",
" SectionOutput(section_name='5', section_title='RESULTS', start_page_number=5, is_subsection=False, description='# 5 RESULTS'),\n",
" SectionOutput(section_name='5.1', section_title='A QUALITATIVE ANALYSIS OF SWE-LLAMA GENERATIONS', start_page_number=8, is_subsection=True, description='# 5.1 A QUALITATIVE ANALYSIS OF SWE-LLAMA GENERATIONS'),\n",
" SectionOutput(section_name='6', section_title='RELATED WORK', start_page_number=8, is_subsection=False, description='# 6 RELATED WORK'),\n",
" SectionOutput(section_name='7', section_title='DISCUSSION', start_page_number=9, is_subsection=False, description='# 7 DISCUSSION'),\n",
" SectionOutput(section_name='8', section_title='ETHICS STATEMENT', start_page_number=10, is_subsection=False, description='# 8 ETHICS STATEMENT'),\n",
" SectionOutput(section_name='9', section_title='REPRODUCIBILITY STATEMENT', start_page_number=10, is_subsection=False, description='# 9 REPRODUCIBILITY STATEMENT'),\n",
" SectionOutput(section_name='10', section_title='ACKNOWLEDGEMENTS', start_page_number=10, is_subsection=False, description='# 10 ACKNOWLEDGEMENTS'),\n",
" SectionOutput(section_name='A', section_title='BENCHMARK DETAILS', start_page_number=15, is_subsection=False, description='# A BENCHMARK DETAILS'),\n",
" SectionOutput(section_name='A.1', section_title='HIGH LEVEL OVERVIEW', start_page_number=15, is_subsection=True, description='# A.1 HIGH LEVEL OVERVIEW'),\n",
" SectionOutput(section_name='A.2', section_title='CONSTRUCTION PROCESS', start_page_number=16, is_subsection=True, description='# A.2 CONSTRUCTION PROCESS'),\n",
" SectionOutput(section_name='A.3', section_title='Execution-Based Validation', start_page_number=18, is_subsection=True, description='# A.3 EXECUTION-BASED VALIDATION'),\n",
" SectionOutput(section_name='A.5', section_title='Evaluation Test Set Characterization', start_page_number=20, is_subsection=True, description='# A.5 EVALUATION TEST SET CHARACTERIZATION'),\n",
" SectionOutput(section_name='A.6', section_title='DEVELOPMENT SET CHARACTERIZATION', start_page_number=23, is_subsection=True, description='# A.6 DEVELOPMENT SET CHARACTERIZATION'),\n",
" SectionOutput(section_name='B', section_title='ADDITIONAL DETAILS ON TRAINING SWE-LLAMA', start_page_number=24, is_subsection=False, description='# B ADDITIONAL DETAILS ON TRAINING SWE-LLAMA'),\n",
" SectionOutput(section_name='B.1', section_title='TRAINING DETAILS', start_page_number=24, is_subsection=True, description='# B.1 TRAINING DETAILS'),\n",
" SectionOutput(section_name='D', section_title='ADDITIONAL EXPERIMENTAL DETAILS', start_page_number=28, is_subsection=False, description='# D ADDITIONAL EXPERIMENTAL DETAILS'),\n",
" SectionOutput(section_name='D.1', section_title='RETRIEVAL DETAILS', start_page_number=28, is_subsection=True, description='# D.1 RETRIEVAL DETAILS'),\n",
" SectionOutput(section_name='D.2', section_title='INFERENCE SETTINGS', start_page_number=29, is_subsection=True, description='# D.2 INFERENCE SETTINGS'),\n",
" SectionOutput(section_name='D.3', section_title='PROMPT TEMPLATE EXAMPLE', start_page_number=29, is_subsection=True, description='# D.3 PROMPT TEMPLATE EXAMPLE'),\n",
" SectionOutput(section_name='E', section_title='Societal Impact', start_page_number=31, is_subsection=False, description='# E SOCIETAL IMPACT'),\n",
" SectionOutput(section_name='F', section_title='In-Depth Analysis of SWE-Llama Generations', start_page_number=31, is_subsection=False, description='# F IN-DEPTH ANALYSIS OF SWE-LLAMA GENERATIONS')]"
" SectionOutput(section_name='A.1', section_title='HIGH LEVEL OVERVIEW', start_page_number=15, is_subsection=True, description='### A.1 HIGH LEVEL OVERVIEW'),\n",
" SectionOutput(section_name='A.2', section_title='CONSTRUCTION PROCESS', start_page_number=16, is_subsection=True, description='## A.2 CONSTRUCTION PROCESS'),\n",
" SectionOutput(section_name='A.3', section_title='EXECUTION-BASED VALIDATION', start_page_number=18, is_subsection=True, description='### A.3 EXECUTION-BASED VALIDATION'),\n",
" SectionOutput(section_name='A.4', section_title='EVALUATION PROCEDURE', start_page_number=19, is_subsection=True, description='## A.4 EVALUATION PROCEDURE'),\n",
" SectionOutput(section_name='A.5', section_title='EVALUATION TEST SET CHARACTERIZATION', start_page_number=20, is_subsection=True, description='## A.5 EVALUATION TEST SET CHARACTERIZATION'),\n",
" SectionOutput(section_name='A.6', section_title='DEVELOPMENT SET CHARACTERIZATION', start_page_number=23, is_subsection=True, description='## A.6 DEVELOPMENT SET CHARACTERIZATION'),\n",
" SectionOutput(section_name='B.1', section_title='TRAINING DETAILS', start_page_number=24, is_subsection=True, description='## B.1 TRAINING DETAILS'),\n",
" SectionOutput(section_name='C.1', section_title='RESULTS WITH “ORACLE” RETRIEVAL', start_page_number=24, is_subsection=True, description='## C.1 RESULTS WITH “ORACLE” RETRIEVAL'),\n",
" SectionOutput(section_name='C.2', section_title='EVALUATION TEST SET', start_page_number=24, is_subsection=True, description='## C.2 EVALUATION TEST SET'),\n",
" SectionOutput(section_name='C.3', section_title='GPT-4 EVALUATION SUBSET RESULTS', start_page_number=24, is_subsection=True, description='## C.3 GPT-4 EVALUATION SUBSET RESULTS'),\n",
" SectionOutput(section_name='C.4', section_title='EXTENDED TEMPORAL ANALYSIS', start_page_number=25, is_subsection=True, description='## C.4 EXTENDED TEMPORAL ANALYSIS'),\n",
" SectionOutput(section_name='C.5', section_title='F2P, P2P RATE ANALYSIS', start_page_number=25, is_subsection=True, description='## C.5 F2P, P2P RATE ANALYSIS'),\n",
" SectionOutput(section_name='C.7', section_title='SOFTWARE ENGINEERING METRICS', start_page_number=27, is_subsection=True, description='## C.7 SOFTWARE ENGINEERING METRICS'),\n",
" SectionOutput(section_name='D.1', section_title='RETRIEVAL DETAILS', start_page_number=28, is_subsection=True, description='## D.1 RETRIEVAL DETAILS'),\n",
" SectionOutput(section_name='D.2', section_title='INFERENCE SETTINGS', start_page_number=29, is_subsection=True, description='## D.2 INFERENCE SETTINGS'),\n",
" SectionOutput(section_name='D.3', section_title='PROMPT TEMPLATE EXAMPLE', start_page_number=29, is_subsection=True, description='## D.3 PROMPT TEMPLATE EXAMPLE')]"
]
},
"execution_count": null,
@@ -576,7 +532,7 @@
}
],
"source": [
"sections_dict[\"swebench.pdf\"]"
"sections_dict[\"iclr_docs/swebench.pdf\"]"
]
},
{
@@ -755,7 +711,7 @@
"from llama_index.vector_stores.chroma import ChromaVectorStore\n",
"from llama_index.core import VectorStoreIndex\n",
"\n",
"persist_dir = \"storage_chroma\"\n",
"persist_dir = \"chroma_storage\"\n",
"\n",
"vector_store = ChromaVectorStore.from_params(\n",
" collection_name=\"text_nodes\", persist_dir=persist_dir\n",
@@ -805,7 +761,7 @@
"source": [
"from llama_index.llms.openai import OpenAI\n",
"\n",
"llm = OpenAI(model=\"gpt-4o\")"
"llm = OpenAI(model=\"gpt-5-mini\", api_key=\"sk-...\")"
]
},
{
@@ -833,6 +789,7 @@
" FilterCondition,\n",
")\n",
"from llama_index.core.schema import NodeWithScore\n",
"from typing import List\n",
"\n",
"\n",
"def section_retrieve(query: str, verbose: bool = False) -> List[NodeWithScore]:\n",
@@ -870,57 +827,6 @@
" return all_section_nodes.values()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f721e770-ce4c-4511-96d5-8a89d16c7281",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
">> Identifying the right sections to retrieve\n",
">> Retrieving section: A: BENCHMARK DETAILS\n",
">> Retrieving section: 2: BENCHMARK CONSTRUCTION\n",
">> Retrieving section: A: BENCHMARK DETAILS\n"
]
}
],
"source": [
"nodes = section_retrieve(\n",
" \"Give me a full overview of the benchmark details in SWE Bench\", verbose=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e99eaa71-7d93-40c0-bba0-a9c983a6cbd3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'page_num': 15, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.1: HIGH LEVEL OVERVIEW'}\n",
"{'page_num': 16, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.2: CONSTRUCTION PROCESS'}\n",
"{'page_num': 17, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.2: CONSTRUCTION PROCESS'}\n",
"{'page_num': 18, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.3: Execution-Based Validation'}\n",
"{'page_num': 19, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.3: Execution-Based Validation'}\n",
"{'page_num': 20, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.5: Evaluation Test Set Characterization'}\n",
"{'page_num': 21, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.5: Evaluation Test Set Characterization'}\n",
"{'page_num': 22, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.5: Evaluation Test Set Characterization'}\n",
"{'page_num': 23, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.6: DEVELOPMENT SET CHARACTERIZATION'}\n",
"{'page_num': 2, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': '2: BENCHMARK CONSTRUCTION', 'sub_section_id': '2: BENCHMARK CONSTRUCTION'}\n"
]
}
],
"source": [
"for n in nodes:\n",
" print(n.node.metadata)"
]
},
{
"cell_type": "code",
"execution_count": null,
@@ -932,9 +838,9 @@
"output_type": "stream",
"text": [
">> Identifying the right sections to retrieve\n",
">> Retrieving section: F: ADDITIONAL RESULTS\n",
">> Retrieving section: 6: Conclusion\n",
">> Retrieving section: 5: EXPERIMENTS\n",
">> Retrieving section: F: ADDITIONAL RESULTS\n"
">> Retrieving section: 5: EXPERIMENTS\n"
]
}
],
@@ -955,11 +861,26 @@
"name": "stdout",
"output_type": "stream",
"text": [
"{'page_num': 21, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': 'F: ADDITIONAL RESULTS', 'sub_section_id': 'F.1: FULL QUALITATIVE RESULTS'}\n",
"{'page_num': 22, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': 'F: ADDITIONAL RESULTS', 'sub_section_id': 'F.4: Additional Baselines'}\n",
"{'page_num': 9, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}\n",
"{'page_num': 10, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}\n",
"{'page_num': 11, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}\n",
"{'page_num': 12, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}\n",
"{'page_num': 13, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}\n",
"{'page_num': 14, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}\n",
"{'page_num': 15, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}\n",
"{'page_num': 16, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}\n",
"{'page_num': 17, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': 'C.1: Universality of Inner Product Decomposition'}\n",
"{'page_num': 18, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': 'C.2: Lipschitz Constraint under the Temporal Distance Metric'}\n",
"{'page_num': 19, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': 'C.2: Lipschitz Constraint under the Temporal Distance Metric'}\n",
"{'page_num': 20, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': 'E.2: DADS'}\n",
"{'page_num': 21, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': 'F.1: FULL QUALITATIVE RESULTS'}\n",
"{'page_num': 22, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': 'F.4: ADDITIONAL BASELINES'}\n",
"{'page_num': 23, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': 'G.1: Environments'}\n",
"{'page_num': 24, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': 'G.2: IMPLEMENTATION DETAILS'}\n",
"{'page_num': 25, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': 'G.2: IMPLEMENTATION DETAILS'}\n",
"{'page_num': 6, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '5: EXPERIMENTS', 'sub_section_id': '5: EXPERIMENTS'}\n",
"{'page_num': 7, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '5: EXPERIMENTS', 'sub_section_id': '5.2: QUALITATIVE COMPARISON'}\n",
"{'page_num': 8, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '5: EXPERIMENTS', 'sub_section_id': '5.3: QUANTITATIVE COMPARISON'}\n"
"{'page_num': 8, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '5: EXPERIMENTS', 'sub_section_id': '5.3: Quantitative Comparison'}\n"
]
}
],
@@ -1027,10 +948,24 @@
"output_type": "stream",
"text": [
">> Identifying the right sections to retrieve\n",
">> Retrieving section: A: BENCHMARK DETAILS\n",
">> Retrieving section: 5: RESULTS\n",
">> Retrieving section: A: BENCHMARK DETAILS\n",
"In SWEBench, difficulty correlates with context length in a way that as the total context length increases, model performance tends to drop. This is observed across various models, including Claude 2, which shows a significant decrease in performance with longer context lengths. The models often struggle to localize the problematic code that needs updating when presented with a lot of code that may not be directly related to the issue at hand. This suggests that models can become distracted by additional context, which aligns with findings from other studies indicating that models may be sensitive to the relative location of target sequences. Even when increasing the maximum context size improves recall with respect to the oracle files, performance still drops, indicating that models are ineffective at localizing the necessary code changes.\n"
">> Retrieving section: 3: SWE-LLAMA: FINE-TUNING CODELLAMA FOR SWE-BENCH\n",
">> Retrieving section: 4: EXPERIMENTAL SETUP\n",
"Key findings about how difficulty correlates with context length\n",
"\n",
"- Performance falls as total input/context size grows. As the amount of code and other context provided to models increases, their ability to localize and produce correct edits drops noticeably (this behavior was observed across multiple models, e.g., Claude 2 and others).\n",
"\n",
"- Extra (irrelevant) context distracts models. When models are given a lot of code that is unrelated to the actual edit, they frequently struggle to find the problematic lines that need changing. This sensitivity includes the relative location of the target code within the larger context.\n",
"\n",
"- Increasing retriever recall doesn't fix it. Expanding retrieval windows (to include more files and therefore raise oracle recall) can actually hurt end-to-end performance because models become less effective at pinpointing the needed edits amid the extra material.\n",
"\n",
"- Collapsing context around the true edits helps. An ablation that collapses retrieved files to only the lines actually modified in the reference patch (±15 lines) improved results — for example, one models resolved rate rose from 4.8% to 5.9%, and another increased from ~1.3% to 3.4% — showing that concentrating context on the most relevant snippets makes the task easier.\n",
"\n",
"- Finetuned models are sensitive to context-distribution shifts. Models fine-tuned on tightly scoped (oracle) contexts performed worse when given BM25-retrieved context that contained many irrelevant files, indicating that training with one style of context can reduce robustness to different retrieval outputs.\n",
"\n",
"Implications\n",
"- Better retrieval or context-compression methods (e.g., more precise retrieval, collapsing to edited regions, or preprocessing to highlight likely relevant locations) are likely more useful than simply increasing context size.\n",
"- Robust model behavior requires not just larger windows but mechanisms for localization and filtering of relevant code within long contexts.\n"
]
}
],
@@ -1052,18 +987,98 @@
"output_type": "stream",
"text": [
">> Identifying the right sections to retrieve\n",
">> Retrieving section: A: BENCHMARK DETAILS\n",
">> Retrieving section: 2: BENCHMARK CONSTRUCTION\n",
">> Retrieving section: A: BENCHMARK DETAILS\n",
"SWE-bench is a benchmark designed to evaluate language models in a realistic software engineering setting by using GitHub issues and pull requests from popular repositories. The benchmark involves generating a pull request that addresses a given issue and passes related tests. The construction of SWE-bench involves a three-stage pipeline:\n",
">> Retrieving section: 10: ACKNOWLEDGEMENTS\n",
">> Retrieving section: 1: Introduction\n",
">> Retrieving section: 3: SWE-LLAMA: FINE-TUNING CODELLAMA FOR SWE-BENCH\n",
"High-level summary\n",
"- SWE-bench is a repository-scale, execution-validated benchmark of real GitHub issues paired with merged pull-request solutions. Each task gives a snapshot of a real codebase plus an issue description; the model must produce a patch that, when applied, makes the repository pass the tests that verify the issue was addressed.\n",
"- The benchmark emphasizes realistic, hard software-engineering problems: large codebases, multi-file edits, long issue descriptions, and unit tests used for automatic verification.\n",
"\n",
"1. **Repo Selection and Data Scraping**: Pull requests are collected from 12 popular open-source Python repositories on GitHub, resulting in approximately 90,000 PRs. These repositories are chosen for their better maintenance, clear contributor guidelines, and comprehensive test coverage.\n",
"Data sources and collection\n",
"- Candidate PRs are sourced from popular Python projects (selected from highly downloaded PyPI packages and mapped to their GitHub repositories). Repositories are filtered to ensure permissible licenses.\n",
"- Pull requests are collected via the GitHub API and then filtered automatically.\n",
"\n",
"2. **Attribute-Based Filtering**: Candidate tasks are created by selecting merged PRs that resolve a GitHub issue and contribute tests. This indicates that the user likely added tests to verify the resolution of the issue.\n",
"Task-instance selection criteria\n",
"A PR becomes a candidate task only if it satisfies all of:\n",
"- Status = merged (the PR was accepted).\n",
"- The PR resolves one or more GitHub issues (detected via links like “fixes #N” in title/body/commits).\n",
"- The PR introduces or edits test files (file paths containing test-related keywords).\n",
"Only candidates that pass execution-based validation are kept.\n",
"\n",
"3. **Execution-Based Filtering**: For each candidate task, the PR's test content is applied, and test results are logged before and after applying the PR's other content. Tasks are filtered out if they do not have at least one test that changes from fail to pass or if they result in installation or runtime errors.\n",
"Task-instance components\n",
"Each task instance encodes:\n",
"- Codebase reference C: repo owner/name and the base commit (mirrored repositories are created so code can be retrieved reproducibly).\n",
"- Problem statement P: aggregated issue titles and descriptions and any issue/PR comments up to the PRs first commit (no post-solution comments that would leak the fix).\n",
"- Tests T: the tests introduced/edited by the PR (extracted from the PR diff and stored as a .patch).\n",
"- Solution δ (gold patch): the PRs code changes excluding test edits (stored as a .patch).\n",
"- Metadata fields: base_commit, created_at, instance_id, issue_numbers, repo, pull_number, version, env_install_commit, hints_text (collected comments), and cached test result mappings like FAIL_TO_PASS and PASS_TO_PASS.\n",
"\n",
"The benchmark is designed to be extensible, allowing for updates with new task instances as new language models are released. It includes a robust framework for execution-based evaluation, ensuring that generated solutions can be verified by running unit tests. SWE-bench also provides a training dataset, SWE-bench-train, and fine-tuned models like SWE-Llama 7b and 13b, which are based on the CodeLlama model. These models are evaluated on their ability to resolve issues, with SWE-Llama 13b showing competitive performance in some settings.\n"
"Execution-based validation (quality control)\n",
"- Virtual execution contexts are created per repository release version (manual inspection of README/contributing to determine Python version, dependencies, install commands). Conda environments are used.\n",
"- For each candidate instance the pipeline:\n",
" 1. Checks out the base commit.\n",
" 2. Installs the codebase in the corresponding env.\n",
" 3. Applies the test patch T and runs tests (log_pre).\n",
" 4. Applies the solution patch δ and runs tests again (log_post).\n",
"- Candidates are discarded if any step fails (checkout, install, apply patch, test run).\n",
"- Instances are retained only if at least one test changes from fail → pass (a true FAIL_TO_PASS) and if there are no trivial issues (e.g., ImportError or AttributeError in log_pre that indicate missing dependency/name issues).\n",
"- Instances whose tests exercise newly created functions/classes (i.e., tests requiring names introduced by δ) are excluded because they would be impossible to solve from the problem statement alone.\n",
"\n",
"Task-instance format and artifacts\n",
"- Finalized instances are saved in a single JSON file (task metadata and patch contents are included as patch-format strings).\n",
"- For each instance the validation engine caches parsed test-to-status mappings for log_pre/log_post and creates ground-truth lists: FAIL_TO_PASS, PASS_TO_PASS (used during evaluation to check both that the fix was implemented and that prior behavior is preserved).\n",
"- Mirrors of original repositories are created and stored to preserve exact base commits and enable reproducible checkout.\n",
"\n",
"Evaluation procedure (how models are scored)\n",
"- Model input: problem statement P and the codebase C (usually limited by retrieval/long-context strategy). The model must generate a single .patch (a git/unified-diff style patch).\n",
"- Per predicted patch the evaluation harness:\n",
" 1. Resets repo to base commit.\n",
" 2. Activates the executable context for the instance version.\n",
" 3. Installs the codebase.\n",
" 4. Applies the test patch T.\n",
" 5. Attempts to apply the predicted patch \\hat{δ}. If applying fails, an automatic \"patch-fix\" step tries to repair the patch (e.g., strip extraneous context lines and recalculate headers); if it still fails the prediction is scored as failure.\n",
" 6. Runs the repositorys test command to generate log_{\\hat{δ}}.\n",
" 7. Parses log_{\\hat{δ}} into a test-to-status mapping using repository-specific parsers.\n",
" 8. Declares the task solved only if all tests listed in FAIL_TO_PASS and PASS_TO_PASS have status = pass in log_{\\hat{δ}}.\n",
"- The principal metric is % Resolved: fraction of task instances fully solved (all required tests pass).\n",
"\n",
"Patch-fixing and robustness\n",
"- If a generated patch does not apply, the harness attempts an automated repair (e.g., removing context lines, fixing header offsets) before giving up. Applied-but-broken patches that then fail tests are classified according to pass/fail patterns (Resolved, Breaking Resolved, Partially Resolved, Work-in-Progress, No-Op, Regression) to provide finer-grained analysis.\n",
"\n",
"Dataset scale and characterization\n",
"- Raw crawl: ~93k PRs across selected repositories; after conversion/filters and execution validation the final evaluation set contains 2,294 task instances.\n",
"- Instances come from 12 widely used Python repositories with varied sizes and purposes (e.g., scikit-learn, Django, matplotlib, requests, pytest, sympy, astropy, etc.).\n",
"- Typical instance properties: long problem descriptions (median ~140 words), large repositories (median ~thousands of files and hundreds of thousands of lines), and reference edits that usually touch ~12 files, edit a few functions, and modify a few dozen lines on average.\n",
"- Tests: each instance has at least one FAIL_TO_PASS; many instances include many PASS_TO_PASS tests for regression protection (median tens to hundreds of pass-to-pass tests).\n",
"\n",
"Development set, train set, and extensions\n",
"- A smaller development set (~225 instances, >10% of the main set) is provided for tuning and debugging.\n",
"- A separate SWE-bench-train dataset (19k non-testing task instances from many repos) was prepared for fine-tuning models; fine-tuned models were released (SWE-Llama 7B and 13B) to study open-model performance on long contexts.\n",
"- The collection pipeline and mirror strategy were designed to be easily extendable so the benchmark can be updated continuously with new PRs and support additional languages or repos.\n",
"\n",
"Reproducibility and release commitments\n",
"- The codebase used to collect, validate, and evaluate task instances is organized and documented; mirrors and the JSON of task instances are provided so others can reproduce experiments.\n",
"- Execution contexts, validation logs, and ground-truth test mappings are cached to avoid re-running expensive validation at evaluation time.\n",
"- Plans include open-sourcing the task instances, collection/evaluation infrastructure, training data used for fine-tuning, and model weights along with documentation.\n",
"\n",
"Design decisions and safeguards\n",
"- Using merged PRs that added tests provides a strong ground-truth signal that the PR truly solved the issue and allowed for reproducible verification.\n",
"- Excluding instances with trivial dependency/name errors or tests that require newly-introduced symbol names ensures tasks are solvable from the given P + C without hidden knowledge.\n",
"- Mirroring repositories preserves commit history and avoids breakage from later upstream edits.\n",
"\n",
"What solving a task means (concrete criterion)\n",
"- A generated patch must apply and, after applying the repositorys tests, every test that the validation flagged as verifying the issue (FAIL_TO_PASS) must now pass, and all tests that previously passed but were intended to remain passing (PASS_TO_PASS) must still pass. Only then is the task counted as solved.\n",
"\n",
"Utility and intended uses\n",
"- The benchmark measures model ability to: localize defects, reason across a large codebase, produce multi-line and multi-file edits in patch format, and use execution feedback (tests) as verification.\n",
"- It is intended both as a hard evaluation for current models and as a development target for models and systems that perform repository-scale code edits, retrieval from large codebases, iterative editing with execution feedback, or agent-style multi-step repair.\n",
"\n",
"Limitations to be aware of\n",
"- The benchmark focuses on repositories with permissive licenses and decent test coverage (popular projects), so it emphasizes bug fixes and features that were covered by tests and merged in those projects.\n",
"- Some tasks that require creating new symbol names first introduced in the solution are excluded because they would not be solvable from the baseline inputs.\n",
"- Execution environments are created per release version (manual aspects exist), and some instances are discarded when installation or environment setup cannot be reliably reproduced.\n",
"\n",
"Overall, SWE-bench provides a large, execution-validated, reproducible suite of real-world repository-scale code-editing tasks that require understanding long contexts and producing correct patch-format edits verified by the projects own tests.\n"
]
}
],
@@ -1074,34 +1089,6 @@
"print(str(response))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d747bf8-0ed2-4c10-8108-9d0e8d53a4fb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'page_num': 15, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.1: HIGH LEVEL OVERVIEW'}\n",
"{'page_num': 16, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.2: CONSTRUCTION PROCESS'}\n",
"{'page_num': 17, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.2: CONSTRUCTION PROCESS'}\n",
"{'page_num': 18, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.3: Execution-Based Validation'}\n",
"{'page_num': 19, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.3: Execution-Based Validation'}\n",
"{'page_num': 20, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.5: Evaluation Test Set Characterization'}\n",
"{'page_num': 21, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.5: Evaluation Test Set Characterization'}\n",
"{'page_num': 22, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.5: Evaluation Test Set Characterization'}\n",
"{'page_num': 23, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': 'A: BENCHMARK DETAILS', 'sub_section_id': 'A.6: DEVELOPMENT SET CHARACTERIZATION'}\n",
"{'page_num': 2, 'paper_path': 'iclr_docs/swebench.pdf', 'section_id': '2: BENCHMARK CONSTRUCTION', 'sub_section_id': '2: BENCHMARK CONSTRUCTION'}\n"
]
}
],
"source": [
"for n in response.source_nodes:\n",
" print(n.metadata)"
]
},
{
"cell_type": "code",
"execution_count": null,
@@ -1113,20 +1100,76 @@
"output_type": "stream",
"text": [
">> Identifying the right sections to retrieve\n",
">> Retrieving section: F: ADDITIONAL RESULTS\n",
">> Retrieving section: 6: Conclusion\n",
">> Retrieving section: 5: EXPERIMENTS\n",
">> Retrieving section: F: ADDITIONAL RESULTS\n",
"The additional experimental results in the METRA paper include several key findings:\n",
">> Retrieving section: 5: EXPERIMENTS\n",
"Here are the additional experimental results and analyses reported.\n",
"\n",
"1. **Full Qualitative Results**: METRA discovers diverse locomotion behaviors across different environments, including state-based Ant and HalfCheetah, and pixel-based Quadruped and Humanoid. The results are consistent across multiple random seeds, indicating robustness in behavior discovery.\n",
"1) Full qualitative results (complete skill behaviors, 8 seeds)\n",
"- Environments: state-based Ant and HalfCheetah; pixel-based Quadruped and Humanoid.\n",
"- Skill parameterizations used in these visualizations: 2-D continuous skills for Ant and Humanoid, 4-D continuous skills for Quadruped, 16 discrete skills for HalfCheetah.\n",
"- Main finding: across 8 random seeds METRA consistently discovers diverse locomotion behaviors (radial/x-y coverage, different locomotion modes) regardless of seed. The paper shows multiple sample trajectories per seed to illustrate robustness and diversity.\n",
"\n",
"2. **Latent Space Visualization**: METRA effectively captures the most temporally spread-out dimensions in the state space, such as x-y coordinates, in its latent space. This is demonstrated in both state-based and pixel-based environments, with higher-dimensional latent spaces capturing more diverse behaviors.\n",
"2) Latent-space visualization\n",
"- Setup: METRA trained with 2-D continuous latent space on Ant (state inputs) and Humanoid (pixel inputs).\n",
"- Observation: the learned representation φ(s) captures the agents x-y coordinates in the 2-D latent space in both Ant and Humanoid. The learned φ trajectories align with the x-y trajectories, indicating METRA finds the temporally most spread-out manifold (x-y plane) even from pixels.\n",
"- Note: with higher-dimensional or discrete latent spaces, METRA captures more diverse, non-linear behaviors beyond simple locomotion.\n",
"\n",
"3. **Ablation Study of Latent Space Sizes**: The study shows that increasing the size of the latent space generally enhances the diversity of skills learned by METRA. Different dimensions of continuous and discrete skills were tested on Ant and HalfCheetah.\n",
"3) Ablation: effect of latent-space size on learned skills\n",
"- Latent-space sizes tested: 1-D, 2-D, 4-D continuous; discrete sets of sizes {2}, {4}, {8}, {16}, {24}.\n",
"- Environments: Ant and HalfCheetah.\n",
"- Result: skill diversity increases as the capacity (dimensionality / cardinality) of Z grows.\n",
" - 1-D: simple linear/one-dimensional coverage\n",
" - 2-D: radial coverage / 2-D spread\n",
" - 4-D: more complex radial / richer behaviors\n",
" - Discrete increases produce progressively more distinct discrete behaviors (more segments, more diverse skill classes)\n",
"- Conclusion: METRA maximizes state coverage under latent capacity, so increasing Zs capacity yields more diverse discovered behaviors.\n",
"\n",
"4. **Comparison with Additional Baselines**: METRA was compared with DGPO, a method focused on finding diverse behaviors that maximize task rewards. The comparison was conducted in a controlled Markov process setting without external rewards, using only intrinsic rewards.\n",
"4) Additional baseline: DGPO comparison (discrete-skill comparison; 4 seeds)\n",
"- Experimental setup: DIAYN, DGPO, and METRA were trained with 16 discrete skills for 10,000 epochs (≈16M environment steps).\n",
"- Metrics reported: policy state coverage and total state coverage (means ± std).\n",
"- Results (Table reproduced):\n",
" - HalfCheetah (policy state coverage)\n",
" - DIAYN: 6.75 ± 2.22\n",
" - DGPO: 6.75 ± 2.06\n",
" - METRA: 186.75 ± 16.21\n",
" - HalfCheetah (total state coverage)\n",
" - DIAYN: 19.50 ± 3.87\n",
" - DGPO: 22.25 ± 5.85\n",
" - METRA: 177.75 ± 17.10\n",
" - Ant (policy state coverage)\n",
" - DIAYN: 11.25 ± 5.44\n",
" - DGPO: 7.00 ± 3.83\n",
" - METRA: 1387.75 ± 77.38\n",
" - Ant (total state coverage)\n",
" - DIAYN: 107.75 ± 17.00\n",
" - DGPO: 121.50 ± 4.36\n",
" - METRA: 6313.25 ± 747.92\n",
"- Interpretation given: DGPO (which maximizes a metric-agnostic KL-style objective in discrete Z) still produces limited state coverage similar to DIAYN, whereas METRA (a metric-aware Wasserstein formulation) achieves substantially greater coverage in these locomotion environments.\n",
"\n",
"These results highlight METRA's ability to discover diverse and meaningful behaviors in various environments, its effective use of latent spaces, and its performance relative to other methods.\n"
"5) Skill examples / qualitative descriptions by latent size\n",
"- A tabulated description shows how skills change qualitatively with latent-size choices (examples):\n",
" - Ant (continuous Z):\n",
" - 1-D: linearly increasing coverage\n",
" - 2-D: radial coverage with 2-D spread\n",
" - 4-D: more complex radial coverage\n",
" - Ant / HalfCheetah (discrete Z):\n",
" - Discrete 2 / 4 / 8 / 16 / 24 skills: progressively more segments and more diverse behaviors, with 24 discrete skills showing the highest diversity.\n",
"- The paper notes that with discrete Z METRA can discover qualitatively distinct behaviors such as flips or static postures (in addition to locomotion) when capacity is sufficient.\n",
"\n",
"6) Details on coverage metrics, datasets, and protocol used in these additional results\n",
"- Policy state coverage: computed by sampling 48 deterministic trajectories using 48 randomly sampled skills at each evaluation epoch (used for skill-discovery method policy coverage plots).\n",
"- Queue state coverage: computed from most recent 100,000 training trajectories (used for some comparisons).\n",
"- Total state coverage: computed from the entire set of training trajectories up to the current epoch (used as a generous metric for pure-exploration baselines).\n",
"- For locomotion coverage counting: x-y bins of 1×1 are counted for Ant, Quadruped, Humanoid; x bins for HalfCheetah. Kitchen uses task success counts for pre-defined subtasks.\n",
"- Seeds: most qualitative and skill-discovery comparisons use 8 seeds; the DGPO comparison reported used 4 seeds.\n",
"\n",
"7) Additional notes and takeaways from the extra experiments\n",
"- METRAs learned φ(s) is effective for zero-shot goal selection because φ preserves temporal distances; the latent difference φ(g) φ(s) gives a direction in Z to reach a goal.\n",
"- Increasing latent capacity helps but requires choosing continuous vs. discrete Z appropriately for the desired types of behaviors.\n",
"- The DGPO comparison further supports that metric-aware objectives (METRA) lead to substantially higher state coverage than metric-agnostic mutual-information/KL-style objectives.\n",
"\n",
"If you want, I can extract and present the specific numeric tables and captions (e.g., the full Table 1 numbers above) in CSV or another concise format, or summarize the visual findings into representative example trajectories for each latent-size setting.\n"
]
}
],
@@ -1140,9 +1183,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "llama_index_v3",
"display_name": ".venv",
"language": "python",
"name": "llama_index_v3"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
@@ -6,7 +6,19 @@
"source": [
"# LlamaParse Agent\n",
"\n",
"This demo walks through using an OpenAI Agent with [LlamaParse](https://cloud.llamaindex.ai)."
"This demo walks through using an OpenAI Agent with [LlamaParse](https://cloud.llamaindex.ai).\n",
"\n",
"Status:\n",
"| Last Executed | Version | State |\n",
"|---------------|---------|------------|\n",
"| Aug-19-2025 | 0.6.61 | Maintained |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
@@ -22,7 +34,7 @@
"metadata": {},
"outputs": [],
"source": [
"!pip install llama-cloud-services llama-index llama-index-postprocessor-sbert-rerank"
"!pip install llama-cloud-services \"llama-index>=0.13.0<0.14.0\""
]
},
{
@@ -48,7 +60,7 @@
"from llama_index.llms.openai import OpenAI\n",
"\n",
"Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-3-small\")\n",
"Settings.llm = OpenAI(model=\"gpt-3.5-turbo\", temperature=0.2)"
"Settings.llm = OpenAI(model=\"gpt-5-mini\")"
]
},
{
@@ -83,9 +95,15 @@
"outputs": [],
"source": [
"from llama_cloud_services import LlamaParse\n",
"from sympy import O\n",
"\n",
"parser = LlamaParse(\n",
" result_type=\"markdown\",\n",
" parse_mode=\"parse_page_with_agent\",\n",
" model=\"openai-gpt-4-1-mini\",\n",
" high_res_ocr=True,\n",
" adaptive_long_table=True,\n",
" outlined_table_extraction=True,\n",
" output_tables_as_HTML=True,\n",
")"
]
},
@@ -98,53 +116,27 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Started parsing the file under job_id 81251f39-01be-434e-99e8-1c1b83b82098\n"
"Started parsing the file under job_id cd1958b0-b260-4a63-aa74-bf829a0c125f\n",
".."
]
}
],
"source": [
"documents = await parser.aload_data(\"paper.pdf\")"
"result = await parser.aparse(\"paper.pdf\")\n",
"documents = result.get_markdown_documents(split_by_page=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Embeddings have been explicitly disabled. Using MockEmbedding.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"41it [00:00, 26765.21it/s]\n",
"100%|██████████| 41/41 [00:13<00:00, 2.98it/s]\n"
]
}
],
"outputs": [],
"source": [
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()\n",
"\n",
"from llama_index.core.node_parser import (\n",
" MarkdownElementNodeParser,\n",
" SentenceSplitter,\n",
")\n",
"\n",
"# explicitly extract tables with the MarkdownElementNodeParser\n",
"node_parser = MarkdownElementNodeParser(num_workers=8)\n",
"nodes = node_parser.get_nodes_from_documents(documents)\n",
"nodes, objects = node_parser.get_nodes_and_objects(nodes)\n",
"from llama_index.core.node_parser import SentenceSplitter\n",
"\n",
"# Chain splitters to ensure chunk size requirements are met\n",
"nodes = SentenceSplitter(chunk_size=512, chunk_overlap=20).get_nodes_from_documents(\n",
" nodes\n",
"nodes = SentenceSplitter(chunk_size=2048, chunk_overlap=256).get_nodes_from_documents(\n",
" documents\n",
")"
]
},
@@ -173,30 +165,41 @@
"metadata": {},
"outputs": [],
"source": [
"from llama_index.agent.openai import OpenAIAgent\n",
"from llama_index.core.tools import QueryEngineTool, ToolMetadata\n",
"from llama_index.postprocessor.colbert_rerank import ColbertRerank\n",
"from llama_index.core.agent import FunctionAgent\n",
"from llama_index.core.tools import QueryEngineTool\n",
"\n",
"tools = [\n",
" QueryEngineTool(\n",
" QueryEngineTool.from_defaults(\n",
" vector_index.as_query_engine(\n",
" similarity_top_k=8, node_postprocessors=[ColbertRerank(top_n=3)]\n",
" ),\n",
" metadata=ToolMetadata(\n",
" name=\"search\",\n",
" description=\"Search the document, pass the entire user message in the query\",\n",
" similarity_top_k=4,\n",
" ),\n",
" name=\"query\",\n",
" description=\"Send a query that requires only a subset of the top-k documents to be considered\",\n",
" ),\n",
" QueryEngineTool(\n",
" QueryEngineTool.from_defaults(\n",
" summary_index.as_query_engine(),\n",
" metadata=ToolMetadata(\n",
" name=\"summarize\",\n",
" description=\"Summarize the document using the user message\",\n",
" ),\n",
" name=\"query_all_docs\",\n",
" description=\"Send a query that requires all documents to be considered\",\n",
" ),\n",
"]\n",
"\n",
"agent = OpenAIAgent.from_tools(tools=tools, verbose=True)"
"agent = FunctionAgent(\n",
" tools=tools,\n",
" llm=Settings.llm,\n",
" system_prompt=\"You are a helpful assistant that can answer questions about the paper.\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from llama_index.core.workflow import Context\n",
"\n",
"# Context to persist the agent session\n",
"ctx = Context(agent)"
]
},
{
@@ -208,18 +211,40 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Added user message to memory: What is the summary of the paper?\n",
"=== Calling Function ===\n",
"Calling function: summarize with args: {\"input\":\"summary\"}\n",
"Got output: The research focuses on developing Multimodal Large Language Models (MLLMs) by incorporating image-caption, interleaved image-text, and text-only data for pre-training. It highlights the importance of factors like the image encoder, resolution, and token count, while downplaying the design of the vision-language connector. With models scaling up to 30B parameters, the MM1 family demonstrates impressive performance in pre-training metrics and competitive outcomes on diverse multimodal benchmarks. It demonstrates abilities such as in-context learning and multi-image reasoning, aiming to provide valuable insights for creating MLLMs that benefit the research community.\n",
"========================\n",
"\n"
"Calling tool query_all_docs with args {'input': 'Provide the summary of the paper (concise abstract-like summary).'}\n",
"Tool call query_all_docs({'input': 'Provide the summary of the paper (concise abstract-like summary).'}) returned This paper presents a practical recipe and empirical analysis for building high-performing multimodal large language models (MLLMs). Through systematic ablations of image encoders, visionlanguage connectors, and pre-training data mixtures, the work identifies key design lessons: image resolution and the number of image tokens drive the largest gains, followed by encoder capacity and pre-training data; architectural choices for the visionlanguage connector matter far less. Data-wise, a careful mixture of captioned images, interleaved imagetext documents, and some text-only data is critical — caption data boosts zero-shot captioning, interleaved documents enable strong few-shot and text performance, and text-only data preserves language capabilities. The authors apply these lessons to scale MM1: ViT-H image encoders at high resolution feeding 144 visual tokens into decoder-only LLMs (dense and MoE variants) trained on a 45/45/10 mixture (interleaved/caption/text), for ~200k steps (~400B tokens). MM1 models (dense up to 30B, MoE up to effectively tens of billions of parameters) achieve state-of-the-art few-shot pre-training metrics and competitive supervised fine-tuning results across many established multimodal benchmarks, while exhibiting enhanced in-context learning, multi-image reasoning, and few-shot chain-of-thought capabilities. Practical training details (learning-rate scaling, unfreezing the encoder during SFT, high-resolution support via positional interpolation and sub-image decomposition) and the positive impact of synthetic caption data are reported to guide reproducing and extending these findings.\n",
"\n",
"================\n",
"\n",
"Here is a concise, abstractstyle summary of the paper:\n",
"\n",
"- Goal: provide a practical recipe and empirical analysis for building highperforming multimodal LLMs (MLLMs) and identify which design choices matter most.\n",
"- Key findings: image resolution and number of image tokens yield the largest performance gains, followed by visionencoder capacity and pretraining data; the specific architecture of the visionlanguage connector matters far less.\n",
"- Data mix: a careful pretraining mixture is critical—captioned images boost zeroshot captioning, interleaved imagetext documents enable strong fewshot and text performance, and some textonly data preserves language capabilities. The authors use a 45/45/10 split (interleaved/caption/text).\n",
"- MM1 models: applying these lessons, they scale ViTH encoders at high resolution producing 144 visual tokens into decoderonly LLMs (dense up to 30B, MoE variants effectively larger), trained ~200k steps (~400B tokens).\n",
"- Results: MM1 achieves stateoftheart fewshot pretraining metrics and competitive supervised finetuning across many multimodal benchmarks, with improved incontext learning, multiimage reasoning, and fewshot chainofthought behavior.\n",
"- Practical guidance: reportable tricks include learningrate scaling, unfreezing the encoder during SFT, supporting high resolution via positional interpolation and subimage decomposition, and the positive impact of synthetic caption data.\n",
"\n",
"Overall, the paper offers both empirical insights about what drives MLLM performance and a concrete, reproducible recipe (MM1) that attains strong multimodal capabilities.\n"
]
}
],
"source": [
"# note -- this will take a while with local LLMs, its sending every node in the document to the LLM\n",
"resp = agent.chat(\"What is the summary of the paper?\")"
"from llama_index.core.agent import ToolCall, ToolCallResult\n",
"\n",
"handler = agent.run(\n",
" \"What is the summary of the paper that you have access to?\", ctx=ctx\n",
")\n",
"async for ev in handler.stream_events():\n",
" if isinstance(ev, ToolCall):\n",
" print(f\"Calling tool {ev.tool_name} with args {ev.tool_kwargs}\")\n",
" elif isinstance(ev, ToolCallResult):\n",
" print(f\"Tool call {ev.tool_name}({ev.tool_kwargs}) returned {ev.tool_output}\")\n",
"\n",
"print(\"\\n================\\n\")\n",
"\n",
"resp = await handler\n",
"print(resp)"
]
},
{
@@ -231,57 +256,191 @@
"name": "stdout",
"output_type": "stream",
"text": [
"The summary of the paper highlights the development of Multimodal Large Language Models (MLLMs) by incorporating image-caption, interleaved image-text, and text-only data for pre-training. The research emphasizes factors like the image encoder, resolution, and token count, while de-emphasizing the design of the vision-language connector. The MM1 family of models, scaling up to 30B parameters, shows impressive performance in pre-training metrics and competitive outcomes on various multimodal benchmarks. These models demonstrate capabilities such as in-context learning and multi-image reasoning, aiming to provide valuable insights for creating MLLMs that benefit the research community.\n"
"Calling tool query_all_docs with args {'input': 'Describe in detail how the authors evaluate their work: which benchmarks and tasks they use (pretraining metrics, few-shot evaluation, supervised fine-tuning, multimodal benchmarks, in-context learning, chain-of-thought, multi-image reasoning), the metrics reported, baselines compared, and ablation studies conducted. Include mentions of training steps, model sizes, and any special evaluation setups (e.g., positional interpolation, sub-image decomposition, synthetic caption data).'}\n",
"Tool call query_all_docs({'input': 'Describe in detail how the authors evaluate their work: which benchmarks and tasks they use (pretraining metrics, few-shot evaluation, supervised fine-tuning, multimodal benchmarks, in-context learning, chain-of-thought, multi-image reasoning), the metrics reported, baselines compared, and ablation studies conducted. Include mentions of training steps, model sizes, and any special evaluation setups (e.g., positional interpolation, sub-image decomposition, synthetic caption data).'}) returned Overview\n",
"- Evaluation covers both pre-training (zero-/few-shot) and supervised fine-tuning (SFT) regimes, plus targeted analyses of in-context learning, multi-image reasoning, and chain-of-thought prompting. Evaluations include captioning, VQA, a set of text-only tasks (TextCore), and a wide collection of modern multimodal benchmarks. Results are reported for multiple model scales (dense 3B, 7B, 30B and MoE variants) and compared to several published baselines.\n",
"\n",
"Pre-training evaluation\n",
"- Tasks and benchmarks:\n",
" - Image captioning: COCO (Karpathy test), NoCaps (val), TextCaps (val). Captioning use standard caption prompts and reporting.\n",
" - Visual question answering / text-in-image tasks: VQAv2 (testdev), TextVQA (val), VizWiz (testdev), GQA, OK-VQA (val).\n",
" - A text-only evaluation suite called TextCore (ARC, PIQA, LAMBADA, WinoGrande, HellaSWAG, SciQ, TriviaQA, WebQS) to measure preservation/quality of language capabilities.\n",
"- Prompting and generation:\n",
" - Captioning prompt: \"{IMAGE} A photo of\" (or equivalent). VQA prompt: \"{IMAGE} Question: {QUESTION} Short answer:\".\n",
" - Greedy decoding until EOS or task-specific stop tokens. For captioning the newline is a stop token; for VQA additional stop tokens include \".\", \",\", \"Question\".\n",
" - VQA postprocessing follows the same logic used by OpenFlamingo implementations.\n",
"- Metrics:\n",
" - Captioning: CIDEr (computed via nlg-eval).\n",
" - VQA and related QA tasks: task-appropriate accuracy metrics (reported as percentages).\n",
" - TextCore: aggregated scores reported to indicate text-only capabilities.\n",
" - Pre-training few-shot evaluation reported for 0-shot, 4-shot, and 8-shot settings (4- and 8-shot used as main few-shot points).\n",
"- Splits and sampling:\n",
" - Few-shot prompts are sampled from training when available, otherwise validation, ensuring the query example is not one of the shots.\n",
"- Scale and settings for pre-training evaluation runs:\n",
" - Most pre-training evaluations use smaller ablation setups: base ablation LLM = 1.2B (but some encoder ablations use a 2.9B LLM to ensure capacity).\n",
" - Final pre-trained models evaluated at 3B, 7B, and 30B (dense) and MoE variants (3B backbone with 64 experts; 7B backbone with 32 experts).\n",
"- Baselines for pre-training comparisons:\n",
" - Flamingo (various sizes), Emu2 (14B, 37B), IDEFICS (9B, 80B), and other published pre-trained MLLMs where few-shot pre-training numbers are available.\n",
"\n",
"Supervised fine-tuning (SFT) evaluation\n",
"- SFT data and setup:\n",
" - SFT mixture contains ≈1.45M examples: GPT-4/GPT-4V-generated instruction-response data (e.g., LLaVA-Conv/Complex, ShareGPT-4V), many academic VL datasets (VQAv2, GQA, OKVQA, A-OKVQA, COCO Captions, OCRVQA, TextCaps, DVQA, ChartQA, AI2D, DocVQA, InfoVQA, SynthDog-En), and a small internal text-only SFT set.\n",
" - Fine-tuning: 10k steps, batch size 256, sequence length 2048; optimizer AdaFactor with peak LR 1e-5 and cosine decay to 0. Both image encoder and LLM are unfrozen unless noted in ablations.\n",
"- Benchmarks & aggregated evaluation:\n",
" - A large set of 12+ multimodal benchmarks is used for SFT evaluation, including VQAv2, TextVQA, ScienceQA-IMG, MMMU, MathVista, MME (perception/cognition splits), MMBench, SEED-Bench, POPE, LLaVA-Bench-in-the-Wild, MM-Vet, etc.\n",
" - Results reported per-dataset and combined into a meta-average for comparisons; the meta-average is normalized relative to a compact baseline to make metrics comparable across tasks.\n",
"- Baselines and SFT comparisons:\n",
" - Compared against a range of SOTA and contemporary multimodal models after instruction tuning: LLaVA variants (1.5/NeXT), InstructBLIP, Qwen-VL, Emu2-Chat, CogVLM, Gemini family, GPT4V where available, and others. Both dense and MoE variants are compared when available.\n",
"- High-resolution and multi-image SFT evaluation:\n",
" - Two techniques are used to support high-resolution inputs during SFT:\n",
" - Positional embedding interpolation to adapt ViT positional embeddings to larger resolutions (used to support 448×448, 560×560, 672×672, etc.).\n",
" - Sub-image decomposition (crop-based): for very high resolution (e.g., 1344×1344) the image is split into multiple sub-images (e.g., five 672×672 crops) that are encoded independently and concatenated as a sequence to the LLM.\n",
" - Default SFT evaluation results reported at an effective high resolution (1344×1344) via these strategies. Reported improvement with higher resolution (e.g., relative gains up to ~15% average when supporting 1344×1344 vs 336×336).\n",
"- Chain-of-thought & few-shot in-context evaluation after SFT:\n",
" - MathVista is used to quantify few-shot chain-of-thought capability: example results show 0-shot 39.4, 4-shot 41.9, and an 8-shot mixed-resolution in-context setup achieves 44.4.\n",
" - Mixed-resolution in-context strategy: to fit more examples in context while managing token cost of high-resolution sub-image decomposition, some in-context examples are encoded at lower resolution and only the last N examples use full high-resolution decomposition (N=3 in reported experiments).\n",
"\n",
"Ablation studies and analyses\n",
"- Overall ablation design:\n",
" - A compact base configuration is used for systematic ablations: ViT-L/14 image encoder (CLIP), C-Abstractor connector with 144 image tokens, pre-training mixture 45% captioned images / 45% interleaved image-text / 10% text-only, and a 1.2B decoder-only LLM for many ablations.\n",
" - One component changed at a time; evaluations are zero-/few-shot across the same captioning and VQA benchmarks.\n",
"- Image encoder ablations:\n",
" - Compared contrastive (CLIP variants trained on DFN-5B, VeCap-300M, OpenAI CLIP) against reconstructive losses (AIM models).\n",
" - Resolution ablations: 224 → 336 → 378 px; clear finding that image resolution has the largest impact, followed by encoder capacity and training data composition. Increasing resolution yielded ~3% absolute boost in many metrics.\n",
" - Encoder size: ViT-L → ViT-H shows modest gains (typically <1% absolute).\n",
" - Training data for encoders: inclusion of synthetic caption data (VeCap) yields non-trivial few-shot improvements.\n",
" - Table-based reporting of 0-/4-/8-shot metrics for these variants.\n",
"- Vision-language (VL) connector ablations:\n",
" - Connector types: average pooling (grid pooling + linear), attention pooling (learnable queries), and C-Abstractor (convolutional mapping / ResNet-based projector).\n",
" - Image token counts: experiments with 64 vs 144 image tokens per image.\n",
" - Findings: number of visual tokens and image resolution matter most; the particular connector architecture has comparatively little effect on final performance. Detailed 0/4/8-shot tables compare pooling strategies across token counts and resolutions.\n",
"- Pre-training data mixture ablations:\n",
" - Systematically varied mixes of captioned image pairs vs interleaved image-text documents vs text-only data. Examples tested: 100% caption, mixtures such as 66/33, 50/50, and 0/100, and image/text-only ratios (e.g., 91/9, 86/14, 66/33).\n",
" - Key lessons:\n",
" - Interleaved documents are critical for few-shot and text-only performance; captioning data strongly lifts zero-shot captioning performance.\n",
" - Text-only data helps preserve/boost few-shot and text-only performance; including ~914% text-only yields a better balance.\n",
" - A final recommended pre-training mix is 45% interleaved / 45% image-caption / 10% text-only to balance zero- and few-shot capabilities.\n",
" - Impact of synthetic VeCap captions: even though small (~7% of caption pool), VeCap gives measurable few-shot gains (e.g., 2.4% and 4% absolute in reported settings).\n",
"- SFT-specific ablations:\n",
" - Repeating data-mixture and connector ablations in the SFT context: caption-pretraining helps SFT zero-shot metrics; choice of VL connector still has limited effect though finer differences appear at high token counts; freezing vs unfreezing the image encoder matters (frozen better at lower resolution; unfrozen better for high-resolution SFT).\n",
"- Hyperparameter and optimization ablations:\n",
" - Learning-rate grid searches run at small scales (models 9M, 85M, 302M, 1.2B) and 50k-step probes, then a log-linear fit extrapolated to larger model sizes. Grid-search experiments used 50k training steps for each setting.\n",
" - Resulting scaling rule and fitted formula for optimal peak learning rate as a function of LLM parameter count is provided and used to choose LRs for the 3B/7B/30B models (e.g., final LRs used: 6e-5 (3B), 4e-5 (7B), 2e-5 (30B)). Weight decay scaled as λ = 0.1 · η.\n",
"- MoE (mixture-of-experts) experiments:\n",
" - Two MoE designs: 3B-MoE with 64 experts (64B total params, top-2 gating, replace every-2 layers) and 7B-MoE with 32 experts (47B total params, replace every-4 layers).\n",
" - Training used top-2 gating, load-balance loss coefficient 0.01, router z-loss 0.001, and otherwise the same hyperparameters and data mixture as the dense backbones. MoE variants show uniform improvements over dense counterparts on many SFT benchmarks.\n",
"- Additional implementation/evaluation notes:\n",
" - Pre-training: models trained unfrozen for 200k steps (≈400B tokens) with batch size 512 and sequence length 4096, allowing up to 16 images per sequence and 144 tokens per image (≈1M text tokens + 1M image tokens per batch in the final setup). The pre-training mixture is fixed deterministically for reproducibility.\n",
" - Pre-training evaluation prompts, stop tokens, and postprocessing are standardized (greedy decoding), and detailed splits used for each benchmark are specified.\n",
" - SFT evaluation meta-average: benchmarks are normalized to a compact baseline configuration prior to averaging so disparate metrics can be compared.\n",
" - For high-resolution SFT, the positional interpolation approach (to support larger patches) and the sub-image decomposition scheme (to represent very large images as multiple crops) are both used and evaluated; sub-image decomposition increases the number of image tokens dramatically, which motivates mixed-resolution in-context examples for few-shot prompting.\n",
"\n",
"Reporting and comparisons\n",
"- Tabular reporting:\n",
" - Pre-training few-shot results are reported in detailed tables per model scale (3B, 7B, 30B) for 0/4/8/16-shot where applicable, across captioning and VQA datasets.\n",
" - SFT comparisons show per-benchmark numbers and a combined meta-average; both dense and MoE model variants are included.\n",
"- Baselines and contemporaries cited for direct comparison include Flamingo, IDEFICS, Emu2, LLaVA-NeXT, CogVLM, Gemini family, GPT4V, and many instruction-tuned MLLMs. Where appropriate, notes on differences in prompting setups (e.g., some baselines include text-only demonstrations in “0” prompts) are documented.\n",
"- Qualitative analysis:\n",
" - A variety of qualitative examples shown for counting, OCR, multi-image reasoning, style following, instruction following, and chain-of-thought reasoning; these accompany quantitative results to illustrate capabilities such as multi-image reasoning and few-shot chain-of-thought.\n",
"\n",
"Key reported evaluation figures (examples)\n",
"- Pre-training duration: 200k steps (~400B tokens).\n",
"- Pre-training batch & context: batch 512, sequence length 4096, up to 16 images per sequence, 144 tokens per image.\n",
"- SFT: 10k steps; batch 256; seq length 2048; AdaFactor with peak LR 1e-5.\n",
"- MoE variants: 3B backbone + 64 experts (64B total); 7B backbone + 32 experts (47B total); top-2 gating; load-balance and router regularizers used.\n",
"- Example few-shot chain-of-thought: MathVista 0-shot 39.4 → 4-shot 41.9 → 8-shot with mixed-resolution 44.4.\n",
"\n",
"In summary\n",
"- Evaluation is multi-faceted: systematic pre-training zero-/few-shot tests on captioning and VQA, text-only TextCore checks, extensive SFT across a broad benchmark suite, ablations covering image encoder, VL connector, data mixtures, training hyperparameters, and input-resolution strategies, plus experiments with MoE scaling. Metrics include CIDEr for captioning, accuracy for VQA and other benchmarks, TextCore aggregated scores, and a normalized meta-average for SFT. The authors report results across multiple model sizes and variants and compare to a broad set of recent multimodal models.\n",
"\n",
"================\n",
"\n",
"Short answer: the authors evaluate across (1) pre-training zero-/few-shot benchmarks (captioning, VQA, and a text-only suite), (2) supervised instruction finetuning (SFT) on a large multimodal mixture with extensive downstream benchmarks, and (3) targeted analyses (incontext/fewshot learning, chainofthought, multiimage reasoning). They report standard task metrics (CIDEr for captioning, accuracy for VQA/QA, aggregated TextCore scores, and a normalized SFT metaaverage), compare to many recent MLLMs, and run systematic ablations (encoder, connector, data mixtures, hyperparameters, resolution/tokenization, MoE). Key training/eval settings and special setups are also evaluated (positional interpolation, subimage decomposition, synthetic caption data). Details:\n",
"\n",
"1) Pretraining evaluation\n",
"- Tasks and datasets:\n",
" - Image captioning: COCO (Karpathy test), NoCaps (val), TextCaps (val).\n",
" - VQA/textinimage: VQAv2 (testdev), TextVQA, VizWiz, GQA, OKVQA, etc.\n",
" - TextCore: a textonly suite (ARC, PIQA, LAMBADA, WinoGrande, HellaSWAG, SciQ, TriviaQA, WebQS) to check language preservation.\n",
"- Prompting & decoding:\n",
" - Zero/4/8 (and sometimes 16) shot prompts; fewshot examples sampled from train/val ensuring no leakage.\n",
" - Greedy decoding with taskspecific stop tokens; VQA postprocessing matches Flamingo style.\n",
"- Metrics:\n",
" - CIDEr for captioning, accuracy (%) for VQA/QA tasks, aggregated TextCore scores for language capability.\n",
"- Model scales for evaluation:\n",
" - Ablations often use a small base LLM (1.2B, sometimes 2.9B). Final pretrained models evaluated at 3B, 7B, 30B (dense) and MoE variants.\n",
"- Baselines:\n",
" - Compared against Flamingo, Emu2, IDEFICS, and other published pretrained MLLMs when fewshot pretraining numbers are available.\n",
"\n",
"2) Supervised finetuning (SFT) evaluation\n",
"- SFT data:\n",
" - ≈1.45M instruction examples: GPT4/GPT4V synthetic instruction data (LLaVAConv/Complex, ShareGPT4V), many academic VL datasets (VQAv2, GQA, OKVQA, COCO Captions, TextCaps, OCRVQA, ChartQA, DocVQA, etc.), and a small internal text SFT set.\n",
"- Finetuning procedure:\n",
" - 10k steps, batch 256, seq length 2048, AdaFactor optimizer, peak LR 1e5 with cosine decay. Image encoder and LLM unfrozen unless ablated.\n",
"- Downstream benchmarks and reporting:\n",
" - 12+ multimodal benchmarks for SFT evaluation (VQAv2, TextVQA, ScienceQAIMG, MMMU, MathVista, MME, MMBench, SEEDBench, POPE, LLaVABiW, MMVet, etc.). Results reported per dataset and combined into a normalized metaaverage for fair aggregation across heterogeneous metrics.\n",
"- Baselines:\n",
" - Compared to instructiontuned contemporaries: LLaVA/NeXT, InstructBLIP, QwenVL, Emu2Chat, CogVLM, Gemini family, GPT4V where available.\n",
"\n",
"3) Targeted analyses (incontext learning, CoT, multiimage)\n",
"- Incontext/fewshot: standard 0/4/8shot probes across captioning and VQA.\n",
"- Chainofthought: MathVista used to quantify fewshot CoT; reported example: 0shot 39.4 → 4shot 41.9 → 8shot mixedresolution 44.4.\n",
"- Multiimage reasoning: evaluated qualitatively and quantitatively on multiimage benchmarks and examples.\n",
"\n",
"4) Ablation studies (systematic and extensive)\n",
"- Image encoder ablations:\n",
" - Contrastive (CLIP variants) vs reconstructive (AIM); encoder size (ViTL → ViTH); encoder training data (including synthetic caption data VeCap).\n",
" - Resolution ablations (e.g., 224 → 336 → 378 px): resolution and number of visual tokens give the largest gains.\n",
"- Visionlanguage connector ablations:\n",
" - Connector types (avgpooling, attention pooling, CAbstractor) and visual token counts (e.g., 64 vs 144). Finding: connector architecture matters far less than token count/resolution.\n",
"- Pretraining data mixture ablations:\n",
" - Varied mixes of caption pairs / interleaved imagetext documents / textonly. Key finding: 45% interleaved / 45% caption / 10% text gives the best balance (interleaved documents help fewshot/text performance; captions boost zeroshot captioning; text-only preserves language capabilities).\n",
" - Small synthetic caption pool (VeCap) provides measurable fewshot gains.\n",
"- SFT ablations:\n",
" - Freezing vs unfreezing image encoder in SFT (unfreeze better for highresolution), datamix effects in SFT, connector behavior at high token counts.\n",
"- Hyperparameter & optimizer ablations:\n",
" - LR grid searches at small scales (9M → 1.2B) with 50kstep probes and a fitted scaling rule; final LRs chosen (e.g., ~6e5 for 3B, 4e5 for 7B, 2e5 for 30B for pretraining). Weight decay scaled proportionally.\n",
"- MoE experiments:\n",
" - Two MoE setups: 3B backbone + 64 experts (~64B params) and 7B + 32 experts (~47B params), top2 gating, loadbalance/reg losses; MoE variants yield uniform improvements on many SFT benchmarks.\n",
"\n",
"5) Special evaluation/training setups and numbers\n",
"- Pretraining infrastructure & settings:\n",
" - Pretraining: ≈200k steps (~400B tokens), batch 512, seq length 4096, allow up to 16 images per sequence, 144 tokens per image in final setup. Pretraining mixture fixed deterministically.\n",
"- Highresolution support:\n",
" - Positional embedding interpolation to adapt ViT positional embeddings to larger resolutions.\n",
" - Subimage decomposition (split very large images into multiple crops, encode independently, and concatenate visual tokens) to support extremely high effective resolution (e.g., 1344×1344 as five 672×672 crops).\n",
" - Mixedresolution incontext strategy to keep context capacity reasonable while enabling highresolution targets in the last few shots.\n",
"- Decoding/postprocessing:\n",
" - Greedy decoding; taskspecific stops; standardized postprocessing to align with prior work.\n",
"- Reporting conventions:\n",
" - 0/4/8shot pretraining tables, SFT perdataset numbers and a normalized metaaverage, and qualitative examples (counting, OCR, style following, multiimage reasoning, CoT).\n",
"\n",
"6) Qualitative analysis\n",
"- Numerous qualitative examples illustrating multiimage reasoning, counting, OCR, instruction following, and chainofthought behaviors accompany the quantitative results.\n",
"\n",
"In short: the evaluation is broad (pretraining fewshot, SFT, targeted capability probes), quantitatively rigorous (CIDEr/accuracy/metaaverages), compares to many contemporary MLLMs, and is supported by wide ablations (encoder, connector, data, optimization, resolution, MoE) and practical highresolution evaluation techniques (positional interpolation, subimage decomposition, mixedresolution incontext).\n"
]
}
],
"source": [
"print(str(resp))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Added user message to memory: How do the authors evaluate their work?\n",
"=== Calling Function ===\n",
"Calling function: search with args: {\"input\":\"evaluation methods\"}\n",
"Got output: The evaluation methods involve synthesizing all benchmark results into a single meta-average number to simplify comparisons. This is achieved by normalizing the evaluation metrics with respect to a baseline configuration, standardizing the results for each task, adjusting every metric by dividing it by its respective baseline, and then averaging across all metrics.\n",
"========================\n",
"\n"
]
}
],
"source": [
"resp = agent.chat(\"How do the authors evaluate their work?\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The authors evaluate their work by synthesizing all benchmark results into a single meta-average number to simplify comparisons. They normalize the evaluation metrics with respect to a baseline configuration, standardize the results for each task, adjust every metric by dividing it by its respective baseline, and then average across all metrics for evaluation.\n"
]
}
],
"source": [
"print(str(resp))"
"handler = agent.run(\"How do the authors evaluate their work?\", ctx=ctx)\n",
"async for ev in handler.stream_events():\n",
" if isinstance(ev, ToolCall):\n",
" print(f\"Calling tool {ev.tool_name} with args {ev.tool_kwargs}\")\n",
" elif isinstance(ev, ToolCallResult):\n",
" print(f\"Tool call {ev.tool_name}({ev.tool_kwargs}) returned {ev.tool_output}\")\n",
"\n",
"\n",
"print(\"\\n================\\n\")\n",
"\n",
"resp = await handler\n",
"print(resp)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llama-parse-aNC435Vv-py3.10",
"display_name": ".venv",
"language": "python",
"name": "python3"
},
+188
View File
@@ -0,0 +1,188 @@
"""
⚠️ DEPRECATION NOTICE:
This example uses the deprecated llama-cloud-services package, which will be maintained until May 1, 2026.
Please migrate to: pip install llama-cloud>=1.0 (https://github.com/run-llama/llama-cloud-py)
"""
"""
Example: Batch Processing a Folder of PDFs with LlamaParse
This script demonstrates how to process multiple PDFs from a folder
using LlamaParse with controlled concurrency using asyncio and semaphores.
Usage:
python batch_parse_folder.py --input-dir ./pdfs --max-concurrent 5
"""
import asyncio
import argparse
from pathlib import Path
from typing import List, Dict, Any
from datetime import datetime
from dotenv import load_dotenv
import os
from llama_cloud_services import LlamaParse
# Load environment variables from .env file
load_dotenv()
async def parse_single_file(
parser: LlamaParse,
file_path: Path,
semaphore: asyncio.Semaphore,
) -> Dict[str, Any]:
"""
Parse a single PDF file with concurrency control.
Args:
parser: LlamaParse instance
file_path: Path to the PDF file
semaphore: Semaphore to control concurrent requests
Returns:
Dictionary with file info and parse result
"""
async with semaphore:
try:
print(f"Starting parse: {file_path.name}")
result = await parser.aparse(str(file_path))
print(f"✓ Completed: {file_path.name} ({len(result.pages)} pages)")
return {
"file": file_path.name,
"status": "success",
"result": result,
"pages": len(result.pages) if result.pages else 0,
}
except Exception as e:
print(f"✗ Error parsing {file_path.name}: {str(e)}")
return {
"file": file_path.name,
"status": "error",
"error": str(e),
}
async def parse_folder(
input_dir: Path,
max_concurrent: int = 5,
api_key: str = None,
) -> List[Dict[str, any]]:
"""
Parse all PDFs in a folder with controlled concurrency.
Args:
input_dir: Directory containing PDF files
max_concurrent: Maximum number of concurrent parse operations
api_key: LlamaCloud API key (loaded from .env file)
Returns:
List of parse results for each file
"""
# Find all PDF files
pdf_files = list(input_dir.glob("*.pdf"))
if not pdf_files:
print(f"No PDF files found in {input_dir}")
return []
print(f"Found {len(pdf_files)} PDF files to parse")
# Initialize parser
parser = LlamaParse(
api_key=api_key,
num_workers=1, # We control concurrency with semaphore
show_progress=False, # We'll show our own progress
)
# Create semaphore to limit concurrent requests
semaphore = asyncio.Semaphore(max_concurrent)
# Create tasks for all files
tasks = [parse_single_file(parser, pdf_file, semaphore) for pdf_file in pdf_files]
# Run all tasks concurrently (but limited by semaphore)
print(
f"Processing {len(tasks)} files with max {max_concurrent} concurrent operations..."
)
start_time = datetime.now()
results = await asyncio.gather(*tasks)
end_time = datetime.now()
duration = (end_time - start_time).total_seconds()
# Process results
successful = [
r for r in results if isinstance(r, dict) and r.get("status") == "success"
]
failed = [r for r in results if isinstance(r, dict) and r.get("status") == "error"]
# Print summary
print("PARSE SUMMARY \n")
print(f"Total files: {len(pdf_files)}")
print(f"Successful: {len(successful)}")
print(f"Failed: {len(failed)}")
print(f"Total time: {duration:.2f} seconds")
print(f"Average time per file: {duration / len(pdf_files):.2f} seconds")
if failed:
print("\nFailed files:")
for result in failed:
print(f" - {result['file']}: {result.get('error', 'Unknown error')}")
return results
def main():
"""Main entry point for the script."""
parser = argparse.ArgumentParser(
description="Batch process PDFs in a folder with LlamaParse"
)
parser.add_argument(
"--input-dir",
type=str,
required=True,
help="Directory containing PDF files to parse",
)
parser.add_argument(
"--max-concurrent",
type=int,
default=5,
help="Maximum number of concurrent parse operations (default: 5)",
)
args = parser.parse_args()
input_dir = Path(args.input_dir)
# Validate input directory
if not input_dir.exists():
print(f"Error: Input directory does not exist: {input_dir}")
return
if not input_dir.is_dir():
print(f"Error: Input path is not a directory: {input_dir}")
return
# Get API key from environment (loaded from .env file)
api_key = os.getenv("LLAMA_CLOUD_API_KEY")
if not api_key:
print("Error: LLAMA_CLOUD_API_KEY not found. Please set it in your .env file")
return
# Run async function
asyncio.run(
parse_folder(
input_dir=input_dir,
max_concurrent=args.max_concurrent,
api_key=api_key,
)
)
if __name__ == "__main__":
main()
+89 -340
View File
@@ -11,9 +11,18 @@
"\n",
"This example shows off LlamaParse parsing capabilities to build a functioning query pipeline over the Caltrain weekend schedule, a big timetable containing all trains northbound and southbound and their stops in various cities.\n",
"\n",
"Naive parsing solutions mess up in representing this tabular representation, leading to LLM hallucinations. In contrast, LlamaParse text-mode spatially lays out the table in a neat format, enabling more sophisticated LLMs like gpt-4-turbo to understand the spacing and reason over all the numbers.\n",
"\n",
"**NOTE**: LlamaParse markdown mode doesn't quite work yet - it's in development!"
"Status:\n",
"| Last Executed | Version | State |\n",
"|---------------|---------|------------|\n",
"| Aug-19-2025 | 0.6.61 | Maintained |"
]
},
{
"cell_type": "markdown",
"id": "0cb82ca8",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
@@ -26,18 +35,6 @@
"Download the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e6ae2e38-30c9-4865-aa13-47780bc3848f",
"metadata": {},
"outputs": [],
"source": [
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()"
]
},
{
"cell_type": "code",
"execution_count": null,
@@ -55,7 +52,7 @@
"source": [
"## Initialize LlamaParse\n",
"\n",
"Initialize LlamaParse in `text` mode which will represent complex documents incl. text, tables, and figures as nicely formatted text."
"Parse the text results from `LlamaParse`, which will represent complex documents incl. text, tables, and figures as nicely formatted text."
]
},
{
@@ -64,26 +61,29 @@
"id": "54aa9579-84d4-49bc-ab54-5474e69c1188",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/jerryliu/Programming/llama_parse/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Started parsing the file under job_id 5f73353a-1f4b-480d-9eea-58d1d22b75f6\n"
"Started parsing the file under job_id d162724f-dcb9-4bfe-9bd4-337244906fb8\n",
".."
]
}
],
"source": [
"from llama_cloud_services import LlamaParse\n",
"\n",
"docs = LlamaParse(result_type=\"text\").load_data(\"./caltrain_schedule_weekend.pdf\")"
"result = await LlamaParse(\n",
" parse_mode=\"parse_page_with_agent\",\n",
" model=\"openai-gpt-4-1-mini\",\n",
" high_res_ocr=True,\n",
" adaptive_long_table=True,\n",
" outlined_table_extraction=True,\n",
" output_tables_as_HTML=True,\n",
" api_key=\"llx-...\",\n",
").aparse(\"./caltrain_schedule_weekend.pdf\")\n",
"\n",
"documents = result.get_text_documents(split_by_page=True)"
]
},
{
@@ -104,73 +104,44 @@
"name": "stdout",
"output_type": "stream",
"text": [
"ZONE 2ZONE 3ZONE 4ZONE 4 ZONE 3ZONE 2ZONE 1ZONE 1\n",
" Printer-Friendly Caltrain Schedule\n",
" Northbound WEEKEND SERVICE to SAN FRANCISCO 2XX Local\n",
" Printer Friendly WEEKEND Caltrain Schedule\n",
" Morning to Early Afternoon Page 1 of 2\n",
" Northbound WEEKEND SERVICE to SAN FRANCISCO 6XX Local\n",
" Train No. 601 603 605 607 609 611 613 615 617 619 621 623 625 627 629 631\n",
" Tamien 6:51a 7:51a 8:51a 9:51a 10:51a 11:51a 12:51p 1:51p\n",
" San Jose Diridon 6:56a 7:26a 7:56a 8:26a 8:56a 9:26a 9:56a 10:26a 10:56a 11:26a 11:56a 12:26p 12:56p 1:26p 1:56p 2:26p\n",
" Santa Clara 7:03a 7:33a 8:03a 8:33a 9:03a 9:33a 10:03a 10:33a 11:03a 11:33a 12:03p 12:33p 1:03p 1:33p 2:03p 2:33p\n",
"ZONE 4 Lawrence 7:08a 7:38a 8:08a 8:38a 9:08a 9:38a 10:08a 10:38a 11:08a 11:38a 12:08p 12:38p 1:08p 1:38p 2:08p 2:38p\n",
"\n",
" Sunnyvale 7:12a 7:42a 8:12a 8:42a 9:12a 9:42a 10:12a 10:42a 11:12a 11:42a 12:12p 12:42p 1:12p 1:42p 2:12p 2:42p\n",
" Mountain View 7:16a 7:46a 8:16a 8:46a 9:16a 9:46a 10:16a 10:46a 11:16a 11:46a 12:16p 12:46p 1:16p 1:46p 2:16p 2:46p\n",
" San Antonio 7:19a 7:49a 8:19a 8:49a 9:19a 9:49a 10:19a 10:49a 11:19a 11:49a 12:19p 12:49p 1:19p 1:49p 2:19p 2:49p\n",
" California Ave 7:22a 7:52a 8:22a 8:52a 9:22a 9:52a 10:22a 10:52a 11:22a 11:52a 12:22p 12:52p 1:22p 1:52p 2:22p 2:52p\n",
" Palo Alto 7:25a 7:55a 8:25a 8:55a 9:25a 9:55a 10:25a 10:55a 11:25a 11:55a 12:25p 12:55p 1:25p 1:55p 2:25p 2:55p\n",
"ZONE 3 Menlo Park 7:27a 7:57a 8:27a 8:57a 9:27a 9:57a 10:27a 10:57a 11:27a 11:57a 12:27p 12:57p 1:27p 1:57p 2:27p 2:57p\n",
"\n",
" Train No. 221 225 229 233 237 241 245 249 253 257 261 265 269 273 *277 *281\n",
" Service Types L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2\n",
" Tamien 7:12a 9:05a 10:05a 11:05a 1:05p 3:05p 5:05p 7:05p 9:05p 11:05p\n",
" San Jose Diridon 7:19a 9:12a 10:12a 11:12a 12:12p 1:12p 2:12p 3:12p 4:12p 5:12p 6:12p 7:12p 8:12p 9:12p 10:19p 11:12p\n",
" Santa Clara 7:25a 9:18a 10:18a 11:18a 12:18p 1:18p 2:18p 3:18p 4:18p 5:18p 6:18p 7:18p 8:18p 9:18p 10:25p 11:18p\n",
" Lawrence 7:31a 9:24a 10:24a 11:24a 12:24p 1:24p 2:24p 3:24p 4:24p 5:24p 6:24p 7:24p 8:24p 9:24p 10:31p 11:24p\n",
" Sunnyvale 7:35a 9:28a 10:28a 11:28a 12:28p 1:28p 2:28p 3:28p 4:28p 5:28p 6:28p 7:28p 8:28p 9:28p 10:35p 11:28p\n",
" Mountain View 7:40a 9:34a 10:34a 11:34a 12:34p 1:34p 2:34p 3:34p 4:34p 5:34p 6:34p 7:34p 8:34p 9:34p 10:40p 11:34p\n",
" San Antonio 7:43a 9:37a 10:37a 11:37a 12:37p 1:37p 2:37p 3:37p 4:37p 5:37p 6:37p 7:37p 8:37p 9:37p 10:44p 11:37p\n",
" California Ave 7:48a 9:42a 10:42a 11:42a 12:42p 1:42p 2:42p 3:42p 4:42p 5:42p 6:42p 7:42p 8:42p 9:42p 10:48p 11:42p\n",
" Palo Alto 7:52a 9:46a 10:46a 11:46a 12:46p 1:46p 2:46p 3:46p 4:46p 5:46p 6:46p 7:46p 8:46p 9:46p 10:53p 11:46p\n",
" Menlo Park 7:55a 9:50a 10:50a 11:50a 12:50p 1:50p 2:50p 3:50p 4:50p 5:50p 6:50p 7:50p 8:50p 9:50p 10:56p 11:50p\n",
" Redwood City 8:01a 9:56a 10:56a 11:56a 12:56p 1:56p 2:56p 3:56p 4:56p 5:56p 6:56p 7:56p 8:56p 9:56p 11:02p 11:56p\n",
" San Carlos 8:05a 10:01a 11:01a 12:01p 1:01p 2:01p 3:01p 4:01p 5:01p 6:01p 7:01p 8:01p 9:01p 10:01p 11:07p 12:01a\n",
" Belmont 8:09a 10:04a 11:04a 12:04p 1:04p 2:04p 3:04p 4:04p 5:04p 6:04p 7:04p 8:04p 9:04p 10:04p 11:10p 12:04a\n",
" Hillsdale 8:12a 10:08a 11:08a 12:08p 1:08p 2:08p 3:08p 4:08p 5:08p 6:08p 7:08p 8:08p 9:08p 10:08p 11:14p 12:08a\n",
" Hayward Park 8:15a 10:11a 11:11a 12:11p 1:11p 2:11p 3:11p 4:11p 5:11p 6:11p 7:11p 8:11p 9:11p 10:11p 11:17p 12:11a\n",
" San Mateo 8:19a 10:15a 11:15a 12:15p 1:15p 2:15p 3:15p 4:15p 5:15p 6:15p 7:15p 8:15p 9:15p 10:15p 11:21p 12:15a\n",
" Burlingame 8:22a 10:19a 11:19a 12:19p 1:19p 2:19p 3:19p 4:19p 5:19p 6:19p 7:19p 8:19p 9:19p 10:19p 11:25p 12:19a\n",
" Broadway 8:25a 10:22a 11:22a 12:22p 1:22p 2:22p 3:22p 4:22p 5:22p 6:22p 7:22p 8:22p 9:22p 10:22p 11:28p 12:22a\n",
" Millbrae 8:29a 10:26a 11:26a 12:26p 1:26p 2:26p 3:26p 4:26p 5:26p 6:26p 7:26p 8:26p 9:26p 10:26p 11:32p 12:26a\n",
" San Bruno 8:34a 10:30a 11:30a 12:30p 1:30p 2:30p 3:30p 4:30p 5:30p 6:30p 7:30p 8:30p 9:30p 10:30p 11:37p 12:30a\n",
" S. San Francisco 8:38a 10:34a 11:34a 12:34p 1:34p 2:34p 3:34p 4:34p 5:34p 6:34p 7:34p 8:34p 9:34p 10:34p 11:41p 12:34a\n",
" Bayshore 8:44a 10:41a 11:41a 12:41p 1:41p 2:41p 3:41p 4:41p 5:41p 6:41p 7:41p 8:41p 9:41p 10:41p 11:47p 12:41a\n",
" 22 ndStreet 8:50a 10:46a 11:46a 12:46p 1:46p 2:46p 3:46p 4:46p 5:46p 6:46p 7:46p 8:46p 9:46p 10:46p 11:53p 12:46a\n",
" San Francisco 8:56a 10:52a 11:53a 12:53p 1:52p 2:52p 3:52p 4:52p 5:52p 6:52p 7:52p 8:52p 9:52p 10:52p 11:59p 12:52a\n",
" *On SAP Center event days, Train 277 or Train 281departure from San Jose Diridon station may be delayed and will depart no later than 10:30p or 11:30p respectively.\n",
" Redwood City 7:32a 8:02a 8:32a 9:02a 9:32a 10:02a 10:32a 11:02a 11:32a 12:02p 12:32p 1:02p 1:32p 2:02p 2:32p 3:02p\n",
" San Carlos 7:35a 8:05a 8:35a 9:05a 9:35a 10:05a 10:35a 11:05a 11:35a 12:05p 12:35p 1:05p 1:35p 2:05p 2:35p 3:05p\n",
" Belmont 7:38a 8:08a 8:38a 9:08a 9:38a 10:08a 10:38a 11:08a 11:38a 12:08p 12:38p 1:08p 1:38p 2:08p 2:38p 3:08p\n",
" Hillsdale 7:41a 8:11a 8:41a 9:11a 9:41a 10:11a 10:41a 11:11a 11:41a 12:11p 12:41p 1:11p 1:41p 2:11p 2:41p 3:11p\n",
" Hayward Park 7:43a 8:13a 8:43a 9:13a 9:43a 10:13a 10:43a 11:13a 11:43a 12:13p 12:43p 1:13p 1:43p 2:13p 2:43p 3:13p\n",
" San Mateo 7:46a 8:16a 8:46a 9:16a 9:46a 10:16a 10:46a 11:16a 11:46a 12:16p 12:46p 1:16p 1:46p 2:16p 2:46p 3:16p\n",
" Burlingame 7:48a 8:18a 8:48a 9:18a 9:48a 10:18a 10:48a 11:18a 11:48a 12:18p 12:48p 1:18p 1:48p 2:18p 2:48p 3:18p\n",
" Broadway 7:51a 8:21a 8:51a 9:21a 9:51a 10:21a 10:51a 11:21a 11:51a 12:21p 12:51p 1:21p 1:51p 2:21p 2:51p 3:21p\n",
"ZONE 2 Millbrae 7:54a 8:24a 8:54a 9:24a 9:54a 10:24a 10:54a 11:24a 11:54a 12:24p 12:54p 1:24p 1:54p 2:24p 2:54p 3:24p\n",
"\n",
" San Bruno 7:57a 8:27a 8:57a 9:27a 9:57a 10:27a 10:57a 11:27a 11:57a 12:27p 12:57p 1:27p 1:57p 2:27p 2:57p 3:27p\n",
" S. San Francisco 8:00a 8:30a 9:00a 9:30a 10:00a 10:30a 11:00a 11:30a 12:00p 12:30p 1:00p 1:30p 2:00p 2:30p 3:00p 3:30p\n",
" Bayshore 8:05a 8:35a 9:05a 9:35a 10:05a 10:35a 11:05a 11:35a 12:05p 12:35p 1:05p 1:35p 2:05p 2:35p 3:05p 3:35p\n",
" 22ⁿᵈ Street 8:10a 8:40a 9:10a 9:40a 10:10a 10:40a 11:10a 11:40a 12:10p 12:40p 1:10p 1:40p 2:10p 2:40p 3:10p 3:40p\n",
"ZONE 1 San Francisco 8:15a 8:45a 9:15a 9:45a 10:15a 10:45a 11:15a 11:45a 12:15p 12:45p 1:15p 1:45p 2:15p 2:45p 3:15p 3:45p\n",
"\n",
" Southbound WEEKEND SERVICE to SAN JOSE 2XX Local\n",
" Train No. 224 228 232 236 240 244 248 252 256 260 264 268 272 276 280 284\n",
" Service Types L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2\n",
" San Francisco 8:28a 9:58a 10:58a 11:58a 12:58p 1:58p 2:58p 3:58p 4:58p 5:58p 6:58p 7:58p 8:58p 9:58p 10:58p 12:05a\n",
" 22 ndStreet 8:33a 10:03a 11:03a 12:03p 1:03p 2:03p 3:03p 4:03p 5:03p 6:03p 7:03p 8:03p 9:03p 10:03p 11:03p 12:10a\n",
" Bayshore 8:38a 10:08a 11:08a 12:08p 1:08p 2:08p 3:08p 4:08p 5:08p 6:08p 7:08p 8:08p 9:08p 10:08p 11:08p 12:15a\n",
" S. San Francisco 8:45a 10:15a 11:15a 12:15p 1:15p 2:15p 3:15p 4:15p 5:15p 6:15p 7:15p 8:15p 9:15p 10:15p 11:15p 12:22a\n",
" San Bruno 8:49a 10:19a 11:19a 12:19p 1:19p 2:19p 3:19p 4:19p 5:19p 6:19p 7:19p 8:19p 9:19p 10:19p 11:19p 12:26a\n",
" Millbrae 8:53a 10:24a 11:24a 12:24p 1:24p 2:24p 3:24p 4:24p 5:24p 6:24p 7:24p 8:24p 9:24p 10:24p 11:24p 12:31a\n",
" Broadway 8:57a 10:27a 11:27a 12:27p 1:27p 2:27p 3:27p 4:27p 5:27p 6:27p 7:27p 8:27p 9:27p 10:27p 11:27p 12:35a\n",
" Burlingame 9:00a 10:31a 11:31a 12:31p 1:31p 2:31p 3:31p 4:31p 5:31p 6:31p 7:31p 8:31p 9:31p 10:31p 11:31p 12:38a\n",
" San Mateo 9:04a 10:34a 11:34a 12:34p 1:34p 2:34p 3:34p 4:34p 5:34p 6:34p 7:34p 8:34p 9:34p 10:34p 11:34p 12:41a\n",
" Hayward Park 9:07a 10:37a 11:37a 12:37p 1:37p 2:37p 3:37p 4:37p 5:37p 6:37p 7:37p 8:37p 9:37p 10:37p 11:37p 12:45a\n",
" Hillsdale 9:10a 10:41a 11:41a 12:41p 1:41p 2:41p 3:41p 4:41p 5:41p 6:41p 7:41p 8:41p 9:41p 10:41p 11:41p 12:48a\n",
" Belmont 9:14a 10:44a 11:44a 12:44p 1:44p 2:44p 3:44p 4:44p 5:44p 6:44p 7:44p 8:44p 9:44p 10:44p 11:44p 12:52a\n",
" San Carlos 9:17a 10:48a 11:48a 12:48p 1:48p 2:48p 3:48p 4:48p 5:48p 6:48p 7:48p 8:48p 9:48p 10:48p 11:48p 12:55a\n",
" Redwood City 9:21a 10:52a 11:52a 12:52p 1:52p 2:52p 3:52p 4:52p 5:52p 6:52p 7:52p 8:52p 9:52p 10:52p 11:52p 12:59a\n",
" Menlo Park 9:28a 10:58a 11:58a 12:58p 1:58p 2:58p 3:58p 4:58p 5:58p 6:58p 7:58p 8:58p 9:58p 10:58p 11:58p 1:05a\n",
" Palo Alto 9:32a 11:02a 12:02p 1:02p 2:02p 3:02p 4:02p 5:02p 6:02p 7:02p 8:02p 9:02p 10:02p 11:02p 12:02a 1:09a\n",
" California Avenue 9:36a 11:06a 12:06p 1:06p 2:06p 3:06p 4:06p 5:06p 6:06p 7:06p 8:06p 9:06p 10:06p 11:06p 12:06a 1:12a\n",
" San Antonio 9:41a 11:11a 12:11p 1:11p 2:11p 3:11p 4:11p 5:11p 6:11p 7:11p 8:11p 9:11p 10:11p 11:11p 12:10a 1:17a\n",
" Mountain View 9:45a 11:16a 12:16p 1:16p 2:16p 3:16p 4:16p 5:16p 6:16p 7:16p 8:16p 9:16p 10:16p 11:16p 12:15a 1:21a\n",
" Sunnyvale 9:51a 11:21a 12:21p 1:21p 2:21p 3:21p 4:21p 5:21p 6:21p 7:21p 8:21p 9:21p 10:21p 11:21p 12:20a 1:26a\n",
" Lawrence 9:55a 11:26a 12:26p 1:26p 2:26p 3:26p 4:26p 5:26p 6:26p 7:26p 8:26p 9:26p 10:26p 11:26p 12:25a 1:31a\n",
" Santa Clara 10:01a 11:32a 12:32p 1:32p 2:32p 3:32p 4:32p 5:32p 6:32p 7:32p 8:32p 9:32p 10:32p 11:32p 12:31a 1:37a\n",
" San Jose Diridon 10:10a 11:40a 12:40p 1:38p 2:40p 3:38p 4:40p 5:38p 6:40p 7:38p 8:40p 9:38p 10:40p 11:38p 12:39a 1:44a\n",
" Tamien 10:15a 11:45a 12:45p 2:45p 4:45p 6:45p 8:45p 10:45p 12:44a 1:49a\n",
" EFFECTIVE September 12, 2022 Timetable subject to change without notice.\n"
"EFFECTIVE September 21, 2024 Timetable subject to change without notice See Page 2 For Afternoon and Evening Times\n"
]
}
],
"source": [
"print(docs[0].get_content())"
"print(documents[0].text)"
]
},
{
@@ -180,9 +151,7 @@
"source": [
"## Initialize Query Engine\n",
"\n",
"We now initialize a query engine over this data. Here we use a baseline summary index, which doesn't do vector indexing/chunking and instead dumps the entire text into the prompt.\n",
"\n",
"We see that the LLM (gpt-4-turbo) is able to provide all the stops for train no 225 northbound."
"We now initialize a query engine over this data. Here we use a baseline summary index, which doesn't do vector indexing/chunking and instead dumps the entire text into the prompt."
]
},
{
@@ -195,8 +164,8 @@
"from llama_index.core import SummaryIndex\n",
"from llama_index.llms.openai import OpenAI\n",
"\n",
"llm = OpenAI(model=\"gpt-4o\")\n",
"index = SummaryIndex.from_documents(docs)\n",
"llm = OpenAI(model=\"gpt-5-mini\", api_key=\"sk-...\")\n",
"index = SummaryIndex.from_documents(documents)\n",
"query_engine = index.as_query_engine(llm=llm)"
]
},
@@ -208,7 +177,7 @@
"outputs": [],
"source": [
"response = query_engine.query(\n",
" \"What are the stops (and times) for train no 237 northbound?\"\n",
" \"What are the stops (and times) for train no 609 northbound?\"\n",
")"
]
},
@@ -222,31 +191,32 @@
"name": "stdout",
"output_type": "stream",
"text": [
"The stops and times for train no. 237 northbound are as follows:\n",
"Train No. 609 northbound (stops and times):\n",
"\n",
"- San Jose Diridon: 12:12 PM\n",
"- Santa Clara: 12:18 PM\n",
"- Lawrence: 12:24 PM\n",
"- Sunnyvale: 12:28 PM\n",
"- Mountain View: 12:34 PM\n",
"- San Antonio: 12:37 PM\n",
"- California Ave: 12:42 PM\n",
"- Palo Alto: 12:46 PM\n",
"- Menlo Park: 12:50 PM\n",
"- Redwood City: 12:56 PM\n",
"- San Carlos: 1:01 PM\n",
"- Belmont: 1:04 PM\n",
"- Hillsdale: 1:08 PM\n",
"- Hayward Park: 1:11 PM\n",
"- San Mateo: 1:15 PM\n",
"- Burlingame: 1:19 PM\n",
"- Broadway: 1:22 PM\n",
"- Millbrae: 1:26 PM\n",
"- San Bruno: 1:30 PM\n",
"- S. San Francisco: 1:34 PM\n",
"- Bayshore: 1:41 PM\n",
"- 22nd Street: 1:46 PM\n",
"- San Francisco: 1:52 PM\n"
"- Tamien — 8:51a\n",
"- San Jose Diridon — 8:56a\n",
"- Santa Clara — 9:03a\n",
"- Lawrence — 9:08a\n",
"- Sunnyvale — 9:12a\n",
"- Mountain View — 9:16a\n",
"- San Antonio — 9:19a\n",
"- California Ave — 9:22a\n",
"- Palo Alto — 9:25a\n",
"- Menlo Park — 9:27a\n",
"- Redwood City — 9:32a\n",
"- San Carlos — 9:35a\n",
"- Belmont — 9:38a\n",
"- Hillsdale — 9:41a\n",
"- Hayward Park — 9:43a\n",
"- San Mateo — 9:46a\n",
"- Burlingame — 9:48a\n",
"- Broadway — 9:51a\n",
"- Millbrae — 9:54a\n",
"- San Bruno — 9:57a\n",
"- S. San Francisco — 10:00a\n",
"- Bayshore — 10:05a\n",
"- 22nd Street — 10:10a\n",
"- San Francisco — 10:15a\n"
]
}
],
@@ -262,18 +232,10 @@
"outputs": [],
"source": [
"response = query_engine.query(\n",
" \"What are all the trains (and times) that end at Tamien going Southbound?\"\n",
" \"What are all the trains (and times) that end at Redwood City going Southbound?\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "6cf9fce0-5067-48f6-a7ef-62aa9e2edc3d",
"metadata": {},
"source": [
"It gets most of the answers correct (to be fair it misses two trains)."
]
},
{
"cell_type": "code",
"execution_count": null,
@@ -284,233 +246,20 @@
"name": "stdout",
"output_type": "stream",
"text": [
"The trains that end at Tamien going Southbound are:\n",
"\n",
"- Train 224 at 10:15a\n",
"- Train 228 at 11:45a\n",
"- Train 240 at 2:45p\n",
"- Train 248 at 4:45p\n",
"- Train 256 at 6:45p\n",
"- Train 264 at 8:45p\n",
"- Train 272 at 10:45p\n",
"- Train 284 at 1:49a\n"
"None. On this weekend schedule no southbound trains terminate at Redwood City — every listed southbound train continues beyond Redwood City to later stations (Menlo Park/Palo Alto and onward).\n"
]
}
],
"source": [
"print(str(response))"
]
},
{
"cell_type": "markdown",
"id": "e51e7feb-b74f-4101-8963-933ac7ec9763",
"metadata": {},
"source": [
"## Try Baseline\n",
"\n",
"In contrast, we try a baseline approach with the default PDF reader (PyPDF) in `SimpleDirectoryReader`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "364e5155-cc75-4302-a754-9444ae28e6b1",
"metadata": {},
"outputs": [],
"source": [
"from llama_index.core import SimpleDirectoryReader\n",
"from llama_index.core import SummaryIndex\n",
"from llama_index.llms.openai import OpenAI\n",
"\n",
"llm = OpenAI(model=\"gpt-4o\")\n",
"input_file = \"caltrain_schedule_weekend.pdf\"\n",
"reader = SimpleDirectoryReader(input_files=[input_file])\n",
"base_docs = reader.load_data()\n",
"index = SummaryIndex.from_documents(base_docs)\n",
"base_query_engine = index.as_query_engine(llm=llm)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a4011389-2d27-4a1a-bf8d-7309da28ab15",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Southbound WEEKEND SERVICE to SAN JOSE\n",
"Train No. 224 228 232 236 240 244 248 252 256 260 264 268 272 276 280 284\n",
"Service Types L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2\n",
"San Francisco 8:28a 9:58a 10:58a 11:58a 12:58p 1:58p 2:58p 3:58p 4:58p 5:58p 6:58p 7:58p 8:58p 9:58p 10:58p 12:05a\n",
"22nd Street 8:33a 10:03a 11:03a 12:03p 1:03p 2:03p 3:03p 4:03p 5:03p 6:03p 7:03p 8:03p 9:03p 10:03p 11:03p 12:10a\n",
"Bayshore 8:38a 10:08a 11:08a 12:08p 1:08p 2:08p 3:08p 4:08p 5:08p 6:08p 7:08p 8:08p 9:08p 10:08p 11:08p 12:15a\n",
"S. San Francisco 8:45a 10:15a 11:15a 12:15p 1:15p 2:15p 3:15p 4:15p 5:15p 6:15p 7:15p 8:15p 9:15p 10:15p 11:15p 12:22a\n",
"San Bruno 8:49a 10:19a 11:19a 12:19p 1:19p 2:19p 3:19p 4:19p 5:19p 6:19p 7:19p 8:19p 9:19p 10:19p 11:19p 12:26a\n",
"Millbrae 8:53a 10:24a 11:24a 12:24p 1:24p 2:24p 3:24p 4:24p 5:24p 6:24p 7:24p 8:24p 9:24p 10:24p 11:24p 12:31a\n",
"Broadway 8:57a 10:27a 11:27a 12:27p 1:27p 2:27p 3:27p 4:27p 5:27p 6:27p 7:27p 8:27p 9:27p 10:27p 11:27p 12:35a\n",
"Burlingame 9:00a 10:31a 11:31a 12:31p 1:31p 2:31p 3:31p 4:31p 5:31p 6:31p 7:31p 8:31p 9:31p 10:31p 11:31p 12:38a\n",
"San Mateo 9:04a 10:34a 11:34a 12:34p 1:34p 2:34p 3:34p 4:34p 5:34p 6:34p 7:34p 8:34p 9:34p 10:34p 11:34p 12:41a\n",
"Hayward Park 9:07a 10:37a 11:37a 12:37p 1:37p 2:37p 3:37p 4:37p 5:37p 6:37p 7:37p 8:37p 9:37p 10:37p 11:37p 12:45a\n",
"Hillsdale 9:10a 10:41a 11:41a 12:41p 1:41p 2:41p 3:41p 4:41p 5:41p 6:41p 7:41p 8:41p 9:41p 10:41p 11:41p 12:48a\n",
"Belmont 9:14a 10:44a 11:44a 12:44p 1:44p 2:44p 3:44p 4:44p 5:44p 6:44p 7:44p 8:44p 9:44p 10:44p 11:44p 12:52a\n",
"San Carlos 9:17a 10:48a 11:48a 12:48p 1:48p 2:48p 3:48p 4:48p 5:48p 6:48p 7:48p 8:48p 9:48p 10:48p 11:48p 12:55a\n",
"Redwood City 9:21a 10:52a 11:52a 12:52p 1:52p 2:52p 3:52p 4:52p 5:52p 6:52p 7:52p 8:52p 9:52p 10:52p 11:52p 12:59a\n",
"Menlo Park 9:28a 10:58a 11:58a 12:58p 1:58p 2:58p 3:58p 4:58p 5:58p 6:58p 7:58p 8:58p 9:58p 10:58p 11:58p 1:05a\n",
"Palo Alto 9:32a 11:02a 12:02p 1:02p 2:02p 3:02p 4:02p 5:02p 6:02p 7:02p 8:02p 9:02p 10:02p 11:02p 12:02a 1:09a\n",
"California Avenue 9:36a 11:06a 12:06p 1:06p 2:06p 3:06p 4:06p 5:06p 6:06p 7:06p 8:06p 9:06p 10:06p 11:06p 12:06a 1:12a\n",
"San Antonio 9:41a 11:11a 12:11p 1:11p 2:11p 3:11p 4:11p 5:11p 6:11p 7:11p 8:11p 9:11p 10:11p 11:11p 12:10a 1:17a\n",
"Mountain View 9:45a 11:16a 12:16p 1:16p 2:16p 3:16p 4:16p 5:16p 6:16p 7:16p 8:16p 9:16p 10:16p 11:16p 12:15a 1:21a\n",
"Sunnyvale 9:51a 11:21a 12:21p 1:21p 2:21p 3:21p 4:21p 5:21p 6:21p 7:21p 8:21p 9:21p 10:21p 11:21p 12:20a 1:26a\n",
"Lawrence 9:55a 11:26a 12:26p 1:26p 2:26p 3:26p 4:26p 5:26p 6:26p 7:26p 8:26p 9:26p 10:26p 11:26p 12:25a 1:31a\n",
"Santa Clara 10:01a 11:32a 12:32p 1:32p 2:32p 3:32p 4:32p 5:32p 6:32p 7:32p 8:32p 9:32p 10:32p 11:32p 12:31a 1:37a\n",
"San Jose Diridon 10:10a 11:40a 12:40p 1:38p 2:40p 3:38p 4:40p 5:38p 6:40p 7:38p 8:40p 9:38p 10:40p 11:38p 12:39a 1:44a\n",
"Tamien 10:15a 11:45a 12:45p 2:45p 4:45p 6:45p 8:45p 10:45p 12:44a 1:49aPrinter-Friendly Caltrain Schedule\n",
"Northbound WEEKEND SERVICE to SAN FRANCISCO\n",
"Train No. 221 225 229 233 237 241 245 249 253 257 261 265 269 273 *277 *281\n",
"Service Types L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2\n",
"Tamien 7:12a 9:05a 10:05a 11:05a 1:05p 3:05p 5:05p 7:05p 9:05p 11:05p\n",
"San Jose Diridon 7:19a 9:12a 10:12a 11:12a 12:12p 1:12p 2:12p 3:12p 4:12p 5:12p 6:12p 7:12p 8:12p 9:12p 10:19p 11:12p\n",
"Santa Clara 7:25a 9:18a 10:18a 11:18a 12:18p 1:18p 2:18p 3:18p 4:18p 5:18p 6:18p 7:18p 8:18p 9:18p 10:25p 11:18p\n",
"Lawrence 7:31a 9:24a 10:24a 11:24a 12:24p 1:24p 2:24p 3:24p 4:24p 5:24p 6:24p 7:24p 8:24p 9:24p 10:31p 11:24p\n",
"Sunnyvale 7:35a 9:28a 10:28a 11:28a 12:28p 1:28p 2:28p 3:28p 4:28p 5:28p 6:28p 7:28p 8:28p 9:28p 10:35p 11:28p\n",
"Mountain View 7:40a 9:34a 10:34a 11:34a 12:34p 1:34p 2:34p 3:34p 4:34p 5:34p 6:34p 7:34p 8:34p 9:34p 10:40p 11:34p\n",
"San Antonio 7:43a 9:37a 10:37a 11:37a 12:37p 1:37p 2:37p 3:37p 4:37p 5:37p 6:37p 7:37p 8:37p 9:37p 10:44p 11:37p\n",
"California Ave 7:48a 9:42a 10:42a 11:42a 12:42p 1:42p 2:42p 3:42p 4:42p 5:42p 6:42p 7:42p 8:42p 9:42p 10:48p 11:42p\n",
"Palo Alto 7:52a 9:46a 10:46a 11:46a 12:46p 1:46p 2:46p 3:46p 4:46p 5:46p 6:46p 7:46p 8:46p 9:46p 10:53p 11:46p\n",
"Menlo Park 7:55a 9:50a 10:50a 11:50a 12:50p 1:50p 2:50p 3:50p 4:50p 5:50p 6:50p 7:50p 8:50p 9:50p 10:56p 11:50p\n",
"Redwood City 8:01a 9:56a 10:56a 11:56a 12:56p 1:56p 2:56p 3:56p 4:56p 5:56p 6:56p 7:56p 8:56p 9:56p 11:02p 11:56p\n",
"San Carlos 8:05a 10:01a 11:01a 12:01p 1:01p 2:01p 3:01p 4:01p 5:01p 6:01p 7:01p 8:01p 9:01p 10:01p 11:07p 12:01a\n",
"Belmont 8:09a 10:04a 11:04a 12:04p 1:04p 2:04p 3:04p 4:04p 5:04p 6:04p 7:04p 8:04p 9:04p 10:04p 11:10p 12:04a\n",
"Hillsdale 8:12a 10:08a 11:08a 12:08p 1:08p 2:08p 3:08p 4:08p 5:08p 6:08p 7:08p 8:08p 9:08p 10:08p 11:14p 12:08a\n",
"Hayward Park 8:15a 10:11a 11:11a 12:11p 1:11p 2:11p 3:11p 4:11p 5:11p 6:11p 7:11p 8:11p 9:11p 10:11p 11:17p 12:11a\n",
"San Mateo 8:19a 10:15a 11:15a 12:15p 1:15p 2:15p 3:15p 4:15p 5:15p 6:15p 7:15p 8:15p 9:15p 10:15p 11:21p 12:15a\n",
"Burlingame 8:22a 10:19a 11:19a 12:19p 1:19p 2:19p 3:19p 4:19p 5:19p 6:19p 7:19p 8:19p 9:19p 10:19p 11:25p 12:19a\n",
"Broadway 8:25a 10:22a 11:22a 12:22p 1:22p 2:22p 3:22p 4:22p 5:22p 6:22p 7:22p 8:22p 9:22p 10:22p 11:28p 12:22a\n",
"Millbrae 8:29a 10:26a 11:26a 12:26p 1:26p 2:26p 3:26p 4:26p 5:26p 6:26p 7:26p 8:26p 9:26p 10:26p 11:32p 12:26a\n",
"San Bruno 8:34a 10:30a 11:30a 12:30p 1:30p 2:30p 3:30p 4:30p 5:30p 6:30p 7:30p 8:30p 9:30p 10:30p 11:37p 12:30a\n",
"S. San Francisco 8:38a 10:34a 11:34a 12:34p 1:34p 2:34p 3:34p 4:34p 5:34p 6:34p 7:34p 8:34p 9:34p 10:34p 11:41p 12:34a\n",
"Bayshore 8:44a 10:41a 11:41a 12:41p 1:41p 2:41p 3:41p 4:41p 5:41p 6:41p 7:41p 8:41p 9:41p 10:41p 11:47p 12:41a\n",
"22nd Street 8:50a 10:46a 11:46a 12:46p 1:46p 2:46p 3:46p 4:46p 5:46p 6:46p 7:46p 8:46p 9:46p 10:46p 11:53p 12:46a\n",
"San Francisco 8:56a 10:52a 11:53a 12:53p 1:52p 2:52p 3:52p 4:52p 5:52p 6:52p 7:52p 8:52p 9:52p 10:52p 11:59p 12:52aZONE 2 ZONE 3 ZONE 4 ZONE 4 ZONE 3 ZONE 2 ZONE 1 ZONE 12XX Local\n",
"2XX Local\n",
"EFFECTIVE September 12, 2022 Timetable subject to change without notice. *On SAP Center event days, Train 277 or Train 281departure from San Jose Diridon station may be delayed and will depart no later than 10:30p or 11:30p respectively.\n"
]
}
],
"source": [
"print(base_docs[0].get_content())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "42203c70-7ca7-4200-bf47-6282eefca3bf",
"metadata": {},
"outputs": [],
"source": [
"base_response = base_query_engine.query(\n",
" \"What are the stops (and times) for train no 237 northbound?\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "06aa47b6-0f31-4b2d-90f0-bf6c74befd38",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train No. 237 northbound stops at the following stations and times:\n",
"\n",
"- Tamien: 1:05p\n",
"- San Jose Diridon: 1:12p\n",
"- Santa Clara: 1:18p\n",
"- Lawrence: 1:24p\n",
"- Sunnyvale: 1:28p\n",
"- Mountain View: 1:34p\n",
"- San Antonio: 1:37p\n",
"- California Ave: 1:42p\n",
"- Palo Alto: 1:46p\n",
"- Menlo Park: 1:50p\n",
"- Redwood City: 1:56p\n",
"- San Carlos: 2:01p\n",
"- Belmont: 2:04p\n",
"- Hillsdale: 2:08p\n",
"- Hayward Park: 2:11p\n",
"- San Mateo: 2:15p\n",
"- Burlingame: 2:19p\n",
"- Broadway: 2:22p\n",
"- Millbrae: 2:26p\n",
"- San Bruno: 2:30p\n",
"- S. San Francisco: 2:34p\n",
"- Bayshore: 2:41p\n",
"- 22nd Street: 2:46p\n",
"- San Francisco: 2:52p\n"
]
}
],
"source": [
"print(str(base_response))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4f3c1de7-3351-4cd8-991c-34a777952194",
"metadata": {},
"outputs": [],
"source": [
"base_response = base_query_engine.query(\n",
" \"What are all the trains (and times) that end at Tamien going Southbound?\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "513b1007-7508-4fb1-836c-de9353433a67",
"metadata": {},
"source": [
"Note that the trains don't line up with the times!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "108edb92-76af-406b-a139-8b9e7c6528f2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The trains that end at Tamien going Southbound are:\n",
"\n",
"- Train 224 at 10:15a\n",
"- Train 228 at 11:45a\n",
"- Train 240 at 2:45p\n",
"- Train 252 at 4:45p\n",
"- Train 264 at 6:45p\n",
"- Train 276 at 8:45p\n",
"- Train 284 at 10:45p\n",
"- Train 284 at 12:44a\n"
]
}
],
"source": [
"print(str(base_response))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llama_parse",
"display_name": ".venv",
"language": "python",
"name": "llama_parse"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
+207 -55
View File
@@ -6,11 +6,23 @@
"source": [
"# Advanced RAG with LlamaParse\n",
"\n",
"<a href=\"https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
"<a href=\"https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/parse/demo_advanced.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
"\n",
"This notebook is a complete walkthrough for using LlamaParse with advanced indexing/retrieval techniques in LlamaIndex over the Apple 10K Filing. \n",
"\n",
"This allows us to ask sophisticated questions that aren't possible with \"naive\" parsing/indexing techniques with existing models."
"This allows us to ask sophisticated questions that aren't possible with \"naive\" parsing/indexing techniques with existing models.\n",
"\n",
"Status:\n",
"| Last Executed | Version | State |\n",
"|---------------|---------|------------|\n",
"| Aug-18-2025 | 0.6.61 | Maintained |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
@@ -19,7 +31,7 @@
"metadata": {},
"outputs": [],
"source": [
"%pip install llama-index llama-cloud-services"
"%pip install llama-cloud-services \"llama-index>=0.13.2<0.14.0\" \"llama-index-embeddings-huggingface>=0.6.0<0.7.0\" torchvision \"sentence-transformers<5.0\""
]
},
{
@@ -50,7 +62,7 @@
"os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"llx-...\"\n",
"\n",
"# Using OpenAI API for embeddings/llms\n",
"os.environ[\"OPENAI_API_KEY\"] = \"sk-proj-...\""
"os.environ[\"OPENAI_API_KEY\"] = \"sk-...\""
]
},
{
@@ -64,7 +76,7 @@
"from llama_index.core import Settings\n",
"\n",
"embed_model = OpenAIEmbedding(model_name=\"text-embedding-3-small\")\n",
"llm = OpenAI(model=\"gpt-4o-mini\")\n",
"llm = OpenAI(model=\"gpt-5-mini\")\n",
"\n",
"Settings.llm = llm\n",
"Settings.embed_model = embed_model"
@@ -91,14 +103,27 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Started parsing the file under job_id e403a457-1721-4093-82bf-4a316d2d637a\n"
"Started parsing the file under job_id f347cb97-dfe2-4677-991a-5ceba6d9fc6a\n"
]
}
],
"source": [
"from llama_cloud_services import LlamaParse\n",
"\n",
"result = await LlamaParse(take_screenshot=True).aparse(\"./apple_2021_10k.pdf\")\n",
"result = await LlamaParse(\n",
" # The parsing mode\n",
" parse_mode=\"parse_page_with_agent\",\n",
" # The model to use\n",
" model=\"openai-gpt-4-1-mini\",\n",
" # Whether to use high resolution OCR (Slower)\n",
" high_res_ocr=True,\n",
" # Adaptive long table. LlamaParse will try to detect long tables across pages\n",
" adaptive_long_table=True,\n",
" outlined_table_extraction=True,\n",
" output_tables_as_HTML=True,\n",
" # Whether to take a screenshot of the page, needed for screenshot-retrieval\n",
" take_screenshot=True,\n",
").aparse(\"./apple_2021_10k.pdf\")\n",
"\n",
"markdown_nodes = await result.aget_markdown_nodes(split_by_page=True)\n",
"screenshot_image_nodes = await result.aget_image_nodes(\n",
@@ -134,7 +159,16 @@
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-08-18 20:53:51,246 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
"2025-08-18 20:53:52,143 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n"
]
}
],
"source": [
"from llama_index.core import VectorStoreIndex\n",
"\n",
@@ -158,7 +192,15 @@
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-08-18 20:53:53,070 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n"
]
}
],
"source": [
"from llama_index.core import VectorStoreIndex\n",
"\n",
@@ -170,7 +212,22 @@
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/loganmarkewich/llama_parse/py/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n",
"2025-08-18 20:53:55,230 - INFO - Load pretrained SentenceTransformer: llamaindex/vdr-2b-multi-v1\n",
"Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.\n",
"2025-08-18 20:54:05,369 - INFO - 2 prompts are loaded, with the keys: ['query', 'text']\n",
"Generating embeddings: 0%| | 0/82 [00:00<?, ?it/s]2025-08-18 20:54:06,599 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
"Generating embeddings: 100%|██████████| 82/82 [00:01<00:00, 61.24it/s]\n",
"Generating image embeddings: 100%|██████████| 82/82 [26:06<00:00, 19.11s/it]\n"
]
}
],
"source": [
"from llama_index.core.indices import MultiModalVectorStoreIndex\n",
"from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
@@ -182,7 +239,7 @@
" model_name=\"llamaindex/vdr-2b-multi-v1\",\n",
" embed_batch_size=2,\n",
" trust_remote_code=True,\n",
" cache_folder=\"./hf_cache_2\",\n",
" cache_folder=\"./hf_cache\",\n",
" device=\"cpu\", # set to \"cuda\" if you have a GPU or remove to auto-detect\n",
")\n",
"\n",
@@ -337,19 +394,58 @@
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-08-18 21:20:29,006 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
"2025-08-18 21:20:38,721 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"***********Baseline Query Engine***********\n",
"The total fair value of marketable securities in 2020 was $190,516 million.\n",
"The total fair value of marketable securities in 2020 was $153,814 million (approximately $153.8 billion).\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-08-18 21:20:39,233 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
"2025-08-18 21:20:48,185 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"***********Markdown Query Engine***********\n",
"The total fair value of marketable securities in 2020 was $191,830 million.\n",
"The total fair value was $191,830 million (approximately $191.83 billion).\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-08-18 21:20:48,515 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
"2025-08-18 21:21:09,275 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"***********MultiModal Query Engine***********\n",
"The total fair value of marketable securities in 2020 was $191,830 million.\n"
"The table shows:\n",
"\n",
"- Total fair value (cash, cash equivalents and marketable securities) in 2020: $191,830 million (≈ $191.83 billion). \n",
"- Total marketable securities (current + noncurrent) in 2020: $52,927 + $100,887 = $153,814 million (≈ $153.81 billion).\n"
]
}
],
@@ -391,7 +487,7 @@
{
"data": {
"text/plain": [
"'images/page_41.jpg'"
"'images/page_42.jpg'"
]
},
"execution_count": null,
@@ -415,32 +511,64 @@
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-08-18 21:35:33,281 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
"2025-08-18 21:35:40,959 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"***********Baseline Query Engine***********\n",
"The effective interest rates for the debt issuances in 2021 were as follows:\n",
"\n",
"- Floating-rate notes: 0.48% 0.63%\n",
"- Fixed-rate notes: 0.03% 4.78% for maturities from 2022 to 2060\n",
"- Fixed-rate notes issued in the second quarter: 0.75% 2.81% for maturities from 2026 to 2061\n",
"- Fixed-rate notes issued in the fourth quarter: 1.43% 2.86% for maturities from 2028 to 2061\n",
"- Second quarter 2021 fixed-rate notes (20262061): effective interest rates 0.75%2.81%\n",
"- Fourth quarter 2021 fixed-rate notes (20282061): effective interest rates 1.43%2.86%\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-08-18 21:35:41,285 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
"2025-08-18 21:35:49,132 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"***********Markdown Query Engine***********\n",
"The effective interest rates for the debt issuances in 2021 were as follows:\n",
"\n",
"- Floating-rate notes: 0.48% 0.63%\n",
"- Fixed-rate notes: 0.03% 4.78% for the 0.000% 4.650% notes, 0.75% 2.81% for the 0.700% 2.800% notes, and 1.43% 2.86% for the 1.400% 2.850% notes.\n",
"- Floating-rate notes (2022): 0.48% 0.63%\n",
"- Fixed-rate 0.000% 4.650% notes (2022 2060): 0.03% 4.78%\n",
"- Second-quarter 2021 fixed-rate notes (0.700% 2.800%, 2026 2061): 0.75% 2.81%\n",
"- Fourth-quarter 2021 fixed-rate notes (1.400% 2.850%, 2028 2061): 1.43% 2.86%\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-08-18 21:35:49,411 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
"2025-08-18 21:36:06,767 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"***********MultiModal Query Engine***********\n",
"The effective interest rates of all debt issuances in 2021 were as follows:\n",
"The effective interest rate ranges reported for the 2021 debt issuances were:\n",
"\n",
"1. **Floating-rate notes**: 0.48% 0.63%\n",
"2. **Fixed-rate 0.000% 4.650% notes**: 0.03% 4.78%\n",
"3. **Fixed-rate 0.700% 2.800% notes**: 0.75% 2.81%\n",
"4. **Fixed-rate 1.400% 2.850% notes**: 1.43% 2.86%\n"
"- Floatingrate notes (2022): 0.48% 0.63% \n",
"- Fixedrate 0.000% 4.650% notes (20222060): 0.03% 4.78% \n",
"- Q2 2021 fixedrate notes (0.700% 2.800%, maturities 20262061): 0.75% 2.81% \n",
"- Q4 2021 fixedrate notes (1.400% 2.850%, maturities 20282061): 1.43% 2.86%\n"
]
}
],
@@ -539,42 +667,66 @@
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-08-18 21:36:07,790 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
"2025-08-18 21:36:14,197 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"***********Baseline Query Engine***********\n",
"The current state taxes for the years 2019 to 2021 are as follows (in millions):\n",
"\n",
"- 2021: $1,620\n",
"- 2020: $455\n",
"- 2019: $475\n",
"\n",
"This indicates an increase of $1,165 million from 2020 to 2021, a decrease of $20 million from 2018 to 2019, and an increase of $80 million from 2019 to 2020.\n",
"State current tax (in millions):\n",
"- 2019: +$475 million\n",
"- 2020: +$455 million\n",
"- 2021: +$1,620 million\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-08-18 21:36:14,584 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
"2025-08-18 21:36:22,084 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"***********Markdown Query Engine***********\n",
"The current state taxes for the years 2019 to 2021 are as follows (in millions):\n",
"\n",
"- **2021**: $1,620\n",
"- **2020**: $455\n",
"- **2019**: $475\n",
"\n",
"The changes in current state taxes from year to year are:\n",
"\n",
"- From 2019 to 2020: Decrease of $20 million\n",
"- From 2020 to 2021: Increase of $1,165 million\n",
"2019 — Current state taxes: $475 million (change vs prior year: n/a) \n",
"2020 — Current state taxes: $455 million (change vs 2019: $20 million) \n",
"2021 — Current state taxes: $1,620 million (change vs 2020: +$1,165 million)\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-08-18 21:36:22,441 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
"2025-08-18 21:36:33,498 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"***********MultiModal Query Engine***********\n",
"The current state taxes for the years 2019 to 2021 are as follows (in millions):\n",
"The current state tax amounts (in millions) per the Note 5 table are:\n",
"\n",
"- **2021**: $1,620\n",
"- **2020**: $455\n",
"- **2019**: $475\n",
"- 2019: $475\n",
"- 2020: $455 ($20 vs 2019; 4.2%)\n",
"- 2021: $1,620 (+$1,165 vs 2020; +256.0%)\n",
"\n",
"So, the changes are:\n",
"- From 2019 to 2020: Decrease of $20 million\n",
"- From 2020 to 2021: Increase of $1,165 million\n"
"All amounts are in millions of dollars.\n"
]
}
],
@@ -597,7 +749,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "llama-parse-aNC435Vv-py3.10",
"display_name": ".venv",
"language": "python",
"name": "python3"
},
+104 -36
View File
@@ -6,32 +6,26 @@
"source": [
"# Using the Raw API\n",
"\n",
"This notebook walks through how to use the raw API and how"
"This notebook walks through how to use the raw API to parse documents.\n",
"\n",
"Status:\n",
"| Last Executed | Version | State |\n",
"|---------------|---------|------------|\n",
"| Aug-18-2025 | N/A | Maintained |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **⚠️ DEPRECATION NOTICE**>> This example uses the deprecated `llama-cloud-services` package, which will be maintained until **May 1, 2026**.>> **Please migrate to:**> - **Python**: `pip install llama-cloud>=1.0` ([GitHub](https://github.com/run-llama/llama-cloud-py))> - **New Package Documentation**: https://docs.cloud.llamaindex.ai/>> The new package provides the same functionality with improved performance and support."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2024-02-02 11:11:39-- https://arxiv.org/pdf/1706.03762.pdf\n",
"Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.3.42, 151.101.67.42, ...\n",
"Connecting to arxiv.org (arxiv.org)|151.101.131.42|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 2215244 (2.1M) [application/pdf]\n",
"Saving to: ./attention.pdf\n",
"\n",
"./attention.pdf 100%[===================>] 2.11M --.-KB/s in 0.08s \n",
"\n",
"2024-02-02 11:11:39 (27.3 MB/s) - ./attention.pdf saved [2215244/2215244]\n",
"\n"
]
}
],
"outputs": [],
"source": [
"!wget \"https://arxiv.org/pdf/1706.03762.pdf\" -O \"./attention.pdf\""
]
@@ -62,15 +56,23 @@
"with open(file_path, \"rb\") as f:\n",
" mime_type = mimetypes.guess_type(file_path)[0]\n",
" files = {\"file\": (f.name, f, mime_type)}\n",
" body = {\n",
" \"parse_mode\": \"parse_page_with_agent\",\n",
" \"model\": \"openai-gpt-4-1-mini\",\n",
" \"high_res_ocr\": True,\n",
" \"adaptive_long_table\": True,\n",
" \"outlined_table_extraction\": True,\n",
" \"output_tables_as_HTML\": True,\n",
" }\n",
"\n",
" # send the request, upload the file\n",
" url = f\"{base_url}/upload\"\n",
" response = requests.post(url, headers=headers, files=files)\n",
" response = requests.post(url, headers=headers, files=files, data=body)\n",
"\n",
"response.raise_for_status()\n",
"# get the job id for the result_url\n",
"job_id = response.json()[\"id\"]\n",
"result_type = \"text\" # or \"markdown\"\n",
"result_type = \"json\" # or \"markdown\" or \"json\"\n",
"result_url = f\"{base_url}/job/{job_id}/result/{result_type}\"\n",
"\n",
"# check for the result until its ready\n",
@@ -82,8 +84,7 @@
" time.sleep(2)\n",
"\n",
"# download the result\n",
"result = response.json()\n",
"output = result[result_type]"
"result = response.json()"
]
},
{
@@ -95,27 +96,94 @@
"name": "stdout",
"output_type": "stream",
"text": [
" Provided proper attribution is provided, Google hereby grants permission to\n",
" reproduce the tables and figures in this paper solely for use in journalistic or\n",
" scholarly works.\n",
" Attention Is All You Need\n",
"arXiv:1706.03762v7 [cs.CL] 2 Aug 2023\n",
" Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit\n",
" Google Brain Google Brain Google Research Google Research\n",
" avaswani@google.com noam@google.com nikip@google.com usz@google.com\n",
" Llion Jones Aidan N. Gomez † Łukasz Kaiser\n",
" Google Research University of Toronto \n"
"dict_keys(['pages', 'job_metadata'])\n"
]
}
],
"source": [
"print(output[:1000])"
"print(result.keys())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dict_keys(['page', 'text', 'md', 'images', 'charts', 'items', 'status', 'originalOrientationAngle', 'links', 'width', 'height', 'triggeredAutoMode', 'parsingMode', 'structuredData', 'noStructuredContent', 'noTextContent', 'pageHeaderMarkdown', 'pageFooterMarkdown', 'confidence'])\n"
]
}
],
"source": [
"print(result[\"pages\"][0].keys())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.\n",
"\n",
"# Attention Is All You Need\n",
"\n",
"**Ashish Vaswani*** \n",
"Google Brain \n",
"avaswani@google.com \n",
"\n",
"**Noam Shazeer*** \n",
"Google Brain \n",
"noam@google.com \n",
"\n",
"**Niki Parmar*** \n",
"Google Research \n",
"nikip@google.com \n",
"\n",
"**Jakob Uszkoreit*** \n",
"Google Research \n",
"usz@google.com \n",
"\n",
"**Llion Jones*** \n",
"Google Research \n",
"llion@google.com \n",
"\n",
"**Aidan N. Gomez* †** \n",
"University of Toronto \n",
"aidan@cs.toronto.edu \n",
"\n",
"**Łukasz Kaiser*** \n",
"Google Brain \n",
"lukaszkaiser@google.com \n",
"\n",
"**Illia Polosukhin* ‡** \n",
"illia.polosukhin@gmail.com \n",
"\n",
"## Abstract\n",
"\n",
"The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.\n",
"\n",
"----\n",
"\n",
"*Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Il\n"
]
}
],
"source": [
"print(result[\"pages\"][0][\"md\"][:2000])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llama-parse-aNC435Vv-py3.11",
"display_name": ".venv",
"language": "python",
"name": "python3"
},

Some files were not shown because too many files have changed in this diff Show More