[PR #621] [CLOSED] Update docugami reader to optionally return XML markup in chunks #684

Closed
opened 2026-02-15 18:15:55 -05:00 by yindo · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/run-llama/llama-hub/pull/621
Author: @tjaffri
Created: 11/3/2023
Status: Closed

Base: mainHead: tjaffri/dgloader_xml


📝 Commits (9)

📊 Changes

3 files changed (+279 additions, -1115 deletions)

View changed files

llama_hub/docugami/.gitignore (+1 -0)
📝 llama_hub/docugami/base.py (+71 -108)
📝 llama_hub/docugami/docugami.ipynb (+207 -1007)

📄 Description

Description

Optionally return XML markup in chunks from the docugami reader. This can be used to enhance RAG, using semantic cues on chunks.

Type of Change

Please delete options that are not relevant.

  • New Loader/Tool
  • Bug fix / Smaller change
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Added new unit/integration tests
  • Added new notebook (that tests end-to-end)
  • I stared at the code and made sure it makes sense

Suggested Checklist:

  • I have added a library.json file if a new loader/tool was added
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/run-llama/llama-hub/pull/621 **Author:** [@tjaffri](https://github.com/tjaffri) **Created:** 11/3/2023 **Status:** ❌ Closed **Base:** `main` ← **Head:** `tjaffri/dgloader_xml` --- ### 📝 Commits (9) - [`fd88a01`](https://github.com/run-llama/llama-hub/commit/fd88a015a7612c6ac95e7244c5dd24fe4a218b7f) Support xml markup in chunks - [`172a885`](https://github.com/run-llama/llama-hub/commit/172a885dd20764f4059b71794895c9b5acacbfe5) notebook updates, and metadata length - [`6f6355f`](https://github.com/run-llama/llama-hub/commit/6f6355f3abf1d161493920abf3bb3f67480fde5a) lint - [`c3767aa`](https://github.com/run-llama/llama-hub/commit/c3767aa50d4894d4f77f4cbbd110cc21d75c1667) revert unintentional change - [`a4f7569`](https://github.com/run-llama/llama-hub/commit/a4f7569c3662d85a50ab3b6a2d7a059e942240e0) Fix typo - [`6498188`](https://github.com/run-llama/llama-hub/commit/6498188ed2740ff224f03d844f980fb9f006b6ba) Merge branch 'tjaffri/dgloader_xml' of https://github.com/docugami/llama-hub into tjaffri/dgloader_xml - [`d73f0ea`](https://github.com/run-llama/llama-hub/commit/d73f0ea7f00a5cf18b554cbcd599cbd96b8f1f59) Make sub-chunking tables optional - [`2f2f6fb`](https://github.com/run-llama/llama-hub/commit/2f2f6fb00e07c7a7dbf108b3256262814dee17d7) Update default subchunking for tables - [`59c6dd3`](https://github.com/run-llama/llama-hub/commit/59c6dd38be2cd3ec53bd51ca8c30e5cec7815f68) Switch to using dgml-utils package ### 📊 Changes **3 files changed** (+279 additions, -1115 deletions) <details> <summary>View changed files</summary> ➕ `llama_hub/docugami/.gitignore` (+1 -0) 📝 `llama_hub/docugami/base.py` (+71 -108) 📝 `llama_hub/docugami/docugami.ipynb` (+207 -1007) </details> ### 📄 Description # Description Optionally return XML markup in chunks from the docugami reader. This can be used to enhance RAG, using semantic cues on chunks. ## Type of Change Please delete options that are not relevant. - [ ] New Loader/Tool - [x] Bug fix / Smaller change - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [x] This change requires a documentation update # How Has This Been Tested? Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration - [ ] Added new unit/integration tests - [x] Added new notebook (that tests end-to-end) - [ ] I stared at the code and made sure it makes sense # Suggested Checklist: - [ ] I have added a library.json file if a new loader/tool was added - [x] I have performed a self-review of my own code - [x] I have commented my code, particularly in hard-to-understand areas - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes - [ ] I ran `make format; make lint` to appease the lint gods --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
yindo added the pull-request label 2026-02-15 18:15:55 -05:00
yindo closed this issue 2026-02-15 18:15:55 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: run-llama/llama-hub#684