Merge pull request #6 from run-llama/clelia/testing-against-rag

experiment: testing against rag
2026-07-01 21:45:00 -04:00 · 2026-01-07 18:03:45 +01:00
parent 3c9f0a4c99 41af58cf97
commit 37ce90f731
37 changed files with 29708 additions and 910 deletions
@@ -10,4 +10,9 @@ wheels/
 .venv

 # caches
-*_cache/
+*_cache/
+tmp/
+qdrant_storage/
+
+# logs
+fs-explorer.log
@@ -9,6 +9,7 @@ test:
 lint:
 	$(info ****************** linting ******************)
 	uv run pre-commit run -a
+	uv run ruff check --fix

 format:
 	$(info ****************** formatting ******************)
@@ -29,3 +29,9 @@ Run:
 explore --task "Within the data/ directory, can you help e find the PDF file that contains an order or a complaint, and, once you found them, ask me which one I would like you to summarize"
 ```

+## Evaluation against traditional RAG
+
+In the [`packages`](./packages/) directory we have two packages:
+
+- [**rag-starterkit**](./packages/rag-starterkit/): a traditional RAG application implementing hybrid search. (More in the [dedicated README](./packages/rag-starterkit/README.md)).
+- [**eval-framework**](./packages/eval-framework/): a simple evaluation framework that produces LLM-as-a-judge-based evals along with collecting time-, tool usage- and file-search-based statistics. Find out how to run the evaluation and collect statistics in the [dedicated README](./packages/eval-framework/README.md).
@@ -0,0 +1,27 @@
+[
+    {
+        "question": "What is the key takeway (and/or term revisiting) for memory and learning from a nested learning perspective?", 
+        "answer": "Key takeaway (Nested Learning perspective): From a nested learning (NL) viewpoint, memory and learning are not separate modules. Instead, memory is any neural parameter update caused by input, and learning is the process of acquiring, encoding, and organizing these memories effectively. All updates—across different levels of the learning hierarchy and at different frequencies—constitute memory, stored on multiple time scales. This distributed, continuum view aligns with neuropsychology and reframes gradient-based optimization itself as a form of (associative) memory, enabling more robust behavior and mitigating catastrophic forgetting.", 
+        "file": "nested_learning.pdf"
+    },
+    {
+        "question": "What is the architecture of the DeepEncoder in DeepSeek OCR?",
+        "answer": "DeepEncoder uses a two-stage architecture: first, a window-attention-based visual perception module (SAM-base) to extract local image features, and second, a global-attention-based visual knowledge module (CLIP-large) to capture global semantics. Between them, a 2-layer convolutional downsampling module compresses vision tokens (from 4096 to 256 for a 1024x1024 image), keeping computation and memory manageable while preserving important information.",
+        "file": "deepseek_ocr.pdf"
+    },
+    {
+        "question": "What related work in the field of Byte-Level LMs do the author of the BOLMO paper individuate?",
+        "answer": "The authors point to byte-level LMs that replace subwords with UTF-8 bytes, especially Latent Tokenizer Language Models (LTLMs) such as Hourglass-style models that use pooling, depooling, and dynamic, learned tokenization, which can match subword models at similar compute.",
+        "file": "bolmo.pdf"
+    },
+    {
+        "question": "What datasets was MUVERA evaluated on?",
+        "answer": "MUVERA's evaluation includes results from six of the well-studied BEIR information retrieval datasets: MS MARCO, HotpotQA, NQ, Quora, SciDocs, and ArguAna. These datasets were selected for varying corpus size (8K-8.8M) and average number of document tokens (18-165).",
+        "file": "muvera.pdf"
+    }, 
+    {
+        "question": "What limitations and future work perspectives are outlined for Recursive Language Models?",
+        "answer": "Limitations and future work for RLMs include exploring asynchronous or sandboxed sub-calls to reduce inference cost, deeper recursion beyond a single sub-call, and training models specifically as RLMs to improve reasoning and context efficiency.",
+        "file": "recursive_language_models.pdf"
+    }
+]
@@ -0,0 +1,60 @@
+# Evaluation Results Report
+
+## Summary
+
+Total tasks evaluated: **5**
+
+---
+
+## Time Performance
+
+Average execution time across all tasks:
+
+| Framework | Average Time (seconds) | Status |
+|-----------|------------------------|--------|
+| Agentic File Search with FileSystem Tools (fs-explorer) | 11.17s |  |
+| **Traditional RAG** | 7.36s | **Faster** |
+
+**Winner:** Traditional RAG (3.81s faster)
+
+---
+
+## LLM Evaluation Metrics
+
+### Correctness Scores
+
+| Framework | Average Score | Status |
+|-----------|---------------|--------|
+| **Agentic File Search with FileSystem Tools (fs-explorer)** | 8.40 | **Higher** |
+| Traditional RAG | 6.40 |  |
+
+**Winner:** Agentic File Search with FileSystem Tools (fs-explorer) (+2.00 points)
+
+### Relevance Scores
+
+| Framework | Average Score | Status |
+|-----------|---------------|--------|
+| **Agentic File Search with FileSystem Tools (fs-explorer)** | 9.60 | **Higher** |
+| Traditional RAG | 8.00 |  |
+
+**Winner:** Agentic File Search with FileSystem Tools (fs-explorer) (+1.60 points)
+
+---
+
+## Overall Comparison
+
+| Metric | FS-Explorer | RAG | Winner |
+|--------|-------------|-----|--------|
+| **Speed** | 11.17s | 7.36s | Traditional RAG |
+| **Correctness** | 8.40 | 6.40 | Agentic File Search with FileSystem Tools (fs-explorer) |
+| **Relevance** | 9.60 | 8.00 | Agentic File Search with FileSystem Tools (fs-explorer) |
+
+---
+
+## Key Takeaways
+
+- **Fastest Framework:** Traditional RAG is 3.81s faster on average
+- **Most Correct:** Agentic File Search with FileSystem Tools (fs-explorer) produces more correct answers
+- **Most Relevant:** Agentic File Search with FileSystem Tools (fs-explorer) produces more relevant answers
+
+**Overall Winner:** FS-Explorer (2/3 metrics)
@@ -0,0 +1,207 @@
+[
+  {
+    "task": {
+      "question": "What is the key takeway (and/or term revisiting) for memory and learning from a nested learning perspective?",
+      "answer": "Key takeaway (Nested Learning perspective): From a nested learning (NL) viewpoint, memory and learning are not separate modules. Instead, memory is any neural parameter update caused by input, and learning is the process of acquiring, encoding, and organizing these memories effectively. All updates\u2014across different levels of the learning hierarchy and at different frequencies\u2014constitute memory, stored on multiple time scales. This distributed, continuum view aligns with neuropsychology and reframes gradient-based optimization itself as a form of (associative) memory, enabling more robust behavior and mitigating catastrophic forgetting.",
+      "file": "nested_learning.pdf"
+    },
+    "tool_calls": [
+      "check_api_key",
+      "parse_file"
+    ],
+    "llm_evaluations": {
+      "fs_explorer": {
+        "relevance": 10,
+        "correctness": 9,
+        "reason": "The proposed answer closely matches the ground truth: it captures the nested learning view that memory and learning aren\u2019t separate modules, defines memory as any parameter update caused by input, and treats optimization/gradient updates as memory. It also frames learning as organizing/acquiring useful memories. It misses or only implies some nuances from the ground truth (e.g., explicit emphasis on multiple time scales/hierarchical frequencies, link to neuropsychology, catastrophic forgetting mitigation), but the core takeaway is correct and highly relevant."
+      },
+      "rag": {
+        "relevance": 10,
+        "correctness": 9,
+        "reason": "The proposed answer closely matches the ground truth: it states that memory is any input-induced parameter update, learning is acquiring useful memories, rejects a separate memory module, and frames optimizer/gradient updates as memory in a distributed way consistent with neuropsychology. It slightly under-emphasizes the explicit 'multiple time scales / different frequencies across hierarchy' and the point about mitigating catastrophic forgetting, but these are implied rather than stated."
+      }
+    },
+    "used_files": {
+      "fs_explorer": [
+        "./nested_learning.pdf"
+      ],
+      "rag": null
+    },
+    "has_error": {
+      "fs_explorer": false,
+      "rag": false
+    },
+    "time_taken": {
+      "fs_explorer": 10.877681016921997,
+      "rag": 6.647925853729248
+    },
+    "answers": {
+      "fs_explorer": "From a nested learning perspective, the key takeaway for memory and learning is that memory is not an isolated system but is distributed throughout the parameters, defined as any neural update caused by an input. Learning is the process of acquiring effective and useful memory, which involves effectively storing, encoding, and acquiring these updates. Consequently, any update performed by gradient descent or other optimization algorithms at any level of a neural learning module is considered a form of memory.",
+      "rag": "From the nested learning (NL) perspective, the key takeaway for memory and learning is that \"memory\" should be understood as any neural update induced by input, and \"learning\" is the process of acquiring such useful memories. This view contrasts with traditional machine learning architectures, which often treat memory as a distinct module (e.g., short-term or long-term memory blocks). Instead, NL aligns more closely with neuropsychological understanding: memory is a distributed, interconnected process resulting from updates across all levels of neural learning modules. Thus, any parameter update\u2014from gradient descent or another optimizer\u2014at any layer or component of a neural network constitutes a form of memory. This broadens and unifies the definition of memory in artificial neural networks, blurring the lines between architecture components traditionally labeled as \"memory\" and other parameters, all under the umbrella of ongoing learning updates."
+    }
+  },
+  {
+    "task": {
+      "question": "What is the architecture of the DeepEncoder in DeepSeek OCR?",
+      "answer": "DeepEncoder uses a two-stage architecture: first, a window-attention-based visual perception module (SAM-base) to extract local image features, and second, a global-attention-based visual knowledge module (CLIP-large) to capture global semantics. Between them, a 2-layer convolutional downsampling module compresses vision tokens (from 4096 to 256 for a 1024x1024 image), keeping computation and memory manageable while preserving important information.",
+      "file": "deepseek_ocr.pdf"
+    },
+    "tool_calls": [
+      "check_api_key",
+      "parse_file"
+    ],
+    "llm_evaluations": {
+      "fs_explorer": {
+        "relevance": 10,
+        "correctness": 9,
+        "reason": "The proposed answer matches the ground truth\u2019s key points: a two-stage encoder with SAM-base window attention for local features, a 2-layer convolutional downsampling/token compressor between stages, and CLIP-large global attention for global semantics. It also correctly conveys the purpose (reducing tokens/activation memory). Minor deductions: it adds extra specifics not present in the ground truth (parameter counts, exact conv channel sizes, patch-size detail, \u2018first patch embedding removed\u2019) and does not explicitly state the token reduction example (4096\u2192256 for 1024\u00d71024), though the 16\u00d7 compression is consistent."
+      },
+      "rag": {
+        "relevance": 10,
+        "correctness": 9,
+        "reason": "The answer matches the ground truth\u2019s core points: a two-stage encoder with SAM-base using window/local attention followed by CLIP-large using global attention, connected by a token compression/downsampling module that reduces token count (\u224816\u00d7). It is highly relevant and largely correct. Minor issues: it describes the bridge as a \u201c16\u00d7 token compressor\u201d but does not explicitly mention the ground-truth detail that it is a 2-layer convolutional downsampling module, nor the specific example reduction from 4096\u2192256 for a 1024\u00d71024 image (though 16\u00d7 implies it). Parameter counts and decoder mention are extra but not conflicting."
+      }
+    },
+    "used_files": {
+      "fs_explorer": [
+        "./deepseek_ocr.pdf"
+      ],
+      "rag": null
+    },
+    "has_error": {
+      "fs_explorer": false,
+      "rag": false
+    },
+    "time_taken": {
+      "fs_explorer": 11.284583806991577,
+      "rag": 6.58709716796875
+    },
+    "answers": {
+      "fs_explorer": "The architecture of the DeepEncoder in DeepSeek-OCR consists of three main components connected in series: \n1. A visual perception feature extraction component dominated by window attention, specifically an 80M SAM-base (patch-size 16).\n2. A 16x token compressor, which is a 2-layer convolutional module that performs 16x downsampling. Each convolutional layer has a kernel size of 3, stride of 2, padding of 1, and channels increasing from 256 to 1024.\n3. A visual knowledge feature extraction component with dense global attention, specifically a 300M CLIP-large (with the first patch embedding layer removed).\n\nThe DeepEncoder is approximately 380M in parameters and is designed to maintain low activation memory while achieving high compression ratios for vision tokens.",
+      "rag": "The DeepEncoder in DeepSeek-OCR is a key part of its architecture dedicated to extracting image features, tokenizing, and compressing visual representations before passing them to the decoder. According to the provided context and Figure 3 description, DeepEncoder is approximately 380 million parameters in size and is composed mainly of two parts: (1) an 80M parameter SAM-base module, which focuses on perception and uses local (window) attention to process the input image, and (2) a 300M parameter CLIP-large module, which provides knowledge integration through dense global attention. These two components are connected in series and bridged by a 16\u00d7 token compressor that downsamples and compresses the feature tokens. The output from DeepEncoder is then ready for use by the DeepSeek-3B-MoE decoder. In summary, DeepEncoder employs both local and global attention mechanisms (via SAM and CLIP), combined with efficient token compression, to prepare rich and compact visual representations for downstream tasks."
+    }
+  },
+  {
+    "task": {
+      "question": "What related work in the field of Byte-Level LMs do the author of the BOLMO paper individuate?",
+      "answer": "The authors point to byte-level LMs that replace subwords with UTF-8 bytes, especially Latent Tokenizer Language Models (LTLMs) such as Hourglass-style models that use pooling, depooling, and dynamic, learned tokenization, which can match subword models at similar compute.",
+      "file": "bolmo.pdf"
+    },
+    "tool_calls": [
+      "check_api_key",
+      "parse_file"
+    ],
+    "llm_evaluations": {
+      "fs_explorer": {
+        "relevance": 8,
+        "correctness": 6,
+        "reason": "The response is largely on-topic (byte-level LMs and especially LTLM/hourglass-style approaches), and it captures the core idea that related work includes replacing subwords with UTF-8 bytes and LTLMs with pooling/depooling and learned/dynamic tokenization. However, it goes well beyond the ground-truth scope with many specific model names/citations that are not supported by the provided ground truth and may include inaccuracies or extraneous details. Because the ground truth highlights primarily LTLM/hourglass-style models matching subword compute, the added taxonomy and comparisons reduce correctness relative to what can be verified."
+      },
+      "rag": {
+        "relevance": 3,
+        "correctness": 2,
+        "reason": "The answer is only loosely related to byte-level LM related work and does not identify the key related work singled out in the BOLMO paper per the ground truth (UTF-8 byte replacement and especially Latent Tokenizer Language Models / hourglass pooling\u2013depooling with learned tokenization). It instead speculates about other byte-level models (ByT5, CANINE) and adds uncertainty, so it largely misses the specific point."
+      }
+    },
+    "used_files": {
+      "fs_explorer": [
+        "./bolmo.pdf"
+      ],
+      "rag": "BOLMO.pdf"
+    },
+    "has_error": {
+      "fs_explorer": false,
+      "rag": false
+    },
+    "time_taken": {
+      "fs_explorer": 14.328953981399536,
+      "rag": 11.591348886489868
+    },
+    "answers": {
+      "fs_explorer": "The authors of the Bolmo paper identify several key related works in the field of Byte-Level LMs, grouping them into distinct approaches:\n\n1. **Direct Byte-to-Byte Models**: Works that replace subword tokens with UTF-8 bytes while keeping architectures mostly the same, such as ByT5 (Xue et al., 2022), MambaByte (Wang et al., 2024), EvaByte (Zheng et al., 2025), and Minixhofer et al. (2025b).\n2. **Hierarchical and Pooling Architectures**: Models that use local encoders/decoders to pool tokens into representation patches, including the Hourglass Transformer (Nawrot et al., 2022), MEGABYTE (Yu et al., 2023), and Block Transformer (Ho et al., 2024).\n3. **Latent Tokenizer Language Models (LTLMs)**: Architectures featuring dynamic tokenization or learned boundaries, which the authors refer to collectively as LTLMs. Key examples include:\n   - **DTP** (Dynamic Token Pooling; Nawrot et al., 2023)\n   - **Spacebyte** (Slagle, 2024)\n   - **H-Net** (Dynamic Chunking; Hwang et al., 2025)\n   - **BLT** (Byte Latent Transformer; Pagnoni et al., 2025)\n4. **Additional Comparative Models**: In their results, they also compare against **TFree-Hat** (Neitemeier et al., 2025).\n\nThey distinguish Bolmo from these works by focusing on 'byteifying' existing subword-level LMs through distillation and adaptation rather than training a byte-level model from scratch.",
+      "rag": "The context provided is empty (''), so I don't have access to the actual content of the BOLMO paper. However, based on common practice in papers about Byte-Level Language Models (LMs) such as BOLMO, the authors typically reference related work including pioneering byte-level models like ByT5 (Google research), CANINE, and byte-level versions of transformer architectures such as GPT and BERT variants. They often discuss the advantages of byte-level models over subword or character-level models, such as improved multilinguality and robustness to out-of-vocabulary words. For a precise and paper-specific answer, please provide excerpts or the related work section from the BOLMO paper."
+    }
+  },
+  {
+    "task": {
+      "question": "What datasets was MUVERA evaluated on?",
+      "answer": "MUVERA's evaluation includes results from six of the well-studied BEIR information retrieval datasets: MS MARCO, HotpotQA, NQ, Quora, SciDocs, and ArguAna. These datasets were selected for varying corpus size (8K-8.8M) and average number of document tokens (18-165).",
+      "file": "muvera.pdf"
+    },
+    "tool_calls": [
+      "check_api_key",
+      "parse_file"
+    ],
+    "llm_evaluations": {
+      "fs_explorer": {
+        "relevance": 10,
+        "correctness": 9,
+        "reason": "The answer correctly lists all six BEIR datasets mentioned in the ground truth (MS MARCO, HotpotQA, NQ, Quora, SciDocs, ArguAna), so it fully addresses the question. It omits the extra ground-truth detail about why they were selected (corpus size/token lengths), but that detail isn\u2019t required to answer which datasets were used."
+      },
+      "rag": {
+        "relevance": 8,
+        "correctness": 5,
+        "reason": "The response is on-topic (it addresses which datasets MUVERA was evaluated on) but it is incomplete and partially incorrect relative to the ground truth. It correctly notes there are 6 BEIR datasets and names three of them (MS MARCO, HotpotQA, NQ), but fails to name the other three (Quora, SciDocs, ArguAna) and instead claims they were not explicitly named. The added detail about comparing against PLAID is not asked for and not part of the ground-truth dataset list."
+      }
+    },
+    "used_files": {
+      "fs_explorer": [
+        "./muvera.pdf"
+      ],
+      "rag": null
+    },
+    "has_error": {
+      "fs_explorer": false,
+      "rag": false
+    },
+    "time_taken": {
+      "fs_explorer": 8.626154899597168,
+      "rag": 4.070914030075073
+    },
+    "answers": {
+      "fs_explorer": "MUVERA was evaluated on six datasets from the BEIR (Benchmarking Efficient Information Retrieval) benchmark: MS MARCO, HotpotQA, NQ (Natural Questions), Quora, SciDocs, and ArguAna.",
+      "rag": "MUVERA was evaluated on 6 datasets from the BEIR benchmark, as described earlier in the context. Specifically, the datasets mentioned by name are MS MARCO, HotpotQA, and NQ, with the other three not explicitly named in the provided excerpt. The evaluation compared MUVERA's performance against PLAID on these datasets."
+    }
+  },
+  {
+    "task": {
+      "question": "What limitations and future work perspectives are outlined for Recursive Language Models?",
+      "answer": "Limitations and future work for RLMs include exploring asynchronous or sandboxed sub-calls to reduce inference cost, deeper recursion beyond a single sub-call, and training models specifically as RLMs to improve reasoning and context efficiency.",
+      "file": "recursive_language_models.pdf"
+    },
+    "tool_calls": [
+      "check_api_key",
+      "parse_file"
+    ],
+    "llm_evaluations": {
+      "fs_explorer": {
+        "relevance": 10,
+        "correctness": 9,
+        "reason": "The proposed answer closely matches the ground truth: it mentions asynchronous/sandboxed sub-calls to reduce inference cost, deeper recursion beyond a single sub-call, and training models specifically as RLMs. It also adds minor extra detail (implementation mechanisms and reasoning-trajectory bootstrapping) that is consistent with the theme but not explicitly required by the ground truth, so correctness is slightly below perfect only due to potential overreach beyond the minimal stated points."
+      },
+      "rag": {
+        "relevance": 9,
+        "correctness": 7,
+        "reason": "The proposed answer captures two core ground-truth points: (1) deeper recursion beyond a single sub-call, and (2) training models specifically as RLMs. However, it omits the ground-truth limitation/future-work item about exploring asynchronous or sandboxed sub-calls to reduce inference cost. It also adds extra details (e.g., bootstrapping/supervision, \u201cnew scaling axis\u201d) that are not in the ground truth; these don\u2019t necessarily make it wrong, but they go beyond what\u2019s required for exact alignment."
+      }
+    },
+    "used_files": {
+      "fs_explorer": [
+        "./recursive_language_models.pdf"
+      ],
+      "rag": null
+    },
+    "has_error": {
+      "fs_explorer": false,
+      "rag": false
+    },
+    "time_taken": {
+      "fs_explorer": 10.710226058959961,
+      "rag": 7.890398979187012
+    },
+    "answers": {
+      "fs_explorer": "The limitations and future work perspectives outlined for Recursive Language Models (RLMs) include: \n1. Implementation Mechanisms: The optimal mechanism for implementing RLMs is still under-explored. Future work could investigate alternative strategies like asynchronous sub-calls and sandboxed REPLs to reduce runtime and inference costs.\n2. Recursion Depth: The study used a maximum recursion depth of one (where sub-calls are LMs). Future work should investigate the impact of deeper layers of recursion.\n3. Model Training: The experiments focused on existing frontier models. Future work could involve explicitly training models to be used specifically as RLMs (either as root or sub-LMs) for better performance.\n4. Reasoning Trajectories: There is a hypothesis that RLM trajectories can be viewed as a form of reasoning that can be trained by bootstrapping existing frontier models.",
+      "rag": "The limitations and future work perspectives for Recursive Language Models (RLMs), as outlined in the provided context, are:\n\n**Limitations:**\n- The experiments used a maximum recursion depth of one, meaning sub-calls were only to language models themselves rather than deeper recursive structures. This restricts the exploration of potentially beneficial deeper recursion for reasoning and inference.\n- The current RLM framework was evaluated using existing pre-trained 'frontier' models, which were not specifically trained to serve in recursive or sub-model roles. As found in the experiments, these models are not yet efficient decision-makers when applied in the RLM setting, suggesting a mismatch between the model's training setup and its use in RLMs.\n\n**Future Work:**\n- Investigate deeper layers of recursion beyond the current depth-one limit, which may provide stronger reasoning and inference capabilities.\n- Explicitly train models for use within the RLM framework, either as root or sub-language models, which could improve overall performance and efficiency in reasoning tasks.\n- Explore and develop methods for bootstrapping and training RLMs as a form of explicit reasoning, potentially viewing RLM trajectories as reasoning chains that can be supervised and improved over time.\n- The authors express excitement about these directions and believe purpose-trained RLMs could represent a new scaling axis for the advancement of language model systems."
+    }
+  }
+]
@@ -0,0 +1,19 @@
+{
+  "time_stats": {
+    "fs_explorer": 11.165519952774048,
+    "rag": 7.35753698348999,
+    "best": "rag"
+  },
+  "llm_stats": {
+    "fs_explorer": {
+      "correctness": 8.4,
+      "relevance": 9.6
+    },
+    "rag": {
+      "correctness": 6.4,
+      "relevance": 8
+    },
+    "best_correctness": "fs-explorer",
+    "best_relevance": "fs-explorer"
+  }
+}
@@ -0,0 +1,37 @@
+# Evaluation Framework
+
+Framework to evaluate the performance of the fs-explorer agent (agentic file search with filesystem tools) agains the performance of traditional RAG. 
+
+## Run the evaluation
+
+> This assumes that you have `fs-explorer` and `rag-starterkit` installed
+
+Start local Qdrant Server (from the `packages/eval-framework` folder):
+
+```bash
+docker compose up -d
+```
+
+Move to the benchmark folder with the data:
+
+```bash
+cd ../../data/benchmark
+```
+
+Pre-parse all PDF files with LlamaParse for the benchmark to be faster:
+
+```bash
+explore load-cache --directory . --skip tmp --skip questions_and_answers.json --skip fs-explorer.log --skip rag-starterkit.log
+```
+
+Run evaluation (might need to break down the questions in [`questions_and_answers.json`](../../data/benchmark/questions_and_answers.json) into sub-groups because of rate-limiting issues):
+
+```bash
+run-eval -df questions_and_answers.json
+```
+
+This will produce a [`results.json`](../../data/benchmark/results.json) file (containing all the details on the evaluation tasks and results), that you can use to produce a [`statistics.json`](../../data/benchmark/statistics.json) file (containing summary statistics from the evaluation results) and a [`report.md`](../../data/benchmark/report.md) file (containing a human-readable report on the evaluation statistics). In order to get statistics, run:
+
+```bash
+get-stats
+```
@@ -0,0 +1,11 @@
+name: evaluation-framework
+
+services:
+  qdrant:
+    image: qdrant/qdrant:latest
+    restart: always
+    ports:
+      - "6333:6333"
+      - "6334:6334"
+    volumes:
+      - "./qdrant_storage:/qdrant/storage"
@@ -0,0 +1,26 @@
+[build-system]
+requires = ["uv_build>=0.9.10,<0.10.0"]
+build-backend = "uv_build"
+
+[project]
+name = "eval-framework"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.10"
+dependencies = [
+    "fs-explorer",
+    "rag-starterkit",
+    "typer>=0.20.0",
+]
+
+[tool.uv.sources]
+rag-starterkit = { workspace = true }
+fs-explorer = { workspace = true }
+
+[tool.uv.build-backend]
+module-name = "eval_framework"
+
+[project.scripts]
+run-eval = "eval_framework.main:app_eval"
+get-stats = "eval_framework.main:app_stats"
@@ -0,0 +1,57 @@
+import re
+
+PATTERN = re.compile(r"\{\{([^\}]+)\}\}")
+
+
+class TemplateValidationError(Exception):
+    """Raised when the arguments to render a template fail to validate"""
+
+
+class Template:
+    """
+    Jinja2-like class for string templating
+
+    Attributes:
+        content (str): original template string
+        _to_render (list[str]): fields of the string that have to be rendered with the template
+    """
+
+    def __init__(self, content: str):
+        """
+        Create a template from a string.
+
+        Args:
+            content (str): the template string
+        """
+        self.content = content
+        self._to_render = PATTERN.findall(content)
+
+    def _validate(self, args: dict[str, str]) -> bool:
+        return all(el in args for el in self._to_render) and all(
+            isinstance(args[k], str) for k in args
+        )
+
+    def render(self, args: dict[str, str]) -> str:
+        """
+        Render the template.
+
+        Args:
+            args (dict[str, str]): a dictionary of arguments for the template to be rendered. The keys represent the fields in the template, and the values represent the strings with which to fill the template.
+
+        Returns:
+            str: The rendered template string.
+        """
+        if self._validate(args):
+            content = self.content
+            for word in self._to_render:
+                content = content.replace("{{" + word + "}}", args[word])
+            return content
+        else:
+            if (ls := list(set(self._to_render) - set(list(args.keys())))) != []:
+                raise TemplateValidationError(
+                    f"Missing the following arguments for the template: {', '.join(ls)}"
+                )
+            else:
+                raise TemplateValidationError(
+                    "You should provide a dictionary with only string values."
+                )
@@ -0,0 +1,182 @@
+import os
+import json
+import asyncio
+
+from openai import AsyncOpenAI
+from openai.types.responses.easy_input_message_param import EasyInputMessageParam
+from openai.types.shared_params import Reasoning
+from pydantic import BaseModel, Field
+from typing import TypedDict, cast
+from .run import run_pipeline, run_workflow
+from ._templating import Template
+
+
+class EvalTask(TypedDict):
+    question: str
+    answer: str
+    file: str
+
+
+class LLMEvaluation(TypedDict):
+    relevance: int
+    correctness: int
+    reason: str
+
+
+class BestTime(TypedDict):
+    fs_explorer: float
+    rag: float
+
+
+class FilePath(TypedDict):
+    fs_explorer: list[str] | str | None
+    rag: list[str] | str | None
+
+
+class HasError(TypedDict):
+    fs_explorer: bool
+    rag: bool
+
+
+class LLMEvaluations(TypedDict):
+    fs_explorer: LLMEvaluation | None
+    rag: LLMEvaluation | None
+
+
+class Answers(TypedDict):
+    fs_explorer: str | None
+    rag: str | None
+
+
+class EvalResult(TypedDict):
+    task: EvalTask
+    llm_evaluations: LLMEvaluations
+    answers: Answers
+    time_taken: BestTime
+    used_files: FilePath
+    tool_calls: list[str]
+    has_error: HasError
+
+
+class Evaluation(BaseModel):
+    relevance: int = Field(
+        description="Evaluation of the response, based on its relevance compared with the ground truth. Ranges between 0 and 10",
+        ge=0,
+        le=10,
+    )
+    correctness: int = Field(
+        description="Evaluation of the response, based on its correctness compared with the ground truth. Ranges between 0 and 10",
+        ge=0,
+        le=10,
+    )
+    reason: str = Field(description="Brief explanation of the evaluation")
+
+    def to_llm_evaluation(self) -> LLMEvaluation:
+        return LLMEvaluation(**self.model_dump())
+
+
+LLM_AS_A_JUDGE_PROMPT = Template(
+    "The following question: '{{question}}' has this ground truth answer: '{{ground_truth}}'. Please evaluate this answer: '{{answer}}' grading its correctness and relevance between 0 and 10, and providing a brief explanation of the evaluation."
+)
+LLM_AS_A_JUDGE_MODEL = "gpt-5.2"
+
+
+async def llm_as_a_judge(
+    question: str, ground_truth: str, produced_answer: str
+) -> Evaluation | None:
+    content = LLM_AS_A_JUDGE_PROMPT.render(
+        {"question": question, "ground_truth": ground_truth, "answer": produced_answer}
+    )
+    message = EasyInputMessageParam(content=content, role="user")
+    client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
+    response = await client.responses.parse(
+        text_format=Evaluation,
+        input=[message],
+        reasoning=Reasoning(effort="none"),
+        model=LLM_AS_A_JUDGE_MODEL,
+    )
+    return response.output_parsed
+
+
+def get_evaluation_dataset(dataset_file: str) -> list[EvalTask]:
+    with open(dataset_file, "r") as f:
+        data = json.load(f)
+    assert isinstance(data, list)
+    eval_tasks: list[EvalTask] = []
+    for d in data:
+        assert isinstance(d, dict)
+        assert "question" in d
+        assert "answer" in d
+        assert "file" in d
+        eval_tasks.append(cast(EvalTask, d))
+    return eval_tasks
+
+
+async def run_evaluation(dataset_file: str, results_file: str = "results.json") -> None:
+    tasks = get_evaluation_dataset(dataset_file)
+    results: list[EvalResult] = []
+    try:
+        for i, task in enumerate(tasks):
+            print(f"Starting task {i + 1} of {len(tasks)}")
+            wf_result = await run_workflow(question=task["question"])
+            pipeline_result = await run_pipeline(question=task["question"])
+            best_time = BestTime(
+                fs_explorer=wf_result["time_taken"], rag=pipeline_result["time_taken"]
+            )
+            file_check = FilePath(
+                fs_explorer=wf_result["file_path"], rag=pipeline_result["file_path"]
+            )
+            tool_calls = wf_result["tool_calls"] or []
+            has_error = HasError(fs_explorer=True, rag=True)
+            if wf_result["error"] is None:
+                has_error["fs_explorer"] = False
+                if wf_result["final_answer"] is not None:
+                    wf_evaluation = await llm_as_a_judge(
+                        question=task["question"],
+                        ground_truth=task["answer"],
+                        produced_answer=wf_result["final_answer"],
+                    )
+                    if wf_evaluation is not None:
+                        wf_evaluation = wf_evaluation.to_llm_evaluation()
+                else:
+                    wf_evaluation = None
+            else:
+                wf_evaluation = None
+            if pipeline_result["error"] is None:
+                has_error["rag"] = False
+                if pipeline_result["final_answer"] is not None:
+                    pipeline_evaluation = await llm_as_a_judge(
+                        question=task["question"],
+                        ground_truth=task["answer"],
+                        produced_answer=pipeline_result["final_answer"],
+                    )
+                    if pipeline_evaluation is not None:
+                        pipeline_evaluation = pipeline_evaluation.to_llm_evaluation()
+                else:
+                    pipeline_evaluation = None
+            else:
+                pipeline_evaluation = None
+            llm_evaluations = LLMEvaluations(
+                fs_explorer=wf_evaluation, rag=pipeline_evaluation
+            )
+            eval_result = EvalResult(
+                task=task,
+                tool_calls=tool_calls,
+                llm_evaluations=llm_evaluations,
+                used_files=file_check,
+                has_error=has_error,
+                time_taken=best_time,
+                answers=Answers(
+                    fs_explorer=wf_result["final_answer"],
+                    rag=pipeline_result["final_answer"],
+                ),
+            )
+            results.append(eval_result)
+            print(
+                f"Finished task {i + 1} of {len(tasks)}, sleeping 1 sec to avoid rate limiting issues..."
+            )
+            await asyncio.sleep(1)
+    except Exception as e:
+        print(f"An error occurred: {e}")
+    with open(results_file, "w") as f:
+        json.dump(results, f, indent=2)
@@ -0,0 +1,65 @@
+import asyncio
+
+from typing import Annotated
+from typer import Typer, Option
+from .evaluate import run_evaluation
+from .stats import get_eval_stats
+
+app_eval = Typer()
+app_stats = Typer()
+
+
+@app_eval.command()
+def run_evaluations(
+    dataset_file: Annotated[
+        str,
+        Option(
+            "-df",
+            "--dataset-file",
+            help="JSON file containing the questions and answers dataset",
+        ),
+    ],
+    results_file: Annotated[
+        str,
+        Option(
+            "-o",
+            "--output",
+            help="JSON file where to save the output of the evaluation.",
+        ),
+    ] = "results.json",
+) -> None:
+    asyncio.run(run_evaluation(dataset_file=dataset_file, results_file=results_file))
+
+
+@app_stats.command()
+def get_stats(
+    results_file: Annotated[
+        str,
+        Option(
+            "-rf",
+            "--results-file",
+            help="JSON file where the evaluation output has been saved.",
+        ),
+    ] = "results.json",
+    stats_file: Annotated[
+        str,
+        Option(
+            "-j",
+            "--output-json",
+            help="JSON file where the extracted statistics will be saved",
+        ),
+    ] = "statistics.json",
+    report_file: Annotated[
+        str,
+        Option(
+            "-r",
+            "--output-report",
+            help="Markdown file where a human-readable report on the extracted statistics will be saved",
+        ),
+    ] = "report.md",
+) -> None:
+    get_eval_stats(
+        results_file=results_file,
+        result_json_file=stats_file,
+        result_md_file=report_file,
+    )
@@ -0,0 +1,76 @@
+import time
+
+from typing import TypedDict
+from fs_explorer.workflow import (
+    workflow,
+    InputEvent,
+    ToolCallEvent,
+    ExplorationEndEvent,
+)
+from qdrant_client import AsyncQdrantClient
+from rag_starterkit.pipeline import Pipeline
+from ._templating import Template
+
+FS_EXPLORER_PROMPT = Template(
+    "Search the answer to the following question: '{{question}}' by using one of the PDF files available in the current directory. In your final response, you must report the answer to the question. In this task, you MUST NOT ask for any human assistance and you MUST ONLY use tool calling."
+)
+
+
+class RunResult(TypedDict):
+    time_taken: float
+    tool_calls: list[str] | None
+    error: str | None
+    final_answer: str | None
+    file_path: str | list[str] | None
+
+
+async def run_workflow(question: str) -> RunResult:
+    start_event = InputEvent(task=FS_EXPLORER_PROMPT.render({"question": question}))
+    tool_calls = []
+    file_names: list[str] = []
+    start_time = time.time()
+    handler = workflow.run(start_event=start_event)
+    async for event in handler.stream_events():
+        if isinstance(event, ToolCallEvent):
+            tool_calls.append(event.tool_name)
+            if event.tool_name == "parse_file":
+                file_name = event.tool_input.get("file_path")
+                if file_name is not None:
+                    file_names.append(file_name)
+    result = await handler
+    end_time = time.time()
+    assert isinstance(result, ExplorationEndEvent)
+    return RunResult(
+        time_taken=(end_time - start_time),
+        tool_calls=tool_calls,
+        error=result.error,
+        final_answer=result.final_result,
+        file_path=file_names,
+    )
+
+
+PIPELINE = Pipeline(
+    qdrant_client=AsyncQdrantClient(location="http://localhost:6333"),
+    qdrant_collection_name="rag-benchmark",
+    cache_directory="tmp/cache",
+)
+
+
+async def run_pipeline(question: str) -> RunResult:
+    await PIPELINE.prepare()
+    start_time = time.time()
+    try:
+        result, file_path = await PIPELINE.run(question)
+        error = None
+    except Exception as e:
+        file_path = None
+        result = None
+        error = str(e)
+    end_time = time.time()
+    return RunResult(
+        time_taken=(end_time - start_time),
+        tool_calls=None,
+        error=error,
+        final_answer=result,
+        file_path=file_path,
+    )
@@ -0,0 +1,212 @@
+import json
+
+from statistics import mean
+from typing import cast, TypedDict, Literal
+from .evaluate import EvalResult, BestTime, LLMEvaluations
+
+FrameworkType = Literal["rag", "fs-explorer"]
+
+
+class TimeAverage(TypedDict):
+    fs_explorer: float
+    rag: float
+    best: FrameworkType
+
+
+class LLMAverage(TypedDict):
+    correctness: float
+    relevance: float
+
+
+class LLMStats(TypedDict):
+    fs_explorer: LLMAverage
+    rag: LLMAverage
+    best_correctness: FrameworkType
+    best_relevance: FrameworkType
+
+
+class EvalStats(TypedDict):
+    time_stats: TimeAverage
+    llm_stats: LLMStats
+
+
+def get_results(results_file: str) -> list[EvalResult]:
+    with open(results_file, "r") as f:
+        data = json.load(f)
+    assert isinstance(data, list)
+    results: list[EvalResult] = []
+    for d in data:
+        assert isinstance(d, dict)
+        assert "task" in d
+        assert "llm_evaluations" in d
+        assert "answers" in d
+        assert "time_taken" in d
+        assert "used_files" in d
+        assert "tool_calls" in d
+        assert "has_error" in d
+        results.append(cast(EvalResult, d))
+    return results
+
+
+def get_time_average(time_stats: list[BestTime]) -> TimeAverage:
+    fs_expl = []
+    rag = []
+    for time in time_stats:
+        fs_expl.append(time["fs_explorer"])
+        rag.append(time["rag"])
+    fs_expl_mean = mean(fs_expl)
+    rag_mean = mean(rag)
+    best = "rag" if fs_expl_mean > rag_mean else "fs-explorer"
+    return TimeAverage(fs_explorer=fs_expl_mean, rag=rag_mean, best=best)
+
+
+def get_llm_stats(llm_stats: list[LLMEvaluations]) -> LLMStats:
+    fs_expl_corr = []
+    rag_corr = []
+    fs_expl_rel = []
+    rag_rel = []
+    for stat in llm_stats:
+        if stat["fs_explorer"] is not None:
+            fs_expl_corr.append(stat["fs_explorer"]["correctness"])
+            fs_expl_rel.append(stat["fs_explorer"]["relevance"])
+        if stat["rag"] is not None:
+            rag_corr.append(stat["rag"]["correctness"])
+            rag_rel.append(stat["rag"]["relevance"])
+    rag_corr_avg = mean(rag_corr)
+    rag_rel_avg = mean(rag_rel)
+    fs_expl_corr_avg = mean(fs_expl_corr)
+    fs_expl_rel_avg = mean(fs_expl_rel)
+    best_corr = "fs-explorer" if fs_expl_corr_avg > rag_corr_avg else "rag"
+    best_rel = "fs-explorer" if fs_expl_rel_avg > rag_rel_avg else "rag"
+    return LLMStats(
+        fs_explorer=LLMAverage(correctness=fs_expl_corr_avg, relevance=fs_expl_rel_avg),
+        rag=LLMAverage(correctness=rag_corr_avg, relevance=rag_rel_avg),
+        best_correctness=best_corr,
+        best_relevance=best_rel,
+    )
+
+
+def create_markdown_report(eval_stats: EvalStats, num_tasks: int) -> str:
+    """Generate a markdown report from evaluation statistics."""
+    time_stats = eval_stats["time_stats"]
+    llm_stats = eval_stats["llm_stats"]
+
+    # Helper function to format framework names
+    def format_framework(name: str) -> str:
+        return (
+            "Agentic File Search with FileSystem Tools (fs-explorer)"
+            if name == "fs-explorer"
+            else "Traditional RAG"
+        )
+
+    # Helper function to add winner emoji
+    def add_winner(framework: str, best: str) -> str:
+        return (
+            f"**{format_framework(framework)}**"
+            if framework == best
+            else format_framework(framework)
+        )
+
+    md = f"""# Evaluation Results Report
+
+## Summary
+
+Total tasks evaluated: **{num_tasks}**
+
+---
+
+## Time Performance
+
+Average execution time across all tasks:
+
+| Framework | Average Time (seconds) | Status |
+|-----------|------------------------|--------|
+| {add_winner("fs-explorer", time_stats["best"])} | {time_stats["fs_explorer"]:.2f}s | {"**Faster**" if time_stats["best"] == "fs-explorer" else ""} |
+| {add_winner("rag", time_stats["best"])} | {time_stats["rag"]:.2f}s | {"**Faster**" if time_stats["best"] == "rag" else ""} |
+
+**Winner:** {format_framework(time_stats["best"])} ({abs(time_stats["fs_explorer"] - time_stats["rag"]):.2f}s faster)
+
+---
+
+## LLM Evaluation Metrics
+
+### Correctness Scores
+
+| Framework | Average Score | Status |
+|-----------|---------------|--------|
+| {add_winner("fs-explorer", llm_stats["best_correctness"])} | {llm_stats["fs_explorer"]["correctness"]:.2f} | {"**Higher**" if llm_stats["best_correctness"] == "fs-explorer" else ""} |
+| {add_winner("rag", llm_stats["best_correctness"])} | {llm_stats["rag"]["correctness"]:.2f} | {"**Higher**" if llm_stats["best_correctness"] == "rag" else ""} |
+
+**Winner:** {format_framework(llm_stats["best_correctness"])} (+{abs(llm_stats["fs_explorer"]["correctness"] - llm_stats["rag"]["correctness"]):.2f} points)
+
+### Relevance Scores
+
+| Framework | Average Score | Status |
+|-----------|---------------|--------|
+| {add_winner("fs-explorer", llm_stats["best_relevance"])} | {llm_stats["fs_explorer"]["relevance"]:.2f} | {"**Higher**" if llm_stats["best_relevance"] == "fs-explorer" else ""} |
+| {add_winner("rag", llm_stats["best_relevance"])} | {llm_stats["rag"]["relevance"]:.2f} | {"**Higher**" if llm_stats["best_relevance"] == "rag" else ""} |
+
+**Winner:** {format_framework(llm_stats["best_relevance"])} (+{abs(llm_stats["fs_explorer"]["relevance"] - llm_stats["rag"]["relevance"]):.2f} points)
+
+---
+
+## Overall Comparison
+
+| Metric | FS-Explorer | RAG | Winner |
+|--------|-------------|-----|--------|
+| **Speed** | {time_stats["fs_explorer"]:.2f}s | {time_stats["rag"]:.2f}s | {format_framework(time_stats["best"])} |
+| **Correctness** | {llm_stats["fs_explorer"]["correctness"]:.2f} | {llm_stats["rag"]["correctness"]:.2f} | {format_framework(llm_stats["best_correctness"])} |
+| **Relevance** | {llm_stats["fs_explorer"]["relevance"]:.2f} | {llm_stats["rag"]["relevance"]:.2f} | {format_framework(llm_stats["best_relevance"])} |
+
+---
+
+## Key Takeaways
+
+- **Fastest Framework:** {format_framework(time_stats["best"])} is {abs(time_stats["fs_explorer"] - time_stats["rag"]):.2f}s faster on average
+- **Most Correct:** {format_framework(llm_stats["best_correctness"])} produces more correct answers
+- **Most Relevant:** {format_framework(llm_stats["best_relevance"])} produces more relevant answers
+"""
+
+    # Add overall winner summary
+    fs_wins = sum(
+        [
+            time_stats["best"] == "fs-explorer",
+            llm_stats["best_correctness"] == "fs-explorer",
+            llm_stats["best_relevance"] == "fs-explorer",
+        ]
+    )
+    rag_wins = 3 - fs_wins
+
+    if fs_wins > rag_wins:
+        overall_winner = "FS-Explorer"
+    elif rag_wins > fs_wins:
+        overall_winner = "RAG"
+    else:
+        overall_winner = "Tie"
+
+    md += (
+        f"\n**Overall Winner:** {overall_winner} ({max(fs_wins, rag_wins)}/3 metrics)\n"
+    )
+
+    return md
+
+
+def get_eval_stats(
+    results_file: str, result_json_file: str, result_md_file: str
+) -> None:
+    results = get_results(results_file)
+    times = [result["time_taken"] for result in results]
+    llm_evals = [result["llm_evaluations"] for result in results]
+    time_stats = get_time_average(times)
+    llm_stats = get_llm_stats(llm_evals)
+    eval_stats = EvalStats(
+        time_stats=time_stats,
+        llm_stats=llm_stats,
+    )
+
+    with open(result_json_file, "w") as f:
+        json.dump(eval_stats, f, indent=2)
+
+    markdown_report = create_markdown_report(eval_stats, len(results))
+    with open(result_md_file, "w") as f:
+        f.write(markdown_report)
@@ -0,0 +1,75 @@
+# rag-starterkit
+
+RAG application built around a starter kit to benchmark in performance against the fs-explorer agent.
+
+## Stack
+
+- [LlamaParse](https://developers.llamaindex.ai/python/cloud/llamaparse/) for advanced, OCR-driven text parsing and extraction
+- [Chonkie](https://chonkie.ai) for sentence-based chunking
+- OpenAI for dense embeddings
+- [FastEmbed](https://github.com/qdrant/fastembed) for sparse embeddings
+- [Qdrant](https://qdrant.tech) for vector storage and search
+
+## Flow
+
+### Data Ingestion
+
+```mermaid
+flowchart TD
+    A(PDF files)
+    B[Pre-parse with LlamaParse and cache]
+    C(Load texts from Cache)
+    D[Chunk wth Chonkie]
+    H(Chunks)
+    E[Embed with OpenAI - dense]
+    I(Dense Embeddings)
+    F[Embed with FastEmbed - sparse]
+    J(Sparse Embeddings)
+    G[Upload to Qdrant]
+    A --> B
+    B --> C
+    C --> D
+    D --> H
+    H --> E
+    H --> F
+    E --> I
+    F --> J 
+    I --> G
+    J --> G
+```
+
+### Retrieval Augmented Generation
+
+```mermaid
+flowchart TD
+    A(Query)
+    M[Determine which file to use]
+    N(File filter for Qdrant search)
+    B[Embed with OpenAI - Dense]
+    C(Dense Query Embedding)
+    D[Embed with Fastembed - Sparse]
+    E(Sparse Query Embedding)
+    F[Search Qdrant]
+    G(Dense Search Results)
+    H(Sparse Search Results)
+    I[Reranking with Reciprocal Rank Fusion]
+    J(Most relevant context - hybrid)
+    K[OpenAI LLM]
+    L(Generated Response)
+    A --> M
+    M --> N
+    N --> F
+    A --> B
+    B --> C 
+    A --> D
+    D --> E
+    E --> F
+    C --> F
+    F --> G
+    F --> H 
+    H --> I
+    G --> I
+    I --> J
+    J --> K
+    K --> L
+```
@@ -0,0 +1,21 @@
+[build-system]
+requires = ["uv_build>=0.9.10,<0.10.0"]
+build-backend = "uv_build"
+
+[project]
+name = "rag-starterkit"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.10"
+dependencies = [
+    "chonkie>=1.5.2",
+    "diskcache>=5.6.3",
+    "fastembed>=0.7.4",
+    "llama-cloud-services>=0.6.88",
+    "openai>=2.14.0",
+    "qdrant-client>=1.16.2",
+]
+
+[tool.uv.build-backend]
+module-name = "rag_starterkit"
@@ -0,0 +1,35 @@
+from chonkie import SentenceChunker, Chunk
+from typing import TypedDict
+from fastembed import SparseEmbedding
+
+
+class ChunkWithMetadata(TypedDict):
+    chunk: Chunk
+    file_path: str
+    embedding: list[float]
+    sparse_embedding: SparseEmbedding | None
+
+
+class Chunker:
+    def __init__(self) -> None:
+        self._chunker = SentenceChunker(
+            chunk_overlap=200,  # allow 10% chunk size overlap
+            chunk_size=2048,
+        )
+
+    def chunk_texts(self, contents: dict[str, str]) -> list[ChunkWithMetadata]:
+        texts = list(contents.values())
+        files = list(contents.keys())
+        batch_chunks = self._chunker.chunk_batch(texts=texts)
+        chunks_w_meta: list[ChunkWithMetadata] = []
+        for i, batch_chunk in enumerate(batch_chunks):
+            for chunk in batch_chunk:
+                chunks_w_meta.append(
+                    ChunkWithMetadata(
+                        chunk=chunk,
+                        file_path=files[i],
+                        embedding=[],
+                        sparse_embedding=None,
+                    )
+                )
+        return chunks_w_meta
@@ -0,0 +1,56 @@
+from openai import AsyncOpenAI
+from fastembed import SparseTextEmbedding, SparseEmbedding
+
+from .chunk import ChunkWithMetadata
+
+DEFAULT_EMBEDDING_MODEL = "text-embedding-3-small"
+DEFAULT_FASTEMBED_MODEL = "Qdrant/bm25"
+
+
+class Embedder:
+    def __init__(
+        self,
+        api_key: str,
+        openai_model: str | None = None,
+        fastembed_model: str | None = None,
+    ):
+        self._client = AsyncOpenAI(api_key=api_key)
+        self.model = openai_model or DEFAULT_EMBEDDING_MODEL
+        self._sparse_embedder = SparseTextEmbedding(
+            model_name=(fastembed_model or DEFAULT_FASTEMBED_MODEL),
+            cache_dir="tmp/fastembed",
+        )
+
+    async def embed_chunks(
+        self, chunks: list[ChunkWithMetadata]
+    ) -> list[ChunkWithMetadata]:
+        texts = [chunk["chunk"].text for chunk in chunks]
+        embeddings = await self._client.embeddings.create(
+            input=texts,
+            model=self.model,
+            dimensions=768,
+        )
+        for i, embedding in enumerate(embeddings.data):
+            chunks[i]["embedding"] = embedding.embedding
+        return chunks
+
+    def sparse_embed_chunks(
+        self, chunks: list[ChunkWithMetadata]
+    ) -> list[ChunkWithMetadata]:
+        texts = [chunk["chunk"].text for chunk in chunks]
+        embeddings = list(self._sparse_embedder.embed(texts))
+        for i, embedding in enumerate(embeddings):
+            chunks[i]["sparse_embedding"] = embedding
+        return chunks
+
+    async def embed_query(self, query: str) -> list[float]:
+        embeddings = await self._client.embeddings.create(
+            input=query,
+            model=self.model,
+            dimensions=768,
+        )
+        return embeddings.data[0].embedding
+
+    def sparse_embed_query(self, query: str) -> SparseEmbedding:
+        embeddings = list(self._sparse_embedder.query_embed(query=query))
+        return embeddings[0]
@@ -0,0 +1,70 @@
+from openai import AsyncOpenAI
+from openai.types.responses.easy_input_message_param import EasyInputMessageParam
+from pathlib import Path
+from typing import Any
+from pydantic import BaseModel, Field
+
+DEFAULT_OPENAI_MODEL = "gpt-4.1"
+SYSTEM_PROMPT = """
+Your task is to individuate, among the files that the user provides, the one that is most likely to have the answer to the user's query. Provide the file, and the confidence you have in your response (as an integer between 0 and 100).
+"""
+
+
+class FileFilter(BaseModel):
+    file_path: str = Field(
+        description="File path to filter for when searching an answer for the query"
+    )
+    confidence: int = Field(
+        description="Confidence in your choice for the file_path field. Must be between 0 and 100",
+        ge=0,
+        le=100,
+    )
+
+
+class GroundedResponse(BaseModel):
+    response: str = Field(description="Response based on the provided context")
+
+
+class LLMFilter:
+    def __init__(self, api_key: str, model: str | None = None):
+        self._client = AsyncOpenAI(api_key=api_key)
+        self.model = model or DEFAULT_OPENAI_MODEL
+
+    def _build_user_message(
+        self, query: str, file_paths: list[str]
+    ) -> EasyInputMessageParam:
+        file_paths = [str(Path(file_path).resolve()) for file_path in file_paths]
+        fls = "\n- ".join(file_paths)
+        content = f"Find, among these files:\n\n- {fls}\n\nThe one that would be the most likely to contain the answer to this query: '{query}'"
+        return EasyInputMessageParam(role="user", content=content, type="message")
+
+    def _build_context_message(self, query: str, context: str) -> EasyInputMessageParam:
+        content = f"Based on this context: '{context}' provide a reponse to this query: '{query}'"
+        return EasyInputMessageParam(role="user", content=content, type="message")
+
+    async def generate_filter(
+        self, query: str, file_paths: list[str]
+    ) -> FileFilter | None:
+        messages: list[Any] = [
+            EasyInputMessageParam(content=SYSTEM_PROMPT, role="system", type="message")
+        ]
+        messages.append(self._build_user_message(query, file_paths))
+        response = await self._client.responses.parse(
+            text_format=FileFilter,
+            input=messages,
+            model=self.model,
+        )
+        return response.output_parsed
+
+    async def generate_response(
+        self,
+        query: str,
+        context: str,
+    ) -> GroundedResponse | None:
+        message = self._build_context_message(query, context)
+        response = await self._client.responses.parse(
+            text_format=GroundedResponse,
+            input=[message],
+            model=self.model,
+        )
+        return response.output_parsed
@@ -0,0 +1,96 @@
+import logging
+import asyncio
+import os
+
+from pathlib import Path
+from typing import cast
+from diskcache import Cache
+from llama_cloud_services.parse.utils import ResultType
+from llama_cloud_services.parse.types import JobResult
+from llama_cloud_services import LlamaParse
+
+
+async def parse_directory(
+    directory: str, recursive: bool, to_skip: list[str]
+) -> dict[str, str]:
+    logging.basicConfig(
+        filename="rag-starterkit.log",
+        filemode="w",
+        level=logging.INFO,
+        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+    )
+    dir_path = Path(directory)
+    to_skip_resolved = [str((dir_path / path).resolve()) for path in to_skip]
+    if not recursive:
+        files = []
+        fls = os.listdir(dir_path)
+        for fl in fls:
+            resolved = str(Path(dir_path / fl).resolve())
+            if resolved not in to_skip_resolved:
+                files.append(resolved)
+    else:
+        files = []
+        for root, dirs, fls in os.walk(dir_path):
+            dirs[:] = [
+                str((Path(root) / d).resolve())
+                for d in dirs
+                if d not in to_skip_resolved
+            ]
+            fls[:] = [
+                str((Path(root) / f).resolve())
+                for f in fls
+                if f not in to_skip_resolved
+            ]
+            for fl in fls:
+                files.append(str((Path(root) / fl).resolve()))
+    semaphore = asyncio.Semaphore(5)
+    parser = LlamaParse(
+        api_key=cast(str, os.getenv("LLAMA_CLOUD_API_KEY")),
+        result_type=ResultType.TXT,
+        fast_mode=True,
+    )
+
+    async def parse_job(file_path: str) -> tuple[str, str] | None:
+        async with semaphore:
+            result = cast(JobResult, await parser.aparse(file_path=file_path))
+            if result.error is None:
+                text = await result.aget_text()
+                return file_path, text
+            else:
+                logging.info(
+                    f"Could not parse file {file_path} because of {result.error}"
+                )
+
+    files_contents = await asyncio.gather(*(parse_job(file) for file in files))
+    data: dict[str, str] = {}
+    for el in files_contents:
+        if el is not None:
+            data[el[0]] = el[1]
+    return data
+
+
+def contents_from_cache(cache_directory: str = "tmp/cache") -> dict[str, str]:
+    logging.basicConfig(
+        filename="rag-starterkit.log",
+        filemode="w",
+        level=logging.INFO,
+        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+    )
+    cache = Cache(directory=cache_directory)
+    data: dict[str, str] = {}
+    cache_keys = list(cache.iterkeys())
+    for key in cache_keys:
+        if isinstance(key, str):
+            data_key = key
+        elif isinstance(key, bytes):
+            data_key = str(key, encoding="utf-8")
+        else:
+            raise ValueError(
+                "Cache keys shold be either strings or byte-encoded strings"
+            )
+        value = cache.get(data_key)
+        if value is not None:
+            data[data_key] = cast(str, value)
+        else:
+            logging.info(f"Skipping file {data_key} as it stored a null content")
+    return data
@@ -0,0 +1,88 @@
+import os
+import inspect
+
+from qdrant_client import AsyncQdrantClient
+from typing import Any, cast
+from .parse import parse_directory, contents_from_cache
+from .chunk import Chunker
+from .embed import Embedder
+from .vectordb import VectorDB
+from .llm_filter import LLMFilter
+
+
+class Pipeline:
+    def __init__(
+        self,
+        qdrant_client: AsyncQdrantClient,
+        qdrant_collection_name: str,
+        rrf_constant: int = 60,
+        parsing_kwargs: dict[str, Any] | None = None,
+        cache_directory: str | None = None,
+        openai_api_key: str | None = None,
+        openai_emebdding_model: str | None = None,
+        fastembed_model: str | None = None,
+        openai_llm_model: str | None = None,
+    ):
+        if cache_directory is None and parsing_kwargs is None:
+            raise ValueError(
+                "At least one between parsing_kwargs and cache_directory has to be provided"
+            )
+        self.parsing_strategy = (
+            parse_directory if cache_directory is None else contents_from_cache
+        )
+        self.parsing_kwargs = (
+            parsing_kwargs
+            if cache_directory is None
+            else {"cache_directory": cache_directory}
+        )
+        if openai_api_key is None:
+            openai_api_key = os.getenv("OPENAI_API_KEY")
+        if openai_api_key is None:
+            raise ValueError(
+                "OPENAI_API_KEY must be set within the environment if openai_api_key is not provided as argument"
+            )
+        self.chunker = Chunker()
+        self.embedder = Embedder(
+            api_key=openai_api_key,
+            openai_model=openai_emebdding_model,
+            fastembed_model=fastembed_model,
+        )
+        self.vector_db = VectorDB(
+            qdrant_client=qdrant_client,
+            collection_name=qdrant_collection_name,
+            embedder=self.embedder,
+            rrf_constant=rrf_constant,
+        )
+        self.filter_llm = LLMFilter(api_key=openai_api_key, model=openai_llm_model)
+        self.file_paths: list[str] = []
+        self.is_ready = False
+
+    async def prepare(self) -> None:
+        if not self.is_ready and not (await self.vector_db.check_if_loaded()):
+            if inspect.iscoroutinefunction(self.parsing_strategy):
+                assert self.parsing_kwargs is not None, "parsing_kwargs cannot be null"
+                contents = await self.parsing_strategy(**self.parsing_kwargs)
+            else:
+                contents = self.parsing_strategy(**self.parsing_kwargs)  # type: ignore
+            contents = cast(dict[str, str], contents)
+            self.file_paths = [key for key in contents]
+            chunks = self.chunker.chunk_texts(contents)
+            chunks = await self.embedder.embed_chunks(chunks)
+            chunks = self.embedder.sparse_embed_chunks(chunks)
+            await self.vector_db.configure_collection()
+            await self.vector_db.upload(chunks)
+        self.is_ready = True
+
+    async def run(self, query: str, limit: int = 1) -> tuple[str | None, str | None]:
+        if not self.is_ready:
+            raise ValueError("Pipeline has not been prepared before running")
+        filter_file = await self.filter_llm.generate_filter(query, self.file_paths)
+        file_path = (
+            filter_file.file_path
+            if filter_file is not None and filter_file.confidence > 50
+            else None
+        )
+        results = await self.vector_db.search(query, file_path=file_path, limit=limit)
+        context = "\n\n".join([result["content"] for result in results])
+        response = await self.filter_llm.generate_response(query, context)
+        return response.response if response is not None else None, file_path
@@ -0,0 +1,195 @@
+from qdrant_client import AsyncQdrantClient
+from qdrant_client.models import (
+    VectorParams,
+    SparseVectorParams,
+    Distance,
+    SparseIndexParams,
+    SparseVector,
+    Filter,
+    FieldCondition,
+    MatchValue,
+)
+from typing import TypedDict, Literal, cast
+
+
+from .chunk import ChunkWithMetadata
+from .embed import Embedder
+
+
+class SearchResult(TypedDict):
+    id: int
+    content: str
+    file_path: str
+    score: float
+    type: Literal["sparse", "dense"]
+
+
+class SimpleReranker:
+    def __init__(self, k: int = 60) -> None:
+        """
+        Args:
+            k: Constant for RRF formula. Higher values reduce the impact of top-ranked items. Default of 60 is commonly used in literature.
+        """
+        self.k = k
+
+    def _reciprocal_rank_fusion(
+        self, dense_results: list[SearchResult], sparse_results: list[SearchResult]
+    ) -> dict[str, float]:
+        rrf_scores: dict[str, float] = {}
+        for rank, result in enumerate(dense_results, start=1):
+            content = result["content"]
+            rrf_scores[content] = rrf_scores.get(content, 0.0) + 1 / (self.k + rank)
+        for rank, result in enumerate(sparse_results, start=1):
+            content = result["content"]
+            rrf_scores[content] = rrf_scores.get(content, 0.0) + 1 / (self.k + rank)
+
+        return rrf_scores
+
+    def _dedupe_and_merge(
+        self, dense_results: list[SearchResult], sparse_results: list[SearchResult]
+    ) -> dict[str, SearchResult]:
+        results_map: dict[str, SearchResult] = {}
+
+        for result in dense_results:
+            if result["content"] not in results_map:
+                results_map[result["content"]] = result
+
+        for result in sparse_results:
+            if result["content"] not in results_map:
+                results_map[result["content"]] = result
+
+        return results_map
+
+    def rerank(
+        self,
+        dense_results: list[SearchResult],
+        sparse_results: list[SearchResult],
+        limit: int = 1,
+    ) -> list[SearchResult]:
+        rrf_scores = self._reciprocal_rank_fusion(dense_results, sparse_results)
+        results_map = self._dedupe_and_merge(dense_results, sparse_results)
+        reranked_results: list[SearchResult] = []
+        for content, result in results_map.items():
+            result_copy = result.copy()
+            result_copy["score"] = rrf_scores[content]
+            reranked_results.append(result_copy)
+        reranked_results.sort(key=lambda x: x["score"], reverse=True)
+        return reranked_results[:limit]
+
+
+class VectorDB:
+    def __init__(
+        self,
+        qdrant_client: AsyncQdrantClient,
+        collection_name: str,
+        embedder: Embedder,
+        rrf_constant: int = 60,
+    ) -> None:
+        self._client = qdrant_client
+        self.collection_name = collection_name
+        self.embedder = embedder
+        self._reranker = SimpleReranker(k=rrf_constant)
+
+    async def configure_collection(self) -> None:
+        if await self._client.collection_exists(self.collection_name):
+            return None
+        else:
+            await self._client.create_collection(
+                collection_name=self.collection_name,
+                vectors_config={
+                    "dense-text": VectorParams(size=768, distance=Distance.COSINE)
+                },
+                sparse_vectors_config={
+                    "sparse-text": SparseVectorParams(
+                        index=SparseIndexParams(on_disk=False)
+                    )
+                },
+            )
+
+    async def check_if_loaded(self) -> bool:
+        if not await self._client.collection_exists(self.collection_name):
+            return False
+        collection = await self._client.get_collection(self.collection_name)
+        return collection.points_count is not None and collection.points_count > 0
+
+    async def upload(self, data: list[ChunkWithMetadata]) -> None:
+        sparse_embeddings: list[dict[str, SparseVector]] = []
+        dense_embeddings: list[dict[str, list[float]]] = []
+        payloads: list[dict[str, str]] = []
+        for d in data:
+            assert d["sparse_embedding"] is not None
+            sparse_embedding = {
+                "sparse-text": SparseVector(
+                    indices=d["sparse_embedding"].indices.tolist(),
+                    values=d["sparse_embedding"].values.tolist(),
+                )
+            }
+            dense_embedding = {"dense-text": d["embedding"]}
+            payload = {"content": d["chunk"].text, "file_path": d["file_path"]}
+            sparse_embeddings.append(sparse_embedding)
+            dense_embeddings.append(dense_embedding)
+            payloads.append(payload)
+        self._client.upload_collection(
+            self.collection_name,
+            vectors=dense_embeddings,
+            payload=payloads,
+            ids=range(len(dense_embeddings)),
+        )
+        self._client.upload_collection(
+            self.collection_name,
+            vectors=sparse_embeddings,
+            payload=payloads,
+            ids=range(
+                len(dense_embeddings), len(dense_embeddings) + len(sparse_embeddings)
+            ),
+        )
+
+    async def search(
+        self, query: str, file_path: str | None = None, limit: int = 1
+    ) -> list[SearchResult]:
+        dense_embedding = await self.embedder.embed_query(query)
+        sparse_embedding = self.embedder.sparse_embed_query(query)
+        if file_path:
+            filt = Filter(
+                must=FieldCondition(key="file_path", match=MatchValue(value=file_path))
+            )
+        else:
+            filt = None
+        result_dense = await self._client.query_points(
+            collection_name=self.collection_name,
+            query=dense_embedding,
+            using="dense-text",
+            query_filter=filt,
+        )
+        result_sparse = await self._client.query_points(
+            collection_name=self.collection_name,
+            query=SparseVector(
+                indices=sparse_embedding.indices.tolist(),
+                values=sparse_embedding.values.tolist(),
+            ),
+            using="sparse-text",
+            query_filter=filt,
+        )
+        dense_results: list[SearchResult] = []
+        sparse_results: list[SearchResult] = []
+        for point in result_dense.points:
+            if point.payload is not None:
+                result = SearchResult(
+                    id=cast(int, point.id),
+                    content=point.payload.get("content", ""),
+                    file_path=point.payload.get("file_path", ""),
+                    score=point.score,
+                    type="dense",
+                )
+                dense_results.append(result)
+        for point in result_sparse.points:
+            if point.payload is not None:
+                result = SearchResult(
+                    id=cast(int, point.id),
+                    content=point.payload.get("content", ""),
+                    file_path=point.payload.get("file_path", ""),
+                    score=point.score,
+                    type="sparse",
+                )
+                sparse_results.append(result)
+        return self._reranker.rerank(dense_results, sparse_results, limit)
@@ -9,6 +9,7 @@ description = "Explore and understand your filesystem better with AI."
 readme = "README.md"
 requires-python = ">=3.10"
 dependencies = [
+    "diskcache>=5.6.3",
    "google-genai>=1.55.0",
    "llama-cloud-services>=0.6.88",
    "llama-index-workflows>=2.11.5",
@@ -18,6 +19,15 @@ dependencies = [
 [tool.uv.build-backend]
 module-name = "fs_explorer"

+[tool.uv.workspace]
+members = [
+    "packages/rag-starterkit",
+    "packages/eval-framework",
+]
+
+[tool.uv.sources]
+rag-starterkit = {workspace = true}
+
 [dependency-groups]
 dev = [
    "pre-commit>=4.5.0",
@@ -1,6 +1,6 @@
 import os
 from typing import Callable, Any, cast
-from google.genai.types import Content, HttpOptions, Part
+from google.genai.types import Content, Part
 from google.genai import Client as GenAIClient
 from .models import Action, ActionType, ToolCallAction, Tools
 from .fs import read_file, grep_file_content, glob_paths, parse_file, check_api_key
@@ -23,7 +23,7 @@ Every time, you will be asked to take one of the following actions:
    + `grep`: grep the content of a file, providing its path and the pattern (`file_path` and `pattern` parameters, both strings)
    + `glob`: list files within a directory that comply with a certain pattern, providing the directory path and the pattern to search for (`directory` and `pattern` parameters, both strings)
    + `check_api_key`: check whether or not the `LLAMA_CLOUD_API_KEY` is set before using the `parse_file` tool. No paramaeter needed for this tool. Use only once per session, as you can assume that the API key will not change status throughout the course of the session.
-    + `parse_file`: read the content of an **unstructured file** (allowed extensions: .pdf, .doc, .docx, .pptx, .xlsx). Call only if `LLAMA_CLOUD_API_KEY` is set within the environment.
+    + `parse_file`: read the content of an **unstructured file** (allowed extensions: .pdf, .doc, .docx, .pptx, .xlsx). Call only if `LLAMA_CLOUD_API_KEY` is set within the environment or if a cache with files is ready.
 - Go deeper - go one level deeper in the filesystem, accessing a subfolder of the folder you are currently exploring
 - Ask human - ask a question to the user in order to clarify their intent for a task or if you are uncertain about how to proceed when you reached a certain point. This should be treated as an emergency measure, and you should try to not use human help unless you **really** need it.
 - Stop - you have reached your goal, so you can exit, returning to the user with a final result of all the operations
@@ -40,9 +40,7 @@ class FsExplorerAgent:
            raise ValueError(
                "GOOGLE_API_KEY not found within the current environment: please export it or provide it to the class constructor."
            )
-        self._client = GenAIClient(
-            api_key=api_key, http_options=HttpOptions(api_version="v1")
-        )
+        self._client = GenAIClient(api_key=api_key)
        self._chat_history: list[Content] = [
            Content(role="system", parts=[Part.from_text(text=SYSTEM_PROMPT)])
        ]
@@ -0,0 +1,95 @@
+import os
+import asyncio
+import logging
+
+from typing import cast
+from diskcache import Cache
+from pathlib import Path
+from llama_cloud_services.parse.utils import ResultType
+from llama_cloud_services.parse.types import JobResult
+from llama_cloud_services import LlamaParse
+
+CACHING_DIR = Path("tmp/cache")
+
+
+class ParsedFileCache:
+    def __init__(self) -> None:
+        self._cache = Cache(directory=str(CACHING_DIR))
+        self._is_warmed_up = CACHING_DIR.is_dir()
+
+    def warmup(self) -> None:
+        if not self._is_warmed_up:
+            os.makedirs(CACHING_DIR, exist_ok=True)
+            self._is_warmed_up = True
+        return None
+
+    @property
+    def is_empty(self) -> bool:
+        return len(list(self._cache.iterkeys())) == 0
+
+    def add_file(self, file_path: str, content: str) -> None:
+        resolved_path = str(Path(file_path).resolve())
+        self._cache.add(resolved_path, content)
+
+    def get_file(self, file_path: str) -> str | None:
+        resolved_path = str(Path(file_path).resolve())
+        return cast(str | None, self._cache.get(resolved_path))
+
+    def close(self) -> None:
+        self._cache.close()
+
+
+CACHE = ParsedFileCache()
+
+
+async def parse_and_cache(directory: str, recursive: bool, to_skip: list[str]) -> None:
+    logging.basicConfig(
+        filename="fs-explorer.log",
+        filemode="w",
+        level=logging.INFO,
+        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+    )
+    CACHE.warmup()
+    dir_path = Path(directory)
+    to_skip_resolved = [str((dir_path / path).resolve()) for path in to_skip]
+    if not recursive:
+        files = []
+        fls = os.listdir(dir_path)
+        for fl in fls:
+            resolved = str(Path(dir_path / fl).resolve())
+            if resolved not in to_skip_resolved:
+                files.append(resolved)
+    else:
+        files = []
+        for root, dirs, fls in os.walk(dir_path):
+            dirs[:] = [
+                str((Path(root) / d).resolve())
+                for d in dirs
+                if d not in to_skip_resolved
+            ]  # type: ignore[invalid-assignment]
+            fls[:] = [
+                str((Path(root) / f).resolve())
+                for f in fls
+                if f not in to_skip_resolved
+            ]  # type: ignore[invalid-assignment]
+            for fl in fls:
+                files.append(str((Path(root) / fl).resolve()))
+    semaphore = asyncio.Semaphore(5)
+    parser = LlamaParse(
+        api_key=cast(str, os.getenv("LLAMA_CLOUD_API_KEY")),
+        result_type=ResultType.TXT,
+        fast_mode=True,
+    )
+
+    async def parse_job(file_path: str) -> None:
+        async with semaphore:
+            result = cast(JobResult, await parser.aparse(file_path=file_path))
+            if result.error is None:
+                text = await result.aget_text()
+                CACHE.add_file(file_path, text)
+            else:
+                logging.info(
+                    f"Could not parse file {file_path} because of {result.error}"
+                )
+
+    await asyncio.gather(*(parse_job(file) for file in files))
@@ -6,6 +6,7 @@ from typing import cast
 from llama_cloud_services.parse.utils import ResultType
 from llama_cloud_services.parse.types import JobResult
 from llama_cloud_services import LlamaParse
+from .caching import CACHE, CACHING_DIR


 def describe_dir_content(directory: str) -> str:
@@ -60,18 +61,27 @@ def glob_paths(directory: str, pattern: str) -> str:


 def check_api_key() -> str:
-    return (
-        "LLAMA_CLOUD_API_KEY is set and you can use the 'parse_file' tool"
-        if os.getenv("LLAMA_CLOUD_API_KEY") is not None
-        else "LLAMA_CLOUD_API_KEY is not set and you cannot use the 'parse_file' tool"
-    )
+    message = ""
+    if os.getenv("LLAMA_CLOUD_API_KEY") is not None:
+        message += "LLAMA_CLOUD_API_KEY is set and you can use the 'parse_file' tool"
+        if CACHING_DIR.is_dir():
+            message += " in all its functionalities"
+        return message
+    else:
+        if CACHING_DIR.is_dir():
+            message += "LLAMA_CLOUD_API_KEY is not set and you can use 'parse_file', but you will only have access to cached files. You should try to use the tool nevertheless."
+        else:
+            message += "LLAMA_CLOUD_API_KEY is not set and you cannot use the 'parse_file' tool"
+        return message


 async def parse_file(file_path: str) -> str:
    if not os.path.exists(file_path) or not os.path.isfile(file_path):
        return f"No such file: {file_path}"
+    if (content := CACHE.get_file(file_path)) is not None:
+        return content
    if os.getenv("LLAMA_CLOUD_API_KEY") is None:
-        return f"Not possible to parse {file_path} as the necessary credentials (`LLAMA_CLOUD_API_KEY`) are not set in the environment"
+        return f"Not possible to parse {file_path} because it has not been cached and the necessary credentials (`LLAMA_CLOUD_API_KEY`) are not set in the environment"
    parser = LlamaParse(
        api_key=cast(str, os.getenv("LLAMA_CLOUD_API_KEY")),
        result_type=ResultType.TXT,
@@ -15,6 +15,7 @@ from .workflow import (
    AskHumanEvent,
    HumanAnswerEvent,
 )
+from .caching import parse_and_cache, CACHE

 app = Typer()

@@ -76,7 +77,10 @@ async def run_workflow(task: str):
    return None


-@app.command()
+@app.command(
+    name="run",
+    help="Run the exploration with a specific task",
+)
 def main(
    task: Annotated[
        str,
@@ -88,3 +92,75 @@ def main(
    ],
 ) -> None:
    asyncio.run(run_workflow(task))
+
+
+@app.command(
+    name="load-cache",
+    help="Parse all the files in a directory at once (also recursively) and add them to a persistent cache for faster retrieval at agent runtime",
+)
+def load_cache(
+    directory: Annotated[
+        str,
+        Option(
+            "--directory",
+            "-d",
+            help="Directory containing the files to parse and load to cache. Defaults to current working directory.",
+        ),
+    ] = ".",
+    recursive: Annotated[
+        bool,
+        Option(
+            "--recursive/--no-recursive",
+            "-r",
+            help="Find files recursively within the target directory",
+            is_flag=True,
+        ),
+    ] = False,
+    to_skip: Annotated[
+        list[str],
+        Option(
+            "--skip",
+            "-s",
+            help="Skip one or more directories or files within the target directory. The path should be relative to the target directory (e.g. `testfile.txt` and not `data/testfile.txt` if `data` is the target directory). Can be used multiple times. Defaults to an empty list. Used only if `--recursive` is set.",
+        ),
+    ] = [],
+) -> None:
+    asyncio.run(parse_and_cache(directory, recursive, to_skip))
+
+
+@app.command(
+    name="get-cached",
+    help="Get the content of a cached file, if it exists",
+)
+def get_cached(
+    file: Annotated[
+        str,
+        Option(
+            "--file",
+            "-f",
+            help="The cached file whose content should be retrieved",
+        ),
+    ],
+    max_chars: Annotated[
+        int,
+        Option("--max", "-m", help="Max charachters to display. Defaults to 10.000"),
+    ] = 10000,
+) -> None:
+    content = CACHE.get_file(file)
+    console = Console()
+    if content is not None:
+        content = (
+            content[:max_chars] + "\n\nCONTINUES..."
+            if len(content) > max_chars
+            else content
+        )
+        markdown = Markdown(content)
+        panel = Panel(
+            markdown,
+            title_align="left",
+            title=f"Content for {file}",
+            border_style="bold",
+        )
+        console.print(panel)
+    else:
+        console.print(f"[bold yellow]No cached content for {file}[/]")
@@ -1,7 +1,6 @@
 import pytest
 import os

-from unittest.mock import patch
 from fs_explorer.fs import (
    describe_dir_content,
    read_file,
@@ -18,7 +17,7 @@ def test_describe_dir_content() -> None:
        == "Content of tests/testfiles\nFILES:\n- tests/testfiles/file1.txt\n- tests/testfiles/file2.md\nSUBFOLDERS:\n- tests/testfiles/last"
    )
    description = describe_dir_content("tests/testfile")
-    assert description == f"No such directory: tests/testfile"
+    assert description == "No such directory: tests/testfile"
    description = describe_dir_content("tests/testfiles/last")
    assert (
        description
@@ -71,5 +70,5 @@ async def test_parse_file_without_api_key() -> None:
    content = await parse_file("data/testfile.txt")
    assert (
        content
-        == "Not possible to parse data/testfile.txt as the necessary credentials (`LLAMA_CLOUD_API_KEY`) are not set in the environment"
+        == "Not possible to parse data/testfile.txt because it has not been cached and the necessary credentials (`LLAMA_CLOUD_API_KEY`) are not set in the environment"
    )