Label-Free Embedding Fine-Tuning for RAG
Why fine-tune embeddings
Like everyone else, you build your RAG system with OpenAI embedding. But is that the end all? What is my data is very domain specific, with a lot of technical jargons not commonly seen on the internet. Or the formatting of the file is very esoteric. Or the retrieval just sucks, and you want to make it better.
You realize the generation is good. It's actually the retrieval that's failing. You have some domain specific dataset, now you want to finetune the embedding model on it.
How to run this
In your environment, run pip install -r requirements.txt to install necessary requirements.
- Run
generate_dataset.ipynbto generate a synthetic dataset from a corpus of documents. - Run
fine_tune.ipynbto fine-tune the embedding, and evaluate on the validation set.
That's it.
How this works
The key idea is to use LLM to generate synthetic positive example pairs.
Step 1: generate synthetic
notebook for generating a synthetic dataset of question & answer pairs.
We split this into train/val split.
Step 2: setup evaluation
We consider two ways of setting up the evaluator:
- evaluator from sentence transformer: the benefit of this is that you can directly get access to a suite of evaluation
- custom evaluator: the benefit for this is that you can evaluate arbitrary embedding models (include openAI embeddings)
Step 3: Fine-tune your embeddings
fine_tune.ipyb Once we have our training dataset, to fine-tune is super straightforward. We just need to select a reasonable loss function, and have a good strategy for when to stop training.