Compare commits

...

11 Commits

Author SHA1 Message Date
Eugene Yurtsev c1c5585d3a Fix list of env variables in benchmark all notebook (#173)
Fix list of env variables
2024-04-10 22:06:44 -04:00
ccurme c45993617b add tool calling benchmark notebook (#171)
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2024-04-10 22:03:19 -04:00
Eugene Yurtsev d13b33e956 Update deps (#170)
Update deps
2024-04-10 09:47:14 -04:00
Eugene Yurtsev 20a4aee5c1 Add factory for regular tool using agents (#169)
add factory for regular tool using agents

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-04-10 09:27:32 -04:00
Eugene Yurtsev 4139ac8632 update model providers (#168)
* Update packages to be used with different providers
* Register Anthropic models
2024-04-09 17:44:02 -04:00
Eugene Yurtsev 89be01737d update dependencies (#167)
Update dependencies
2024-04-09 17:17:25 -04:00
Bagatur 29e4e878a4 docs: add high cardinality links (#166) 2024-03-13 23:39:42 -07:00
Bagatur ffc2832088 docs: include high cardinality (#165) 2024-03-13 23:09:07 -07:00
Bagatur 8b5feab7b2 Add high cardinality benchmark (#164) 2024-03-08 09:10:03 -08:00
Konjeti Maruthi a805c985a6 Missing Word in comparing_techniques.ipynb (#160)
Fixing a missing word in
https://langchain-ai.github.io/langchain-benchmarks/notebooks/retrieval/comparing_techniques.html

The sentence after the heading is incomplete since I have added the word
`documents` which would complete the sentence.

Before changing:
<img width="527" alt="LangChainFix"
src="https://github.com/langchain-ai/langchain-benchmarks/assets/63769209/4859bbf0-19ae-4b87-830d-85f6242b9b61">
2024-02-16 15:23:11 -05:00
Eugene Yurtsev c0ac497ed4 Update README.md to fix archived links (#162) 2024-02-06 12:35:26 -05:00
19 changed files with 3716 additions and 1989 deletions
@@ -39,7 +39,7 @@ jobs:
- name: Install dependencies
shell: bash
run: poetry install
run: poetry install --with test
- name: Install the opposite major version of pydantic
# If normal tests use pydantic v1, here we'll use v2, and vice versa.
+1 -1
View File
@@ -158,5 +158,5 @@ cython_debug/
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
.idea/
.DS_Store
+4 -4
View File
@@ -47,10 +47,10 @@ The other directories are legacy and may be moved in the future.
Below are archived benchmarks that require cloning this repo to run.
- [CSV Question Answering](https://github.com/langchain-ai/langchain-benchmarks/tree/main/csv-qa)
- [Extraction](https://github.com/langchain-ai/langchain-benchmarks/tree/main/extraction)
- [Q&A over the LangChain docs](https://github.com/langchain-ai/langchain-benchmarks/tree/main/langchain-docs-benchmarking)
- [Meta-evaluation of 'correctness' evaluators](https://github.com/langchain-ai/langchain-benchmarks/tree/main/meta-evals)
- [CSV Question Answering](https://github.com/langchain-ai/langchain-benchmarks/tree/main/archived/csv-qa)
- [Extraction](https://github.com/langchain-ai/langchain-benchmarks/tree/main/archived/extraction)
- [Q&A over the LangChain docs](https://github.com/langchain-ai/langchain-benchmarks/tree/main/archived/langchain-docs-benchmarking)
- [Meta-evaluation of 'correctness' evaluators](https://github.com/langchain-ai/langchain-benchmarks/tree/main/archived/meta-evals)
## Related
+1 -1
View File
@@ -1194,7 +1194,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
"version": "3.11.4"
}
},
"nbformat": 4,
@@ -0,0 +1,749 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "e4d1cb60-6d32-4337-abee-1b6c794b7f4c",
"metadata": {},
"source": [
"# Extracting high-cardinality categoricals\n",
"\n",
"Suppose we built a book recommendation chatbot, and as part of it we want to extract and filter on author name if that's part of the user input. A user might ask a question like:\n",
"\n",
"> \"what are books about aliens by Steven King\"\n",
"\n",
"If we're not careful, our extraction system would most likely extract the author name \"Steven King\" from this input. This might cause us to miss all the most relevant book results, since the user was almost certainly looking for books by *Stephen King*.\n",
"\n",
"This is a case of having to extract a **high-cardinality categorical** value. Given a dataset of books and their respective authors, there's a large but finite number of valid author names, and we need some way of making sure our extraction system outputs valid and relevant author names even if the user input refers to invalid names. \n",
"\n",
"We've built a dataset to help benchmark different approaches for dealing with this challenge. The dataset is simple: it is a collection of 23 mispelled and corrected human names. To use it for high-cardinality categorical testing, we're going to generate a large set of valid names (~10,000) that includes the correct spellings of all the names in the dataset. Using this, we'll test the ability of various extraction systems to extract a corrected name from the user question:\n",
"\n",
"> \"what are books about aliens by {mispelled_name}\"\n",
"\n",
"where for each datapoint in our dataset, we'll use the mispelled name as the input and expect the corrected name as the extracted output."
]
},
{
"cell_type": "markdown",
"id": "dbe58c19-c29d-41d8-844a-b03c6ee1e07a",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"We need to install a few packages and set some env vars first:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a478941-ca99-40ee-b4f0-635f74d94a11",
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain-benchmarks langchain-openai faker chromadb numpy scikit-learn"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8c0aa002-c334-4c51-bdf9-ffe9ae7bd56f",
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"LANGCHAIN_API_KEY\"] = getpass.getpass()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9c3dc147-2681-437e-8a26-204f10ed4d41",
"metadata": {},
"outputs": [],
"source": [
"from operator import attrgetter\n",
"\n",
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"from langchain_openai import ChatOpenAI\n",
"from langsmith import Client\n",
"\n",
"from langchain_benchmarks import registry"
]
},
{
"cell_type": "markdown",
"id": "318e0ed7-1ab5-4219-9223-900b250066de",
"metadata": {},
"source": [
"This is the `Name Correction` benchmark in langchain-benchmarket:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "3f2be995-b6a9-4c3d-a19f-001c0e05ac9c",
"metadata": {},
"outputs": [],
"source": [
"client = Client()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "cd3d005c-9b60-4bc6-a467-815e7e3bbc7c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'https://smith.langchain.com/public/78df83ee-ba7f-41c6-832c-2b23327d4cf7/d'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"task = registry[\"Name Correction\"]\n",
"task.dataset_url"
]
},
{
"cell_type": "markdown",
"id": "fc4d14ea-6a46-43b1-a0ac-8e632e1297d2",
"metadata": {},
"source": [
"**NOTE**: If you are running this notebook for the first time, clone the public dataset into your LangSmith organization by uncommenting the below:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "dca18a40-85f1-4911-9e41-936975fbddf8",
"metadata": {},
"outputs": [],
"source": [
"# client.clone_public_dataset(task.dataset_url)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "3f9ad08e-69cc-436e-94f9-b0e1e2c4a9d1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'name': 'Tracy Cook'} {'name': 'Traci Cook'}\n",
"{'name': 'Dan Klein'} {'name': 'Daniel Klein'}\n",
"{'name': 'Jen Mcintosh'} {'name': 'Jennifer Mcintosh'}\n",
"{'name': 'Cassie Hull'} {'name': 'Cassandra Hull'}\n",
"{'name': 'Andy Williams'} {'name': 'Andrew Williams'}\n"
]
}
],
"source": [
"examples = list(client.list_examples(dataset_name=task.dataset_name))\n",
"for example in examples[:5]:\n",
" print(example.inputs, example.outputs)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "35c85a6f-5d8d-4018-9b83-b6cab0587c1c",
"metadata": {},
"outputs": [],
"source": [
"def run_on_dataset(chain, run_name):\n",
" client.run_on_dataset(\n",
" dataset_name=task.dataset_name,\n",
" llm_or_chain_factory=chain,\n",
" evaluation=task.eval_config,\n",
" project_name=run_name,\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "4fd7318a-4195-4da8-94d7-34ee6b7c2097",
"metadata": {},
"source": [
"## Augmenting with more fake names\n",
"\n",
"For our tests we'll create a list of 10,000 names that represent all the possible values for this category. This will include our target names from the dataset."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "06098983-f5cf-4de3-ae07-4cdbe091522c",
"metadata": {},
"outputs": [],
"source": [
"from faker import Faker\n",
"\n",
"Faker.seed(42)\n",
"fake = Faker()\n",
"fake.seed_instance(0)\n",
"\n",
"incorrect_names = [example.inputs[\"name\"] for example in examples]\n",
"correct_names = [example.outputs[\"name\"] for example in examples]\n",
"\n",
"# We'll make sure that our list of valid names contains the correct spellings\n",
"# and not the incorrect spellings from our dataset\n",
"valid_names = list(\n",
" set([fake.name() for _ in range(10_000)] + correct_names).difference(\n",
" incorrect_names\n",
" )\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "ab6d9b4b-717b-4947-ac17-a100a0ced088",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"9382"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(valid_names)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "6e7d27bf-c82c-43e1-961a-ea67733b1dec",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Debra Lee', 'Kevin Harper', 'Donald Anderson']"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"valid_names[:3]"
]
},
{
"cell_type": "markdown",
"id": "bd801ab5-b2a4-49bc-9c11-698dc760eb28",
"metadata": {},
"source": [
"## Chain 1: Baseline\n",
"\n",
"As a baseline we'll create a function-calling chain that has no information about the set of valid names."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "1e0694d9-d67d-4f90-b40c-f8373389f5c4",
"metadata": {},
"outputs": [],
"source": [
"class Search(BaseModel):\n",
" query: str\n",
" author: str\n",
"\n",
"\n",
"system = \"\"\"Generate a relevant search query for a library system\"\"\"\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\"system\", \"{system}\"),\n",
" (\"human\", \"what are books about aliens by {name}\"),\n",
" ]\n",
")\n",
"llm = ChatOpenAI(model=\"gpt-3.5-turbo-0125\", temperature=0)\n",
"structured_llm = llm.with_structured_output(Search)\n",
"\n",
"query_analyzer_1 = (\n",
" prompt.partial(system=system) | structured_llm | {\"name\": attrgetter(\"author\")}\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f4a4d81f-532a-4efb-86cb-cc0555dbc4e7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for project 'GPT-3.5' at:\n",
"https://smith.langchain.com/o/43ae1439-dbb7-53b8-bef4-155154d3f962/datasets/1765d6b2-aa2e-46ec-9158-9f4ca8f228c6/compare?selectedSessions=f429ec84-b879-4e66-b7fb-ef7be69d1acd\n",
"\n",
"View all tests for Dataset Extracting Corrected Names at:\n",
"https://smith.langchain.com/o/43ae1439-dbb7-53b8-bef4-155154d3f962/datasets/1765d6b2-aa2e-46ec-9158-9f4ca8f228c6\n",
"[------------------------------------------------->] 23/23"
]
}
],
"source": [
"run_on_dataset(query_analyzer_1, \"GPT-3.5\")"
]
},
{
"cell_type": "markdown",
"id": "b4f42968-069f-450b-a03b-f47934956f89",
"metadata": {},
"source": [
"As we might have expected, this gives us a `Correct rate: 0%`. Let's see if we can do better :)\n",
"\n",
"See the test run in LangSmith [here](https://smith.langchain.com/public/8c0a4c25-426d-4582-96fc-d7def170be76/d/compare?selectedSessions=f429ec84-b879-4e66-b7fb-ef7be69d1acd)."
]
},
{
"cell_type": "markdown",
"id": "08ef2fc6-0ad9-4a3e-a306-bd7100f7b1fb",
"metadata": {},
"source": [
"## Chain 2: All candidates in prompt\n",
"\n",
"Next, let's dump the full list of valid names in the system prompt. We'll need a model with a longer context window than the 16k token window of gpt-3.5-turbo-0125 so we'll use gpt-4-0125-preview."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "d0f65f4f-5461-43b1-9c7b-5fcdaf48c2ce",
"metadata": {},
"outputs": [],
"source": [
"valid_names_str = \"\\n\".join(valid_names)\n",
"\n",
"system_2 = \"\"\"Generate a relevant search query for a library system.\n",
"\n",
"`author` attribute MUST be one of:\n",
"\n",
"{valid_names_str}\n",
"\n",
"Do NOT hallucinate author name!\"\"\"\n",
"\n",
"formatted_system = system_2.format(valid_names_str=valid_names_str)\n",
"structured_llm_2 = ChatOpenAI(\n",
" model=\"gpt-4-0125-preview\", temperature=0\n",
").with_structured_output(Search)\n",
"query_analyzer_2 = (\n",
" prompt.partial(system=formatted_system)\n",
" | structured_llm_2\n",
" | {\"name\": attrgetter(\"author\")}\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "de679906-c69d-4ceb-bc5e-73a291b21cdc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for project 'GPT-4, all names in prompt' at:\n",
"https://smith.langchain.com/o/43ae1439-dbb7-53b8-bef4-155154d3f962/datasets/1765d6b2-aa2e-46ec-9158-9f4ca8f228c6/compare?selectedSessions=8c4cfdfc-3646-438e-be47-43a40d66292a\n",
"\n",
"View all tests for Dataset Extracting Corrected Names at:\n",
"https://smith.langchain.com/o/43ae1439-dbb7-53b8-bef4-155154d3f962/datasets/1765d6b2-aa2e-46ec-9158-9f4ca8f228c6\n",
"[------------------------------------------------->] 23/23"
]
}
],
"source": [
"run_on_dataset(query_analyzer_2, \"GPT-4, all names in prompt\")"
]
},
{
"cell_type": "markdown",
"id": "eb678fdd-0e57-4063-adea-56248aea11e5",
"metadata": {},
"source": [
"This gets us up to `Correct rate: 26%`.\n",
"\n",
"See the test run in LangSmith [here](https://smith.langchain.com/public/8c0a4c25-426d-4582-96fc-d7def170be76/d/compare?selectedSessions=8c4cfdfc-3646-438e-be47-43a40d66292a)."
]
},
{
"cell_type": "markdown",
"id": "0aa394b5-a665-4f4c-809d-c0d756c9b23e",
"metadata": {},
"source": [
"## Chain 3: Top k candidates from vectorstore in prompt\n",
"\n",
"10,000 names is a lot to have in the prompt. Perhaps we could get better performance by shortening the list using vector search first to only include names that have the highest similarity to the user question. We can return to using GPT-3.5 as a result:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f9439e3f-5aa2-45b7-ab1f-149060744e03",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.vectorstores import Chroma\n",
"from langchain_core.prompts import PromptTemplate\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"k = 10\n",
"embeddings = OpenAIEmbeddings(model=\"text-embedding-3-small\")\n",
"vectorstore = Chroma.from_texts(valid_names, embeddings, collection_name=\"author_names\")\n",
"retriever = vectorstore.as_retriever(search_kwargs={\"k\": k})"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "04018b30-2378-4c96-8515-39d66c554459",
"metadata": {},
"outputs": [],
"source": [
"system_chain = (\n",
" (lambda name: f\"what are books about aliens by {name}\")\n",
" | retriever\n",
" | (\n",
" lambda docs: system_2.format(\n",
" valid_names_str=\"\\n\".join(d.page_content for d in docs)\n",
" )\n",
" )\n",
")\n",
"query_analyzer_3 = (\n",
" RunnablePassthrough.assign(system=system_chain)\n",
" | prompt\n",
" | structured_llm\n",
" | {\"name\": attrgetter(\"author\")}\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "fd5af75e-41fa-42ee-b9ac-62eb13e21022",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for project 'GPT-3.5, top 10 names in prompt, vecstore' at:\n",
"https://smith.langchain.com/o/43ae1439-dbb7-53b8-bef4-155154d3f962/datasets/1765d6b2-aa2e-46ec-9158-9f4ca8f228c6/compare?selectedSessions=af93ec50-ccbb-4b3c-908a-70c75e5516ea\n",
"\n",
"View all tests for Dataset Extracting Corrected Names at:\n",
"https://smith.langchain.com/o/43ae1439-dbb7-53b8-bef4-155154d3f962/datasets/1765d6b2-aa2e-46ec-9158-9f4ca8f228c6\n",
"[------------------------------------------------->] 23/23"
]
}
],
"source": [
"run_on_dataset(query_analyzer_3, f\"GPT-3.5, top {k} names in prompt, vecstore\")"
]
},
{
"cell_type": "markdown",
"id": "b7e0f097-7432-4728-a60b-b980046c1275",
"metadata": {},
"source": [
"This gets us up to `Correct rate: 57%`\n",
"\n",
"See the test run in LangSmith [here](https://smith.langchain.com/public/8c0a4c25-426d-4582-96fc-d7def170be76/d/compare?selectedSessions=af93ec50-ccbb-4b3c-908a-70c75e5516ea)."
]
},
{
"cell_type": "markdown",
"id": "20aaa33a-d475-41a1-8f1a-53e18382b3d7",
"metadata": {},
"source": [
"## Chain 4: Top k candidates by ngram overlap in prompt\n",
"\n",
"Instead of using vector search, which requires embeddings and vector stores, a cheaper and faster approach would be to compare ngram overlap between the user question and the list of valid names:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "05b2fc1c-0f61-4638-bbf5-fed5b634db51",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.metrics.pairwise import cosine_similarity\n",
"\n",
"\n",
"# Function to generate character n-grams\n",
"def ngrams(string, n=3):\n",
" string = \"START\" + string.replace(\" \", \"\").lower() + \"END\"\n",
" ngrams = zip(*[string[i:] for i in range(n)])\n",
" return [\"\".join(ngram) for ngram in ngrams]\n",
"\n",
"\n",
"# Vectorize documents using TfidfVectorizer with the custom n-grams function\n",
"vectorizer = TfidfVectorizer(analyzer=ngrams)\n",
"tfidf_matrix = vectorizer.fit_transform(valid_names)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "2994aff8-4bfd-4cf3-9b73-2bda7c470ba4",
"metadata": {},
"outputs": [],
"source": [
"def get_names(query):\n",
" # Vectorize query\n",
" query_tfidf = vectorizer.transform([query])\n",
"\n",
" # Compute cosine similarity\n",
" cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix).flatten()\n",
"\n",
" # Find the index of the most similar document\n",
" most_similar_document_indexes = np.argsort(-cosine_similarities)\n",
"\n",
" return \"\\n\".join([valid_names[i] for i in most_similar_document_indexes[:k]])"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "a549a347-1449-4ae2-a30d-e8f0b917d50e",
"metadata": {},
"outputs": [],
"source": [
"def get_system_prompt(input):\n",
" name = input[\"name\"]\n",
" valid_names_str = get_names(f\"what are books about aliens by {name}\")\n",
" return system_2.format(valid_names_str=valid_names_str)\n",
"\n",
"\n",
"query_analyzer_4 = (\n",
" RunnablePassthrough.assign(system=get_system_prompt)\n",
" | prompt\n",
" | structured_llm\n",
" | {\"name\": attrgetter(\"author\")}\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "dd1b69a8-5ca6-4a2d-9ad3-567d0105b672",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for project 'GPT-3.5, top 10 names in prompt, ngram' at:\n",
"https://smith.langchain.com/o/43ae1439-dbb7-53b8-bef4-155154d3f962/datasets/1765d6b2-aa2e-46ec-9158-9f4ca8f228c6/compare?selectedSessions=bc28b761-2ac9-4391-8df1-758f0a4d5100\n",
"\n",
"View all tests for Dataset Extracting Corrected Names at:\n",
"https://smith.langchain.com/o/43ae1439-dbb7-53b8-bef4-155154d3f962/datasets/1765d6b2-aa2e-46ec-9158-9f4ca8f228c6\n",
"[------------------------------------------------->] 23/23"
]
}
],
"source": [
"run_on_dataset(query_analyzer_4, f\"GPT-3.5, top {k} names in prompt, ngram\")"
]
},
{
"cell_type": "markdown",
"id": "b4e16c1b-33d5-4ca1-932b-8234ffc668bf",
"metadata": {},
"source": [
"This gets us up to `Correct rate: 65%`\n",
"\n",
"See the test run in LangSmith [here](https://smith.langchain.com/public/8c0a4c25-426d-4582-96fc-d7def170be76/d/compare?selectedSessions=bc28b761-2ac9-4391-8df1-758f0a4d5100)."
]
},
{
"cell_type": "markdown",
"id": "d3045376-e102-4ec6-877a-91448677f3f3",
"metadata": {},
"source": [
"## Chain 5: Replace with top candidate from vectorstore\n",
"\n",
"Instead of (or in addition to) searching for similar candidates before extraction, we can also compare and correct the extracted value after-the-fact a search over the valid names. With Pydantic classes this is easy using a validator:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "ac719651-0775-4fa4-bd22-9fddebcc6918",
"metadata": {},
"outputs": [],
"source": [
"from langchain_core.pydantic_v1 import validator\n",
"\n",
"\n",
"class Search(BaseModel):\n",
" query: str\n",
" author: str\n",
"\n",
" @validator(\"author\")\n",
" def double(cls, v: str) -> str:\n",
" return vectorstore.similarity_search(v, k=1)[0].page_content\n",
"\n",
"\n",
"structured_llm_3 = llm.with_structured_output(Search)\n",
"query_analyzer_5 = (\n",
" prompt.partial(system=system) | structured_llm_3 | {\"name\": attrgetter(\"author\")}\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "fc1cfdcb-47fb-40c4-898d-f290cd53a37d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for project 'GPT-3.5, correct name, vecstore' at:\n",
"https://smith.langchain.com/o/43ae1439-dbb7-53b8-bef4-155154d3f962/datasets/1765d6b2-aa2e-46ec-9158-9f4ca8f228c6/compare?selectedSessions=e3eda1e1-bc25-46e8-a4fb-db324cefd1c9\n",
"\n",
"View all tests for Dataset Extracting Corrected Names at:\n",
"https://smith.langchain.com/o/43ae1439-dbb7-53b8-bef4-155154d3f962/datasets/1765d6b2-aa2e-46ec-9158-9f4ca8f228c6\n",
"[------------------------------------------------->] 23/23"
]
}
],
"source": [
"run_on_dataset(query_analyzer_5, f\"GPT-3.5, correct name, vecstore\")"
]
},
{
"cell_type": "markdown",
"id": "e6e96a2c-506e-461f-bd05-cb88fe0ea3aa",
"metadata": {},
"source": [
"This gets us up to `Correct rate: 83%`\n",
"\n",
"See the test run in LangSmith [here](https://smith.langchain.com/public/8c0a4c25-426d-4582-96fc-d7def170be76/d/compare?selectedSessions=e3eda1e1-bc25-46e8-a4fb-db324cefd1c9)."
]
},
{
"cell_type": "markdown",
"id": "f1f8ce77-01a3-41d1-a047-103cb2e552f9",
"metadata": {},
"source": [
"## Chain 6: Replace with top candidate by ngram overlap\n",
"\n",
"We can do the same with ngram overlap search instead of vector search:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "21ffa8c9-907b-453a-9b32-01a981bca5ec",
"metadata": {},
"outputs": [],
"source": [
"class Search(BaseModel):\n",
" query: str\n",
" author: str\n",
"\n",
" @validator(\"author\")\n",
" def double(cls, v: str) -> str:\n",
" return get_names(v).split(\"\\n\")[0]\n",
"\n",
"\n",
"structured_llm_4 = llm.with_structured_output(Search)\n",
"query_analyzer_6 = (\n",
" prompt.partial(system=system) | structured_llm_4 | {\"name\": attrgetter(\"author\")}\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "126354dd-c54e-4391-8a5e-5e200d006a18",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for project 'GPT-3.5, correct name, ngram' at:\n",
"https://smith.langchain.com/o/43ae1439-dbb7-53b8-bef4-155154d3f962/datasets/1765d6b2-aa2e-46ec-9158-9f4ca8f228c6/compare?selectedSessions=8f8846c8-2ada-41bc-8d2c-e1d56e7c92ce\n",
"\n",
"View all tests for Dataset Extracting Corrected Names at:\n",
"https://smith.langchain.com/o/43ae1439-dbb7-53b8-bef4-155154d3f962/datasets/1765d6b2-aa2e-46ec-9158-9f4ca8f228c6\n",
"[------------------------------------------------->] 23/23"
]
}
],
"source": [
"run_on_dataset(query_analyzer_6, f\"GPT-3.5, correct name, ngram\")"
]
},
{
"cell_type": "markdown",
"id": "b8c8cd81-61d0-4c1f-957d-1910be7706e7",
"metadata": {},
"source": [
"This gets us up to `Correct rate: 74%`, slightly worse than Chain 5 (same thing using vector search insteadf of ngram).\n",
"\n",
"See the test run in LangSmith [here](https://smith.langchain.com/public/8c0a4c25-426d-4582-96fc-d7def170be76/d/compare?selectedSessions=8f8846c8-2ada-41bc-8d2c-e1d56e7c92ce)."
]
},
{
"cell_type": "markdown",
"id": "4d7f7ab4-466d-434c-98bd-ebe1906599a9",
"metadata": {},
"source": [
"## See all results in LangSmith\n",
"\n",
"To see the full dataset and all the test results, head to LangSmith: https://smith.langchain.com/public/8c0a4c25-426d-4582-96fc-d7def170be76/d"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "benchmarks-venv",
"language": "python",
"name": "benchmarks-venv"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
+1 -1
View File
@@ -122,7 +122,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
"version": "3.11.4"
}
},
"nbformat": 4,
File diff suppressed because one or more lines are too long
@@ -311,7 +311,7 @@
"\n",
"## Customizing Chunking\n",
"\n",
"The simplest change you can make to the index is configure how you split the "
"The simplest change you can make to the index is configure how you split the documents."
]
},
{
File diff suppressed because one or more lines are too long
+1
View File
@@ -27,6 +27,7 @@
./notebooks/extraction/intro
./notebooks/extraction/email
./notebooks/extraction/chat_extraction
./notebooks/extraction/high_cardinality
```
```{toctree}
@@ -0,0 +1,5 @@
from langchain_benchmarks.extraction.tasks.high_cardinality.name_correction import (
NAME_CORRECTION_TASK,
)
__all__ = ["NAME_CORRECTION_TASK"]
@@ -0,0 +1,36 @@
from langchain.smith import RunEvalConfig
from langchain_core.pydantic_v1 import BaseModel, Field
from langsmith.evaluation import EvaluationResult, run_evaluator
from langsmith.schemas import Example, Run
from langchain_benchmarks.schema import ExtractionTask
@run_evaluator
def correct_name(run: Run, example: Example) -> EvaluationResult:
if "name" in run.outputs:
prediction = run.outputs["name"]
else:
prediction = run.outputs["output"]["name"]
name = example.outputs["name"]
score = int(name == prediction)
return EvaluationResult(key="correct", score=score)
class Person(BaseModel):
"""Information about a person."""
name: str = Field(..., description="The person's name")
NAME_CORRECTION_TASK = ExtractionTask(
name="Name Correction",
dataset_id="https://smith.langchain.com/public/78df83ee-ba7f-41c6-832c-2b23327d4cf7/d",
schema=Person,
description="""A dataset of 23 misspelled full names and their correct spellings.""",
dataset_url="https://smith.langchain.com/public/78df83ee-ba7f-41c6-832c-2b23327d4cf7/d",
dataset_name="Extracting Corrected Names",
eval_config=RunEvalConfig(
custom_evaluators=[correct_name],
),
)
+22 -1
View File
@@ -212,10 +212,31 @@ _FIREWORKS_MODELS = [
]
_ANTHROPIC_MODELS = [
RegisteredModel(
provider="anthropic",
name="claude-3-haiku-20240307",
description="Fastest and most compact model for near-instant responsiveness",
type="chat",
params={"model": "claude-3-haiku-20240307"},
),
RegisteredModel(
provider="anthropic",
name="claude-3-sonnet-20240229",
description="Ideal balance of intelligence and speed for enterprise workloads",
type="chat",
params={"model": "claude-3-sonnet-20240229"},
),
RegisteredModel(
provider="anthropic",
name="claude-3-opus-20240229",
description="Most powerful model for highly complex tasks",
type="chat",
params={"model": "claude-3-opus-20240229"},
),
RegisteredModel(
provider="anthropic",
name="claude-2",
description=("Superior performance on tasks that require complex reasoning"),
description="Superior performance on tasks that require complex reasoning",
type="chat",
params={
"model": "claude-2",
+6 -1
View File
@@ -1,6 +1,10 @@
"""Registry of environments for ease of access."""
from langchain_benchmarks.extraction.tasks import chat_extraction, email_task
from langchain_benchmarks.extraction.tasks import (
chat_extraction,
email_task,
high_cardinality,
)
from langchain_benchmarks.rag.tasks import (
LANGCHAIN_DOCS_TASK,
MULTI_MODAL_SLIDE_DECKS_TASK,
@@ -27,5 +31,6 @@ registry = Registry(
LANGCHAIN_DOCS_TASK,
SEMI_STRUCTURED_REPORTS_TASK,
MULTI_MODAL_SLIDE_DECKS_TASK,
high_cardinality.NAME_CORRECTION_TASK,
]
)
+25 -8
View File
@@ -130,11 +130,14 @@ class ExtractionTask(BaseTask):
# We might want to make this optional / or support more types
# and add validation, but let's wait until we have more examples
instructions: ChatPromptTemplate
instructions: Optional[ChatPromptTemplate] = None
"""Get the prompt for the task.
This is the default prompt to use for the task.
"""
dataset_url: Optional[str] = None
dataset_name: Optional[str] = None
eval_config: Optional[RunEvalConfig] = None
@dataclasses.dataclass(frozen=True)
@@ -253,7 +256,13 @@ class Registry:
Provider = Literal["fireworks", "openai", "anthropic", "anyscale"]
ModelType = Literal["chat", "llm"]
AUTHORIZED_NAMESPACES = {"langchain", "langchain_google_genai"}
AUTHORIZED_NAMESPACES = {
"langchain",
"langchain_google_genai",
"langchain_openai",
"langchain_anthropic",
"langchain_fireworks",
}
def _get_model_class_from_path(
@@ -270,7 +279,15 @@ def _get_model_class_from_path(
)
# Import the module dynamically
module = importlib.import_module(module_name)
try:
module = importlib.import_module(module_name)
except ImportError:
raise ImportError(
f"Could not import module {module_name}. "
f"Perhaps you need to run to pip install the package? "
f"`pip install {module_name}`."
)
model_class = getattr(module, attribute_name)
if not issubclass(model_class, (BaseLanguageModel, BaseChatModel)):
raise ValueError(
@@ -282,13 +299,13 @@ def _get_model_class_from_path(
def _get_default_path(provider: str, type_: ModelType) -> str:
"""Get the default path for a model."""
paths = {
("fireworks", "chat"): "langchain.chat_models.fireworks.ChatFireworks",
("fireworks", "llm"): "langchain.llms.fireworks.Fireworks",
("anthropic", "chat"): "langchain_anthropic.ChatAnthropic",
("anyscale", "chat"): "langchain.chat_models.anyscale.ChatAnyscale",
("anyscale", "llm"): "langchain.llms.anyscale.Anyscale",
("openai", "chat"): "langchain.chat_models.openai.ChatOpenAI",
("openai", "llm"): "langchain.llms.openai.OpenAI",
("anthropic", "chat"): "langchain.chat_models.anthropic.ChatAnthropic",
("fireworks", "chat"): "langchain_fireworks.ChatFireworks",
("fireworks", "llm"): "langchain_fireworks.Fireworks",
("openai", "chat"): "langchain_openai.ChatOpenAI",
("openai", "llm"): "langchain_openai.OpenAI",
(
"google-genai",
"chat",
@@ -12,6 +12,7 @@ from langchain_benchmarks.tool_usage.agents.openai_functions import OpenAIAgentF
from langchain_benchmarks.tool_usage.agents.runnable_agent import (
CustomRunnableAgentFactory,
)
from langchain_benchmarks.tool_usage.agents.tool_using_agent import StandardAgentFactory
__all__ = [
"OpenAIAgentFactory",
@@ -20,4 +21,5 @@ __all__ = [
"CustomAgentFactory",
"AnthropicToolUserFactory",
"CustomRunnableAgentFactory",
"StandardAgentFactory",
]
@@ -0,0 +1,82 @@
"""Factory for creating agents.
This is useful for agents that follow the standard LangChain tool format.
"""
from typing import Optional
from langchain.agents import AgentExecutor
from langchain_core.language_models import BaseChatModel
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import Runnable
from langchain_benchmarks.rate_limiting import RateLimiter, with_rate_limit
from langchain_benchmarks.schema import ToolUsageTask
from langchain_benchmarks.tool_usage.agents.adapters import apply_agent_executor_adapter
class StandardAgentFactory:
"""A standard agent factory.
Use this factory with chat models that support the standard LangChain tool
calling API where the chat model populates the tool_calls attribute on AIMessage.
"""
def __init__(
self,
task: ToolUsageTask,
model: BaseChatModel,
prompt: ChatPromptTemplate,
*,
rate_limiter: Optional[RateLimiter] = None,
) -> None:
"""Create an agent factory for the given tool usage task.
Args:
task: The task to create an agent factory for
model: chat model to use, must support tool usage
prompt: This is a chat prompt at the moment.
Must include an agent_scratchpad
For example,
ChatPromptTemplate.from_messages(
[
("system", "{instructions}"),
("human", "{input}"),
MessagesPlaceholder("agent_scratchpad"),
]
)
rate_limiter: will be appended to the agent runnable
"""
self.task = task
self.model = model
self.prompt = prompt
self.rate_limiter = rate_limiter
def __call__(self) -> Runnable:
"""Call the factory to create Runnable agent."""
# Temporarily import here until new langchain is released with create_tools_agent
from langchain.agents import create_tool_calling_agent
env = self.task.create_environment()
if "instructions" in self.prompt.input_variables:
finalized_prompt = self.prompt.partial(instructions=self.task.instructions)
else:
finalized_prompt = self.prompt
agent = create_tool_calling_agent(self.model, env.tools, finalized_prompt)
if self.rate_limiter:
agent = with_rate_limit(agent, self.rate_limiter)
executor = AgentExecutor(
agent=agent,
tools=env.tools,
handle_parsing_errors=True,
return_intermediate_steps=True,
)
return apply_agent_executor_adapter(
executor, state_reader=env.read_state
).with_config({"run_name": "Agent", "metadata": {"task": self.task.name}})
Generated
+2341 -1525
View File
File diff suppressed because it is too large Load Diff
+42 -4
View File
@@ -1,6 +1,6 @@
[tool.poetry]
name = "langchain-benchmarks"
version = "0.0.10"
version = "0.0.11"
description = "🦜💪 Flex those feathers!"
authors = ["LangChain AI"]
license = "MIT"
@@ -8,21 +8,50 @@ readme = "README.md"
[tool.poetry.dependencies]
python = "^3.8.1"
langchain = ">=0.0.300"
langchain = "^0.1.0"
langsmith = ">=0.0.70"
tqdm = "^4"
ipywidgets = "^8"
tabulate = ">=0.8.0"
[tool.poetry.group.dev]
optional = true
[tool.poetry.group.dev.dependencies]
jupyterlab = "^3.6.1"
jupyter = "^1.0.0"
langchain-core = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/core"}
langchain = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/langchain"}
langchain-anthropic = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/partners/anthropic"}
langchain-fireworks = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/partners/fireworks"}
langchain-mistralai = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/partners/mistralai"}
langchain-cohere = {git = "https://github.com/langchain-ai/langchain-cohere.git", subdirectory="libs/cohere"}
langchain-groq = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/partners/groq"}
langchain-openai = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/partners/openai"}
[tool.poetry.group.typing]
optional = true
[tool.poetry.group.typing.dependencies]
mypy = "^1.7.0"
langchain-core = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/core"}
langchain = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/langchain"}
langchain-anthropic = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/partners/anthropic"}
langchain-fireworks = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/partners/fireworks"}
langchain-mistralai = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/partners/mistralai"}
langchain-cohere = {git = "https://github.com/langchain-ai/langchain-cohere.git", subdirectory="libs/cohere"}
langchain-groq = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/partners/groq"}
langchain-openai = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/partners/openai"}
[tool.poetry.group.lint]
optional = true
[tool.poetry.group.lint.dependencies]
ruff = "^0.1.5"
[tool.poetry.group.docs]
optional = true
[tool.poetry.group.docs.dependencies]
nbsphinx = ">=0.8.9"
sphinx = ">=5.2.0"
@@ -32,6 +61,8 @@ myst-nb = { version = "^1.0.0", python = "^3.9" }
toml = "^0.10.2"
sphinx-copybutton = ">=0.5.1"
[tool.poetry.group.test]
optional = true
[tool.poetry.group.test.dependencies]
pytest = "^7.2.1"
@@ -42,7 +73,14 @@ pytest-socket = "^0.6.0"
pytest-watch = "^4.2.0"
pytest-timeout = "^2.2.0"
freezegun = "^1.3.1"
langchain-core = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/core"}
langchain = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/langchain"}
langchain-anthropic = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/partners/anthropic"}
langchain-fireworks = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/partners/fireworks"}
langchain-mistralai = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/partners/mistralai"}
langchain-cohere = {git = "https://github.com/langchain-ai/langchain-cohere.git", subdirectory="libs/cohere"}
langchain-groq = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/partners/groq"}
langchain-openai = {git = "https://github.com/langchain-ai/langchain.git", subdirectory = "libs/partners/openai"}
[tool.ruff]
select = [