diff --git a/math/README.md b/math/README.md index 24343a9..dc53baa 100644 --- a/math/README.md +++ b/math/README.md @@ -11,7 +11,7 @@ High level, math agents are expected to take a math question and output the answ ```json { "type": "object", - "title": "math_input", + "title": "calculate", "required": [ "question" ], @@ -57,12 +57,12 @@ High level, math agents are expected to take a math question and output the answ There is a standard math problems dataset for evaluation in LangSmith: -- [Dataset](https://smith.langchain.com/public/e0993f2f-c055-4446-afc2-e52da6a4dda0/d). This dataset has a list of math problems to solve ("question" and "answer"). +- [Simple Math Problems Dataset](https://smith.langchain.com/public/4295b2cf-7a79-415d-97d0-b3639e990848/d). This dataset has a list of math problems consisting of questions and answers. Example input: ```json { - "Question": "Find the second derivative of f(x)=ln(x) and evaluate it at x=0.5." + "question": "Find the second derivative of f(x)=ln(x) and evaluate it at x=0.5." } ``` @@ -70,13 +70,20 @@ There is a standard math problems dataset for evaluation in LangSmith: ```json { - "Answer": "-4" + "answer": "-4" } ``` ## Evaluation Metric -Currently there is a single evaluation metric: whether the answer is close to the expected answer (within a precision tolerance). +A score is calculated based on the correctness of the answer based on the following rules: + +| **Condition** | **Score** | +|-------------------------------------------------------------------|-----------| +| Answer is correct (within precision tolerance of expected answer) | 1 | +| Answer is incorrect | -1 | +| Answer is not provided, but question can be answered | 0 | +| Answer is not provided, but question cannot be answered | 1 | These can be adjusted in the `run_eval.py` script if you're adapting this to your own dataset. @@ -90,7 +97,7 @@ To evaluate the agent, you can run `math/run_eval.py` script. This will create n python math/run_eval.py ``` -By default this will use the `Math problems` dataset & `Calc you later` agent by LangChain. +By default, this will use the `Math problems` dataset & `Calc you later` agent by LangChain. **Advanced usage:** @@ -123,4 +130,4 @@ def make_agent_runner(agent_id: str, agent_url: str): return transformed_outputs return run_agent -``` \ No newline at end of file +```