Answer Augmentation (AnsAug)
Generated Answer: James buys 5 packs of beef that are 4 pounds each, so he buys a total of 5 * 4 = 20 pounds of beef. The price of beef is $5.50 per pound, so he pays 20 * $5.50 = $110. The answer is: 110
Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problems due to its complex reasoning procedures. To bridge this gap, we propose MetaMath, a finetuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives, which results in a new dataset called MetaMathQA. Then we finetune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning emonstrate that MetaMath outperforms all open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.5% on GSM8K and 19.8% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release the MetaMathQA dataset, the pretrained MetaMath models with different mode size and the training code for public use.
Model | #params | GSM8K | MATH |
---|---|---|---|
Closed-source Model | |||
GPT-4 | - | 92.0 | 42.5 |
GPT-3.5-Turbo | - | 80.8 | 34.1 |
PaLM | 8B | 4.1 | 1.5 |
PaLM | 62B | 33.0 | 4.4 |
PaLM | 540B | 56.5 | 8.8 |
PaLM-2 | 540B | 80.7 | 34.3 |
Flan-PaLM 2 | 540B | 84.7 | 33.2 |
Minerva | 8B | 16.2 | 14.1 |
Minerva | 62B | 52.4 | 27.6 |
Minerva | 540B | 58.8 | 33.6 |
Open-source models (1-10B) | |||
LLaMA-1 | 7B | 11.0 | 2.9 |
LLaMA-2 | 7B | 14.6 | 2.5 |
MPT | 7B | 6.8 | 3.0 |
Falcon | 7B | 6.8 | 2.3 |
InternLM | 7B | 31.2 | - |
GPT-J | 6B | 34.9 | - |
ChatGLM 2 | 6B | 32.4 | - |
Qwen | 7B | 51.6 | - |
Baichuan-2 | 7B | 24.5 | 5.6 |
SFT | 7B | 41.6 | - |
RFT | 7B | 50.3 | - |
WizardMath | 7B | 54.9 | 10.7 |
MetaMath (ours) | 7B | 66.5 | 19.8 |
Open-source models (11-50B) | |||
LLaMA-1 | 13B | 17.8 | 3.9 |
LLaMA-1 | 33B | 35.6 | 7.1 |
LLaMA-2 | 13B | 28.7 | 3.9 |
LLaMA-2 | 34B | 42.2 | 6.2 |
MPT | 30B | 15.2 | 3.1 |
Falcon | 40B | 19.6 | 2.5 |
GAL | 30B | - | 12.7 |
Vicuna | 13B | 27.6 | - |
Baichuan-2 | 13B | 52.8 | 10.1 |
SFT | 13B | 50.0 | - |
RFT | 13B | 54.8 | - |
WizardMath | 13B | 63.9 | 14.0 |
MetaMath (ours) | 13B | 72.3 | 22.4 |
Open-source models (50-70B) | |||
LLaMA-1 | 65B | 50.9 | 10.6 |
LLaMA-2 | 70B | 56.8 | 13.5 |
RFT | 70B | 64.8 | - |
WizardMath | 70B | 81.6 | 22.7 |
MetaMath (ours) ‡ | 70B | 82.3 | 26.6 |
@article{yu2023metamath,
title={MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models},
author={Yu, Longhui and Jiang, Weisen and Shi, Han and Yu, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James T and Li, Zhenguo and Weller, Adrian and Liu, Weiyang},
journal={arXiv preprint arXiv:2309.12284},
year={2023}
}