MetaMath:
Bootstrap Your Own Mathematical Questions for Large Language Models

1University of Cambridge, 2Southern University of Science and Technology, 3Hong Kong University of Science and Technology, 4Huawei Noah's Ark Lab, 5The Alan Turing Institute, 6Max Planck Institute for Intelligent Systems
*Equal Contribution, Corresponding author

longhuiyu98@gmail.com, shi.han@huawei.com, wl396@cam.ac.uk

Abstract

Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problems due to its complex reasoning procedures. To bridge this gap, we propose MetaMath, a finetuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives, which results in a new dataset called MetaMathQA. Then we finetune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning emonstrate that MetaMath outperforms all open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.5% on GSM8K and 19.8% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release the MetaMathQA dataset, the pretrained MetaMath models with different mode size and the training code for public use.

Overview

metamath

Figure 1: Overview of the MetaMathQA data and MetaMath.

MetaMathQA

Answer Augmentation (AnsAug)

Generated Answer: James buys 5 packs of beef that are 4 pounds each, so he buys a total of 5 * 4 = 20 pounds of beef. The price of beef is $5.50 per pound, so he pays 20 * $5.50 = $110. The answer is: 110

Rephrasing Question

Question: What is the total amount that James paid when he purchased 5 packs of beef, each weighing 4 pounds, at a price of $5.50 per pound?

Rephrase the above question: Each pack of beef weighs 4 pounds, so 5 packs weigh 4 * 5 = 20 pounds in total. The price per pound of beef is $5.50, so the total cost for 20 pounds is 20 * $5.50 = $110. Therefore, James paid a total of $110. The answer is: 110.

Self-Verification Question

Question: James buys x packs of beef that are 4 pounds each. The price of beef is \$5.50 per pound. He paid 110. What is the value of unknown variable x?

Answer: To solve this problem, we need to determine the value of x, which represents the number of packs of beef that James bought. Each pack of beef weighs 4 pounds and costs $5.50 per pound. The total amount James paid is $110. We can set up the equation as follows: Number of packs of beef * Weight per pack * Price per pound = Total amount paid; x * 4 * $5.50 = $110; Let's simplify and solve for x: 4x * $5.50 = $110; 22x = $110; To isolate x, we divide both sides of the equation by 22: 22x / 22 = $110 / 22; x = $5; The value of x is 5.

FOBAR Question

Question: James buys x packs of beef that are 4 pounds each. The price of beef is $5.50 per pound. How much did he pay? If we know the answer to the above question is 110, what is the value of unknown variable x?

Answer: James buys x packs of beef that are 4 pounds each, so he buys a total of 4x pounds of beef. The price of beef is $5.50 per pound, so the total cost of the beef is 5.50 * 4x = 22x. We are given that the total cost is $110, so we can write: 22x = 110. Dividing both sides by 22, we get: x = 5. The value of x is 5.

Comprehensive Results

Model #params GSM8K MATH
Closed-source Model
GPT-4 - 92.0 42.5
GPT-3.5-Turbo - 80.8 34.1
PaLM 8B 4.1 1.5
PaLM 62B 33.0 4.4
PaLM 540B 56.5 8.8
PaLM-2 540B 80.7 34.3
Flan-PaLM 2 540B 84.7 33.2
Minerva 8B 16.2 14.1
Minerva 62B 52.4 27.6
Minerva 540B 58.8 33.6
Open-source models (1-10B)
LLaMA-1 7B 11.0 2.9
LLaMA-2 7B 14.6 2.5
MPT 7B 6.8 3.0
Falcon 7B 6.8 2.3
InternLM 7B 31.2 -
GPT-J 6B 34.9 -
ChatGLM 2 6B 32.4 -
Qwen 7B 51.6 -
Baichuan-2 7B 24.5 5.6
SFT 7B 41.6 -
RFT 7B 50.3 -
WizardMath 7B 54.9 10.7
MetaMath (ours) 7B 66.5 19.8
Open-source models (11-50B)
LLaMA-1 13B 17.8 3.9
LLaMA-1 33B 35.6 7.1
LLaMA-2 13B 28.7 3.9
LLaMA-2 34B 42.2 6.2
MPT 30B 15.2 3.1
Falcon 40B 19.6 2.5
GAL 30B - 12.7
Vicuna 13B 27.6 -
Baichuan-2 13B 52.8 10.1
SFT 13B 50.0 -
RFT 13B 54.8 -
WizardMath 13B 63.9 14.0
MetaMath (ours) 13B 72.3 22.4
Open-source models (50-70B)
LLaMA-1 65B 50.9 10.6
LLaMA-2 70B 56.8 13.5
RFT 70B 64.8 -
WizardMath 70B 81.6 22.7
MetaMath (ours) 70B 82.3 26.6

Table 1: Comparison of testing accuracy to existing LLMs on GSM8K and MATH. Due to the computing resource limitation, we finetune MetaMath-70B using QLoRA.

BibTeX

@article{yu2023metamath,
  title={MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models},
  author={Yu, Longhui and Jiang, Weisen and Shi, Han and Yu, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James T and Li, Zhenguo and Weller, Adrian and Liu, Weiyang},
  journal={arXiv preprint arXiv:2309.12284},
  year={2023}
}