MetaMath

Abstract

Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problems due to its complex reasoning procedures. To bridge this gap, we propose MetaMath, a finetuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives, which results in a new dataset called MetaMathQA. Then we finetune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning emonstrate that MetaMath outperforms all open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.5% on GSM8K and 19.8% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release the MetaMathQA dataset, the pretrained MetaMath models with different mode size and the training code for public use.

Overview

Figure 1: Overview of the MetaMathQA data and MetaMath.

MetaMathQA

Answer Augmentation (AnsAug)

Generated Answer: James buys 5 packs of beef that are 4 pounds each, so he buys a total of 5 * 4 = 20 pounds of beef. The price of beef is $5.50 per pound, so he pays 20 * $5.50 = $110. The answer is: 110

Rephrasing Question

Question: What is the total amount that James paid when he purchased 5 packs of beef, each weighing 4 pounds, at a price of $5.50 per pound?

Rephrase the above question: Each pack of beef weighs 4 pounds, so 5 packs weigh 4 * 5 = 20 pounds in total. The price per pound of beef is $5.50, so the total cost for 20 pounds is 20 * $5.50 = $110. Therefore, James paid a total of $110. The answer is: 110.

Self-Verification Question

Question: James buys x packs of beef that are 4 pounds each. The price of beef is $5.50 per pound. He paid 110. What is the value of unknown variable x?

Answer: To solve this problem, we need to determine the value of x, which represents the number of packs of beef that James bought. Each pack of beef weighs 4 pounds and costs $5.50 per pound. The total amount James paid is $110. We can set up the equation as follows: Number of packs of beef * Weight per pack * Price per pound = Total amount paid; x * 4 * $5.50 = $110; Let's simplify and solve for x: 4x * $5.50 = $110; 22x = $110; To isolate x, we divide both sides of the equation by 22: 22x / 22 = $110 / 22; x = $5; The value of x is 5.

FOBAR Question

Question: James buys x packs of beef that are 4 pounds each. The price of beef is $5.50 per pound. How much did he pay? If we know the answer to the above question is 110, what is the value of unknown variable x?

Answer: James buys x packs of beef that are 4 pounds each, so he buys a total of 4x pounds of beef. The price of beef is $5.50 per pound, so the total cost of the beef is 5.50 * 4x = 22x. We are given that the total cost is $110, so we can write: 22x = 110. Dividing both sides by 22, we get: x = 5. The value of x is 5.

Comprehensive Results

Model	#params	GSM8K	MATH
Closed-source Model
GPT-4	-	92.0	42.5
GPT-3.5-Turbo	-	80.8	34.1
PaLM	8B	4.1	1.5
PaLM	62B	33.0	4.4
PaLM	540B	56.5	8.8
PaLM-2	540B	80.7	34.3
Flan-PaLM 2	540B	84.7	33.2
Minerva	8B	16.2	14.1
Minerva	62B	52.4	27.6
Minerva	540B	58.8	33.6
Open-source models (1-10B)
LLaMA-1	7B	11.0	2.9
LLaMA-2	7B	14.6	2.5
MPT	7B	6.8	3.0
Falcon	7B	6.8	2.3
InternLM	7B	31.2	-
GPT-J	6B	34.9	-
ChatGLM 2	6B	32.4	-
Qwen	7B	51.6	-
Baichuan-2	7B	24.5	5.6
SFT	7B	41.6	-
RFT	7B	50.3	-
WizardMath	7B	54.9	10.7
MetaMath (ours)	7B	66.5	19.8
Open-source models (11-50B)
LLaMA-1	13B	17.8	3.9
LLaMA-1	33B	35.6	7.1
LLaMA-2	13B	28.7	3.9
LLaMA-2	34B	42.2	6.2
MPT	30B	15.2	3.1
Falcon	40B	19.6	2.5
GAL	30B	-	12.7
Vicuna	13B	27.6	-
Baichuan-2	13B	52.8	10.1
SFT	13B	50.0	-
RFT	13B	54.8	-
WizardMath	13B	63.9	14.0
MetaMath (ours)	13B	72.3	22.4
Open-source models (50-70B)
LLaMA-1	65B	50.9	10.6
LLaMA-2	70B	56.8	13.5
RFT	70B	64.8	-
WizardMath	70B	81.6	22.7
MetaMath (ours) ^‡	70B	82.3	26.6

BibTeX

@article{yu2023metamath, title={MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models}, author={Yu, Longhui and Jiang, Weisen and Shi, Han and Yu, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James T and Li, Zhenguo and Weller, Adrian and Liu, Weiyang}, journal={arXiv preprint arXiv:2309.12284}, year={2023} }

MetaMath:Bootstrap Your Own Mathematical Questions for Large Language Models

Abstract

Overview

Figure 1: Overview of the MetaMathQA data and MetaMath.

MetaMathQA

Comprehensive Results

Table 1: Comparison of testing accuracy to existing LLMs on GSM8K and MATH. ‡Due to the computing resource limitation, we finetune MetaMath-70B using QLoRA.

BibTeX

MetaMath:
Bootstrap Your Own Mathematical Questions for Large Language Models

Table 1: Comparison of testing accuracy to existing LLMs on GSM8K and MATH. ^‡Due to the computing resource limitation, we finetune MetaMath-70B using QLoRA.