Math word problems are critical K-8 educational tools, but writing them is time-consuming and requires domain expertise. We suggest that language models can support K-8 math education by automatically generating problems at scale. To be educational, generated problems must be 1) solvable, 2) accurate, and 3) appropriate. Existing datasets are unlabeled for these criteria, making them ill-suited for training problem generators. We introduce MATHWELL, a Llama-2 (70B) model iteratively finetuned to generate K-8 math word problems using data from expert annotation. Using MATHWELL, we generate the largest English word problem dataset with Program of Thought (PoT) rationales to date, containing 20,490 problems. 3,484 are scored by domain experts who find MATHWELL has a 40% higher share of problems that have executable solutions and meet all criteria than alternatives, with 74% of its problems with executable solutions being solvable, accurate, and appropriate. We release our model, data, and annotations.
翻译:数学应用题是K-8阶段关键的教育工具,但编写这类题目既耗时又需要专业知识。我们认为语言模型可以通过自动生成大规模应用题来支撑K-8数学教育。为达到教育目的,生成的问题必须满足以下三点:1)可求解性 2)准确性 3)适切性。现有数据集缺乏针对这些标准的标注,因此不适合训练问题生成模型。我们提出MATHWELL模型——基于Llama-2(70B)进行迭代微调,利用专家标注数据生成K-8数学应用题。借助MATHWELL,我们构建了目前规模最大的含程序化思维(Program of Thought, PoT)推导过程的英语应用题数据集,包含20,490道题目。其中3,484道题经领域专家评分,结果表明相比其他方案,MATHWELL生成的问题中具备可执行解法且符合所有标准的题目占比高出40%,其可执行解法题目中有74%兼具可求解性、准确性和适切性。我们开源了模型、数据集及标注结果。