Math word problems are critical K-8 educational tools, but writing them is time-consuming and requires domain expertise. We suggest that language models can support K-8 math education by automatically generating problems. To be educational, generated problems must be 1) solvable, 2) accurate, and 3) appropriate. Existing datasets are unlabeled for these criteria, making them ill-suited for training problem generators. To address this gap, we use domain expert annotation to curate a high-quality synthetic training dataset for this task. We show the value of this data by using it to iteratively finetune Llama-2 (70B) to create MATHWELL, a K-8 word problem generator. Domain experts find MATHWELL has a 40% higher share of problems that have executable solutions and meet all criteria than existing open-source models, with 74% of its problems with executable solutions being solvable, accurate, and appropriate. MATHWELL achieves 94.9% of GPT-4 Turbo's performance on this task while outputting problems written at a more appropriate reading level for K-8 students. MATHWELL's performance despite being trained by finetuning only highlights the quality of our synthetic data for training age-appropriate word problem generators. We release our model, data, and annotations.
翻译:数学应用题是K-8阶段的关键教育工具,但编写这类题目既耗时又需要领域专业知识。我们提出语言模型可通过自动生成题目来支持K-8数学教育。为达到教育目的,生成的题目必须满足:1)可解性、2)准确性、3)适龄性。现有数据集缺乏对这些标准的标注,因此不适用于训练题目生成器。为弥补这一空白,我们借助领域专家标注,精心构建了适用于该任务的高质量合成训练数据集。通过使用该数据迭代微调Llama-2(70B),我们创建了K-8应用题生成器MATHWELL,并验证了该数据的价值。领域专家发现,与现有开源模型相比,MATHWELL生成的可执行解决方案且满足所有标准的题目比例高出40%,其中74%的可执行解决方案题目兼具可解性、准确性和适龄性。MATHWELL在此任务上的表现达到GPT-4 Turbo的94.9%,同时生成的题目阅读难度更符合K-8学生水平。尽管MATHWELL仅通过微调训练,但其性能充分凸显了我们合成数据在训练适龄应用题生成器方面的质量。我们开源了模型、数据及标注结果。