Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. We leverage LLM metacognition skills [Didolkar et al., 2024] of a strong LLM to extract core "skills" from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills. The use of two different skills within each question makes finding such questions an "out of distribution" task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multiturn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from the MATH dataset [Hendrycks et al., 2021] resulted in MATH$^2$ - a dataset of higher-quality math questions, as evidenced by: (a) Lower performance of all models on MATH$^2$ than on MATH (b) Higher performance on MATH when using MATH$^2$ questions as in-context examples. Although focused on mathematics, our methodology seems applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight. Also of interest is a striking relationship observed between models' performance on the new dataset: the success rate on MATH$^2$ is the square on MATH, suggesting that successfully solving the question in MATH$^2$ requires a nontrivial combination of two distinct math skills.
翻译:当前的大语言模型训练将数学推理定位为核心能力。随着公开资源被充分挖掘,市场对多样化且具有挑战性的数学问题存在未满足的需求。单纯依赖人类专家既耗时又昂贵,而大语言模型生成的问题往往缺乏必要的多样性和难度。我们提出一个设计框架,结合大语言模型的优势与人在回路方法,以生成多样化的高难度数学问题。我们利用强大LLM的元认知能力[Didolkar等人,2024]从现有数学数据集中提取核心"技能"。这些技能通过向LLM随机输入成对核心技能的方式,成为生成新颖且困难问题的基础。在每个问题中使用两种不同技能,使得寻找此类问题对LLM和人类都成为"分布外"任务。我们的流程采用LLM通过多轮提示迭代生成和优化问题及解法,随后由人工标注者验证并进一步优化问题,其效率通过进一步的LLM交互得到提升。将此流程应用于从MATH数据集[Hendrycks等人,2021]提取的技能,最终产生了MATH$^2$——一个更高质量的数学问题数据集,其优势体现在:(a)所有模型在MATH$^2$上的表现均低于在MATH上的表现;(b)使用MATH$^2$问题作为上下文示例时,在MATH数据集上的表现得到提升。尽管聚焦于数学领域,我们的方法似乎适用于其他需要结构化推理的领域,并可能作为可扩展监督的组成部分。另一个值得关注的发现是模型在新数据集上的表现呈现显著相关性:在MATH$^2$上的成功率是MATH数据集成功率的平方,这表明成功解决MATH$^2$中的问题需要两种不同数学技能的非平凡组合。