Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. We leverage LLM metacognition skills [Didolkar et al., 2024] of a strong LLM to extract core "skills" from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills. The use of two different skills within each question makes finding such questions an "out of distribution" task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multiturn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from the MATH dataset [Hendrycks et al., 2021] resulted in MATH$^2$ - a dataset of higher-quality math questions, as evidenced by: (a) Lower performance of all models on MATH$^2$ than on MATH (b) Higher performance on MATH when using MATH$^2$ questions as in-context examples. Although focused on mathematics, our methodology seems applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight. Also of interest is a striking relationship observed between models' performance on the new dataset: the success rate on MATH$^2$ is the square on MATH, suggesting that successfully solving the question in MATH$^2$ requires a nontrivial combination of two distinct math skills.
翻译:当前LLM训练将数学推理定位为核心能力。随着公开数据源被充分挖掘,对多样化和具有挑战性的数学问题存在未满足的需求。单纯依赖人类专家既耗时又昂贵,而LLM生成的问题往往缺乏必要的多样性和难度。我们提出了一个设计框架,结合LLM的优势与人机协同方法,以生成多样化的高难度数学问题。我们利用强大LLM的元认知能力[Didolkar等人,2024年]从现有数学数据集中提取核心"技能"。这些技能作为基础,通过向LLM提示随机配对的核心技能来生成新颖且困难的问题。在每个问题中使用两种不同技能,使得寻找此类问题对LLM和人类都成为"分布外"任务。我们的流程采用LLM通过多轮提示迭代生成和精炼问题与解答。随后由人类标注者验证并进一步优化问题,其效率通过进一步的LLM交互得到提升。将此流程应用于从MATH数据集[Hendrycks等人,2021年]提取的技能,产生了MATH$^2$——一个更高质量的数学问题数据集,证据如下:(a)所有模型在MATH$^2$上的表现均低于在MATH上的表现;(b)使用MATH$^2$问题作为上下文示例时,在MATH上的表现更高。尽管专注于数学领域,我们的方法似乎适用于其他需要结构化推理的领域,并可能作为可扩展监督的组成部分。另一个值得注意的发现是模型在新数据集上的表现之间存在显著关系:在MATH$^2$上的成功率是MATH上成功率的平方,这表明成功解决MATH$^2$中的问题需要两种不同数学技能的非平凡组合。