Recent advances in reasoning models have yielded impressive results in mathematics and coding. However, most approaches rely on static datasets, which have been suggested to encourage memorisation and limit generalisation. We introduce DéjàQ, a framework that departs from this paradigm by jointly evolving a diverse set of synthetic mathematical problems alongside model training. This evolutionary process adapts to the model's ability throughout training, optimising problems for learnability. We propose two LLM-driven mutation strategies in which the model itself mutates the training data, either by altering contextual details or by directly modifying problem structure. We find that the model can generate novel and meaningful problems, and that these LLM-driven mutations improve RL training. We analyse key aspects of DéjàQ, including the validity of generated problems and computational overhead. Our results underscore the potential of dynamically evolving training data to enhance mathematical reasoning and indicate broader applicability, which we will support by open-sourcing our code.
翻译:近期推理模型的进展在数学和编程领域取得了令人瞩目的成果。然而,大多数方法依赖于静态数据集,这类数据集被认为会助长记忆行为并限制泛化能力。我们提出了DéjàQ框架,该框架通过联合演化一组多样化的合成数学问题与模型训练,突破了这一范式。该演化过程在整个训练期间适应模型的能力,从而优化问题的可学习性。我们提出了两种由大语言模型驱动的变异策略:模型通过改变上下文细节或直接修改问题结构来对训练数据进行变异。我们发现模型能够生成新颖且有意义的问题,并且这些由大语言模型驱动的变异能改进强化学习训练效果。我们分析了DéjàQ的关键特性,包括生成问题的有效性与计算开销。我们的研究结果凸显了动态演化训练数据对提升数学推理能力的潜力,并表明了其更广泛的适用性——我们将通过开源代码来支持这一方向。