Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/
翻译:大语言模型(LLM)在推理与代码生成方面已展现出显著进步。然而,高效创建用于评估这些能力的新基准仍面临挑战。传统的基准创建依赖人工手动完成,这一过程既昂贵又耗时。此外,现有基准常会污染LLM的训练数据,因此需要新颖且多样化的基准来准确评估其真实能力。本研究提出InfoSynth,一种基于信息论原理自动生成和评估推理基准的新型框架。我们提出基于KL散度与熵的度量指标,用以量化基准的新颖性与多样性,而无需依赖成本高昂的模型评估。基于此框架,我们开发了一个端到端流程,通过遗传算法与迭代代码反馈,从种子数据集合成稳健的Python编程问题。我们的方法在97%的情况下能为新问题生成准确的测试用例与解决方案,且合成的基准相较于原始种子数据集持续表现出更高的新颖性与多样性。此外,我们的算法提供了一种控制生成问题的新颖性/多样性与难度的方法。InfoSynth为构建高质量、新颖且多样化的大语言模型基准提供了一个可扩展、可自验证的流程。项目页面:https://ishirgarg.github.io/infosynth_web/