A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model's parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models' capacity for associative creativity.
翻译:创造力的一个关键组成部分是联想推理:在概念之间建立新颖且有意义的联系的能力。我们提出了CREATE基准测试,旨在评估模型的创造性联想推理能力。CREATE要求模型生成连接其参数知识中概念的路径集合。这些路径应具备高特异性(概念联系的独特性和紧密性)与高多样性(与其他路径的差异性),且模型若能生成更大规模的高质量多样化路径集合,将获得更高评分。该任务与假设生成等实际创造性任务具有相似需求,包括极大的搜索空间,但能够通过客观答案评分收集大规模基准数据。对前沿模型的评估表明,最强模型能获得比其他模型更高的创造性效用,但由于答案的高度多重性和搜索复杂性,基准饱和难以实现。此外,我们的结果表明,即使在高令牌预算下,思维模型在我们的任务中并不总是更有效。近期提出的创造性提示方法虽能带来一定改进,但提升有限。CREATE为开发新方法以提升模型的联想创造力提供了实验平台。