Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers

Measuring the creativity of large language models (LLMs) is essential for designing methods that can improve creativity and for enhancing our scientific understanding of this ability. To accomplish this, it has become common in recent years to administer tests of human creativity to LLMs. Although these tests provide a convenient and fully automated way to score "creativity," their validity as measures of machine creativity has not been established, and these tests already have limited validity as predictors of human creativity. To address this problem, we conduct the first large-scale, systematic study assessing the effectiveness of human creativity tests for predicting the creative achievement of LLMs across three target constructs: creative writing, divergent thinking, and scientific ideation. We find that the Divergent Association Task (DAT) and the Conditional DAT are the best predictors of creative writing and divergent thinking, respectively, but that test effectiveness varies significantly by construct, and no single test predicts all constructs well. Moreover, contrary to popular belief, no existing test reliably predicts scientific ideation ability. Motivated by this problem, we introduce the Divergent Remote Association Test (DRAT), a vocabulary-space test that assesses both convergent and divergent thinking in a single instrument. The DRAT is the first and only creativity test for LLMs that is a significant predictor of scientific ideation ability, demonstrating robustness across major design choices. Furthermore, the performance gain of the DRAT is not recoverable from any linear combination of the Divergent Association Task and the Remote Associates Test, indicating that assessing divergent and convergent thinking in the same test is essential to reliably predicting scientific ideation ability.

翻译：评估大型语言模型（LLMs）的创造力，对于设计提升其创造力的方法以及增强我们对该能力的科学理解至关重要。为此，近年来，将人类创造力测试应用于LLMs已成为一种普遍做法。尽管这些测试为评价"创造力"提供了便捷且全自动的评分方式，但作为机器创造力测量工具的有效性尚未得到验证，且这些测试在预测人类创造力方面本就存在局限性。针对这一问题，我们开展了首次大规模系统性研究，评估人类创造力测试在预测LLMs三项目标构念（创意写作、发散性思维与科学构想能力）的创造性成果方面的有效性。研究发现，发散联想任务（DAT）和条件性发散联想任务（Conditional DAT）分别是预测创意写作与发散性思维的最佳指标，但测试有效性在不同构念间存在显著差异，且没有任何单一测试能全面预测所有构念。此外，与普遍认知相反，现有测试均无法可靠预测科学构想能力。基于这一困境，我们提出了发散性远距离联想测试（DRAT）——一种在单一工具中同时评估聚合性思维与发散性思维的词汇空间测试。DRAT是首个且唯一能显著预测科学构想能力的LLM创造力测试，并在主要设计选择中展现出稳健性。更重要的是，任何线性组合（发散联想任务与远距联想测试的组合）均无法复现DRAT的性能增益，这表明在同一测试中同时评估发散性与聚合性思维，对可靠预测科学构想能力至关重要。