Large language models (LLMs) are increasingly used in verbal creative tasks. However, previous assessments of the creative capabilities of LLMs remain weakly grounded in human creativity theory and are thus hard to interpret. The widely used Divergent Association Task (DAT) focuses on novelty, ignoring appropriateness, a core component of creativity. We evaluate a range of state-of-the-art LLMs on DAT and show that their scores on the task are lower than those of two baselines that do not possess any creative abilities, undermining its validity for model evaluation. Grounded in human creativity theory, which defines creativity as the combination of novelty and appropriateness, we introduce Conditional Divergent Association Task (CDAT). CDAT evaluates novelty conditional on contextual appropriateness, separating noise from creativity better than DAT, while remaining simple and objective. Under CDAT, smaller model families often show the most creativity, whereas advanced families favor appropriateness at lower novelty. We hypothesize that training and alignment likely shift models along this frontier, making outputs more appropriate but less creative. We release the dataset and code.
翻译:大语言模型(LLM)在语言创造性任务中的应用日益广泛。然而,先前对LLM创造能力的评估仍缺乏扎实的人类创造力理论基础,因而难以合理解释。广泛使用的发散联想任务(DAT)侧重于新颖性,却忽略了创造力的核心要素——适宜性。我们通过DAT评估了一系列先进LLM,发现其在该任务上的得分甚至低于两个不具备任何创造能力的基线模型,这动摇了DAT用于模型评估的有效性。基于将创造力定义为新颖性与适宜性相结合的人类创造力理论,我们提出了条件发散联想任务(CDAT)。CDAT在语境适宜性条件下评估新颖性,能比DAT更有效地区分噪声与创造力,同时保持简单客观的特性。在CDAT评估框架下,较小规模的模型家族常表现出更高的创造力,而先进模型家族则倾向于在较低新颖性水平上追求适宜性。我们推测,训练和对齐过程可能使模型沿此边界发生偏移,导致输出更适宜但创造性降低。我们已公开数据集与代码。