CreativityPrism: A Holistic Evaluation Framework for Large Language Model Creativity

Zhaoyi Joey Hou,Bowei Alvin Zhang,Yining Lu,Bhiman Kumar Baghel,Anneliese Brei,Ximing Lu,Meng Jiang,Faeze Brahman,Snigdha Chaturvedi,Haw-Shiuan Chang,Daniel Khashabi,Xiang Lorraine Li

Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as generating creative text, there is still no holistic and scalable framework to evaluate their creativity across diverse scenarios. Existing methods of LLM creativity evaluation either heavily rely on humans, limiting speed and scalability, or are fragmented across different domains and different definitions of creativity. To address this gap, we propose CREATIVITYPRISM, an evaluation analysis framework that consolidates eight tasks from three domains, divergent thinking, creative writing, and logical reasoning, into a taxonomy of creativity that emphasizes three dimensions: quality, novelty, and diversity of LLM generations. The framework is designed to be scalable with reliable automatic evaluation judges that have been validated against human annotations. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CREATIVITYPRISM and find that while proprietary LLMs dominate creative writing and logical reasoning tasks by a 15% lead over open-sourced ones, they offer no significant advantage in divergent thinking, a domain much less explored in existing post-training regimes. Our analysis also shows that high performance in one creative dimension or domain rarely generalizes to others; specifically, novelty metrics often show weak or negative correlations with other metrics. This fragmentation confirms that a holistic, multi-dimensional framework like CREATIVITYPRISM is essential for meaningful assessment of LLM creativity.

翻译：创造力常被视为人类智能的标志。尽管大型语言模型（LLMs）生成的文本日益被视作具有创造性，但目前仍缺乏一个全面且可扩展的框架来评估其在多样化场景中的创造力。现有的LLM创造力评估方法要么严重依赖人工，限制了评估速度与可扩展性；要么在不同领域及不同的创造力定义间呈现碎片化。为弥补这一空白，我们提出了CREATIVITYPRISM，这是一个评估分析框架，它将发散思维、创意写作和逻辑推理这三个领域的八项任务整合到一个创造力分类体系中，该体系强调LLM生成内容的三个维度：质量、新颖性和多样性。该框架设计为可扩展的，并配备了经过人工标注验证的可靠自动评估器。我们在CREATIVITYPRISM上评估了17个最先进的专有及开源LLM，发现尽管专有LLM在创意写作和逻辑推理任务上以15%的优势领先于开源模型，但在发散思维领域——一个现有后训练方案中较少探索的领域——并未展现出显著优势。我们的分析还表明，在某一创造力维度或领域的高性能很少能泛化到其他维度或领域；具体而言，新颖性指标常与其他指标呈现弱相关或负相关。这种碎片化现象证实，像CREATIVITYPRISM这样的整体性、多维度框架对于LLM创造力的有效评估至关重要。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

专知会员服务

28+阅读 · 2月27日

评估大语言模型在科学发现中的作用

专知会员服务

19+阅读 · 2025年12月19日

《知识增强型大语言模型及面向创造力支持的人机协作框架》233页

专知会员服务

33+阅读 · 2025年9月29日

基于大语言模型的智能体易产生幻觉：分类体系、方法与未来方向综述

专知会员服务

32+阅读 · 2025年9月27日