Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as generating creative text, there is still no holistic and scalable framework to evaluate their creativity across diverse scenarios. Existing methods of LLM creativity evaluation either heavily rely on humans, limiting speed and scalability, or are fragmented across different domains and different definitions of creativity. To address this gap, we propose CREATIVITYPRISM, an evaluation analysis framework that consolidates eight tasks from three domains, divergent thinking, creative writing, and logical reasoning, into a taxonomy of creativity that emphasizes three dimensions: quality, novelty, and diversity of LLM generations. The framework is designed to be scalable with reliable automatic evaluation judges that have been validated against human annotations. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CREATIVITYPRISM and find that while proprietary LLMs dominate creative writing and logical reasoning tasks by a 15% lead over open-sourced ones, they offer no significant advantage in divergent thinking, a domain much less explored in existing post-training regimes. Our analysis also shows that high performance in one creative dimension or domain rarely generalizes to others; specifically, novelty metrics often show weak or negative correlations with other metrics. This fragmentation confirms that a holistic, multi-dimensional framework like CREATIVITYPRISM is essential for meaningful assessment of LLM creativity.
翻译:创造力常被视为人类智能的标志。尽管大型语言模型(LLMs)生成的文本日益被视作具有创造性,但目前仍缺乏一个全面且可扩展的框架来评估其在多样化场景中的创造力。现有的LLM创造力评估方法要么严重依赖人工,限制了评估速度与可扩展性;要么在不同领域及不同的创造力定义间呈现碎片化。为弥补这一空白,我们提出了CREATIVITYPRISM,这是一个评估分析框架,它将发散思维、创意写作和逻辑推理这三个领域的八项任务整合到一个创造力分类体系中,该体系强调LLM生成内容的三个维度:质量、新颖性和多样性。该框架设计为可扩展的,并配备了经过人工标注验证的可靠自动评估器。我们在CREATIVITYPRISM上评估了17个最先进的专有及开源LLM,发现尽管专有LLM在创意写作和逻辑推理任务上以15%的优势领先于开源模型,但在发散思维领域——一个现有后训练方案中较少探索的领域——并未展现出显著优势。我们的分析还表明,在某一创造力维度或领域的高性能很少能泛化到其他维度或领域;具体而言,新颖性指标常与其他指标呈现弱相关或负相关。这种碎片化现象证实,像CREATIVITYPRISM这样的整体性、多维度框架对于LLM创造力的有效评估至关重要。