Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable methods for evaluating creativity across diverse tasks. However, most existing creativity metrics are tightly coupled to specific tasks, embedding domain assumptions into the evaluation process, and limiting scalability and generality. To address this gap, we introduce an automated, domain-agnostic framework for quantifying LLM creativity across open-ended tasks. Our approach separates the measurement apparatus from the creative task itself, enabling scalable, task-agnostic assessment. Divergent creativity is measured using semantic entropy, a reference-free and robust metric for novelty and diversity, validated against human annotations, LLM-based novelty judgments and baseline diversity measures. Convergent creativity is assessed via a novel retrieval-based multi-agent judge framework that delivers context-sensitive evaluation of task fulfilment with over 60% improved efficiency. We validate our framework in three qualitatively distinct domains: problem-solving (MacGyver), research ideation (HypoGen), and creative writing (BookMIA), using a broad suite of LLMs. Empirical results show that our framework reliably captures key facets of creativity, including novelty, diversity, and task fulfilment, and reveal how model properties, such as size, temperature, recency, and reasoning, impact creative performance. Our work establishes a reproducible and generalizable standard for automated LLM creativity evaluation, paving the way for scalable benchmarking and accelerating progress in creative AI.

翻译：大语言模型（LLMs）在语言理解、推理和生成方面取得了显著进展，引发了对其创造潜力的日益关注。实现这一潜力需要系统化且可扩展的方法来评估跨任务的创造力。然而，现有的大多数创造力指标与特定任务紧密耦合，将领域假设嵌入评估过程，限制了可扩展性和通用性。为了填补这一空白，我们提出了一种自动化、领域无关的框架，用于量化大语言模型在开放任务中的创造力。我们的方法将测量装置与创造性任务本身分离，实现了可扩展、任务无关的评估。发散创造力通过语义熵（一种无需参考的鲁棒新颖性和多样性指标）进行测量，并经过人工标注、基于大语言模型的新颖性判断以及基线多样性度量的验证。收敛创造力则通过一种新颖的基于检索的多智能体评判框架进行评估，该框架提供情境敏感的任务完成度评估，效率提升超过60%。我们在三个性质不同的领域——问题解决（MacGyver）、研究构思（HypoGen）和创意写作（BookMIA）中，使用广泛的大语言模型套件验证了我们的框架。实证结果表明，我们的框架可靠地捕捉了创造力的关键方面，包括新颖性、多样性和任务完成度，并揭示了模型属性（如规模、温度、时效性和推理能力）如何影响创造表现。我们的工作为自动化大语言模型创造力评估建立了可复现且通用的标准，为可扩展基准测试铺平了道路，并加速了创造性人工智能的进展。