Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, we propose a novel pairwise-comparison framework for assessing textual creativity that leverages shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human and synthetic data to train highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs.
翻译:创造力评估对大语言模型而言仍是一个具有挑战性的前沿课题。当前评估方法严重依赖低效且昂贵的人工判断,制约了机器创造力提升的进展。尽管存在从心理测试到启发式或提示驱动方法等自动化评估手段,但这些方法往往缺乏普适性或与人类判断的一致性。为解决这些问题,我们提出一种新颖的成对比较框架,通过共享上下文指令来提升文本创造力评估的一致性。我们构建了CreataSet——一个包含10万+人类级及100万+合成创意指令-响应对的大规模数据集,涵盖多样化开放域任务。基于CreataSet训练,我们开发了名为CrEval的LLM评估器。实验表明,CrEval在与人类判断的一致性方面显著优于现有方法。研究结果证实了融合人类数据与合成数据对于训练高鲁棒性评估器的关键意义,并展示了CrEval在提升LLM创造力方面的实际效用。