We examine, analyze, and compare four representative creativity measures--perplexity, LLM-as-a-Judge, the Creativity Index (CI; measuring n-gram overlap with web corpora), and syntactic templates (detecting repetition of common part-of-speech patterns)--across the diverse creative domains, such as creative writing, unconventional problem-solving, and research ideation. For each domain, we compile datasets with human-aligned creative and uncreative examples and evaluate each metric's ability to discriminate between the two sets. Our analyses reveal limited consistency both across domains and metrics, as metrics that distinguish creativity in one domain fail in others (e.g., CI correctly distinguishes in creative writing but fails in problem-solving), and different metrics often disagree on the same data points (e.g., CI suggests one set to be more creative, while perplexity indicates the other set to be more creative.) We highlight key limitations, such as perplexity reflecting fluency rather than novelty; LLM-as-a-Judge producing inconsistent judgments under minor prompt variations and exhibiting bias towards particular labels; CI primarily measuring lexical diversity, with high sensitivity to implementation choices; and syntactic templates being ineffective in settings dominated by formulaic language. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.
翻译:我们针对创意写作、非常规问题解决和研究构思等多样化创意领域,对四种具有代表性的创造力衡量指标——困惑度、LLM-as-a-Judge、创造力指数(CI;衡量与网络语料库的n元语法重叠度)以及句法模板(检测常见词性模式的重复)——进行了检验、分析和比较。针对每个领域,我们收集了与人类判断一致的创意与非创意示例数据集,并评估了各项指标区分这两类样本的能力。我们的分析揭示了跨领域与跨指标间的一致性均存在局限:在某一领域能有效区分创造力的指标在其他领域可能失效(例如CI在创意写作中能正确区分,但在问题解决中则失效),且不同指标对相同数据点的判断常存在分歧(例如CI判定某组更具创意,而困惑度则显示另一组更具创意)。我们重点指出了若干关键局限:困惑度反映的是流畅性而非新颖性;LLM-as-a-Judge在细微提示变动下会产生不一致的判断,并对特定标签存在偏向;CI主要衡量词汇多样性,且对实现方式的选择高度敏感;句法模板在程式化语言主导的场景中效果有限。我们的研究结果强调,需要建立更稳健、可泛化的评估框架,以更好地与人类对创造力的判断保持一致。