We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is $\Delta\theta \approx 0.45$, which implies a head-to-head win probability of $0.61$; the highest-rated model beats the lowest only about $61\%$ of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows.
翻译:我们提出创造力基准,这是一个针对大型语言模型在营销创造力领域的评估框架。该基准涵盖100个品牌(12个类别)和三种提示类型(洞察、创意、狂想)。通过678位在职创意人员对11,012组匿名比较结果进行成对偏好标注,并采用Bradley-Terry模型分析,结果显示模型性能呈现紧密聚集分布,没有模型能在所有品牌或提示类型中占据主导地位:最高与最低性能的差异为$\Delta\theta \approx 0.45$,这意味着直接对决时的胜率约为$0.61$;评分最高的模型仅在大约$61\%$的情况下战胜最低评分模型。我们还通过余弦距离分析模型多样性,以捕捉模型内部及模型间的变异度以及对提示重构的敏感性。通过比较三种LLM-as-judge设置与人类评分排名,发现其相关性弱且不一致,并存在评判者特定偏差,这表明自动化评判无法替代人类评估。传统创造力测试对品牌约束任务的适用性也仅具有部分可迁移性。总体而言,研究结果凸显了专家人类评估和多样性感知工作流程的必要性。