创造力基准：大型语言模型营销创造力的评估基准 (Creativity Benchmark: A benchmark for marketing creativity for large language models)

We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is $\Delta\theta \approx 0.45$, which implies a head-to-head win probability of $0.61$; the highest-rated model beats the lowest only about $61\%$ of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows.

翻译：我们提出创造力基准，这是一个针对大型语言模型在营销创造力领域的评估框架。该基准涵盖100个品牌（12个类别）和三种提示类型（洞察、创意、狂想）。通过678位在职创意人员对11,012组匿名比较结果进行成对偏好标注，并采用Bradley-Terry模型分析，结果显示模型性能呈现紧密聚集分布，没有模型能在所有品牌或提示类型中占据主导地位：最高与最低性能的差异为$\Delta\theta \approx 0.45$，这意味着直接对决时的胜率约为$0.61$；评分最高的模型仅在大约$61\%$的情况下战胜最低评分模型。我们还通过余弦距离分析模型多样性，以捕捉模型内部及模型间的变异度以及对提示重构的敏感性。通过比较三种LLM-as-judge设置与人类评分排名，发现其相关性弱且不一致，并存在评判者特定偏差，这表明自动化评判无法替代人类评估。传统创造力测试对品牌约束任务的适用性也仅具有部分可迁移性。总体而言，研究结果凸显了专家人类评估和多样性感知工作流程的必要性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日