N-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 8,618 expert writer annotations of novelty, pragmaticality, and sensicality via close reading of human- and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, approximately 91% of top-quartile n-gram novel expressions are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike in human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier closed-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify expressions perceived as novel by experts (a positive aspect of writing) or non-pragmatic (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty ratings align with expert writer preferences in an out-of-distribution dataset, more so than an n-gram based metric.
翻译:N元语法新奇性被广泛用于评估语言模型生成超出其训练数据的文本的能力。最近,它也被采纳为衡量文本创造力的指标。然而,关于创造力的理论研究表明,这种方法可能并不充分,因为它没有考虑创造力的双重本质:新奇性(文本的原创程度)和适切性(文本的合理性与实用性)。我们通过对人类和AI生成文本的细读,收集了8,618条专业作家对新奇性、实用性和合理性的标注,研究了这种创造力概念与n元语法新奇性之间的关系。我们发现,虽然n元语法新奇性与专业作家评判的创造力呈正相关,但大约91%处于最高四分位数的n元语法新奇表达并未被判定为具有创造性,这警示我们不应仅依赖n元语法新奇性。此外,与人类撰写的文本不同,开源大型语言模型中更高的n元语法新奇性与更低的实用性相关。在一项针对前沿闭源模型的探索性研究中,我们进一步证实,这些模型比人类更不可能产生创造性的表达。利用我们的数据集,我们测试了零样本、少样本和微调模型是否能够识别被专家认为具有新奇性(写作的积极方面)或不实用(消极方面)的表达。总体而言,前沿大型语言模型的表现远高于随机水平,但仍有改进空间,尤其是在识别不实用表达方面存在困难。我们还发现,在分布外数据集中,LLM-as-a-Judge的新奇性评分与专业作家的偏好更为一致,其表现优于基于n元语法的度量标准。