We evaluate a range of recent LLMs on English creative writing, a challenging and complex task that requires imagination, coherence, and style. We use a difficult, open-ended scenario chosen to avoid training data reuse: an epic narration of a single combat between Ignatius J. Reilly, the protagonist of the Pulitzer Prize-winning novel A Confederacy of Dunces (1980), and a pterodactyl, a prehistoric flying reptile. We ask several LLMs and humans to write such a story and conduct a human evalution involving various criteria such as fluency, coherence, originality, humor, and style. Our results show that some state-of-the-art commercial LLMs match or slightly outperform our writers in most dimensions; whereas open-source LLMs lag behind. Humans retain an edge in creativity, while humor shows a binary divide between LLMs that can handle it comparably to humans and those that fail at it. We discuss the implications and limitations of our study and suggest directions for future research.
翻译:本文对近期多款大型语言模型在英文创意写作任务中的表现进行了系统评估。创意写作是一项涉及想象力、连贯性与风格把控的复杂挑战性任务。我们采用了一个旨在规避训练数据重复利用的高难度开放式场景:以普利策获奖小说《笨蛋联盟》(1980)主人公伊格内修斯·J·赖利与史前飞爬行动物翼手龙之间的史诗级单挑为叙事框架。要求各LLM及人类作者据此创作故事后,通过流利度、连贯性、原创性、幽默感及风格等多维度指标进行人工评估。结果显示:部分顶尖商业LLM在多数维度上可与人类作者媲美甚至略胜一筹,而开源LLM则表现落后。人类在创造力方面仍具优势,幽默感维度则呈现两极分化——部分LLM能接近人类水平,另有部分完全无法胜任。本文最后探讨了研究的启示与局限性,并提出了未来研究方向。