Recent advances in large language models (LLMs) have enabled the generation of high-quality prose, yet the question of whether these models are capable of generating diverse outputs remains contested. In this work, we investigate the diversity of LLM-generated stories through the framework of narrative similarity. Using a contrastive framework and a dataset of human-written stories and prompts from r/WritingPrompts, we collect narrative similarity judgments across 10 representative LLMs, utilizing both human evaluations and three different automatic annotation methods. Our findings reveal a consistent trend: LLM-generated narratives are consistently more similar to each other than human-written stories are. We demonstrate that frontier models in particular converge on a ``mean'' generic narrative that approximates individual human stories but lacks the collective diversity of human authors. Finally, we show that common mitigation strategies, including negative prompting and temperature scaling, fail to meaningfully address this homogeneity.
翻译:近年来,大语言模型(LLMs)在生成高质量散文方面取得了显著进展,但这些模型能否产出多样化的输出仍存争议。本研究通过叙事相似性框架,探究大语言模型生成故事的多样性。我们采用对比框架,基于r/WritingPrompts数据集收集的人类故事与提示词,结合人工评估与三种自动标注方法,对10个代表性大语言模型生成的叙事相似性进行评判。研究发现呈现出一致趋势:大语言模型生成的叙事彼此之间的相似度,始终高于人类创作的故事。我们证明,前沿模型尤其会趋同于一种"均值"化的通用叙事——这种叙事虽近似于个体人类故事,却缺乏人类作者整体的集体多样性。最后,我们揭示包括负向提示与温度缩放在内的常用缓解策略,均未能切实解决这种同质化问题。