To enhance the quality of generated stories, recent story generation models have been investigating the utilization of higher-level attributes like plots or commonsense knowledge. The application of prompt-based learning with large language models (LLMs), exemplified by GPT-3, has exhibited remarkable performance in diverse natural language processing (NLP) tasks. This paper conducts a comprehensive investigation, utilizing both automatic and human evaluation, to compare the story generation capacity of LLMs with recent models across three datasets with variations in style, register, and length of stories. The results demonstrate that LLMs generate stories of significantly higher quality compared to other story generation models. Moreover, they exhibit a level of performance that competes with human authors, albeit with the preliminary observation that they tend to replicate real stories in situations involving world knowledge, resembling a form of plagiarism.
翻译:为提升生成故事的质量,近期故事生成模型已开始探索利用情节或常识知识等高层次属性。以GPT-3为代表的大语言模型(LLMs)的提示学习应用在多种自然语言处理(NLP)任务中展现了卓越性能。本文通过自动评价与人工评价相结合的方式,系统比较了大语言模型与近期模型在三个风格、语域及故事长度各异的数据集上的故事生成能力。结果表明,大语言模型生成的故事质量显著优于其他故事生成模型。此外,尽管初步观察到它们在世界知识相关情境中倾向于复现真实故事(类似抄袭行为),但其表现已能与人类作者相媲美。