We evaluate recent Large language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle to interpret difficult subtext. However, at their best, the models can provide thoughtful thematic analysis of stories. We additionally demonstrate that LLM judgments of summary quality do not match the feedback from the writers.
翻译:我们评估了最新的大型语言模型(LLMs)在短篇小说摘要这一具有挑战性的任务上的表现,这类故事可能篇幅较长、包含微妙的潜台词或打乱的时间线。重要的是,我们直接与作者合作,确保这些故事未曾在网络上分享(因此模型未见过),并利用作者自身的判断对摘要质量进行知情评估。通过基于叙事理论的定量和定性分析,我们比较了GPT-4、Claude-2.1和Llama-2-70B。研究发现,所有三个模型在超过50%的摘要中出现事实性错误,且难以解读复杂的潜台词。然而,在最佳情况下,这些模型能够提供对故事主题的深刻分析。此外,我们证明,模型对摘要质量的判断与作者的反馈并不一致。