Narrative XL: A Large-scale Dataset For Long-Term Memory Models

Despite their tremendous successes, most large language models do not have any long-term memory mechanisms, which restricts their applications. Overcoming this limitation would not only require changes to the typical transformer architectures or training procedures, but also a dataset on which these new models could be trained and evaluated. We argue that existing resources lack a few key properties, and that at present, there are no naturalistic datasets of sufficient scale to train (and not only evaluate) long-term memory language models. We then present our solution that capitalizes on the advances in short-term memory language models to create such a dataset. Using GPT 3.5, we summarized each scene in 1500 hand-curated books from Project Gutenberg, which resulted in approximately 150 scene-level summaries per book. We then created a number of reading comprehension questions based on these summaries, including three types of multiple-choice scene recognition questions, as well as free-form narrative reconstruction questions. Each book is thus associated with more than 500 reading comprehension questions. Crucially, most questions have a known ``retention demand'', indicating how long-term of a memory is needed to answer it, which should aid long-term memory performance evaluation. We validate our data in three small-scale experiments: one with human labelers, and two with existing language models. We show that our questions 1) adequately represent the source material 2) can be used to diagnose the model's memory capacity 3) are not trivial for modern language models even when the memory demand does not exceed those models' context lengths. Lastly, we provide our code which can be used to further expand the dataset in an automated manner.

翻译：尽管取得了巨大成功，但大多数大型语言模型不具备任何长期记忆机制，这限制了其应用。克服这一局限不仅需要改变典型的Transformer架构或训练流程，还需要一个可用于训练和评估这些新模型的数据集。我们认为现有资源缺乏若干关键特性，且目前尚无足够规模的、具有自然语境特征的数据集用于训练（而非仅评估）长期记忆语言模型。我们提出一种解决方案，利用短期记忆语言模型的进展来创建此类数据集。通过GPT 3.5，我们为古腾堡计划中1500本手工精选书籍的每个场景生成摘要，每本书平均获得约150个场景级摘要。基于这些摘要，我们构建了多项阅读理解问题，包括三类多项选择的场景识别问题以及自由形式的叙事重构问题。每本书因此关联超过500道阅读理解题。关键点在于，大多数问题具有已知的"记忆维持需求"，表明回答所需记忆的持续时长，这将有助于评估长期记忆性能。我们通过三项小规模实验验证数据：一项由人工标注者完成，两项由现有语言模型完成。结果表明，我们的问题：1）充分代表源材料；2）可用于诊断模型的记忆容量；3）即使记忆需求未超出模型上下文长度，对现代语言模型而言也并非毫无挑战。最后，我们提供可自动扩展数据集的代码。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日