Despite their tremendous successes, most large language models do not have any long-term memory mechanisms, which restricts their applications. Overcoming this limitation would not only require changes to the typical transformer architectures or training procedures, but also a dataset on which these new models could be trained and evaluated. We argue that existing resources lack a few key properties, and that at present, there are no naturalistic datasets of sufficient scale to train (and not only evaluate) long-term memory language models. We then present our solution that capitalizes on the advances in short-term memory language models to create such a dataset. Using GPT 3.5, we summarized each scene in 1500 hand-curated books from Project Gutenberg, which resulted in approximately 150 scene-level summaries per book. We then created a number of reading comprehension questions based on these summaries, including three types of multiple-choice scene recognition questions, as well as free-form narrative reconstruction questions. Each book is thus associated with more than 500 reading comprehension questions. Crucially, most questions have a known ``retention demand'', indicating how long-term of a memory is needed to answer it, which should aid long-term memory performance evaluation. We validate our data in three small-scale experiments: one with human labelers, and two with existing language models. We show that our questions 1) adequately represent the source material 2) can be used to diagnose the model's memory capacity 3) are not trivial for modern language models even when the memory demand does not exceed those models' context lengths. Lastly, we provide our code which can be used to further expand the dataset in an automated manner.
翻译:尽管取得了巨大成功,但大多数大型语言模型不具备任何长期记忆机制,这限制了其应用。克服这一局限不仅需要改变典型的Transformer架构或训练流程,还需要一个可用于训练和评估这些新模型的数据集。我们认为现有资源缺乏若干关键特性,且目前尚无足够规模的、具有自然语境特征的数据集用于训练(而非仅评估)长期记忆语言模型。我们提出一种解决方案,利用短期记忆语言模型的进展来创建此类数据集。通过GPT 3.5,我们为古腾堡计划中1500本手工精选书籍的每个场景生成摘要,每本书平均获得约150个场景级摘要。基于这些摘要,我们构建了多项阅读理解问题,包括三类多项选择的场景识别问题以及自由形式的叙事重构问题。每本书因此关联超过500道阅读理解题。关键点在于,大多数问题具有已知的"记忆维持需求",表明回答所需记忆的持续时长,这将有助于评估长期记忆性能。我们通过三项小规模实验验证数据:一项由人工标注者完成,两项由现有语言模型完成。结果表明,我们的问题:1)充分代表源材料;2)可用于诊断模型的记忆容量;3)即使记忆需求未超出模型上下文长度,对现代语言模型而言也并非毫无挑战。最后,我们提供可自动扩展数据集的代码。