This paper addresses the challenge of processing long documents using generative transformer models. To evaluate different approaches, we introduce BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts within extensive texts. Our evaluation, which includes benchmarks for GPT-4 and RAG, reveals that common methods are effective only for sequences up to $10^4$ elements. In contrast, fine-tuning GPT-2 with recurrent memory augmentations enables it to handle tasks involving up to $10^7$ elements. This achievement marks a substantial leap, as it is by far the longest input processed by any open neural network model to date, demonstrating a significant improvement in the processing capabilities for long sequences.
翻译:本文解决了使用生成式Transformer模型处理长文档的挑战。为评估不同方法,我们引入了新基准BABILong,旨在衡量模型从长文本中提取并处理分布式事实的能力。评估结果(包括GPT-4和RAG基准)显示,常规方法仅对长度不超过$10^4$元素的序列有效。相比之下,通过递归记忆增强微调的GPT-2可处理高达$10^7$元素的任务。这一成就标志着重大突破——这是迄今任何开源神经网络模型处理过的最长输入,显著提升了长序列处理能力。