Since the proposal of transformers, these models have been limited to bounded input lengths, because of their need to attend to every token in the input. In this work, we propose Unlimiformer: a general approach that wraps any existing pretrained encoder-decoder transformer, and offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index, while the returned kNN distances are the attention dot-product scores. This kNN index can be kept on either the GPU or CPU memory and queried in sub-linear time; this way, we can index practically unlimited input sequences, while every attention head in every decoder layer retrieves its top-k keys, instead of attending to every key. We evaluate Unlimiformer on several long-document and book-summarization benchmarks, showing that it can process even 500k token-long inputs from the BookSum dataset, without any input truncation at test time. We demonstrate that Unlimiformer improves pretrained models such as BART and Longformer by extending them to unlimited inputs without additional learned weights and without modifying their code. We make our code and models publicly available at https://github.com/abertsch72/unlimiformer .
翻译:自Transformer提出以来,这些模型一直受限于有界输入长度,因为它们需要关注输入中的每一个词元。在本工作中,我们提出Unlimiformer:一种通用方法,它可以封装任何现有的预训练编码器-解码器Transformer,并将交叉注意力计算卸载到单个k近邻(kNN)索引上,而返回的kNN距离即为注意力点积分数。该kNN索引可保存在GPU或CPU内存中,并以亚线性时间进行查询;通过这种方式,我们能够索引几乎无限长的输入序列,同时每个解码器层中的每个注意力头仅检索其前k个键,而非关注所有键。我们在多个长文档和书籍摘要基准上评估了Unlimiformer,结果表明它能够处理来自BookSum数据集的长达50万词元的输入,且在测试时无需任何输入截断。我们证明,Unlimiformer无需额外学习权重或修改代码,即可通过将预训练模型(如BART和Longformer)扩展到无限输入来改进它们。我们已在https://github.com/abertsch72/unlimiformer 公开代码和模型。