Transformer models are foundational to natural language processing (NLP) and computer vision. Despite various recent works devoted to reducing the quadratic cost of such models (as a function of the sequence length $n$), dealing with ultra long sequences efficiently (e.g., with more than 16K tokens) remains challenging. Applications such as answering questions based on an entire book or summarizing a scientific article are inefficient or infeasible. In this paper, we propose to significantly reduce the dependency of a Transformer model's complexity on $n$, by compressing the input into a representation whose size $r$ is independent of $n$ at each layer. Specifically, by exploiting the fact that in many tasks, only a small subset of special tokens (we call VIP-tokens) are most relevant to the final prediction, we propose a VIP-token centric compression (Vcc) scheme which selectively compresses the input sequence based on their impact on approximating the representation of these VIP-tokens. Compared with competitive baselines, the proposed algorithm not only is efficient (achieving more than $3\times$ efficiency improvement compared to baselines on 4K and 16K lengths), but also achieves competitive or better performance on a large number of tasks. Further, we show that our algorithm can be scaled to 128K tokens (or more) while consistently offering accuracy improvement.
翻译:Transformer模型是自然语言处理(NLP)和计算机视觉的基础。尽管近期有多项工作致力于降低此类模型的二次复杂度(作为序列长度$n$的函数),但高效处理超长序列(例如,超过16K令牌)仍具挑战性。基于整本书回答问题或总结科学文章等应用因效率低下而难以实现。本文提出通过将输入压缩为每层表示大小$r$与$n$无关的表示,显著降低Transformer模型复杂度对$n$的依赖。具体而言,利用在许多任务中仅有一小部分特殊令牌(我们称为VIP令牌)与最终预测最相关的事实,提出一种以VIP令牌为中心的压缩(Vcc)方案,该方案根据输入序列对近似这些VIP令牌表示的影响,选择性地压缩输入序列。与竞争基线相比,所提算法不仅高效(在4K和16K长度上比基线实现超过3倍的效率提升),而且在大量任务上达到竞争性或更优的性能。此外,我们证明该算法可扩展至128K令牌(或更多),同时持续提升准确率。