We introduce Transformer-VQ, a decoder-only transformer computing softmax-based dense self-attention in linear time. Transformer-VQ's efficient attention is enabled by vector-quantized keys and a novel caching mechanism. In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on ImageNet64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput. Code available: \url{https://github.com/transformer-vq/transformer_vq}
翻译:我们提出了Transformer-VQ,一种仅解码器架构的Transformer,可在线性时间内计算基于softmax的密集自注意力。Transformer-VQ的高效注意力机制通过向量量化的键和一种新颖的缓存机制实现。在大规模实验中,Transformer-VQ在质量上表现出高度竞争力,在Enwik8数据集上达到0.99 bpb,在PG-19数据集上达到26.6 ppl,在ImageNet64数据集上达到3.16 bpb。此外,Transformer-VQ的优化实现在序列长度为8k时比同类二次时间Transformer快3倍以上,在32k时快12倍以上,并可扩展到131k序列长度且保持相近的吞吐量。代码地址:\url{https://github.com/transformer-vq/transformer_vq}