We introduce GoldFinch, a hybrid Linear Attention/Transformer sequence model that uses a new technique to efficiently generate a highly compressed and reusable KV-Cache in linear time and space with respect to sequence length. GoldFinch stacks our new GOLD transformer on top of an enhanced version of the Finch (RWKV-6) architecture. We train up to 1.5B parameter class models of the Finch, Llama, and GoldFinch architectures, and find dramatically improved modeling performance relative to both Finch and Llama. Our cache size savings increase linearly with model layer count, ranging from 756-2550 times smaller than the traditional transformer cache for common sizes, enabling inference of extremely large context lengths even on limited hardware. Although autoregressive generation has O(n) time complexity per token because of attention, pre-fill computation of the entire initial cache state for a submitted context costs only O(1) time per token due to the use of a recurrent neural network (RNN) to generate this cache. We release our trained weights and training code under the Apache 2.0 license for community use.
翻译:本文介绍GoldFinch,一种线性注意力/Transformer混合序列模型,其采用新技术以线性时间和空间复杂度(相对于序列长度)高效生成高度压缩且可复用的KV缓存。GoldFinch将我们提出的新型GOLD Transformer叠加于增强版Finch(RWKV-6)架构之上。我们训练了参数规模达15亿级别的Finch、Llama及GoldFinch架构模型,发现GoldFinch的建模性能相比Finch和Llama均有显著提升。我们的缓存空间节省量随模型层数线性增长,在常见规模下比传统Transformer缓存缩小756至2550倍,使得即使在有限硬件上也能实现极长上下文的推理。虽然自回归生成因注意力机制具有每令牌O(n)的时间复杂度,但利用循环神经网络(RNN)生成缓存,对提交上下文进行完整初始缓存状态的预填充计算仅需每令牌O(1)时间。我们在Apache 2.0许可证下开源训练权重与训练代码以供社区使用。