Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.
翻译:递归神经网络(RNN)推理速度快,且在长序列上扩展高效,但难以训练和扩展。我们提出了Hawk——一种采用门控线性递归的RNN,以及Griffin——一种融合门控线性递归与局部注意力的混合模型。Hawk在下游任务上的表现超越了已报道的Mamba性能,而Griffin在仅使用Llama-2六分之一以下训练数据的情况下达到了与其相当的性能。我们还证明,Griffin能够对远超训练时所见长度的序列进行外推。我们的模型在训练时与Transformer具备相同的硬件效率,推理时则具有更低的延迟和显著更高的吞吐量。我们将Griffin扩展至140亿参数规模,并阐述了如何对模型进行分片以实现高效的分布式训练。