We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than \$2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper
翻译:我们提出了大规模快速注意力蒸馏至线性注意力解码器(RADLADS),这是一种将softmax注意力Transformer模型快速转换为线性注意力解码器模型的方案。该方案包含两种新的RWKV变体架构,以及从流行的开源模型Qwen2.5转换而来的7B、32B和72B规模模型。我们的转换过程仅需350-700M个token,少于原始教师模型训练所用token数量的0.005%。在当今价格下,转换为我们的72B线性注意力模型的成本低于2000美元,但其推理质量仍接近原始Transformer。这些模型在其规模对应的线性注意力模型中,在一系列标准基准测试上实现了最先进的下游性能。我们根据Apache 2.0许可证在HuggingFace上发布了所有模型,但我们的72B模型同时受Qwen许可协议约束。模型地址:https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 训练代码地址:https://github.com/recursal/RADLADS-paper