We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanism, tensor normalization, inference acceleration and stabilization. Specifically, we use LRPE together with an exponential decay to avoid attention dilution issues while allowing the model to retain global interactions between tokens. Additionally, we propose Lightning Attention, a cutting-edge technique that accelerates linear attention by more than twice in runtime and reduces memory usage by a remarkable four times. To further enhance the performance of TransNormer, we leverage a gating mechanism to smooth training and a new tensor normalization scheme to accelerate the model, resulting in an impressive acceleration of over 20%. Furthermore, we have developed a robust inference algorithm that ensures numerical stability and consistent inference speed, regardless of the sequence length, showcasing superior efficiency during both training and inference stages. Scalability is at the heart of our model's design, enabling seamless deployment on large-scale clusters and facilitating expansion to even more extensive models, all while maintaining outstanding performance metrics. Rigorous validation of our model design is achieved through a series of comprehensive experiments on our self-collected corpus, boasting a size exceeding 6TB and containing over 2 trillion tokens. To ensure data quality and relevance, we implement a new self-cleaning strategy to filter our collected data. Our pre-trained models will be released to foster community advancements in efficient LLMs.
翻译:我们提出TransNormerLLM,这是首个基于线性注意力的大语言模型(LLM),在准确性和效率两方面均超越传统的基于softmax注意力的模型。TransNormerLLM源自之前的线性注意力架构TransNormer,通过多项先进改进实现优化,包括位置嵌入、线性注意力加速、门控机制、张量归一化、推理加速与稳定性增强。具体而言,我们采用结合指数衰减的LRPE,以缓解注意力稀释问题,同时使模型能够保持词元间的全局交互。此外,我们提出闪电注意力(Lightning Attention)这一前沿技术,可将线性注意力的运行时间加速两倍以上,并将内存使用量显著降低四倍。为进一步提升TransNormer性能,我们利用门控机制平滑训练过程,并引入新的张量归一化方案加速模型,实现超过20%的显著加速效果。此外,我们开发了稳健的推理算法,确保无论序列长度如何,均能保持数值稳定性与一致的推理速度,在训练与推理阶段均展现出优越效率。可扩展性是我们模型设计的核心,使其能够无缝部署于大规模集群,并便于扩展至更庞大的模型,同时保持卓越性能指标。通过对自建语料库(规模超6TB,包含超2万亿个词元)开展一系列全面实验,我们严格验证了模型设计。为确保数据质量与相关性,我们实施了一种新的自清洗策略对收集的数据进行过滤。我们将发布预训练模型,以推动高效大语言模型的社区进步。