In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.
翻译:在本技术报告中,我们提出了Ring-linear模型系列,具体包括Ring-mini-linear-2.0和Ring-flash-linear-2.0。Ring-mini-linear-2.0包含160亿参数和9.57亿激活值,而Ring-flash-linear-2.0包含1040亿参数和61亿激活值。两个模型均采用了一种高效融合线性注意力与softmax注意力的混合架构,在长上下文推理场景中显著降低了I/O与计算开销。相较于一个320亿参数的稠密模型,该系列将推理成本降低至1/10;与原始Ring系列相比,成本也降低了超过50%。此外,通过对混合架构中不同注意力机制比例的深入系统探索,我们确定了当前最优的模型结构。同时,借助我们自主研发的高性能FP8算子库linghe,整体训练效率提升了50%。得益于训练与推理引擎算子间的高度对齐,这些模型能够在强化学习阶段进行长期、稳定且高效的优化,在多个具有挑战性的复杂推理基准测试中持续保持SOTA性能。