In this work, the Localized Filtering-based Attention (LFA) is introduced to incorporate prior knowledge of local dependencies of natural language into Attention. Based on LFA, we develop and release Yuan 2.0, a large language model with parameters ranging from 2.1 billion to 102.6 billion. A data filtering and generation method is presented to build pretraining and fine-tuning dataset in high quality. A distributed training method with non-uniform pipeline parallel, data parallel, and optimizer parallel is proposed, which greatly reduces the bandwidth requirements of intra-node communication, and achieves good performance in large-scale distributed training. Yuan 2.0 models display impressive ability in code generation, math problem-solving, and chat compared with existing models. The latest version of YUAN 2.0, including model weights and source code, is accessible at Github.
翻译:本文提出了局部过滤注意力(Localized Filtering-based Attention, LFA),将自然语言局部依赖性的先验知识融入注意力机制中。基于LFA,我们开发并发布了YUAN 2.0大语言模型,参数规模涵盖21亿至1026亿。本文提出了一种数据过滤与生成方法,用于构建高质量的预训练和微调数据集。同时设计了非均匀流水线并行、数据并行与优化器并行的分布式训练方法,该方法大幅降低了节点内通信的带宽需求,并在大规模分布式训练中取得了优异性能。与现有模型相比,YUAN 2.0模型在代码生成、数学问题求解和对话任务中展现出卓越能力。最新版YUAN 2.0的模型权重及源代码已在GitHub上开放获取。