Residual networks, as discrete approximations of Ordinary Differential Equations (ODEs), have inspired significant advancements in neural network design, including multistep methods, high-order methods, and multi-particle dynamical systems. The precision of the solution to ODEs significantly affects parameter optimization, thereby impacting model performance. In this work, we present a series of advanced explorations of Transformer architecture design to minimize the error compared to the true ``solution.'' First, we introduce a predictor-corrector learning framework to minimize truncation errors, which consists of a high-order predictor and a multistep corrector. Second, we propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Extensive experiments on large-scale machine translation, abstractive summarization, language modeling, and natural language understanding benchmarks demonstrate the superiority of our approach. On the WMT'14 English-German and English-French tasks, our model achieved BLEU scores of 30.95 and 44.27, respectively. Furthermore, on the OPUS multilingual machine translation task, our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters. Notably, it also beats LLama models by 5.7 accuracy points on the LM Harness Evaluation.
翻译:残差网络作为常微分方程(ODEs)的离散近似,已推动了神经网络设计的重大进展,包括多步方法、高阶方法和多粒子动力系统。ODE求解的精度显著影响参数优化,进而影响模型性能。本研究对Transformer架构设计进行了一系列前沿探索,旨在最小化与真实“解”之间的误差。首先,我们引入预测器-校正器学习框架以减小截断误差,该框架由高阶预测器和多步校正器构成。其次,我们提出基于指数移动平均的系数学习方法以增强高阶预测器。在大规模机器翻译、抽象摘要、语言建模和自然语言理解基准上的大量实验证明了我们方法的优越性。在WMT'14英德和英法任务上,我们的模型分别取得了30.95和44.27的BLEU分数。此外,在OPUS多语言机器翻译任务中,我们的模型仅使用1/3参数,平均超越强大的38亿参数DeepNet模型2.9个SacreBLEU分数。值得注意的是,在LM Harness评估中,它同样以5.7个准确率点优势超越LLama模型。