Training models with varying capacities can be advantageous for deploying them in different scenarios. While high-capacity models offer better performance, low-capacity models require fewer computing resources for training and inference. In this work, we propose a novel one-stop training framework consisting of two composite model architectures and a joint training algorithm called Two-Stage Joint-Training (TSJT). Unlike knowledge distillation, where multiple capacity models are trained from scratch separately, our approach integrates supervisions from different flexible-capacity models simultaneously, leading to faster and more efficient convergence. Extensive experiments on the WMT10 benchmark show that our method outperforms low-capacity baseline models and achieves comparable or better performance on high-capacity models. Notably, the analysis demonstrates that our method significantly influences the initial training process, leading to more efficient convergence and superior solutions.
翻译:训练不同容量的模型有助于将其部署到不同场景中。高容量模型能提供更优性能,而低容量模型在训练和推理时所需计算资源更少。本文提出一种新型一站式训练框架,包含两种复合模型架构与一种称为两阶段联合训练(Two-Stage Joint-Training, TSJT)的联合训练算法。与知识蒸馏方法(需分别从头训练多个容量模型)不同,我们的方法可同时整合来自不同灵活容量模型的监督信号,从而实现更快、更高效的收敛。在WMT10基准测试上的大量实验表明,本方法不仅优于低容量基线模型,在高容量模型上也能达到相当或更优性能。特别值得注意的是,实验分析显示本方法对初始训练过程具有显著影响,可促进更高效的收敛并获得更优解。