Mixture-of-Experts (MoE) models enable efficient scaling, but training them from scratch remains prohibitively expensive. MoE upcycling mitigates this cost by converting pretrained dense models into sparse MoE models. However, existing upcycling methods typically rely on large-scale continued training and often perform poorly under data-constrained supervised adaptation, due to either homogeneous experts or overly disruptive perturbations to pretrained parameters. In this setting, effective upcycling must leverage pretrained weight structure while introducing sufficient diversity among routed experts. To this end, we propose SVD-Partitioned Residual Initialization (SPRI), which distributes SVD-partitioned residuals derived from pretrained feed-forward network (FFN) weights across routed experts, introducing controlled expert diversity grounded in pretrained spectral structure. We further introduce a two-stage training strategy to improve adaptation stability. We evaluate SPRI on multilingual speech-to-text translation, where limited supervised data challenges MoE upcycling and multiple target languages provide natural routing heterogeneity. On CoVoST2 across 15 En-to-XX directions, SPRI improves average BLEU and COMET over fully fine-tuned dense models by 2.58 and 3.32 points, respectively, and outperforms the prior best MoE upcycling baseline by 3.39 BLEU and 4.34 COMET points.
翻译:混合专家模型(MoE)可实现高效扩展,但从头训练成本极为高昂。MoE升级技术通过将预训练密集模型转化为稀疏MoE模型来缓解这一问题。然而现有升级方法通常依赖大规模持续训练,且因专家同质化或对预训练参数造成过度干扰,在数据受限的监督微调场景下表现不佳。针对该设定,有效升级需同时利用预训练权重结构并引入路由专家间的充分多样性。为此,我们提出基于SVD分解的残差初始化方法(SPRI),该方法将预训练前馈网络(FFN)权重经SVD分解后的残差分量分配到路由专家中,在预训练谱结构基础上引入可控的专家多样性。我们进一步提出两阶段训练策略以提升适应稳定性。在数据受限会严重影响MoE升级的多语言语音翻译任务中,我们通过多种目标语言实现自然的路由异质性,对SPRI进行了评估。在CoVoST2的15个英译X方向任务上,SPRI较全微调密集模型平均BLEU提升2.58分、COMET提升3.32分,较先前最佳MoE升级基线分别提升3.39 BLEU和4.34 COMET分。