Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA's flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.
翻译:权重平均(WA)已成为一种通过促进收敛至平坦损失景观来增强泛化能力的强大技术,这与更强的分布外性能相关。然而,将WA直接应用于多模态域泛化(MMDG)具有挑战性:不同模态间优化速度的差异导致WA在早期阶段过拟合于收敛更快的模态,抑制了收敛较慢但具有互补性模态的贡献,从而阻碍有效的模态融合并使损失曲面偏向更尖锐、泛化能力更差的极小值。为解决此问题,我们提出MBCD,一个统一的协同蒸馏框架,该框架保留了WA诱导平坦性的优势,同时克服了其在多模态场景中的缺陷。MBCD首先在学生模型中进行自适应模态丢弃,以抑制早期对主导模态的偏向。随后,一个梯度一致性约束对齐了单模态分支与融合表示之间的学习信号,鼓励协调且更平滑的优化。最后,一个基于WA的教师模型通过将融合知识迁移至每个单模态分支来进行跨模态蒸馏,这强化了跨模态交互并引导收敛朝向更平坦的解。在MMDG基准上的大量实验表明,MBCD始终优于现有方法,在多种未见域上实现了更优的准确性和鲁棒性。