Multimodal fusion faces two robustness challenges: noisy inputs degrade representation quality, and missing modalities cause prediction failures. We propose DCER, a unified framework addressing both challenges through dual-stage compression and energy-based reconstruction. The compression stage operates at two levels: within-modality frequency transforms (wavelet for audio, DCT for video) remove noise while preserving task-relevant patterns, and cross-modality bottleneck tokens force genuine integration rather than modality-specific shortcuts. For missing modalities, energy-based reconstruction recovers representations via gradient descent on a learned energy function, with the final energy providing intrinsic uncertainty quantification (\r{ho} > 0.72 correlation with prediction error). Experiments on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate state-of-the-art performance across all benchmarks, with a U-shaped robustness pattern favoring multimodal fusion at both complete and high-missing conditions. The code will be available on Github.
翻译:多模态融合面临两个鲁棒性挑战:噪声输入会降低表示质量,而模态缺失会导致预测失败。我们提出DCER,一个通过双阶段压缩与基于能量重构的统一框架来解决这两个挑战。压缩阶段在两个层面操作:模态内频率变换(音频使用小波变换,视频使用离散余弦变换)在保留任务相关模式的同时去除噪声;跨模态瓶颈令牌则强制实现真正的融合,而非依赖模态特定的捷径。对于缺失模态,基于能量的重构通过在学习得到的能量函数上进行梯度下降来恢复表示,其最终能量值提供了内在的不确定性量化(与预测误差的相关系数ρ > 0.72)。在CMU-MOSI、CMU-MOSEI和CH-SIMS数据集上的实验表明,该方法在所有基准测试中均取得了最先进的性能,并呈现出U型鲁棒性模式——在模态完整和高度缺失两种情况下均有利于多模态融合。代码将在Github上开源。