Multimodal systems often benefit from combining information across language, sound, and visual streams, but this benefit is not guaranteed. A modality that is useful for one input may become distracting for another, and local feature responses within the same modality can disagree with evidence from other sources. This work investigates how to adjust multimodal representations before they are merged by a downstream predictor. We develop a compact calibration module that compares each modality with the others at the summary level, extracts cues of cross-source support and conflict, and converts these cues into instance-wise and dimension-wise modulation signals. The calibration is applied to the original modality features rather than to already fused representations, enabling the model to suppress misleading components, preserve weak but useful evidence, and emphasize responses that are better supported by the current multimodal context. The module is designed as a plug-in component and can be attached to different fusion backbones without changing their prediction heads. Across five benchmarks covering sentiment understanding, action recognition, audio-visual event detection, and audio-visual emotion classification, the proposed pre-combination calibration strategy improves performance under both sequence-based and convolutional fusion settings. Additional analyses under modality removal, synthetic corruption, training dynamics, and feature-level visualization show that calibrating signals before fusion can reduce interference from unreliable modalities and produce more stable multimodal optimization.
翻译:多模态系统通常受益于跨语言、声音和视觉流的信息融合,但这种受益并非必然。对一个输入有用的模态可能对另一个输入造成干扰,同一模态内的局部特征响应也可能与其他来源的证据不一致。本研究探究如何在多模态表示被下游预测器合并之前对其进行调整。我们开发了一个紧凑的校准模块,该模块在摘要层面比较各模态与其他模态,提取跨源支持与冲突的线索,并将这些线索转换为实例级和维度级的调制信号。该校准应用于原始模态特征而非已融合的表示,使模型能够抑制误导性成分、保留微弱但有用的证据,并强调在当前多模态上下文中得到更好支持的响应。该模块被设计为即插即用组件,可附加到不同融合骨干网络上而无需改变其预测头。在涵盖情感理解、动作识别、音视频事件检测和音视频情感分类的五个基准测试上,所提出的融合前校准策略在基于序列和基于卷积的融合设置下均提升了性能。通过模态移除、合成损坏、训练动态和特征级可视化等额外分析表明,在融合前校准信号可减少来自不可靠模态的干扰,并实现更稳定的多模态优化。