Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose MG-MTTA, which keeps the backbone frozen and updates only a lightweight gate or adapter. The objective combines fused-posterior entropy minimization with a reliability-aware gate prior built from anchor-based modality consistency and cross-modal conflict. Our analysis gives conditions under which entropy reduction preserves the correct ranking and a threshold that characterizes modality-dominance failure. On the ImageNet-based benchmark, MG-MTTA improves top-1 accuracy from 57.97 to 66.51 under semantics-preserving textual shift and from 21.68 to 26.27 under joint visual-textual shift, while remaining competitive in the visual-only benchmark. These results show that multimodal test-time adaptation should control modality reliability, not just prediction entropy.
翻译:视觉语言模型在零样本设置中具有良好的迁移能力,但在部署时视觉与文本分支常出现非对称偏移。在此条件下,基于熵的测试时自适应可能锐化融合后验概率的同时增加误差,因为不可靠的模态仍可能在融合中占主导。我们通过多模态后验的主导化视角研究此类失效模式,并将自适应建模为融合预测上的约束性解混问题。基于此观点,我们提出MG-MTTA方法,该方法保持骨干网络冻结,仅更新轻量级门控或适配器。其目标函数结合了融合后验熵最小化与基于锚点模态一致性和跨模态冲突构建的可靠性感知门控先验。我们的分析给出了熵降低过程中保持正确排序的条件,以及表征模态主导失效的阈值。在基于ImageNet的基准测试中,MG-MTTA在语义保持文本偏移下将top-1准确率从57.97提升至66.51,在联合视觉-文本偏移下从21.68提升至26.27,同时在纯视觉基准测试中保持竞争力。结果表明多模态测试时自适应应控制模态可靠性,而非仅关注预测熵。