Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
翻译:物体-上下文捷径在视觉-语言模型中仍是一个持续存在的挑战,当测试场景与熟悉的训练共现模式不同时,会削弱零样本识别的可靠性。我们将此问题重新定义为因果推断问题并提出:若物体出现在不同环境中,预测结果是否保持不变?为在推理时解答此问题,我们在CLIP的表示空间中估计物体与背景的期望,并通过将物体特征与从外部数据集、批次邻近样本或文本描述中采样的多样化替代上下文重新组合,合成反事实嵌入。通过估计总直接效应并模拟干预,我们进一步减去仅由背景产生的激活,保留有益的物体-上下文交互,同时缓解幻觉分数。无需重新训练或提示设计,我们的方法在上下文敏感基准测试中显著提升了最差组和平均准确率,确立了新的零样本最优性能。除性能提升外,该框架提供了一种轻量级的表示层反事实方法,为去偏且可靠的多模态推理开辟了一条实用的因果路径。