Vision language foundation models such as CLIP exhibit impressive zero-shot generalization yet remain vulnerable to spurious correlations across visual and textual modalities. Existing debiasing approaches often address a single modality either visual or textual leading to partial robustness and unstable adaptation under distribution shifts. We propose a bilateral prompt optimization framework (BiPrompt) that simultaneously mitigates non-causal feature reliance in both modalities during test-time adaptation. On the visual side, it employs structured attention-guided erasure to suppress background activations and enforce orthogonal prediction consistency between causal and spurious regions. On the textual side, it introduces balanced prompt normalization, a learnable re-centering mechanism that aligns class embeddings toward an isotropic semantic space. Together, these modules jointly minimize conditional mutual information between spurious cues and predictions, steering the model toward causal, domain invariant reasoning without retraining or domain supervision. Extensive evaluations on real-world and synthetic bias benchmarks demonstrate consistent improvements in both average and worst-group accuracies over prior test-time debiasing methods, establishing a lightweight yet effective path toward trustworthy and causally grounded vision-language adaptation.
翻译:诸如CLIP等视觉语言基础模型展现出令人印象深刻的零样本泛化能力,但仍易受视觉与文本模态间伪相关性的影响。现有的去偏方法通常仅针对单一模态(视觉或文本),导致在分布偏移下仅获得部分鲁棒性且适应不稳定。我们提出了一种双边提示优化框架(BiPrompt),该框架在测试时适应过程中同时缓解两个模态中对非因果特征的依赖。在视觉侧,它采用结构化注意力引导擦除以抑制背景激活,并强制因果区域与伪相关区域间的正交预测一致性。在文本侧,它引入了平衡提示归一化,这是一种可学习的重中心化机制,将类别嵌入对齐至各向同性的语义空间。这些模块共同最小化伪相关线索与预测之间的条件互信息,从而在不重新训练或无需领域监督的情况下,引导模型进行因果性、领域不变的推理。在真实世界与合成的偏置基准测试上的广泛评估表明,相较于先前的测试时去偏方法,本方法在平均准确率和最差组准确率上均取得了一致性提升,为可信且基于因果关系的视觉语言适应提供了一条轻量而有效的路径。