Vision-Language Models (VLMs) achieve strong cross-modal performance, yet recent evidence suggests they over-rely on textual descriptions while under-utilizing visual evidence -- a phenomenon termed ``text shortcut learning.'' We propose an adversarial evaluation framework that quantifies this cross-modal dependency by measuring accuracy degradation (Drop) when semantically conflicting text is paired with unchanged images. Four adversarial strategies -- shape\_swap, color\_swap, position\_swap, and random\_text -- are applied to a controlled geometric-shapes dataset ($n{=}1{,}000$). We compare three configurations: Baseline CLIP (ViT-B/32), LoRA fine-tuning, and LoRA Optimized (integrating Hard Negative Mining, Label Smoothing, layer-wise learning rates, Cosine Restarts, curriculum learning, and data augmentation). The optimized model reduces average Drop from 27.5\% to 9.8\% (64.4\% relative improvement, $p{<}0.001$) while maintaining 97\% normal accuracy. Attention visualization and embedding-space analysis confirm that the optimized model attends more to visual features and achieves tighter cross-modal alignment.
翻译:视觉-语言模型(VLMs)展现出强大的跨模态性能,但最新证据表明它们过度依赖文本描述而未能充分利用视觉证据——这一现象被称为"文本捷径学习"。我们提出一种对抗性评估框架,通过测量语义冲突文本与不变图像配对时的准确率下降幅度(Drop)来量化这种跨模态依赖性。在受控几何形状数据集(n=1,000)上应用四种对抗策略:形状交换(shape_swap)、颜色交换(color_swap)、位置交换(position_swap)和随机文本(random_text)。我们比较三种配置:基线CLIP(ViT-B/32)、LoRA微调及LoRA优化版(集成难例挖掘、标签平滑、分层学习率、余弦重启、课程学习与数据增强)。优化模型在保持97%正常准确率的同时,将平均Drop从27.5%降至9.8%(相对提升64.4%,p<0.001)。注意力可视化与嵌入空间分析证实,优化模型更关注视觉特征并实现更紧密的跨模态对齐。