Vision-language models (VLMs) achieve strong multimodal performance but remain prone to relation hallucination, which requires accurate reasoning over inter-object interactions. We study the impact of visual perturbations, specifically rotation and noise, and show that even mild distortions significantly degrade relational reasoning across models and datasets. We further evaluate prompt-based augmentation and preprocessing strategies (orientation correction and denoising), finding that while they offer partial improvements, they do not fully resolve hallucinations. Our results reveal a gap between perceptual robustness and relational understanding, highlighting the need for more robust, geometry-aware VLMs.
翻译:视觉-语言模型(VLM)在跨模态任务中表现强劲,但仍易出现关系幻觉——即需要对物体间交互进行精确推理的能力。我们研究了视觉扰动(具体为旋转与噪声)的影响,发现即使是轻微的扭曲也会显著降低模型在多种数据集上的关系推理性能。进一步评估了基于提示增强及预处理策略(方向校正与去噪),结果显示这些方法虽能带来部分改进,但无法完全消除幻觉。我们的研究揭示了感知鲁棒性与关系理解之间的鸿沟,凸显了构建更具鲁棒性且具备几何感知能力的VLM的必要性。