Spatial relation hallucinations pose a persistent challenge in large vision-language models (LVLMs), leading to generate incorrect predictions about object positions and spatial configurations within an image. To address this issue, we propose a constraint-aware prompting framework designed to reduce spatial relation hallucinations. Specifically, we introduce two types of constraints: (1) bidirectional constraint, which ensures consistency in pairwise object relations, and (2) transitivity constraint, which enforces relational dependence across multiple objects. By incorporating these constraints, LVLMs can produce more spatially coherent and consistent outputs. We evaluate our method on three widely-used spatial relation datasets, demonstrating performance improvements over existing approaches. Additionally, a systematic analysis of various bidirectional relation analysis choices and transitivity reference selections highlights greater possibilities of our methods in incorporating constraints to mitigate spatial relation hallucinations.
翻译:空间关系幻觉是大规模视觉语言模型(LVLMs)中持续存在的挑战,导致模型对图像中物体位置和空间配置生成错误预测。为解决这一问题,我们提出一种约束感知提示框架,旨在减少空间关系幻觉。具体而言,我们引入两种约束类型:(1)双向约束,确保成对物体关系的一致性;(2)传递性约束,强制多个物体间的关联依赖性。通过融入这些约束,LVLMs能够生成空间上更连贯一致的输出。我们在三个广泛使用的空间关系数据集上评估了所提方法,结果表明其性能优于现有方法。此外,通过对不同双向关系分析选项和传递性参考选择的系统分析,凸显了本方法在融入约束以缓解空间关系幻觉方面具有更大潜力。