Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.
翻译:现有视觉定位基准主要评估图像区域与字面指代表述之间的对齐能力,模型往往通过匹配显著命名的类别即可成功完成任务。本文探索了一种更具挑战性的互补场景——基于场景的视觉定位,在此设定中目标必须通过角色、意图和关系上下文推断而非显式命名获得。我们提出指代场景理解基准(RSC),专为这一设定设计。该基准中的查询为段落级文本,描述物体角色、用户目标和上下文线索,并包含需深度理解才能解决的刻意干扰物指代。每个实例均标注可解释的难度标签(涵盖唯一性、杂乱度、尺寸、重叠度和位置),以揭示不同失败模式并支持细粒度分析。RSC包含约3.1万个训练样本、4千个域内测试样本及3千个含未见目标类别的分布外数据子集。我们进一步提出课程推理方法ScenGround作为该设定的参考基线,该方法结合监督式冷启动与难度感知强化学习。实验表明,场景式查询能暴露当前模型在标准基准中未发现的系统性失败模式,而课程训练可提升模型在挑战性子集上的表现,并迁移至标准基准。