Recent advances in training-free visual prompting, such as Set-of-Mark, have emerged as a promising direction for enhancing the grounding capabilities of multimodal language models (MLMs). These techniques operate by partitioning the input image into object regions and annotating them with marks, predominantly boxes with numeric identifiers, before feeding the augmented image to the MLM. However, these approaches treat marked objects as isolated entities, failing to capture the relationships between them. On these premises, we propose Graph-of-Mark (GoM), the first pixel-level visual prompting technique that overlays scene graphs onto the input image for spatial reasoning tasks. We evaluate GoM across 3 open-source MLMs and 4 different datasets, conducting extensive ablations on drawn components and investigating the impact of auxiliary graph descriptions in the text prompt. Our results demonstrate that GoM consistently improves the zero-shot capability of MLMs in interpreting object positions and relative directions, improving base accuracy in visual question answering and localization up to 11 percentage points.
翻译:近期无训练视觉提示技术的进展,例如Set-of-Mark,已成为增强多模态语言模型(MLM)定位能力的一个有前景的方向。这些方法通过将输入图像分割为对象区域并为其添加标记(主要是带有数字标识的边界框),然后向MLM输入增强后的图像来运作。然而,这些方法将标记对象视为孤立实体,未能捕捉它们之间的关系。基于此,我们提出Graph-of-Mark(GoM),这是首个像素级视觉提示技术,在空间推理任务中为输入图像叠加场景图。我们在3个开源MLM和4个不同数据集上评估GoM,对绘制的组件进行了广泛消融研究,并探究了文本提示中辅助图描述的影响。结果表明,GoM持续提升了MLM在理解物体位置和相对方向上的零样本能力,在视觉问答和定位任务中基准准确率最高提升11个百分点。