Agentic vision-language models increasingly act through extended interactions, but most evaluations still focus on single-image, single-turn correctness. We introduce AMIGO (Agentic Multi-Image Grounding Oracle Benchmark), a long-horizon benchmark for hidden-target identification over galleries of visually similar images. In AMIGO, the oracle privately selects a target image, and the model must recover it by asking a sequence of attribute-focused Yes/No/Unsure questions under a strict protocol that penalizes invalid actions with Skip. This setting stresses (i) question selection under uncertainty, (ii) consistent constraint tracking across turns, and (iii) fine-grained discrimination as evidence accumulates. AMIGO also supports controlled oracle imperfections to probe robustness and verification behavior under inconsistent feedback. We instantiate AMIGO with Guess My Preferred Dress task and report metrics covering both outcomes and interaction quality, including identification success, evidence verification, efficiency, protocol compliance, noise tolerance, and trajectory-level diagnostics.
翻译:具身视觉语言模型日益通过扩展交互发挥作用,但多数评估仍聚焦于单图像、单轮次的正确性。我们提出AMIGO(智能体多图像定位神谕基准),这是一个面向视觉相似图像库的长期隐藏目标识别基准。在AMIGO中,神谕私密地选定目标图像,模型需通过提出一系列面向属性的“是/否/不确定”问题,在严格协议约束下(对无效操作施以“跳过”惩罚)进行恢复。该设定着重考查:(i) 不确定性条件下的问题选择能力;(ii) 多轮次中一致的约束追踪能力;(iii) 证据累积过程中的细粒度判别能力。AMIGO还支持可控的神谕缺陷机制,用于探测模型在非一致反馈下的鲁棒性与验证行为。我们通过"猜测我最偏好的连衣裙"任务实例化AMIGO,并报告涵盖结果与交互质量的指标,包括识别成功率、证据验证、效率、协议合规性、噪声容限及轨迹级诊断。