With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM's cross-modal attention into an explicit, spatially grounded signal for deciding when to ask. We extract text to image attention maps and pass them to a lightweight CNN to detect referential ambiguity, while a LoRA fine-tuned decoder conducts the dialog and emits grounding location tokens. We train on a real-world interactive dataset for IVG, and a mixed ambiguity set for the detector. With InViG-only supervision, our model surpasses a state-of-the-art method while using parameter-efficient fine-tuning. Similarly, the ambiguity detector outperforms prior baselines. Overall, CLUE turns the internal cross-modal attention of a VLM into an explicit, spatially grounded signal for deciding when to ask. The data and code are publicly available at: mouadabrini.github.io/clue
翻译:随着机器人日益融入日常生活,人机交互变得愈发复杂且多维度。交互式视觉定位(IVG)是此类交互的关键组成部分,机器人需通过其解读人类意图并消除歧义。现有IVG模型普遍缺乏判断何时提出澄清问题的机制,因其隐含依赖习得的表征。CLUE通过将视觉语言模型(VLM)的跨模态注意力转化为显式的空间定位信号来解决此问题,该信号用于决策提问时机。我们提取文本到图像的注意力图,将其输入轻量级CNN以检测指代歧义,同时通过LoRA微调的解码器进行对话并生成定位标记。我们在真实场景的IVG交互数据集上训练模型,并采用混合歧义数据集训练检测器。仅使用InViG监督时,本模型在采用参数高效微调的前提下超越了现有最优方法。同样,歧义检测器性能也优于先前基线方法。总体而言,CLUE将VLM内部的跨模态注意力转化为显式的空间定位信号,用于决策提问时机。数据与代码已公开于:mouadabrini.github.io/clue