Interactive visual grounding in Human-Robot Interaction (HRI) is challenging yet practical due to the inevitable ambiguity in natural languages. It requires robots to disambiguate the user input by active information gathering. Previous approaches often rely on predefined templates to ask disambiguation questions, resulting in performance reduction in realistic interactive scenarios. In this paper, we propose TiO, an end-to-end system for interactive visual grounding in human-robot interaction. Benefiting from a unified formulation of visual dialogue and grounding, our method can be trained on a joint of extensive public data, and show superior generality to diversified and challenging open-world scenarios. In the experiments, we validate TiO on GuessWhat?! and InViG benchmarks, setting new state-of-the-art performance by a clear margin. Moreover, we conduct HRI experiments on the carefully selected 150 challenging scenes as well as real-robot platforms. Results show that our method demonstrates superior generality to diversified visual and language inputs with a high success rate. Codes and demos are available at https://github.com/jxu124/TiO.
翻译:在人机交互中的交互式视觉基础定位因自然语言中不可避免的歧义性而具有挑战性且实用性。它要求机器人通过主动信息收集来消除用户输入的歧义。以往方法通常依赖预定义模板来提出消歧问题,导致在真实交互场景中性能下降。本文提出TiO——一个用于人机交互中交互式视觉基础定位的端到端系统。得益于视觉对话与基础定位的统一化框架,我们的方法能够在大量公开数据的联合训练中获益,并展现出对多样且具有挑战性的开放世界场景的卓越泛化能力。在实验中,我们在GuessWhat?!和InViG基准上验证了TiO,以显著优势创下最新性能记录。此外,我们在精心挑选的150个挑战性场景及真实机器人平台上开展了人机交互实验。结果表明,我们的方法对多样化的视觉和语言输入展现出卓越的泛化能力,且成功率较高。代码与演示位于https://github.com/jxu124/TiO。