Interactive visual grounding in Human-Robot Interaction (HRI) is challenging yet practical due to the inevitable ambiguity in natural languages. It requires robots to disambiguate the user input by active information gathering. Previous approaches often rely on predefined templates to ask disambiguation questions, resulting in performance reduction in realistic interactive scenarios. In this paper, we propose TiO, an end-to-end system for interactive visual grounding in human-robot interaction. Benefiting from a unified formulation of visual dialogue and grounding, our method can be trained on a joint of extensive public data, and show superior generality to diversified and challenging open-world scenarios. In the experiments, we validate TiO on GuessWhat?! and InViG benchmarks, setting new state-of-the-art performance by a clear margin. Moreover, we conduct HRI experiments on the carefully selected 150 challenging scenes as well as real-robot platforms. Results show that our method demonstrates superior generality to diversified visual and language inputs with a high success rate. Codes and demos are available at https://github.com/jxu124/TiO.
翻译:人机交互中的交互式视觉定位由于自然语言中不可避免的歧义性而具有挑战性但实用性强。它要求机器人通过主动信息收集来消除用户输入的歧义。以往方法通常依赖预定义模板来提出消歧问题,导致在真实交互场景中性能下降。本文提出TiO——一种用于人机交互中交互式视觉定位的端到端系统。得益于视觉对话与定位的统一建模,我们的方法可在海量公开数据的联合训练下,展现出对多样化且具有挑战性的开放世界场景的卓越泛化能力。在实验中,我们在GuessWhat?!和InViG基准上验证了TiO,以显著优势刷新了最新性能。此外,我们在精心挑选的150个挑战性场景及真实机器人平台上开展了人机交互实验。结果表明,我们的方法对多样化的视觉与语言输入展现出卓越的泛化能力,且成功率极高。代码与演示请见https://github.com/jxu124/TiO。