Multimodal learning has advanced the performance for many vision-language tasks. However, most existing works in embodied dialog research focus on navigation and leave the localization task understudied. The few existing dialog-based localization approaches assume the availability of entire dialog prior to localizaiton, which is impractical for deployed dialog-based localization. In this paper, we propose DiaLoc, a new dialog-based localization framework which aligns with a real human operator behavior. Specifically, we produce an iterative refinement of location predictions which can visualize current pose believes after each dialog turn. DiaLoc effectively utilizes the multimodal data for multi-shot localization, where a fusion encoder fuses vision and dialog information iteratively. We achieve state-of-the-art results on embodied dialog-based localization task, in single-shot (+7.08% in Acc5@valUnseen) and multi- shot settings (+10.85% in Acc5@valUnseen). DiaLoc narrows the gap between simulation and real-world applications, opening doors for future research on collaborative localization and navigation.
翻译:多模态学习已提升了众多视觉-语言任务的性能。然而,现有具身对话研究大多聚焦于导航任务,而对定位任务的研究尚不充分。少数已有的基于对话的定位方法假设在定位前可获取完整对话,这对于实际部署的对话式定位而言不切实际。本文提出DiaLoc——一种符合真实人类操作员行为的新型对话式定位框架。具体而言,我们实现了一种迭代式位置预测优化方法,可在每轮对话后可视化当前位姿置信度。DiaLoc有效利用多模态数据实现多轮定位,其中融合编码器迭代融合视觉与对话信息。我们在具身对话式定位任务中取得了单轮(valUnseen集合上Acc5提升7.08%)与多轮(valUnseen集合上Acc5提升10.85%)的最优结果。DiaLoc缩小了仿真与现实应用之间的差距,为协作定位与导航的未来研究开辟了新方向。