Can conversational videos captured from multiple egocentric viewpoints reveal the map of a scene in a cost-efficient way? We seek to answer this question by proposing a new problem: efficiently building the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation. Our hypothesis is that as multiple people ("egos") move in a scene and talk among themselves, they receive rich audio-visual cues that can help uncover the unseen areas of the scene. Given the high cost of continuously processing egocentric visual streams, we further explore how to actively coordinate the sampling of visual information, so as to minimize redundancy and reduce power use. To that end, we present an audio-visual deep reinforcement learning approach that works with our shared scene mapper to selectively turn on the camera to efficiently chart out the space. We evaluate the approach using a state-of-the-art audio-visual simulator for 3D scenes as well as real-world video. Our model outperforms previous state-of-the-art mapping methods, and achieves an excellent cost-accuracy tradeoff. Project: http://vision.cs.utexas.edu/projects/chat2map.
翻译:能否通过从多个自我中心视角捕捉的对话视频,以经济高效的方式揭示场景的地图?我们通过提出一个新问题来寻求答案:利用自然对话中参与者的自我中心视听观察中的共享信息,高效构建先前未见过的3D环境地图。我们的假设是,当多个人(“自我”角色)在场景中移动并相互交谈时,他们会接收到丰富的视听线索,这些线索有助于揭示场景中未被观察到的区域。鉴于持续处理自我中心视觉流的高昂成本,我们进一步探索如何主动协调视觉信息的采样,以最小化冗余并降低能耗。为此,我们提出了一种基于深度强化学习的视听方法,该方法与我们的共享场景映射器协同工作,有选择地开启摄像头以高效绘制空间。我们使用最先进的3D场景视听模拟器以及真实世界视频对该方法进行了评估。我们的模型优于以往最先进的映射方法,并实现了优异的成本-准确度权衡。项目网址:http://vision.cs.utexas.edu/projects/chat2map。