Active Sparse Conversations for Improved Audio-Visual Embodied Navigation

Efficient navigation towards an audio-goal necessitates an embodied agent to not only possess the ability to use audio-visual cues effectively, but also be equipped to actively (but occasionally) seek human/oracle assistance without sacrificing autonomy, e.g., when it is uncertain of where to navigate towards locating a noisy or sporadic audio goal. To this end, we present CAVEN -- a conversational audio-visual embodied navigation agent that is capable of posing navigation questions to a human/oracle and processing the oracle responses; both in free-form natural language. At the core of CAVEN is a multimodal hierarchical reinforcement learning (RL) setup that is equipped with a high-level policy that is trained to choose from one of three low-level policies (at every step), namely: (i) to navigate using audio-visual cues, or (ii) to frame a question to the oracle and receive a short or detailed response, or (iii) ask generic questions (when unsure of what to ask) and receive instructions. Key to generating the agent's questions is our novel TrajectoryNet that forecasts the most likely next steps to the goal and a QuestionNet that uses these steps to produce a question. All the policies are learned end-to-end via the RL setup, with penalties to enforce sparsity in receiving navigation instructions from the oracle. To evaluate the performance of CAVEN, we present extensive experiments on the SoundSpaces framework for the task of semantic audio-visual navigation. Our results show that CAVEN achieves upto 12% gain in performance over competing methods, especially in localizing new sound sources, even in the presence of auditory distractions.

翻译：高效地导航至音频目标需要具身智能体不仅具备有效利用音频-视觉线索的能力，还需能够在不丧失自主性的前提下主动（但偶尔）寻求人类/先知协助，例如当其对导航至嘈杂或间歇性音频目标的位置不确定时。为此，我们提出CAVEN——一种能够以自由形式自然语言向人类/先知提出导航问题并处理先知响应的对话式音频-视觉具身导航智能体。CAVEN的核心是一个多模态分层强化学习（RL）框架，该框架配备了一个高级策略，该策略经过训练可在每一步从以下三种低级策略中选择其一：（i）利用音频-视觉线索进行导航，（ii）向先知提出问题并接收简短或详细的回答，或（iii）在不确定问什么时提出通用问题并接收指令。生成智能体问题的关键在于我们提出的TrajectoryNet——用于预测最可能的下一步目标动作路径，以及QuestionNet——利用这些步骤生成问题。所有策略均通过RL框架进行端到端学习，并附加惩罚项以强制执行从先知处接收导航指令的稀疏性。为评估CAVEN的性能，我们针对语义音频-视觉导航任务在SoundSpaces框架上进行了大量实验。结果表明，即使在存在听觉干扰的情况下，CAVEN在定位新声源方面的性能相较于竞争方法提升高达12%。