Current mobile manipulation research predominantly follows an instruction-driven paradigm, where agents rely on predefined textual commands to execute tasks. However, this setting confines agents to a passive role, limiting their autonomy and ability to react to dynamic environmental events. To address these limitations, we introduce sound-triggered mobile manipulation, where agents must actively perceive and interact with sound-emitting objects without explicit action instructions. To support these tasks, we develop Habitat-Echo, a data platform that integrates acoustic rendering with physical interaction. We further propose a baseline comprising a high-level task planner and low-level policy models to complete these tasks. Extensive experiments show that the proposed baseline empowers agents to actively detect and respond to auditory events, eliminating the need for case-by-case instructions. Notably, in the challenging dual-source scenario, the agent successfully isolates the primary source from overlapping acoustic interference to execute the first interaction, and subsequently proceeds to manipulate the secondary object, verifying the robustness of the baseline.
翻译:当前移动操作研究主要遵循指令驱动范式,其中智能体依赖预定义的文本命令来执行任务。然而,这种设定将智能体限制在被动角色中,限制了其自主性和对环境动态事件的反应能力。为应对这些局限,我们引入了声音触发的移动操作,其中智能体必须在没有明确动作指令的情况下,主动感知并与发声物体交互。为支持这些任务,我们开发了Habitat-Echo,一个将声学渲染与物理交互相结合的数据平台。我们进一步提出了一个由高层任务规划器和低层策略模型组成的基线系统来完成这些任务。大量实验表明,所提出的基线系统使智能体能够主动检测并响应听觉事件,无需逐例指令。值得注意的是,在具有挑战性的双声源场景中,智能体成功从重叠的声学干扰中分离出主要声源以执行首次交互,随后继续操作次要物体,验证了基线系统的鲁棒性。