Embodied AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic information about the scene, much of this information is often irrelevant to the specific task at hand. This introduces noise within the learning process and distracts the agent's focus from task-relevant visual cues. Inspired by selective attention in humans-the process through which people filter their perception based on their experiences, knowledge, and the task at hand-we introduce a parameter-efficient approach to filter visual stimuli for embodied AI. Our approach induces a task-conditioned bottleneck using a small learnable codebook module. This codebook is trained jointly to optimize task reward and acts as a task-conditioned selective filter over the visual observation. Our experiments showcase state-of-the-art performance for object goal navigation and object displacement across 5 benchmarks, ProcTHOR, ArchitecTHOR, RoboTHOR, AI2-iTHOR, and ManipulaTHOR. The filtered representations produced by the codebook are also able generalize better and converge faster when adapted to other simulation environments such as Habitat. Our qualitative analyses show that agents explore their environments more effectively and their representations retain task-relevant information like target object recognition while ignoring superfluous information about other objects. Code and pretrained models are available at our project website: https://embodied-codebook.github.io.
翻译:具身AI模型通常采用现成的视觉主干网络(如CLIP)来编码其视觉观测。尽管这种通用表示编码了场景中丰富的句法和语义信息,但这些信息中很大一部分往往与当前具体任务无关。这会在学习过程中引入噪声,并使智能体的注意力偏离与任务相关的视觉线索。受人类选择性注意(即人类基于自身经验、知识和当前任务过滤感知的过程)的启发,我们提出了一种参数高效的方法来为具身AI过滤视觉刺激。我们的方法通过一个小型可学习的码本模块引入任务条件瓶颈。该码本通过联合训练优化任务奖励,充当视觉观测的任务条件选择性过滤器。我们的实验展示了在ProcTHOR、ArchitecTHOR、RoboTHOR、AI2-iTHOR和ManipulaTHOR这5个基准测试中,物体目标导航和物体位移任务的最先进性能。由码本生成的过滤表示在适应其他仿真环境(如Habitat)时,能够更好地泛化并更快收敛。我们的定性分析表明,智能体能够更有效地探索环境,其表示保留了目标物体识别等任务相关信息,同时忽略了其他物体的冗余信息。代码和预训练模型可在项目网站获取:https://embodied-codebook.github.io。