We propose Embodied Navigation Trajectory Learner (ENTL), a method for extracting long sequence representations for embodied navigation. Our approach unifies world modeling, localization and imitation learning into a single sequence prediction task. We train our model using vector-quantized predictions of future states conditioned on current states and actions. ENTL's generic architecture enables the sharing of the the spatio-temporal sequence encoder for multiple challenging embodied tasks. We achieve competitive performance on navigation tasks using significantly less data than strong baselines while performing auxiliary tasks such as localization and future frame prediction (a proxy for world modeling). A key property of our approach is that the model is pre-trained without any explicit reward signal, which makes the resulting model generalizable to multiple tasks and environments.
翻译:我们提出具身导航轨迹学习者(ENTL),一种用于提取具身导航长序列表示的方法。我们的方法将世界建模、定位和模仿学习统一为单一的序列预测任务。我们使用基于当前状态和动作条件的未来状态向量量化预测来训练模型。ENTL的通用架构使得时空序列编码器能够共享于多个具身挑战性任务中。我们在导航任务上使用远少于强基线的数据实现了具有竞争力的性能,同时执行定位和未来帧预测(世界建模的代理)等辅助任务。我们方法的一个关键特性是模型在没有任何显式奖励信号的情况下进行预训练,这使得最终模型能够泛化到多个任务和环境中。