We propose Embodied Navigation Trajectory Learner (ENTL), a method for extracting long sequence representations for embodied navigation. Our approach unifies world modeling, localization and imitation learning into a single sequence prediction task. We train our model using vector-quantized predictions of future states conditioned on current states and actions. ENTL's generic architecture enables sharing of the spatio-temporal sequence encoder for multiple challenging embodied tasks. We achieve competitive performance on navigation tasks using significantly less data than strong baselines while performing auxiliary tasks such as localization and future frame prediction (a proxy for world modeling). A key property of our approach is that the model is pre-trained without any explicit reward signal, which makes the resulting model generalizable to multiple tasks and environments.
翻译:我们提出具身导航轨迹学习者(ENTL),一种用于提取具身导航长序列表征的方法。我们的方法将世界建模、定位与模仿学习统一为单个序列预测任务。我们通过基于当前状态与动作的未来状态向量量化预测来训练模型。ENTL的通用架构使得时空序列编码器可共享于多个具身导航挑战性任务。与强基线相比,我们在使用显著更少数据的情况下实现了具有竞争力的导航性能,同时完成定位与未来帧预测(世界建模的代理任务)等辅助任务。该方法的核心特性是模型无需显式奖励信号即可预训练,使得最终模型能够泛化到多种任务与环境。