We propose Embodied Navigation Trajectory Learner (ENTL), a method for extracting long sequence representations for embodied navigation. Our approach unifies world modeling, localization and imitation learning into a single sequence prediction task. We train our model using vector-quantized predictions of future states conditioned on current states and actions. ENTL's generic architecture enables sharing of the spatio-temporal sequence encoder for multiple challenging embodied tasks. We achieve competitive performance on navigation tasks using significantly less data than strong baselines while performing auxiliary tasks such as localization and future frame prediction (a proxy for world modeling). A key property of our approach is that the model is pre-trained without any explicit reward signal, which makes the resulting model generalizable to multiple tasks and environments.
翻译:我们提出具身导航轨迹学习器(ENTL),一种针对具身导航的长序列表示提取方法。本方法将世界建模、定位与模仿学习统一为单一序列预测任务。我们利用基于当前状态和动作的未来状态矢量量化预测对模型进行训练。ENTL的通用架构使得其时空序列编码器能够共享于多个具身导航类挑战性任务。在导航任务中,我们使用远少于强基线模型的数据量即取得了具有竞争力的性能,同时能够完成定位及未来帧预测(作为世界建模的代理任务)等辅助任务。本方法的关键特性在于模型无需任何显式奖励信号即可完成预训练,从而使得所得模型可泛化至多个任务及环境。