OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav

We present a single neural network architecture composed of task-agnostic components (ViTs, convolutions, and LSTMs) that achieves state-of-art results on both the ImageNav ("go to location in <this picture>") and ObjectNav ("find a chair") tasks without any task-specific modules like object detection, segmentation, mapping, or planning modules. Such general-purpose methods offer advantages of simplicity in design, positive scaling with available compute, and versatile applicability to multiple tasks. Our work builds upon the recent success of self-supervised learning (SSL) for pre-training vision transformers (ViT). However, while the training recipes for convolutional networks are mature and robust, the recipes for ViTs are contingent and brittle, and in the case of ViTs for visual navigation, yet to be fully discovered. Specifically, we find that vanilla ViTs do not outperform ResNets on visual navigation. We propose the use of a compression layer operating over ViT patch representations to preserve spatial information along with policy training improvements. These improvements allow us to demonstrate positive scaling laws for the first time in visual navigation tasks. Consequently, our model advances state-of-the-art performance on ImageNav from 54.2% to 82.0% success and performs competitively against concurrent state-of-art on ObjectNav with success rate of 64.0% vs. 65.0%. Overall, this work does not present a fundamentally new approach, but rather recommendations for training a general-purpose architecture that achieves state-of-art performance today and could serve as a strong baseline for future methods.

翻译：我们提出了一种由任务无关组件（ViT、卷积网络和LSTM）构成的单一神经网络架构，该架构在ImageNav（"前往<此图片>中的位置"）和ObjectNav（"找到椅子"）两项任务上均取得了最佳结果，且无需目标检测、分割、地图构建或规划模块等特定任务组件。这类通用方法具有设计简洁、随算力提升呈正比例扩展、可灵活应用于多任务等优势。我们的工作建立在近期利用自监督学习预训练视觉Transformer（ViT）的成功实践基础上。然而，当卷积网络的训练范式已成熟稳健时，ViT的训练策略仍存在条件依赖性与脆弱性，其在视觉导航领域的具体应用方案尚未完全探明。具体而言，我们发现标准ViT在视觉导航任务中的表现不及ResNet。为此，我们提出在ViT图像块表征上应用压缩层以保留空间信息，并改进策略训练方法。这些改进使我们在视觉导航任务中首次验证了正向扩展规律。最终，我们的模型将ImageNav的成功率从54.2%提升至82.0%，并在ObjectNav任务中以64.0%的成功率与当前最佳方法（65.0%）展开竞争。总体而言，本研究并未提出根本性的新方法，而是为训练具备当前最优性能的通用架构提供实践指南，该架构可作为未来方法的强基准。