One powerful paradigm in visual navigation is to predict actions from observations directly. Training such an end-to-end system allows representations useful for downstream tasks to emerge automatically. However, the lack of inductive bias makes this system data inefficient. We hypothesize a sufficient representation of the current view and the goal view for a navigation policy can be learned by predicting the location and size of a crop of the current view that corresponds to the goal. We further show that training such random crop prediction in a self-supervised fashion purely on synthetic noise images transfers well to natural home images. The learned representation can then be bootstrapped to learn a navigation policy efficiently with little interaction data. The code is available at https://yanweiw.github.io/noise2ptz
翻译:视觉导航中的一个强大范式是直接从观测中预测动作。训练这样的端到端系统能够自动产生对下游任务有用的表征。然而,缺乏归纳偏置导致该系统数据效率低下。我们假设,通过预测当前视图中与目标对应的裁剪区域的位置和大小,可以学习到足以支持导航策略的当前视图与目标视图的表征。我们进一步证明,在纯合成的噪声图像上以自监督方式训练这种随机裁剪预测,能够很好地迁移到自然室内图像。学到的表征随后可以被引导用于高效地学习导航策略,仅需少量交互数据。代码已开源在 https://yanweiw.github.io/noise2ptz。