The main challenge in vision-and-language navigation (VLN) is how to understand natural-language instructions in an unseen environment. The main limitation of conventional VLN algorithms is that if an action is mistaken, the agent fails to follow the instructions or explores unnecessary regions, leading the agent to an irrecoverable path. To tackle this problem, we propose Meta-Explore, a hierarchical navigation method deploying an exploitation policy to correct misled recent actions. We show that an exploitation policy, which moves the agent toward a well-chosen local goal among unvisited but observable states, outperforms a method which moves the agent to a previously visited state. We also highlight the demand for imagining regretful explorations with semantically meaningful clues. The key to our approach is understanding the object placements around the agent in spectral-domain. Specifically, we present a novel visual representation, called scene object spectrum (SOS), which performs category-wise 2D Fourier transform of detected objects. Combining exploitation policy and SOS features, the agent can correct its path by choosing a promising local goal. We evaluate our method in three VLN benchmarks: R2R, SOON, and REVERIE. Meta-Explore outperforms other baselines and shows significant generalization performance. In addition, local goal search using the proposed spectral-domain SOS features significantly improves the success rate by 17.1% and SPL by 20.6% for the SOON benchmark.
翻译:视觉语言导航(VLN)的主要挑战在于如何在未知环境中理解自然语言指令。传统VLN算法的主要局限在于:若智能体执行了错误动作,它将无法遵循指令或探索不必要区域,从而导致路径不可恢复。为解决此问题,我们提出Meta-Explore——一种采用利用策略(exploitation policy)来修正近期误导动作的分层导航方法。研究表明,这种利用策略通过引导智能体向未访问但可观测状态中精心选择的局部目标移动,其性能优于引导智能体返回先前访问状态的方法。我们还强调了利用具有语义意义的线索来想象遗憾性探索的需求。本方法的关键在于从频域视角理解智能体周围物体的空间分布。具体而言,我们提出一种名为场景物体频谱(SOS)的新型视觉表征,该表征对检测到的物体进行类别级二维傅里叶变换。通过结合利用策略与SOS特征,智能体可选择有前景的局部目标以修正路径。我们在三个VLN基准测试(R2R、SOON和REVERIE)中评估了该方法。Meta-Explore性能优于其他基线方法,并展现出显著的泛化能力。此外,采用所提出的频域SOS特征进行局部目标搜索,在SOON基准测试中将成功率提升17.1%,SPL提升20.6%。