PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) requires the agent to follow language instructions to navigate through 3D environments. One main challenge in VLN is the limited availability of photorealistic training environments, which makes it hard to generalize to new and unseen environments. To address this problem, we propose PanoGen, a generation method that can potentially create an infinite number of diverse panoramic environments conditioned on text. Specifically, we collect room descriptions by captioning the room images in existing Matterport3D environments, and leverage a state-of-the-art text-to-image diffusion model to generate the new panoramic environments. We use recursive outpainting over the generated images to create consistent 360-degree panorama views. Our new panoramic environments share similar semantic information with the original environments by conditioning on text descriptions, which ensures the co-occurrence of objects in the panorama follows human intuition, and creates enough diversity in room appearance and layout with image outpainting. Lastly, we explore two ways of utilizing PanoGen in VLN pre-training and fine-tuning. We generate instructions for paths in our PanoGen environments with a speaker built on a pre-trained vision-and-language model for VLN pre-training, and augment the visual observation with our panoramic environments during agents' fine-tuning to avoid overfitting to seen environments. Empirically, learning with our PanoGen environments achieves the new state-of-the-art on the Room-to-Room, Room-for-Room, and CVDN datasets. Pre-training with our PanoGen speaker data is especially effective for CVDN, which has under-specified instructions and needs commonsense knowledge. Lastly, we show that the agent can benefit from training with more generated panoramic environments, suggesting promising results for scaling up the PanoGen environments.

翻译：视觉与语言导航（VLN）要求智能体遵循语言指令在三维环境中导航。VLN面临的主要挑战之一是逼真训练环境的稀缺性，导致其难以泛化至新的未见环境。为解决此问题，我们提出PanoGen，一种能够基于文本条件生成无限多样全景环境的生成方法。具体而言，我们通过对现有Matterport3D环境中的房间图像进行描述收集房间文本描述，并利用最先进的文本-图像扩散模型生成新的全景环境。我们对生成图像采用递归外推法以创建一致的360度全景视图。通过以文本描述为条件，新生成的全景环境与原始环境共享相似的语义信息，确保了全景中物体的共现符合人类直觉，同时通过图像外推法在房间外观与布局上创造了足够多样性。最后，我们探索了两种将PanoGen应用于VLN预训练与微调的方法：利用基于预训练视觉-语言模型构建的说话者为PanoGen环境中的路径生成指令以进行VLN预训练，并在智能体微调阶段通过全景环境增强视觉观测以避免过拟合至已见环境。实验表明，使用PanoGen环境进行学习在Room-to-Room、Room-for-Room及CVDN数据集上均达到新最优水平。对于指令模糊且需常识推理的CVDN数据集，基于PanoGen说话者数据的预训练尤为有效。此外，智能体可通过更多生成全景环境的训练获得性能提升，展现了PanoGen环境扩展的广阔前景。