Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation

A 360-degree (omni-directional) image provides an all-encompassing spherical view of a scene. Recently, there has been an increasing interest in synthesising 360-degree images from conventional narrow field of view (NFoV) images captured by digital cameras and smartphones, for providing immersive experiences in various scenarios such as virtual reality. Yet, existing methods typically fall short in synthesizing intricate visual details or ensure the generated images align consistently with user-provided prompts. In this study, autoregressive omni-aware generative network (AOG-Net) is proposed for 360-degree image generation by out-painting an incomplete 360-degree image progressively with NFoV and text guidances joinly or individually. This autoregressive scheme not only allows for deriving finer-grained and text-consistent patterns by dynamically generating and adjusting the process but also offers users greater flexibility to edit their conditions throughout the generation process. A global-local conditioning mechanism is devised to comprehensively formulate the outpainting guidance in each autoregressive step. Text guidances, omni-visual cues, NFoV inputs and omni-geometry are encoded and further formulated with cross-attention based transformers into a global stream and a local stream into a conditioned generative backbone model. As AOG-Net is compatible to leverage large-scale models for the conditional encoder and the generative prior, it enables the generation to use extensive open-vocabulary text guidances. Comprehensive experiments on two commonly used 360-degree image datasets for both indoor and outdoor settings demonstrate the state-of-the-art performance of our proposed method. Our code will be made publicly available.

翻译：360度（全方位）图像提供了场景的全景球形视角。近年来，从数码相机和智能手机拍摄的常规窄视场（NFoV）图像合成360度图像的兴趣日益增长，旨在为虚拟现实等多种场景提供沉浸式体验。然而，现有方法通常在合成精细视觉细节方面存在不足，或无法确保生成的图像与用户提供的提示一致。本研究提出了一种自回归全方位感知生成网络（AOG-Net），通过结合窄视场（NFoV）图像和文本提示（可联合或单独使用），逐步外推补全不完整的360度图像。这种自回归方案不仅能够通过动态生成和调整过程，推导出更细粒度、与文本一致的图案，还为用户在整个生成过程中提供了更大的条件编辑灵活性。我们设计了一种全局-局部条件机制，以在每个自回归步骤中全面构建外推引导。文本提示、全方位视觉线索、NFoF输入和全方位几何信息被编码，并通过基于交叉注意力的Transformer进一步整合为全局流和局部流，输入到条件生成骨干模型中。由于AOG-Net能够兼容利用大规模模型作为条件编码器和生成先验，因此它能够生成使用广泛开放词汇文本提示的图像。在室内和室外两种常用360度图像数据集上的综合实验表明，我们提出的方法达到了最先进的性能。我们的代码将公开发布。