We present a cross-domain inference technique that learns from synthetic data to estimate depth and normals for in-the-wild omnidirectional 3D scenes encountered in real-world uncontrolled settings. To this end, we introduce UBotNet, an architecture that combines UNet and Bottleneck Transformer elements to predict consistent scene normals and depth. We also introduce the OmniHorizon synthetic dataset containing 24,335 omnidirectional images that represent a wide variety of outdoor environments, including buildings, streets, and diverse vegetation. This dataset is generated from expansive, lifelike virtual spaces and encompasses dynamic scene elements, such as changing lighting conditions, different times of day, pedestrians, and vehicles. Our experiments show that UBotNet achieves significantly improved accuracy in depth estimation and normal estimation compared to existing models. Lastly, we validate cross-domain synthetic-to-real depth and normal estimation on real outdoor images using UBotNet trained solely on our synthetic OmniHorizon dataset, demonstrating the potential of both the synthetic dataset and the proposed network for real-world scene understanding applications.
翻译:我们提出了一种跨领域推理技术,该技术通过学习合成数据来估计真实世界无约束环境中遇到的野外全景三维场景的深度与法线。为此,我们引入了UBotNet架构,该架构结合了UNet与Bottleneck Transformer组件,以预测一致的场景法线与深度。我们还介绍了OmniHorizon合成数据集,该数据集包含24,335张全景图像,涵盖了包括建筑、街道及多样化植被在内的广泛户外环境。此数据集生成自广阔、逼真的虚拟空间,并包含了动态场景元素,例如变化的光照条件、不同时段、行人及车辆。我们的实验表明,与现有模型相比,UBotNet在深度估计与法线估计方面实现了显著提升的精度。最后,我们使用仅在我们合成的OmniHorizon数据集上训练的UBotNet,在真实户外图像上验证了跨领域合成至真实的深度与法线估计,证明了该合成数据集及所提出网络在真实世界场景理解应用中的潜力。