High-resolution wide-angle fisheye images are becoming more and more important for robotics applications such as autonomous driving. However, using ordinary convolutional neural networks or vision transformers on this data is problematic due to projection and distortion losses introduced when projecting to a rectangular grid on the plane. We introduce the HEAL-SWIN transformer, which combines the highly uniform Hierarchical Equal Area iso-Latitude Pixelation (HEALPix) grid used in astrophysics and cosmology with the Hierarchical Shifted-Window (SWIN) transformer to yield an efficient and flexible model capable of training on high-resolution, distortion-free spherical data. In HEAL-SWIN, the nested structure of the HEALPix grid is used to perform the patching and windowing operations of the SWIN transformer, resulting in a one-dimensional representation of the spherical data with minimal computational overhead. We demonstrate the superior performance of our model for semantic segmentation and depth regression tasks on both synthetic and real automotive datasets. Our code is available at https://github.com/JanEGerken/HEAL-SWIN.
翻译:高分辨率广角鱼眼图像在自动驾驶等机器人应用中日益重要。然而,由于将此类数据投影到平面矩形网格时会引入投影畸变和失真,使用普通卷积神经网络或视觉Transformer处理这些数据存在困难。我们提出HEAL-SWIN Transformer,该模型将天体物理学和宇宙学中使用的具有高度均匀性的分层等面积等纬度像素化网格与分层移位窗口Transformer相结合,构建了一个能够高效灵活训练高分辨率、无失真球面数据的模型。在HEAL-SWIN中,利用HEALPix网格的嵌套结构执行SWIN Transformer的分块和窗口操作,从而以最小计算开销实现球面数据的一维表示。我们在合成数据集和真实汽车数据集上展示了该模型在语义分割和深度回归任务中的优越性能。我们的代码开源在 https://github.com/JanEGerken/HEAL-SWIN。