Existing image foundation models are not optimized for spherical images having been trained primarily on perspective images. PanoSAMic integrates the pre-trained Segment Anything (SAM) encoder to make use of its extensive training and integrate it into a semantic segmentation model for panoramic images using multiple modalities. We modify the SAM encoder to output multi-stage features and introduce a novel spatio-modal fusion module that allows the model to select the relevant modalities and best features from each modality for different areas of the input. Furthermore, our semantic decoder uses spherical attention and dual view fusion to overcome the distortions and edge discontinuity often associated with panoramic images. PanoSAMic achieves state-of-the-art (SotA) results on Stanford2D3DS for RGB, RGB-D, and RGB-D-N modalities and on Matterport3D for RGB and RGB-D modalities. https://github.com/dfki-av/PanoSAMic
翻译:现有的图像基础模型主要针对透视图像进行训练,并未针对球面图像进行优化。PanoSAMic集成预训练的Segment Anything(SAM)编码器,利用其广泛的训练经验,并将其融入基于多模态的全景图像语义分割模型。我们改进SAM编码器以输出多阶段特征,并引入一种新颖的空间-模态融合模块,使模型能够为输入图像的不同区域选择相关模态及各模态的最佳特征。此外,我们的语义解码器采用球面注意力机制与双视图融合技术,以克服全景图像常见的畸变和边缘不连续问题。PanoSAMic在Stanford2D3DS数据集上针对RGB、RGB-D和RGB-D-N模态,以及在Matterport3D数据集上针对RGB和RGB-D模态,均取得了最先进的成果。https://github.com/dfki-av/PanoSAMic