Exploring the semantic context in scene images is essential for indoor scene recognition. However, due to the diverse intra-class spatial layouts and the coexisting inter-class objects, modeling contextual relationships to adapt various image characteristics is a great challenge. Existing contextual modeling methods for indoor scene recognition exhibit two limitations: 1) During training, space-independent information, such as color, may hinder optimizing the network's capacity to represent the spatial context. 2) These methods often overlook the differences in coexisting objects across different scenes, suppressing scene recognition performance. To address these limitations, we propose SpaCoNet, which simultaneously models the Spatial relation and Co-occurrence of objects based on semantic segmentation. Firstly, the semantic spatial relation module (SSRM) is designed to explore the spatial relation among objects within a scene. With the help of semantic segmentation, this module decouples the spatial information from the image, effectively avoiding the influence of irrelevant features. Secondly, both spatial context features from the SSRM and deep features from the Image Feature Extraction Module are used to distinguish the coexisting object across different scenes. Finally, utilizing the discriminative features mentioned above, we employ the self-attention mechanism to explore the long-range co-occurrence among objects, and further generate a semantic-guided feature representation for indoor scene recognition. Experimental results on three widely used scene datasets demonstrate the effectiveness and generality of the proposed method. The code will be made publicly available after the blind review process is completed.
翻译:探索场景图像中的语义上下文对于室内场景识别至关重要。然而,由于类内空间布局的多样性以及类间对象的共存性,建模上下文关系以适应各种图像特征是一个巨大挑战。现有室内场景识别的上下文建模方法存在两个局限性:1)训练过程中,颜色等与空间无关的信息可能阻碍网络表征空间上下文能力的优化。2)这些方法常忽略不同场景中共存对象的差异,从而抑制了场景识别性能。为解决这些问题,我们提出SpaCoNet,该方法基于语义分割同时建模对象的空间关系与共现性。首先,设计语义空间关系模块(SSRM)以探索场景内对象间的空间关系。借助语义分割,该模块从图像中解耦空间信息,有效避免无关特征的影响。其次,利用SSRM提取的空间上下文特征和图像特征提取模块提取的深层特征,区分不同场景中的共存对象。最后,基于上述判别性特征,采用自注意力机制探索对象间的长距离共现性,并生成语义引导的特征表示用于室内场景识别。在三个广泛使用的场景数据集上的实验结果表明了所提方法的有效性和泛化性。代码将在盲审流程完成后公开。