Exploring the semantic context in scene images is essential for indoor scene recognition. However, due to the diverse intra-class spatial layouts and the coexisting inter-class objects, modeling contextual relationships to adapt various image characteristics is a great challenge. Existing contextual modeling methods for indoor scene recognition exhibit two limitations: 1) During training, space-independent information, such as color, may hinder optimizing the network's capacity to represent the spatial context. 2) These methods often overlook the differences in coexisting objects across different scenes, suppressing the performance of scene recognition. To address these limitations, we propose SpaCoNet, a novel approach that simultaneously models the Spatial relation and Co-occurrence of objects based on semantic segmentation. Firstly, the semantic spatial relation module (SSRM) is designed to explore the spatial relations among objects within a scene. With the help of semantic segmentation, this module decouples the spatial information from the image, effectively avoiding the influence of irrelevant features. Secondly, both spatial context features from SSRM and deep features from RGB feature extractor are used to distinguish the coexisting object across different scenes. Finally, utilizing the discriminative features mentioned above, we employ the self-attention mechanism to explore the long-range co-occurrence relationships among objects, and further generate a semantic-guided feature representation for indoor scene recognition. Experimental results on three publicly available datasets demonstrate the effectiveness and generality of the proposed method. The code will be made publicly available after the blind-review process is completed.
翻译:探索场景图像中的语义上下文对于室内场景识别至关重要。然而,由于室内场景中类内空间布局的多样性和类间物体的共存性,建模能适应各类图像特征的上下文关系是一项巨大挑战。现有针对室内场景识别的上下文建模方法存在两个局限:1)在训练过程中,颜色等与空间无关的信息可能阻碍网络对空间上下文表征能力的优化。2)这些方法常忽视不同场景中共存物体的差异性,从而抑制了场景识别的性能。为解决上述问题,我们提出SpaCoNet——一种基于语义分割同时建模物体空间关系与共现性的新方法。首先,设计语义空间关系模块(SSRM)来探索场景内物体间的空间关系。借助语义分割,该模块将空间信息从图像中解耦,有效避免了无关特征的影响。其次,利用SSRM提取的空间上下文特征和RGB特征提取器提取的深度特征来区分不同场景中的共存物体。最后,基于上述判别性特征,采用自注意力机制探索物体间的长距离共现关系,并生成用于室内场景识别的语义引导特征表示。在三个公开数据集上的实验结果表明了所提方法的有效性和通用性。代码将在盲审流程结束后公开。