Exploring the semantic context in scene images is essential for indoor scene recognition. However, due to the diverse intra-class spatial layouts and the coexisting inter-class objects, modeling contextual relationships to adapt various image characteristics is a great challenge. Existing contextual modeling methods for scene recognition exhibit two limitations: 1) They typically model only one kind of spatial relationship among objects within scenes in an artificially predefined manner, with limited exploration of diverse spatial layouts. 2) They often overlook the differences in coexisting objects across different scenes, suppressing scene recognition performance. To overcome these limitations, we propose SpaCoNet, which simultaneously models Spatial relation and Co-occurrence of objects guided by semantic segmentation. Firstly, the Semantic Spatial Relation Module (SSRM) is constructed to model scene spatial features. With the help of semantic segmentation, this module decouples the spatial information from the scene image and thoroughly explores all spatial relationships among objects in an end-to-end manner. Secondly, both spatial features from the SSRM and deep features from the Image Feature Extraction Module are allocated to each object, so as to distinguish the coexisting object across different scenes. Finally, utilizing the discriminative features above, we design a Global-Local Dependency Module to explore the long-range co-occurrence among objects, and further generate a semantic-guided feature representation for indoor scene recognition. Experimental results on three widely used scene datasets demonstrate the effectiveness and generality of the proposed method.
翻译:探索场景图像中的语义上下文对于室内场景识别至关重要。然而,由于类内空间布局的多样性以及类间目标共存现象,建模适应各种图像特征的上下文关系是一项重大挑战。现有场景识别的上下文建模方法存在两个局限性:1)它们通常仅以人工预定义的方式建模场景内目标之间的一种空间关系,对多样化空间布局的探索有限;2)它们往往忽略不同场景中共存目标的差异,从而抑制了场景识别性能。为克服这些局限,我们提出SpaCoNet,该方法同时建模由语义分割引导的目标空间关系与共现。首先,构建语义空间关系模块(SSRM)以建模场景空间特征。在语义分割辅助下,该模块从场景图像中解耦空间信息,并以端到端方式全面探索目标间的所有空间关系。其次,将SSRM的空间特征与图像特征提取模块的深度特征分别分配给每个目标,以区分不同场景中的共存目标。最后,利用上述判别性特征,我们设计全局-局部依赖模块来探索目标间的长程共现关系,并进一步生成用于室内场景识别的语义引导特征表示。在三个广泛使用的场景数据集上的实验结果证明了所提方法的有效性与通用性。