Despite the remarkable success of convolutional neural networks in various computer vision tasks, recognizing indoor scenes still presents a significant challenge due to their complex composition. Consequently, effectively leveraging semantic information in the scene has been a key issue in advancing indoor scene recognition. Unfortunately, the accuracy of semantic segmentation has limited the effectiveness of existing approaches for leveraging semantic information. As a result, many of these approaches remain at the stage of auxiliary labeling or co-occurrence statistics, with few exploring the contextual relationships between the semantic elements directly within the scene. In this paper, we propose the Semantic Region Relationship Model (SRRM), which starts directly from the semantic information inside the scene. Specifically, SRRM adopts an adaptive and efficient approach to mitigate the negative impact of semantic ambiguity and then models the semantic region relationship to perform scene recognition. Additionally, to more comprehensively exploit the information contained in the scene, we combine the proposed SRRM with the PlacesCNN module to create the Combined Semantic Region Relation Model (CSRRM), and propose a novel information combining approach to effectively explore the complementary contents between them. CSRRM significantly outperforms the SOTA methods on the MIT Indoor 67, reduced Places365 dataset, and SUN RGB-D without retraining. The code is available at: https://github.com/ChuanxinSong/SRRM
翻译:尽管卷积神经网络在各种计算机视觉任务中取得了显著成功,但由于室内场景构成的复杂性,识别室内场景仍面临重大挑战。因此,有效利用场景中的语义信息已成为推动室内场景识别的关键问题。然而,现有基于语义信息的方法受限于语义分割的精度,许多方法仍停留在辅助标注或共现统计阶段,鲜有直接探索场景内语义元素间上下文关系的研究。本文提出语义区域关系模型(SRRM),该模型直接从场景内部的语义信息出发。具体而言,SRRM采用自适应且高效的方法缓解语义模糊性的负面影响,进而通过建模语义区域关系实现场景识别。此外,为更全面地挖掘场景中包含的信息,我们将所提出的SRRM与PlacesCNN模块结合,构建了组合语义区域关系模型(CSRRM),并提出一种新型信息融合方法以有效探索二者间的互补内容。在不进行重训练的情况下,CSRRM在MIT Indoor 67、精简版Places365数据集和SUN RGB-D上的性能显著超越现有最优方法。代码已开源于:https://github.com/ChuanxinSong/SRRM