Indoor scene classification has become an important task in perception modules and has been widely used in various applications. However, problems such as intra-category variability and inter-category similarity have been holding back the models' performance, which leads to the need for new types of features to obtain a more meaningful scene representation. A semantic segmentation mask provides pixel-level information about the objects available in the scene, which makes it a promising source of information to obtain a more meaningful local representation of the scene. Therefore, in this work, a novel approach that uses a semantic segmentation mask to obtain a 2D spatial layout of the object categories across the scene, designated by segmentation-based semantic features (SSFs), is proposed. These features represent, per object category, the pixel count, as well as the 2D average position and respective standard deviation values. Moreover, a two-branch network, GS2F2App, that exploits CNN-based global features extracted from RGB images and the segmentation-based features extracted from the proposed SSFs, is also proposed. GS2F2App was evaluated in two indoor scene benchmark datasets: the SUN RGB-D and the NYU Depth V2, achieving state-of-the-art results on both datasets.
翻译:室内场景分类已成为感知模块中的重要任务,并广泛应用于各类场景。然而,类内差异性与类间相似性等问题持续制约模型性能,亟需新型特征来获得更具意义的场景表征。语义分割掩膜可提供场景中物体的像素级信息,这使其成为获取更具意义局部场景表征的潜在信息源。为此,本文提出一种创新方法——利用语义分割掩膜获取场景中物体类别的二维空间布局,并设计出基于分割的语义特征(SSFs)。这些特征包含各物体类别的像素数量、二维平均位置及其标准差数值。此外,本文还提出一种双分支网络GS2F2App,该网络同时利用从RGB图像中提取的CNN全局特征和从所提SSFs中提取的分割特征。GS2F2App在SUN RGB-D和NYU Depth V2两个室内场景基准数据集上均取得了当前最优结果。