Scene recognition based on deep-learning has made significant progress, but there are still limitations in its performance due to challenges posed by inter-class similarities and intra-class dissimilarities. Furthermore, prior research has primarily focused on improving classification accuracy, yet it has given less attention to achieving interpretable, precise scene classification. Therefore, we are motivated to propose EnTri, an ensemble scene recognition framework that employs ensemble learning using a hierarchy of visual features. EnTri represents features at three distinct levels of detail: pixel-level, semantic segmentation-level, and object class and frequency level. By incorporating distinct feature encoding schemes of differing complexity and leveraging ensemble strategies, our approach aims to improve classification accuracy while enhancing transparency and interpretability via visual and textual explanations. To achieve interpretability, we devised an extension algorithm that generates both visual and textual explanations highlighting various properties of a given scene that contribute to the final prediction of its category. This includes information about objects, statistics, spatial layout, and textural details. Through experiments on benchmark scene classification datasets, EnTri has demonstrated superiority in terms of recognition accuracy, achieving competitive performance compared to state-of-the-art approaches, with an accuracy of 87.69%, 75.56%, and 99.17% on the MIT67, SUN397, and UIUC8 datasets, respectively.
翻译:基于深度学习的场景识别已取得显著进展,但由于类间相似性与类内差异性带来的挑战,其性能仍存在局限。此外,先前研究主要聚焦于提升分类精度,而对实现可解释的精准场景分类关注不足。为此,我们提出EnTri——一种采用层次化视觉特征进行集成学习的场景识别框架。EnTri在三个不同粒度层级上表征特征:像素级、语义分割级以及物体类别与频度级。通过融合不同复杂度的特征编码方案并利用集成策略,本方法旨在提升分类精度的同时,通过视觉与文本解释增强透明度和可解释性。为实现可解释性,我们设计了一种扩展算法,可生成视觉与文本解释,突出显示影响最终场景类别预测的多种属性,包括物体信息、统计特征、空间布局及纹理细节。通过在基准场景分类数据集上的实验,EnTri在识别精度方面展现出优越性,在MIT67、SUN397和UIUC8数据集上分别达到87.69%、75.56%和99.17%的准确率,取得了与前沿方法相竞争的性能。