Hierarchical classification (HC) assigns each object with multiple labels organized into a hierarchical structure. The existing deep learning based HC methods usually predict an instance starting from the root node until a leaf node is reached. However, in the real world, images interfered by noise, occlusion, blur, or low resolution may not provide sufficient information for the classification at subordinate levels. To address this issue, we propose a novel semantic guided level-category hybrid prediction network (SGLCHPN) that can jointly perform the level and category prediction in an end-to-end manner. SGLCHPN comprises two modules: a visual transformer that extracts feature vectors from the input images, and a semantic guided cross-attention module that uses categories word embeddings as queries to guide learning category-specific representations. In order to evaluate the proposed method, we construct two new datasets in which images are at a broad range of quality and thus are labeled to different levels (depths) in the hierarchy according to their individual quality. Experimental results demonstrate the effectiveness of our proposed HC method.
翻译:层次分类(HC)为每个对象分配按照层次结构组织的多个标签。现有的基于深度学习的HC方法通常从根节点开始预测实例,直至叶节点。然而在现实世界中,受噪声、遮挡、模糊或低分辨率干扰的图像可能无法为下级分类提供充分信息。为解决这一问题,我们提出了一种新颖的语义引导的级别-类别混合预测网络(SGLCHPN),该网络能够以端到端方式联合执行级别和类别预测。SGLCHPN包含两个模块:从输入图像提取特征向量的视觉Transformer,以及利用类别词嵌入作为查询来指导学习类别特定表征的语义引导交叉注意力模块。为评估所提方法,我们构建了两个新数据集,其中图像质量跨度广泛,因此根据各自质量被标注在层次结构的不同层级(深度)。实验结果表明了我们提出的HC方法的有效性。