Enhancing Classification with Hierarchical Scalable Query on Fusion Transformer

Real-world vision based applications require fine-grained classification for various area of interest like e-commerce, mobile applications, warehouse management, etc. where reducing the severity of mistakes and improving the classification accuracy is of utmost importance. This paper proposes a method to boost fine-grained classification through a hierarchical approach via learnable independent query embeddings. This is achieved through a classification network that uses coarse class predictions to improve the fine class accuracy in a stage-wise sequential manner. We exploit the idea of hierarchy to learn query embeddings that are scalable across all levels, thus making this a relevant approach even for extreme classification where we have a large number of classes. The query is initialized with a weighted Eigen image calculated from training samples to best represent and capture the variance of the object. We introduce transformer blocks to fuse intermediate layers at which query attention happens to enhance the spatial representation of feature maps at different scales. This multi-scale fusion helps improve the accuracy of small-size objects. We propose a two-fold approach for the unique representation of learnable queries. First, at each hierarchical level, we leverage cluster based loss that ensures maximum separation between inter-class query embeddings and helps learn a better (query) representation in higher dimensional spaces. Second, we fuse coarse level queries with finer level queries weighted by a learned scale factor. We additionally introduce a novel block called Cross Attention on Multi-level queries with Prior (CAMP) Block that helps reduce error propagation from coarse level to finer level, which is a common problem in all hierarchical classifiers. Our method is able to outperform the existing methods with an improvement of ~11% at the fine-grained classification.

翻译：现实世界中基于视觉的应用需要对各类感兴趣区域（如电子商务、移动应用、仓库管理等）进行细粒度分类，其中降低错误严重性并提升分类精度至关重要。本文提出一种通过可学习独立查询嵌入的分层方法来增强细粒度分类性能。该方法通过分类网络以逐阶段顺序方式利用粗粒度类别预测提升细粒度类别精度。我们利用层级概念学习可跨所有层级扩展的查询嵌入，从而使其在包含大量类别的极端分类场景下仍具适用性。查询向量采用从训练样本中计算的加权特征图像初始化，以最佳表征并捕获目标的方差特性。我们引入Transformer模块来融合查询注意力发生的中间层，以增强不同尺度特征图的空间表征能力。这种多尺度融合有助于提升小尺寸目标的分类精度。针对可学习查询的唯一表征，我们提出双管齐下的方法：首先，在每个层级中利用基于聚类的损失函数确保跨类别查询嵌入间的最大分离度，从而在高维空间学习更优的查询表征；其次，将粗粒度层级查询与经可学习缩放因子加权的细粒度层级查询进行融合。此外，我们创新性地提出基于先验的多级查询交叉注意力模块（CAMP），该模块能有效减少层级分类器中常见的粗粒度向细粒度层级传播的错误问题。本方法在细粒度分类任务上相较现有方法实现了约11%的性能提升。