HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification

The task of multi-label image classification involves recognizing multiple objects within a single image. Considering both valuable semantic information contained in the labels and essential visual features presented in the image, tight visual-linguistic interactions play a vital role in improving classification performance. Moreover, given the potential variance in object size and appearance within a single image, attention to features of different scales can help to discover possible objects in the image. Recently, Transformer-based methods have achieved great success in multi-label image classification by leveraging the advantage of modeling long-range dependencies, but they have several limitations. Firstly, existing methods treat visual feature extraction and cross-modal fusion as separate steps, resulting in insufficient visual-linguistic alignment in the joint semantic space. Additionally, they only extract visual features and perform cross-modal fusion at a single scale, neglecting objects with different characteristics. To address these issues, we propose a Hierarchical Scale-Aware Vision-Language Transformer (HSVLT) with two appealing designs: (1)~A hierarchical multi-scale architecture that involves a Cross-Scale Aggregation module, which leverages joint multi-modal features extracted from multiple scales to recognize objects of varying sizes and appearances in images. (2)~Interactive Visual-Linguistic Attention, a novel attention mechanism module that tightly integrates cross-modal interaction, enabling the joint updating of visual, linguistic and multi-modal features. We have evaluated our method on three benchmark datasets. The experimental results demonstrate that HSVLT surpasses state-of-the-art methods with lower computational cost.

翻译：多标签图像分类任务涉及识别单张图像中的多个对象。考虑到标签中包含的语义信息与图像呈现的视觉特征均具有重要价值，紧密的视觉-语言交互对提升分类性能至关重要。此外，鉴于单张图像中物体尺寸与外观可能存在差异，关注不同尺度的特征有助于发现图像中可能存在的物体。近年来，基于Transformer的方法凭借其建模长程依赖关系的优势，在多标签图像分类领域取得显著成功，但仍存在若干局限：其一，现有方法将视觉特征提取与跨模态融合视为独立步骤，导致联合语义空间中的视觉-语言对齐不充分；其二，这些方法仅在单一尺度上提取视觉特征并进行跨模态融合，忽视了具有不同特性的物体。为解决上述问题，我们提出一种层次化尺度感知视觉-语言Transformer（HSVLT），其包含两项核心设计：（1）层次化多尺度架构，通过跨尺度聚合模块利用从多尺度提取的联合多模态特征，以识别图像中不同尺寸与外观的物体；（2）交互式视觉-语言注意力机制，该新型注意力模块紧密整合跨模态交互，实现视觉、语言及多模态特征的联合更新。我们在三个基准数据集上评估了所提方法，实验结果表明HSVLT以更低的计算成本超越了现有最优方法。