Visual-Language Models (VLMs) have significantly advanced action video recognition. Supervised by the semantics of action labels, recent works adapt the visual branch of VLMs to learn video representations. Despite the effectiveness proved by these works, we believe that the potential of VLMs has yet to be fully harnessed. In light of this, we exploit the semantic units (SU) hiding behind the action labels and leverage their correlations with fine-grained items in frames for more accurate action recognition. SUs are entities extracted from the language descriptions of the entire action set, including body parts, objects, scenes, and motions. To further enhance the alignments between visual contents and the SUs, we introduce a multi-region module (MRA) to the visual branch of the VLM. The MRA allows the perception of region-aware visual features beyond the original global feature. Our method adaptively attends to and selects relevant SUs with visual features of frames. With a cross-modal decoder, the selected SUs serve to decode spatiotemporal video representations. In summary, the SUs as the medium can boost discriminative ability and transferability. Specifically, in fully-supervised learning, our method achieved 87.8% top-1 accuracy on Kinetics-400. In K=2 few-shot experiments, our method surpassed the previous state-of-the-art by +7.1% and +15.0% on HMDB-51 and UCF-101, respectively.
翻译:视觉-语言模型(VLM)显著推动了动作视频识别技术的发展。在动作标签语义的监督下,近期研究通过适配VLM的视觉分支来学习视频表征。尽管这些方法已证明其有效性,但我们认为VLM的潜力尚未被充分挖掘。为此,我们挖掘动作标签背后隐藏的语义单元(SU),并利用其与帧内细粒度元素的关联实现更精准的动作识别。SU是从整个动作集的语言描述中提取的实体,包括身体部位、物体、场景和运动。为增强视觉内容与SU之间的对齐,我们在VLM的视觉分支中引入多区域模块(MRA)。该模块使模型能够在原始全局特征的基础上感知区域感知的视觉特征。我们的方法通过帧的视觉特征自适应地关注并选择相关SU。借助跨模态解码器,所选SU用于解码时空视频表征。总之,SU作为媒介能够提升判别能力和迁移性能。具体而言,在全监督学习中,我们的方法在Kinetics-400上达到了87.8%的top-1准确率。在K=2的少样本实验中,我们的方法在HMDB-51和UCF-101上分别超越先前最先进方法+7.1%和+15.0%。