This paper presents a novel approach to Zero-Shot Action Recognition. Recent works have explored the detection and classification of objects to obtain semantic information from videos with remarkable performance. Inspired by them, we propose using video captioning methods to extract semantic information about objects, scenes, humans, and their relationships. To the best of our knowledge, this is the first work to represent both videos and labels with descriptive sentences. More specifically, we represent videos using sentences generated via video captioning methods and classes using sentences extracted from documents acquired through search engines on the Internet. Using these representations, we build a shared semantic space employing BERT-based embedders pre-trained in the paraphrasing task on multiple text datasets. The projection of both visual and semantic information onto this space is straightforward, as they are sentences, enabling classification using the nearest neighbor rule. We demonstrate that representing videos and labels with sentences alleviates the domain adaptation problem. Additionally, we show that word vectors are unsuitable for building the semantic embedding space of our descriptions. Our method outperforms the state-of-the-art performance on the UCF101 dataset by 3.3 p.p. in accuracy under the TruZe protocol and achieves competitive results on both the UCF101 and HMDB51 datasets under the conventional protocol (0/50\% - training/testing split). Our code is available at https://github.com/valterlej/zsarcap.
翻译:本文提出了一种新颖的零样本动作识别方法。近年来,相关研究通过检测和分类视频中的物体来获取语义信息,并取得了显著性能。受此启发,我们提出利用视频描述方法提取包括物体、场景、人物及其关系的语义信息。据我们所知,这是首次使用描述性句子同时表示视频和标签的工作。具体而言,我们使用视频描述方法生成的句子表示视频,并通过搜索引擎从互联网获取的文档中提取句子来表示类别。基于这些表示,我们利用在多个文本数据集上预训练于释义任务的BERT编码器构建共享语义空间。由于视觉和语义信息均为句子形式,将其投影到该空间的过程十分直接,可通过最近邻规则进行分类。我们证明了用句子表示视频和标签可缓解领域自适应问题。此外,我们还展示了词向量不适用于构建我们的描述性语义嵌入空间。在UCF101数据集上,本方法在TruZe协议下准确率领先现有最优方法3.3个百分点,并在常规协议(0/50%训练/测试划分)下同时在UCF101和HMDB51数据集上取得了有竞争力的结果。我们的代码已开源:https://github.com/valterlej/zsarcap。