Video retrieval (VR) involves retrieving the ground truth video from the video database given a text caption or vice-versa. The two important components of compositionality: objects & attributes and actions are joined using correct syntax to form a proper text query. These components (objects & attributes, actions and syntax) each play an important role to help distinguish among videos and retrieve the correct ground truth video. However, it is unclear what is the effect of these components on the video retrieval performance. We therefore, conduct a systematic study to evaluate the compositional and syntactic understanding of video retrieval models on standard benchmarks such as MSRVTT, MSVD and DIDEMO. The study is performed on two categories of video retrieval models: (i) which are pre-trained on video-text pairs and fine-tuned on downstream video retrieval datasets (Eg. Frozen-in-Time, Violet, MCQ etc.) (ii) which adapt pre-trained image-text representations like CLIP for video retrieval (Eg. CLIP4Clip, XCLIP, CLIP2Video etc.). Our experiments reveal that actions and syntax play a minor role compared to objects & attributes in video understanding. Moreover, video retrieval models that use pre-trained image-text representations (CLIP) have better syntactic and compositional understanding as compared to models pre-trained on video-text data. The code is available at https://github.com/IntelLabs/multimodal_cognitive_ai/tree/main/ICSVR
翻译:视频检索(VR)是指根据文本描述从视频数据库中检索对应的真实视频,反之亦然。组合性的两个重要组成部分——物体与属性和动作——通过正确的句法连接形成合理的文本查询。这些组成部分(物体与属性、动作及句法)在区分不同视频并检索正确真实视频方面各自发挥重要作用。然而,这些组成部分对视频检索性能的具体影响尚不明确。因此,我们开展了一项系统性研究,在MSRVTT、MSVD和DIDEMO等标准基准上评估视频检索模型的组合与句法理解能力。该研究针对两类视频检索模型进行:(i)在视频-文本对上进行预训练并在下游视频检索数据集上微调的模型(如Frozen-in-Time、Violet、MCQ等);(ii)将预训练的图像-文本表示(如CLIP)适配至视频检索的模型(如CLIP4Clip、XCLIP、CLIP2Video等)。实验表明,在视频理解中,动作和句法的作用相较于物体与属性较小。此外,使用预训练图像-文本表示(CLIP)的视频检索模型相较于基于视频-文本数据预训练的模型具有更好的句法与组合理解能力。代码可在https://github.com/IntelLabs/multimodal_cognitive_ai/tree/main/ICSVR获取。