The surge in video and social media content underscores the need for a deeper understanding of multimedia data. Most of the existing mature video understanding techniques perform well with short formats and content that requires only shallow understanding, but do not perform well with long format videos that require deep understanding and reasoning. Deep Video Understanding (DVU) Challenge aims to push the boundaries of multimodal extraction, fusion, and analytics to address the problem of holistically analyzing long videos and extract useful knowledge to solve different types of queries. This paper introduces a query-aware method for long video localization and relation discrimination, leveraging an imagelanguage pretrained model. This model adeptly selects frames pertinent to queries, obviating the need for a complete movie-level knowledge graph. Our approach achieved first and fourth positions for two groups of movie-level queries. Sufficient experiments and final rankings demonstrate its effectiveness and robustness.
翻译:视频与社交媒体内容的激增凸显了对多媒体数据更深入理解的需求。现有成熟的视频理解技术大多在短格式内容及仅需浅层理解的场景下表现良好,但在需要深层理解与推理的长视频格式上效果不佳。深度视频理解挑战赛旨在推动多模态提取、融合与分析技术的边界,以解决长视频整体分析问题,并提取有用知识来回答不同类型的查询。本文提出了一种基于查询感知的长视频定位与关系判别方法,该方法利用图像-语言预训练模型。该模型能够精准筛选与查询相关的帧,无需构建完整的电影级知识图谱。我们的方法在两个电影级查询组中分别取得了第一名和第四名的成绩。充足的实验与最终排名证明了其有效性与鲁棒性。