Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted significant attention due to the growing demand for video analysis. Recent approaches treat MR and HD as similar video grounding problems and address them together with transformer-based architecture. However, we observe that the emphasis of MR and HD differs, with one necessitating the perception of local relationships and the other prioritizing the understanding of global contexts. Consequently, the lack of task-specific design will inevitably lead to limitations in associating the intrinsic specialty of two tasks. To tackle the issue, we propose a Unified Video COMprehension framework (UVCOM) to bridge the gap and jointly solve MR and HD effectively. By performing progressive integration on intra and inter-modality across multi-granularity, UVCOM achieves the comprehensive understanding in processing a video. Moreover, we present multi-aspect contrastive learning to consolidate the local relation modeling and global knowledge accumulation via well aligned multi-modal space. Extensive experiments on QVHighlights, Charades-STA, TACoS , YouTube Highlights and TVSum datasets demonstrate the effectiveness and rationality of UVCOM which outperforms the state-of-the-art methods by a remarkable margin.
翻译:视频时刻检索(MR)和高光检测(HD)因视频分析需求的日益增长而备受关注。近来,研究者将MR与HD视为相似的视频定位问题,并借助基于Transformer的架构协同处理。然而,我们观察到这两项任务的侧重点存在差异:前者需感知局部关系,后者则更强调对全局语境的理解。因此,若缺乏任务特异性设计,必然导致无法有效关联两项任务的固有特性。为解决该问题,我们提出统一视频理解框架(UVCOM)来弥合这一鸿沟,实现MR与HD的高效联合求解。通过多粒度层级上对模态内与跨模态信息进行渐进式融合,UVCOM实现了视频处理的全面理解。此外,我们引入多视角对比学习,借助高度对齐的多模态空间,强化局部关系建模并积累全局知识。在QVHighlights、Charades-STA、TACoS、YouTube Highlights及TVSum数据集上的大量实验表明,UVCOM以显著优势超越现有最先进方法,验证了其有效性与合理性。