Video topic segmentation unveils the coarse-grained semantic structure underlying videos and is essential for other video understanding tasks. Given the recent surge in multi-modal, relying solely on a single modality is arguably insufficient. On the other hand, prior solutions for similar tasks like video scene/shot segmentation cater to short videos with clear visual shifts but falter for long videos with subtle changes, such as livestreams. In this paper, we introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames, bolstered by a cross-modal attention mechanism. Furthermore, we propose a dual-contrastive learning framework adhering to the unsupervised domain adaptation paradigm, enhancing our model's adaptability to longer, more semantically complex videos. Experiments on short and long video corpora demonstrate that our proposed solution, significantly surpasses baseline methods in terms of both accuracy and transferability, in both intra- and cross-domain settings.
翻译:视频主题分割揭示了视频中粗粒度的语义结构,是其他视频理解任务的基础。鉴于多模态技术的近期发展,仅依赖单一模态显然不足。另一方面,针对视频场景/镜头分割等类似任务的现有解决方案适用于视觉变化明显的短视频,但在处理直播等变化细微的长视频时效果不佳。本文提出了一种多模态视频主题分割器,利用视频转录与帧图像,并通过跨模态注意力机制增强性能。此外,我们提出了一种遵循无监督领域自适应范式的双对比学习框架,提升了模型对更长、语义更复杂视频的适应性。在短视频与长视频语料库上的实验表明,我们的方案在准确性和可迁移性方面,在域内和跨域设置下均显著优于基线方法。