Multimodal Fusion and Coherence Modeling for Video Topic Segmentation

The video topic segmentation (VTS) task segments videos into intelligible, non-overlapping topics, facilitating efficient comprehension of video content and quick access to specific content. VTS is also critical to various downstream video understanding tasks. Traditional VTS methods using shallow features or unsupervised approaches struggle to accurately discern the nuances of topical transitions. Recently, supervised approaches have achieved superior performance on video action or scene segmentation over unsupervised approaches. In this work, we improve supervised VTS by thoroughly exploring multimodal fusion and multimodal coherence modeling. Specifically, (1) we enhance multimodal fusion by exploring different architectures using cross-attention and mixture of experts. (2) To generally strengthen multimodality alignment and fusion, we pre-train and fine-tune the model with multimodal contrastive learning. (3) We propose a new pre-training task tailored for the VTS task, and a novel fine-tuning task for enhancing multimodal coherence modeling for VTS. We evaluate the proposed approaches on educational videos, in the form of lectures, due to the vital role of topic segmentation of educational videos in boosting learning experiences. Additionally, we introduce a large-scale Chinese lecture video dataset to augment the existing English corpus, promoting further research in VTS. Experiments on both English and Chinese lecture datasets demonstrate that our model achieves superior VTS performance compared to competitive unsupervised and supervised baselines.

翻译：视频主题分割（VTS）任务旨在将视频分割为易于理解且互不重叠的主题，从而促进对视频内容的高效理解与快速定位特定内容。VTS对于各类下游视频理解任务也至关重要。传统VTS方法依赖浅层特征或无监督方法，难以准确识别主题转换的细微差别。近年来，有监督方法在视频动作或场景分割任务上已展现出优于无监督方法的性能。在本工作中，我们通过深入探索多模态融合与多模态连贯性建模，改进了有监督VTS方法。具体而言：（1）我们通过探索使用交叉注意力与专家混合的不同架构，增强了多模态融合能力。（2）为全面加强多模态对齐与融合，我们采用多模态对比学习对模型进行预训练与微调。（3）我们提出了一种专为VTS任务设计的新型预训练任务，以及一种用于增强VTS多模态连贯性建模的创新微调任务。鉴于教育视频（以讲座形式为主）的主题分割对提升学习体验具有重要作用，我们在教育视频上对所提方法进行评估。此外，我们引入了一个大规模中文讲座视频数据集，以扩充现有的英文语料库，推动VTS领域的进一步研究。在英文与中文讲座数据集上的实验表明，相较于具有竞争力的无监督与有监督基线模型，我们的模型实现了更优的VTS性能。