Multimodal Fusion and Coherence Modeling for Video Topic Segmentation

The video topic segmentation (VTS) task segments videos into intelligible, non-overlapping topics, facilitating efficient comprehension of video content and quick access to specific content. VTS is also critical to various downstream video understanding tasks. Traditional VTS methods using shallow features or unsupervised approaches struggle to accurately discern the nuances of topical transitions. Recently, supervised approaches have achieved superior performance on video action or scene segmentation over unsupervised approaches. In this work, we improve supervised VTS by thoroughly exploring multimodal fusion and multimodal coherence modeling. Specifically, (1) we enhance multimodal fusion by exploring different architectures using cross-attention and mixture of experts. (2) To generally strengthen multimodality alignment and fusion, we pre-train and fine-tune the model with multimodal contrastive learning. (3) We propose a new pre-training task tailored for the VTS task, and a novel fine-tuning task for enhancing multimodal coherence modeling for VTS. We evaluate the proposed approaches on educational videos, in the form of lectures, due to the vital role of topic segmentation of educational videos in boosting learning experiences. Additionally, we introduce a large-scale Chinese lecture video dataset to augment the existing English corpus, promoting further research in VTS. Experiments on both English and Chinese lecture datasets demonstrate that our model achieves superior VTS performance compared to competitive unsupervised and supervised baselines.

翻译：视频主题分割（VTS）任务旨在将视频分割为清晰、非重叠的主题单元，以促进对视频内容的高效理解与特定内容的快速定位。VTS对于各类下游视频理解任务同样至关重要。传统基于浅层特征或无监督方法的VTS方法难以准确识别主题转换的细微差异。近年来，有监督方法在视频动作或场景分割任务上已展现出优于无监督方法的性能。本研究通过深入探索多模态融合与多模态一致性建模，进一步提升了有监督VTS的性能。具体而言：（1）我们通过探索使用交叉注意力与专家混合机制的不同架构，增强了多模态融合能力；（2）为全面加强多模态对齐与融合，我们采用多模态对比学习对模型进行预训练与微调；（3）我们提出了专为VTS任务设计的新型预训练任务，以及用于增强VTS多模态一致性建模的创新微调任务。鉴于教育视频主题分割在提升学习体验中的关键作用，我们在讲座形式的教育视频上对所提方法进行评估。此外，我们构建了一个大规模中文讲座视频数据集，以扩充现有英文语料库，推动VTS领域的进一步研究。在英文与中文讲座数据集上的实验表明，相较于当前具有竞争力的无监督与有监督基线模型，我们的模型实现了更优越的VTS性能。