Spoken content, such as online videos and podcasts, often spans multiple topics, which makes automatic topic segmentation essential for user navigation and downstream applications. However, current methods do not fully leverage acoustic features, leaving room for improvement. We propose a multi-modal approach that fine-tunes both a text encoder and a Siamese audio encoder, capturing acoustic cues around sentence boundaries. Experiments on a large-scale dataset of YouTube videos show substantial gains over text-only and multi-modal baselines. Our model also proves more resilient to ASR noise and outperforms a larger text-only baseline on three additional datasets in Portuguese, German, and English, underscoring the value of learned acoustic features for robust topic segmentation.
翻译:口语内容(如在线视频和播客)通常涵盖多个主题,这使得自动主题分割对于用户导航和下游应用至关重要。然而,现有方法未能充分利用声学特征,存在改进空间。我们提出一种多模态方法,通过微调文本编码器和孪生音频编码器,捕捉句子边界周围的声学线索。在YouTube视频大规模数据集上的实验表明,该方法相比纯文本及多模态基线模型取得了显著提升。我们的模型还表现出更强的自动语音识别噪声鲁棒性,并在葡萄牙语、德语和英语的三个附加数据集上超越了规模更大的纯文本基线,这凸显了学习到的声学特征对于鲁棒主题分割的重要价值。