基于多语言语义嵌入的广播语音主题分割方法研究 (Advancing Topic Segmentation of Broadcasted Speech with Multilingual Semantic Embeddings)

Recent advancements in speech-based topic segmentation have highlighted the potential of pretrained speech encoders to capture semantic representations directly from speech. Traditionally, topic segmentation has relied on a pipeline approach in which transcripts of the automatic speech recognition systems are generated, followed by text-based segmentation algorithms. In this paper, we introduce an end-to-end scheme that bypasses this conventional two-step process by directly employing semantic speech encoders for segmentation. Focused on the broadcasted news domain, which poses unique challenges due to the diversity of speakers and topics within single recordings, we address the challenge of accessing topic change points efficiently in an end-to-end manner. Furthermore, we propose a new benchmark for spoken news topic segmentation by utilizing a dataset featuring approximately 1000 hours of publicly available recordings across six European languages and including an evaluation set in Hindi to test the model's cross-domain performance in a cross-lingual, zero-shot scenario. This setup reflects real-world diversity and the need for models adapting to various linguistic settings. Our results demonstrate that while the traditional pipeline approach achieves a state-of-the-art $P_k$ score of 0.2431 for English, our end-to-end model delivers a competitive $P_k$ score of 0.2564. When trained multilingually, these scores further improve to 0.1988 and 0.2370, respectively. To support further research, we release our model along with data preparation scripts, facilitating open research on multilingual spoken news topic segmentation.

翻译：近期语音主题分割领域的进展突显了预训练语音编码器直接从语音中捕获语义表征的潜力。传统方法通常采用流水线方案：首先生成自动语音识别系统的转录文本，随后应用基于文本的分割算法。本文提出一种端到端方案，通过直接利用语义语音编码器进行分割，绕过了传统的两步流程。针对广播新闻领域——该领域因单条录音中说话者与主题的多样性而带来独特挑战——我们致力于以端到端方式高效定位主题转换点。此外，我们通过构建包含约1000小时公开录音的数据集（涵盖六种欧洲语言），并引入印地语评估集以测试模型在跨语言零样本场景下的跨领域性能，提出了一个新的口语新闻主题分割基准。该设置反映了现实世界的多样性以及模型适应不同语言环境的需求。实验结果表明：传统流水线方法在英语上取得了当前最优的$P_k$分数0.2431，而我们的端到端模型获得了具有竞争力的$P_k$分数0.2564。当进行多语言训练时，这两项分数分别进一步提升至0.1988和0.2370。为促进后续研究，我们公开了模型及数据预处理脚本，以推动多语言口语新闻主题分割的开放研究。