Speaker diarization has gained considerable attention within speech processing research community. Mainstream speaker diarization rely primarily on speakers' voice characteristics extracted from acoustic signals and often overlook the potential of semantic information. Considering the fact that speech signals can efficiently convey the content of a speech, it is of our interest to fully exploit these semantic cues utilizing language models. In this work we propose a novel approach to effectively leverage semantic information in clustering-based speaker diarization systems. Firstly, we introduce spoken language understanding modules to extract speaker-related semantic information and utilize these information to construct pairwise constraints. Secondly, we present a novel framework to integrate these constraints into the speaker diarization pipeline, enhancing the performance of the entire system. Extensive experiments conducted on the public dataset demonstrate the consistent superiority of our proposed approach over acoustic-only speaker diarization systems.
翻译:说话人日志在语音处理研究领域备受关注。主流说话人日志方法主要依赖从声学信号中提取的说话人声音特征,往往忽略了语义信息的潜力。鉴于语音信号能有效传达话语内容,我们感兴趣的是充分利用语言模型挖掘这些语义线索。本文提出了一种新颖方法,在基于聚类的说话人日志系统中有效利用语义信息。首先,我们引入口语理解模块提取与说话人相关的语义信息,并利用这些信息构建成对约束。其次,我们提出一个新颖的框架,将这些约束集成到说话人日志流程中,从而提升整个系统的性能。在公开数据集上进行的大量实验表明,与仅基于声学的说话人日志系统相比,我们提出的方法具有持续优越性。