Speaker diarization(SD) is a classic task in speech processing and is crucial in multi-party scenarios such as meetings and conversations. Current mainstream speaker diarization approaches consider acoustic information only, which result in performance degradation when encountering adverse acoustic conditions. In this paper, we propose methods to extract speaker-related information from semantic content in multi-party meetings, which, as we will show, can further benefit speaker diarization. We introduce two sub-tasks, Dialogue Detection and Speaker-Turn Detection, in which we effectively extract speaker information from conversational semantics. We also propose a simple yet effective algorithm to jointly model acoustic and semantic information and obtain speaker-identified texts. Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
翻译:说话人日志(SD)是语音处理中的经典任务,在会议、对话等多方交互场景中至关重要。当前主流的说话人日志方法仅考虑声学信息,因此在遇到不利声学条件时性能会下降。本文提出从多方会议语义内容中提取说话人相关信息的方法,实验表明这能进一步改善说话人日志效果。我们引入了两个子任务——对话检测和说话人轮次检测,通过这两个任务有效从对话语义中提取说话人信息。同时提出一种简洁有效的算法,用于联合建模声学和语义信息,并获得带说话人标识的文本。在AISHELL-4和AliMeeting数据集上的实验表明,我们的方法相较于纯声学说话人日志系统取得了持续性能提升。