Speaker diarization (SD) is typically used with an automatic speech recognition (ASR) system to ascribe speaker labels to recognized words. The conventional approach reconciles outputs from independently optimized ASR and SD systems, where the SD system typically uses only acoustic information to identify the speakers in the audio stream. This approach can lead to speaker errors especially around speaker turns and regions of speaker overlap. In this paper, we propose a novel second-pass speaker error correction system using lexical information, leveraging the power of modern language models (LMs). Our experiments across multiple telephony datasets show that our approach is both effective and robust. Training and tuning only on the Fisher dataset, this error correction approach leads to relative word-level diarization error rate (WDER) reductions of 15-30% on three telephony datasets: RT03-CTS, Callhome American English and held-out portions of Fisher.
翻译:说话人分割(SD)通常与自动语音识别(ASR)系统共同使用,为识别出的词语标注说话人标签。传统方法将独立优化的ASR和SD系统的输出进行整合,其中SD系统通常仅利用声学信息来识别音频流中的说话人。这种方法可能导致说话人错误,尤其是在说话人转换和重叠区域。本文提出了一种新颖的第二阶段说话人错误纠正系统,该系统利用词汇信息,借助现代语言模型(LMs)的能力。我们在多个电话数据集上的实验表明,该方法既有效又稳健。仅使用Fisher数据集进行训练和调优,该错误纠正方法在三个电话数据集(RT03-CTS、Callhome美式英语和Fisher保留部分)上实现了词级说话人分割错误率(WDER)相对降低15-30%。