In this paper, we introduce DiarizationLM, a framework to leverage large language models (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (ASR) and speaker diarization systems are represented as a compact textual format, which is included in the prompt to an optionally finetuned LLM. The outputs of the LLM can be used as the refined diarization results with the desired enhancement. As a post-processing step, this framework can be easily applied to any off-the-shelf ASR and speaker diarization systems without retraining existing components. Our experiments show that a finetuned PaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone conversation dataset, and rel. 44.9% on the Callhome English dataset.
翻译:本文提出DiarizationLM框架,利用大语言模型(LLM)对说话人日记系统的输出进行后处理。该框架可实现多种目标,例如提升日记转录文本的可读性,或降低词汇日记错误率(WDER)。在该框架中,自动语音识别(ASR)与说话人日记系统的输出被表示为紧凑的文本格式,并作为提示信息输入至可选微调的LLM。LLM的输出可作为经过所需优化的精细化日记结果。作为后处理步骤,该框架可轻松应用于任何现成的ASR与说话人日记系统,无需重新训练现有组件。实验表明,经微调的PaLM 2-S模型在Fisher电话对话数据集上可将WDER相对降低55.5%,在Callhome英语数据集上可相对降低44.9%。