In this paper, we introduce DiarizationLM, a framework to leverage large language models (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (ASR) and speaker diarization systems are represented as a compact textual format, which is included in the prompt to an optionally finetuned LLM. The outputs of the LLM can be used as the refined diarization results with the desired enhancement. As a post-processing step, this framework can be easily applied to any off-the-shelf ASR and speaker diarization systems without retraining existing components. Our experiments show that a finetuned PaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone conversation dataset, and rel. 44.9% on the Callhome English dataset.
翻译:本文提出DiarizationLM框架,利用大语言模型对说话人日志系统的输出进行后处理。该框架可实现多种目标,例如提升日志化转录文本的可读性,或降低词级日志错误率。在此框架中,自动语音识别系统和说话人日志系统的输出被编码为紧凑的文本格式,并作为提示词输入至可选微调的大语言模型。大语言模型的输出即可作为经过优化且具备所需增强特性的日志化结果。作为后处理步骤,该框架无需重新训练现有组件即可轻松应用于任何现成的自动语音识别与说话人日志系统。实验表明,在Fisher电话对话数据集上,微调后的PaLM 2-S模型可将词级日志错误率相对降低55.5%;在Callhome英语数据集上可相对降低44.9%。