In this paper, we introduce DiarizationLM, a framework to leverage large language models (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (ASR) and speaker diarization systems are represented as a compact textual format, which is included in the prompt to an optionally finetuned LLM. The outputs of the LLM can be used as the refined diarization results with the desired enhancement. As a post-processing step, this framework can be easily applied to any off-the-shelf ASR and speaker diarization systems without retraining existing components. Our experiments show that a finetuned PaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone conversation dataset, and rel. 44.9% on the Callhome English dataset.
翻译:本文提出了DiarizationLM框架,该框架利用大型语言模型(LLM)对说话人日志系统的输出进行后处理。通过该框架可实现多种目标,例如提升日志化转录文本的可读性,或降低词级别说话人日志错误率(WDER)。在该框架中,自动语音识别(ASR)与说话人日志系统的输出被转化为紧凑的文本格式,并作为提示(prompt)输入到可选微调后的大语言模型中。LLM的输出可作为经过所需增强的精细化日志结果。作为后处理步骤,该框架可轻松应用于任何现成的ASR和说话人日志系统,而无需重新训练现有组件。实验表明,微调后的PaLM 2-S模型在Fisher电话对话数据集上可将WDER相对降低55.5%,在Callhome英语数据集上相对降低44.9%。