In this paper, we introduce DiarizationLM, a framework to leverage large language models (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (ASR) and speaker diarization systems are represented as a compact textual format, which is included in the prompt to an optionally finetuned LLM. The outputs of the LLM can be used as the refined diarization results with the desired enhancement. As a post-processing step, this framework can be easily applied to any off-the-shelf ASR and speaker diarization systems without retraining existing components. Our experiments show that a finetuned PaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone conversation dataset, and rel. 44.9% on the Callhome English dataset.
翻译:本文介绍了DiarizationLM框架,该框架利用大语言模型(LLM)对说话人分离系统的输出进行后处理。通过该框架可实现多种目标,例如提升带说话人标签转录文本的可读性,或降低词级说话人分离错误率(WDER)。在该框架中,自动语音识别(ASR)和说话人分离系统的输出被表示为紧凑的文本格式,并作为提示词输入到可选微调的大语言模型中。大语言模型的输出可直接用作经优化的说话人分离结果。作为后处理步骤,该框架可轻松应用于任何现成的ASR和说话人分离系统,无需重新训练现有组件。实验表明,在Fisher电话对话数据集上,微调后的PaLM 2-S模型可将WDER相对降低55.5%;在Callhome英语数据集上,该值相对降低44.9%。