Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of large language models (LLMs) for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of transcribed conversations. The ability of the models to improve diarization accuracy in a holdout dataset was measured. We report that fine-tuned LLMs can markedly improve diarization accuracy. However, model performance is constrained to transcripts produced using the same ASR tool as the transcripts used for fine-tuning, limiting generalizability. To address this constraint, an ensemble model was developed by combining weights from three separate models, each fine-tuned using transcripts from a different ASR tool. The ensemble model demonstrated better overall performance than each of the ASR-specific models, suggesting that a generalizable and ASR-agnostic approach may be achievable. We hope to make these models accessible through public-facing APIs for use by third-party applications.
翻译:说话人日志对于解读通过自动语音识别工具转录的对话是必要的。尽管日志方法取得了显著进展,但日志准确性仍是一个问题。本文研究了将大语言模型作为后处理步骤用于日志校正。我们使用Fisher语料库(一个大型转录对话数据集)对大语言模型进行了微调。在留出数据集中测量了模型提升日志准确性的能力。实验表明,经过微调的大语言模型能显著提高日志准确性。然而,模型性能受限于使用与微调时相同ASR工具生成的转录文本,这限制了其泛化能力。为突破此限制,我们通过整合三个独立模型的权重开发了集成模型,每个模型分别使用不同ASR工具的转录文本进行微调。该集成模型展现出比每个ASR专用模型更优的整体性能,表明实现可泛化且与ASR无关的方法是可行的。我们希望通过公开API提供这些模型,以供第三方应用程序使用。