Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of large language models (LLMs) for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of transcribed conversations. The ability of the models to improve diarization accuracy in a holdout dataset from the Fisher corpus as well as an independent dataset was measured. We report that fine-tuned LLMs can markedly improve diarization accuracy. However, model performance is constrained to transcripts produced using the same ASR tool as the transcripts used for fine-tuning, limiting generalizability. To address this constraint, an ensemble model was developed by combining weights from three separate models, each fine-tuned using transcripts from a different ASR tool. The ensemble model demonstrated better overall performance than each of the ASR-specific models, suggesting that a generalizable and ASR-agnostic approach may be achievable. We have made the weights of these models publicly available on HuggingFace at https://huggingface.co/bklynhlth.
翻译:说话人日志对于解读使用自动语音识别工具转录的对话是必要的。尽管说话人日志方法已取得显著进展,但其准确性仍是一个问题。本文研究了将大型语言模型作为后处理步骤用于说话人日志校正的方法。我们使用Fisher语料库(一个大型对话转录数据集)对LLM进行了微调。模型在Fisher语料库的保留数据集以及独立数据集上提升说话人日志准确性的能力得到了评估。实验表明,经过微调的LLM能够显著提高说话人日志的准确性。然而,模型性能受限于与微调所用转录文本采用相同ASR工具生成的转录本,这限制了其泛化能力。为解决这一局限,我们通过整合三个独立模型的权重开发了集成模型,每个模型分别使用不同ASR工具生成的转录本进行微调。该集成模型展现出比各ASR专用模型更优的整体性能,表明实现可泛化且与ASR无关的方法是可行的。我们已将模型权重公开发布于HuggingFace平台:https://huggingface.co/bklynhlth。