Multi-talker speech recognition is often addressed by combining automatic speech recognition (ASR) and speaker diarization in a pipeline system. Recently, LLM-based approaches have shown promise by jointly modeling semantic and speaker information, but they typically require large-scale multi-talker corpora that are costly to annotate. In this paper, we investigate how to efficiently train an LLM-based system with limited real-recorded data while maintaining high accuracy in speaker attribution. We propose several strategies: (1) a dual-encoder architecture to extract semantic and speaker features, (2) a feature interleaving format to merge these features as the inputs to the LLM, (3) a length-aware speaker ID loss to enhance diarization capability, and (4) an adaptive threshold strategy for ASR loss computation to mitigate hallucinations caused by speech overlaps. These strategies balance training between ASR and diarization tasks. Our system outperforms open-source baseline approaches, achieving relative improvements of 18% on the AliMeeting corpus and 24% on the Aishell4 corpus.
翻译:多说话人语音识别通常通过流水线系统结合自动语音识别与说话人分割技术来解决。近年来,基于大语言模型的方法通过联合建模语义和说话人信息展现出潜力,但通常需要大规模多说话人语料库,而这类数据的标注成本高昂。本文研究了如何利用有限的真实录音数据高效训练基于大语言模型的系统,同时保持说话人归因的高准确度。我们提出了多项策略:(1)用于提取语义和说话人特征的双编码器架构;(2)用于将这些特征合并为大语言模型输入的特征交错格式;(3)用于增强说话人分割能力的长度感知说话人身份损失函数;(4)用于缓解语音重叠导致的幻觉问题的自适应阈值自动语音识别损失计算策略。这些策略平衡了自动语音识别与说话人分割任务的训练过程。我们的系统在性能上优于开源基线方法,在AliMeeting语料库上实现了18%的相对改进,在Aishell4语料库上实现了24%的相对改进。