Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.
翻译:单声道多说话人自动语音识别(ASR)因数据稀缺以及识别并归属词语至各说话人的固有难度(尤其在重叠语音中)仍具挑战性。近期进展推动系统从级联架构转向端到端(E2E)架构,后者减少了误差传播,并更好利用语音内容与说话人身份间的协同效应。尽管E2E多说话人ASR进展迅速,该领域仍缺乏对近期发展的全面综述。本文对面向多说话人ASR的E2E神经方法进行了系统分类,重点分析最新进展与对比研究。具体而言,我们分析:(1)预分割音频的架构范式(SIMO vs. SISO),剖析其不同特性与权衡;(2)基于这两种范式的最新架构与算法改进;(3)面向长语音的扩展,包括分割策略与说话人一致性假设拼接。此外,(4)我们在标准基准上评估并比较各类方法。最后讨论构建鲁棒且可扩展多说话人ASR面临的开放挑战与未来研究方向。