Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.
翻译:单通道多说话人自动语音识别(ASR)由于数据稀缺性以及将语音内容识别并归属到特定说话人的内在困难(尤其在语音重叠场景下)而持续面临挑战。近期进展推动了从级联系统向端到端(E2E)架构的转变,这类架构能减少误差传播并更好地利用语音内容与说话人身份之间的协同关系。尽管端到端多说话人ASR发展迅速,该领域仍缺乏对近期进展的全面综述。本文系统性地对端到端多说话人ASR的神经方法进行了分类梳理,重点阐述最新进展与对比分析。具体而言,我们分析了:(1)针对预分段音频的架构范式(SIMO与SISO),剖析其各自特性与权衡;(2)基于这两种范式的最新架构与算法改进;(3)面向长语音的扩展方法,包括分段策略与说话人一致性假设拼接技术。此外,我们(4)在标准基准测试上对各方法进行了评估与比较。最后,我们讨论了当前面临的开放挑战与未来研究方向,以推动构建鲁棒且可扩展的多说话人ASR系统。