We present an end-to-end multichannel speaker-attributed automatic speech recognition (MC-SA-ASR) system that combines a Conformer-based encoder with multi-frame crosschannel attention and a speaker-attributed Transformer-based decoder. To the best of our knowledge, this is the first model that efficiently integrates ASR and speaker identification modules in a multichannel setting. On simulated mixtures of LibriSpeech data, our system reduces the word error rate (WER) by up to 12% and 16% relative compared to previously proposed single-channel and multichannel approaches, respectively. Furthermore, we investigate the impact of different input features, including multichannel magnitude and phase information, on the ASR performance. Finally, our experiments on the AMI corpus confirm the effectiveness of our system for real-world multichannel meeting transcription.
翻译:我们提出了一种端到端多通道说话人归属自动语音识别(MC-SA-ASR)系统,该系统结合了基于Conformer的编码器(采用多帧跨通道注意力机制)和基于说话人归属Transformer的解码器。据我们所知,这是首个在多通道场景中高效集成ASR与说话人识别模块的模型。在LibriSpeech数据集的模拟混合语音上,与先前提出的单通道和多通道方法相比,我们的系统将词错误率分别降低了最多12%和16%。此外,我们研究了不同输入特征(包括多通道幅度和相位信息)对ASR性能的影响。最后,我们在AMI语料库上的实验验证了该系统在真实多通道会议转录中的有效性。