With the development of teleconferencing and in-vehicle voice assistants, far-field multi-speaker speech recognition has become a hot research topic. Recently, a multi-channel transformer (MCT) has been proposed, which demonstrates the ability of the transformer to model far-field acoustic environments. However, MCT cannot encode high-dimensional acoustic features for each speaker from mixed input audio because of the interference between speakers. Based on these, we propose the multi-channel multi-speaker transformer (M2Former) for far-field multi-speaker ASR in this paper. Experiments on the SMS-WSJ benchmark show that the M2Former outperforms the neural beamformer, MCT, dual-path RNN with transform-average-concatenate and multi-channel deep clustering based end-to-end systems by 9.2%, 14.3%, 24.9%, and 52.2% respectively, in terms of relative word error rate reduction.
翻译:随着远程会议和车载语音助手的发展,远场多说话人语音识别已成为研究热点。近期提出的多通道Transformer(MCT)展示了Transformer建模远场声学环境的能力。然而,由于说话人间的相互干扰,MCT无法从混合输入音频中为每个说话人编码高维声学特征。基于此,本文提出用于远场多说话人自动语音识别的多通道多说话人Transformer(M2Former)。在SMS-WSJ基准测试上的实验表明,M2Former在相对词错误率降低指标上,分别优于基于神经波束成形、MCT、变换-平均-拼接双路径RNN以及多通道深度聚类的端到端系统9.2%、14.3%、24.9%和52.2%。