We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR). The model consists of a TitaNet based speaker embedding module, a Conformer based masking as well as ASR modules. These modules are jointly optimized to transcribe a target-speaker, while ignoring speech from other speakers. For training we use Connectionist Temporal Classification (CTC) loss and introduce a scale-invariant spectrogram reconstruction loss to encourage the model better separate the target-speaker's spectrogram from mixture. We obtain state-of-the-art target-speaker word error rate (TS-WER) on WSJ0-2mix-extr (4.2%). Further, we report for the first time TS-WER on WSJ0-3mix-extr (12.4%), LibriSpeech2Mix (4.2%) and LibriSpeech3Mix (7.6%) datasets, establishing new benchmarks for TS-ASR. The proposed model will be open-sourced through NVIDIA NeMo toolkit.
翻译:我们提出CONF-TSASR,一种用于单声道目标说话人自动语音识别(TS-ASR)的非自回归端到端时频域架构。该模型由基于TitaNet的说话人嵌入模块、基于Conformer的掩蔽模块及语音识别模块组成。这些模块经过联合优化,能够在忽略其他说话人语音的同时转录目标说话人。训练中我们使用连接时序分类(CTC)损失,并引入尺度不变语谱图重构损失,以促使模型更好地区分混合语音中目标说话人的语谱图。我们在WSJ0-2mix-extr数据集上实现了当前最优的目标说话人词错误率(TS-WER)(4.2%)。此外,我们首次在WSJ0-3mix-extr(12.4%)、LibriSpeech2Mix(4.2%)和LibriSpeech3Mix(7.6%)数据集上报告TS-WER,为TS-ASR设立了新的基准。该模型将通过NVIDIA NeMo工具包开源。