We propose a novel approach for blind room impulse response (RIR) estimation systems in the context of a downstream application scenario, far-field automatic speech recognition (ASR). We first draw the connection between improved RIR estimation and improved ASR performance, as a means of evaluating neural RIR estimators. We then propose a generative adversarial network (GAN) based architecture that encodes RIR features from reverberant speech and constructs an RIR from the encoded features, and uses a novel energy decay relief loss to optimize for capturing energy-based properties of the input reverberant speech. We show that our model outperforms the state-of-the-art baselines on acoustic benchmarks (by 17\% on the energy decay relief and 22\% on an early-reflection energy metric), as well as in an ASR evaluation task (by 6.9\% in word error rate).
翻译:我们提出了一种新颖的方法,用于远场自动语音识别(ASR)下游应用场景中的盲房间冲激响应(RIR)估计系统。我们首先建立了改进RIR估计与提升ASR性能之间的联系,以此作为评估神经RIR估计器的手段。随后,我们提出了一种基于生成对抗网络(GAN)的架构,该架构从混响语音中编码RIR特征,并根据编码特征构建RIR,同时采用一种新颖的能量衰减补偿损失函数来优化捕获输入混响语音的能量相关特性。我们表明,我们的模型在声学基准测试(能量衰减补偿指标提升17%,早期反射能量指标提升22%)以及ASR评估任务(词错误率降低6.9%)上均优于现有最先进的基线方法。