Modeling room acoustics in a field setting involves some degree of blind parameter estimation from noisy and reverberant audio. Modern approaches leverage convolutional neural networks (CNNs) in tandem with time-frequency representation. Using short-time Fourier transforms to develop these spectrogram-like features has shown promising results, but this method implicitly discards a significant amount of audio information in the phase domain. Inspired by recent works in speech enhancement, we propose utilizing novel phase-related features to extend recent approaches to blindly estimate the so-called "reverberation fingerprint" parameters, namely, volume and RT60. The addition of these features is shown to outperform existing methods that rely solely on magnitude-based spectral features across a wide range of acoustics spaces. We evaluate the effectiveness of the deployment of these novel features in both single-parameter and multi-parameter estimation strategies, using a novel dataset that consists of publicly available room impulse responses (RIRs), synthesized RIRs, and in-house measurements of real acoustic spaces.
翻译:在实地环境中对房间声学进行建模,需要从含噪和混响音频中进行一定程度的盲参数估计。现代方法通常采用卷积神经网络(CNN)结合时频表示。利用短时傅里叶变换生成类语谱图特征的方法已展现出良好效果,但这类方法会隐式丢弃相位域中大量音频信息。受近期语音增强研究的启发,我们提出利用新型相位相关特征,将现有方法扩展至对所谓"混响指纹"参数(即房间体积和RT60)的盲估计。研究表明,在多种声学空间中,加入这些特征的方法在性能上优于仅依赖基于幅度的频谱特征的现有方法。我们通过使用包含公开房间冲激响应(RIR)、合成RIR以及实测真实声学空间数据的新型数据集,评估了这些新型特征在单参数和多参数估计策略中的有效性。