Estimating frequency-varying acoustic parameters is essential for enhancing immersive perception in realistic spatial audio creation. In this paper, we propose a unified framework that blindly estimates reverberation time (T60), direct-to-reverberant ratio (DRR), and clarity (C50) across 10 frequency bands using first-order Ambisonics (FOA) speech recordings as inputs. The proposed framework utilizes a novel feature named Spectro-Spatial Covariance Vector (SSCV), efficiently representing temporal, spectral as well as spatial information of the FOA signal. Our models significantly outperform existing single-channel methods with only spectral information, reducing estimation errors by more than half for all three acoustic parameters. Additionally, we introduce FOA-Conv3D, a novel back-end network for effectively utilising the SSCV feature with a 3D convolutional encoder. FOA-Conv3D outperforms the convolutional neural network (CNN) and recurrent convolutional neural network (CRNN) backends, achieving lower estimation errors and accounting for a higher proportion of variance (PoV) for all 3 acoustic parameters.
翻译:频率变化声学参数的估计对于增强真实空间音频创作中的沉浸感感知至关重要。本文提出了一种统一框架,该框架以一阶Ambisonics(FOA)语音录音作为输入,在10个频带上盲估计混响时间(T60)、直达声与混响声比(DRR)和清晰度(C50)。所提出的框架利用一种名为谱-空间协方差向量(SSCV)的新特征,该特征高效地表征了FOA信号的时域、频域以及空间信息。我们的模型显著优于仅利用频谱信息的现有单通道方法,将所有三个声学参数的估计误差减少了一半以上。此外,我们提出了FOA-Conv3D,这是一种新颖的后端网络,通过3D卷积编码器有效利用SSCV特征。FOA-Conv3D在卷积神经网络(CNN)和循环卷积神经网络(CRNN)后端之上取得了更优性能,实现了更低的估计误差,并对所有3个声学参数解释了更高的方差占比(PoV)。