Recent neural room impulse response (RIR) estimators typically comprise an encoder for reference audio analysis and a generator for RIR synthesis. Especially, it is the performance of the generator that directly influences the overall estimation quality. In this context, we explore an alternate generator architecture for improved performance. We first train an autoencoder with residual quantization to learn a discrete latent token space, where each token represents a small time-frequency patch of the RIR. Then, we cast the RIR estimation problem as a reference-conditioned autoregressive token generation task, employing transformer variants that operate across frequency, time, and quantization depth axes. This way, we address the standard blind estimation task and additional acoustic matching problem, which aims to find an RIR that matches the source signal to the target signal's reverberation characteristics. Experimental results show that our system is preferable to other baselines across various evaluation metrics.
翻译:最近的神经房间冲激响应(RIR)估计器通常包含用于参考音频分析的编码器和用于RIR合成的生成器。特别地,生成器的性能直接决定了整体估计质量。在此背景下,我们探索了一种改进性能的替代生成器架构。我们首先训练一个带有残差量化的自编码器,以学习一个离散的潜在令牌空间,其中每个令牌代表RIR的一个小时间-频率片段。然后,我们将RIR估计问题转化为一个参考条件自回归令牌生成任务,采用在频率、时间和量化深度轴上运行的Transformer变体。通过这种方式,我们解决了标准的盲估计任务以及额外的声学匹配问题——后者旨在找到一种能够匹配源信号与目标信号混响特性的RIR。实验结果表明,我们的系统在各种评估指标上均优于其他基线方法。