Room impulse response (RIR) generation remains a critical challenge for creating immersive virtual acoustic environments. Current methods suffer from two fundamental limitations: the scarcity of full-band RIR datasets and the inability of existing models to generate acoustically accurate responses from diverse input modalities. We present PromptReverb, a two-stage generative framework that addresses these challenges. Our approach combines a variational autoencoder that upsamples band-limited RIRs to full-band quality (48 kHz), and a conditional diffusion transformer model based on rectified flow matching that generates RIRs from descriptions in natural language. Empirical evaluation demonstrates that PromptReverb produces RIRs with superior perceptual quality and acoustic accuracy compared to existing methods, achieving 8.8% mean RT60 error compared to -37% for widely used baselines and yielding more realistic room-acoustic parameters. Our method enables practical applications in virtual reality, architectural acoustics, and audio production where flexible, high-quality RIR synthesis is essential.
翻译:房间脉冲响应(RIR)生成对于创建沉浸式虚拟声学环境仍是一个关键挑战。现有方法存在两个基本局限:全频段RIR数据集的稀缺性,以及现有模型无法从多样化输入模态生成声学精确的响应。我们提出了PromptReverb——一个解决这些挑战的两阶段生成框架。该方法结合了将带限RIR上采样至全频段质量(48 kHz)的变分自编码器,以及基于修正流匹配、可根据自然语言描述生成RIR的条件扩散Transformer模型。实证评估表明,与现有方法相比,PromptReverb生成的RIR具有更优的感知质量和声学精度:其平均RT60误差为8.8%,而广泛使用的基线方法为-37%,且能产生更符合实际的房间声学参数。本方法在虚拟现实、建筑声学和音频制作等需要灵活、高质量RIR合成的实际应用中具有重要价值。