Blind room impulse response (RIR) estimation is a core task for capturing and transferring acoustic properties; yet existing methods often suffer from limited modeling capability and degraded performance under unseen conditions. Moreover, emerging generative audio applications call for more flexible impulse response generation methods. We propose Gencho, a diffusion-transformer-based model that predicts complex spectrogram RIRs from reverberant speech. A structure-aware encoder leverages isolation between early and late reflections to encode the input audio into a robust representation for conditioning, while the diffusion decoder generates diverse and perceptually realistic impulse responses from it. Gencho integrates modularly with standard speech processing pipelines for acoustic matching. Results show richer generated RIRs than non-generative baselines while maintaining strong performance in standard RIR metrics. We further demonstrate its application to text-conditioned RIR generation, highlighting Gencho's versatility for controllable acoustic simulation and generative audio tasks.
翻译:盲房间冲激响应估计是捕捉与传递声学特性的核心任务;然而现有方法常受限于建模能力不足,且在未见条件下性能下降。此外,新兴的生成式音频应用需要更灵活的冲激响应生成方法。我们提出Gencho,一种基于扩散Transformer的模型,可从混响语音预测复数频谱冲激响应。其结构感知编码器利用早期反射与晚期反射间的隔离特性,将输入音频编码为用于条件控制的鲁棒表示,而扩散解码器则基于该表示生成多样且感知真实的冲激响应。Gencho可与标准语音处理流程进行模块化集成,实现声学匹配。实验结果表明,相较于非生成式基线方法,所生成的冲激响应具有更丰富的声学特征,同时在标准冲激响应指标上保持优异性能。我们进一步展示了其在文本条件冲激响应生成中的应用,凸显了Gencho在可控声学仿真与生成式音频任务中的多功能性。