Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.
翻译:生成与场景声学特性一致的音频对于沉浸式虚拟环境至关重要。近年来,神经声场方法实现了空间连续的声音渲染,但受限于场景特异性——每个环境都需要密集的音频测量和高昂的训练成本。小样本方法提升了跨房间的可扩展性,但仍依赖多次录音,且因其确定性本质,无法捕捉稀疏上下文条件下场景声学固有的不确定性。我们提出流匹配声学生成(FLAC),一种用于小样本声学合成的概率方法,能够在给定最小场景上下文条件下建模合理的房间脉冲响应(RIRs)分布。FLAC利用基于流匹配目标训练的扩散变换器,以空间、几何和声学线索为条件,在新场景的任意位置生成RIRs。在AcousticRooms和Hearing Anything Anywhere两个数据集上,FLAC以单样本性能超越当前最先进的八样本基线方法。为补充标准感知评估指标,我们进一步提出联合声学-几何嵌入方法AGREE,通过检索和分布度量实现对生成RIRs的几何一致性评估。本研究首次将生成式流匹配应用于显式RIR合成,为鲁棒且数据高效的声学合成开辟了新方向。