Higher-Order Ambisonics (HOA) encoding from sparse, irregular microphone arrays remains a critical challenge for consumer spatial audio capture in immersive communication and XR. We propose Flow-HOA, a generative framework that jointly optimizes a multi-dimensional objective encompassing time-domain, spectral, and spatial fidelity while producing a deployable, time-invariant bank of Finite Impulse Response (FIR) encoding filters. Using conditional flow matching, the model learns to map a simple prior distribution to the target distribution of FIR filter coefficients. Training is guided by a composite loss that balances time-domain waveform fidelity, multi-resolution spectral consistency, sub-band energy preservation, and spatial directivity constraints. Objective evaluations on synthetically simulated data demonstrate improved performance over strong model-based baselines in both signal fidelity and spatial accuracy metrics. Subjective listening tests on real microphone array recordings further confirm that Flow-HOA yields higher overall sound quality with reduced artifacts, demonstrating generalization from synthetic training data to real-world capture conditions.
翻译:从稀疏、不规则麦克风阵列中进行高阶Ambisonics(HOA)编码,仍然是沉浸式通信和扩展现实(XR)中消费级空间音频捕获的关键挑战。我们提出了Flow-HOA,这是一个生成式框架,能够联合优化涵盖时域、频谱和空间保真度的多维目标,同时产生一组可部署的、时不变的有限脉冲响应(FIR)编码滤波器。利用条件流匹配,该模型学习将简单先验分布映射到FIR滤波器系数的目标分布。训练过程由复合损失函数引导,该函数平衡了时域波形保真度、多分辨率频谱一致性、子带能量保持以及空间指向性约束。在合成模拟数据上的客观评估表明,在信号保真度和空间精度指标上,该模型优于强基线方法。在真实麦克风阵列录音上的主观聆听测试进一步证实,Flow-HOA能够提供更高的整体音质并减少伪影,证明了从合成训练数据到真实世界捕获条件的泛化能力。