This paper presents virtual upmixing of steering vectors captured by a fewer-channel spherical microphone array. This challenge has conventionally been addressed by recovering the directions and signals of sound sources from first-order ambisonics (FOA) data, and then rendering the higher-order ambisonics (HOA) data using a physics-based acoustic simulator. This approach, however, struggles to handle the mutual dependency between the spatial directivity of source estimation and the spatial resolution of FOA ambisonics data. Our method, named SIRUP, employs a latent diffusion model architecture. Specifically, a variational autoencoder (VAE) is used to learn a compact encoding of the HOA data in a latent space and a diffusion model is then trained to generate the HOA embeddings, conditioned by the FOA data. Experimental results showed that SIRUP achieved a significant improvement compared to FOA systems for steering vector upmixing, source localization, and speech denoising.
翻译:本文提出了一种针对少通道球形麦克风阵列采集的导向向量进行虚拟上混的方法。传统上,这一挑战通常通过从一阶Ambisonics(FOA)数据中恢复声源的方向和信号,然后使用基于物理的声学模拟器渲染高阶Ambisonics(HOA)数据来解决。然而,这种方法难以处理声源估计的空间指向性与FOA Ambisonics数据的空间分辨率之间的相互依赖关系。我们提出的方法命名为SIRUP,采用了潜在扩散模型架构。具体而言,使用变分自编码器(VAE)在潜在空间中学习HOA数据的紧凑编码,然后训练一个扩散模型,以FOA数据为条件生成HOA嵌入。实验结果表明,在导向向量上混、声源定位和语音去噪方面,SIRUP相比FOA系统取得了显著提升。