We present a neural network for rendering binaural speech from given monaural audio, position, and orientation of the source. Most of the previous works have focused on synthesizing binaural speeches by conditioning the positions and orientations in the feature space of convolutional neural networks. These synthesis approaches are powerful in estimating the target binaural speeches even for in-the-wild data but are difficult to generalize for rendering the audio from out-of-distribution domains. To alleviate this, we propose Neural Fourier Shift (NFS), a novel network architecture that enables binaural speech rendering in the Fourier space. Specifically, utilizing a geometric time delay based on the distance between the source and the receiver, NFS is trained to predict the delays and scales of various early reflections. NFS is efficient in both memory and computational cost, is interpretable, and operates independently of the source domain by its design. Experimental results show that NFS performs comparable to the previous studies on the benchmark dataset, even with its 25 times lighter memory and 6 times fewer calculations.
翻译:我们提出一种神经网络方法,用于从给定的单声道音频、声源位置及朝向渲染双耳语音。以往多数研究聚焦于在卷积神经网络特征空间中通过条件化位置与朝向信息合成双耳语音。这类合成方法在估计目标双耳语音方面表现出强大性能(即便针对真实场景数据),但难以泛化至分布外域音频的渲染。为缓解这一问题,我们提出神经傅里叶移位(NFS)——一种在傅里叶空间实现双耳语音渲染的新型网络架构。具体而言,NFS利用基于声源与接收器间距的几何时延,通过训练预测各类早期反射的延迟与尺度。该架构在内存与计算代价上均具有高效性,具备可解释性,且其设计本质独立于源域。实验结果表明,在基准数据集上,NFS的性能与先前研究相当,而内存消耗降低至1/25,计算量减少至1/6。