Estimation of a speaker's direction and head orientation with binaural recordings can be a critical piece of information in many real-world applications with emerging `earable' devices, including smart headphones and AR/VR headsets. However, it requires predicting the mutual head orientations of both the speaker and the listener, which is challenging in practice. This paper presents a system for jointly predicting speaker-listener head orientations by leveraging inherent human voice directivity and listener's head-related transfer function (HRTF) as perceived by the ear-mounted microphones on the listener. We propose a convolution neural network model that, given binaural speech recording, can predict the orientation of both speaker and listener with respect to the line joining the two. The system builds on the core observation that the recordings from the left and right ears are differentially affected by the voice directivity as well as the HRTF. We also incorporate the fact that voice is more directional at higher frequencies compared to lower frequencies.
翻译:摘要:在诸多实际应用中,诸如智能耳机和AR/VR头戴设备等新兴“耳戴式”设备的双耳录音中,估计说话者的方向和头部朝向可成为关键信息。然而,这需要预测说话者和听者双方的相互头部朝向,这在实践中颇具挑战性。本文提出一种联合预测说话者-听者头部朝向的系统,通过利用人类语音固有方向性及听者耳戴麦克风感知的听者头部相关传递函数(HRTF)。我们设计了一个卷积神经网络模型,该模型在给定双耳语音录音的条件下,能够预测说话者和听者相对于两者连线的朝向。该系统的核心观测依据是:左耳和右耳录音受语音方向性以及HRTF的影响存在差异。我们还利用了语音在较高频率下比在较低频率下更具方向性这一事实。