This article presents a method for estimating and reconstructing the spatial energy distribution pattern of natural speech, which is crucial for achieving realistic vocal presence in virtual communication settings. The method comprises two stages. First, recordings of speech captured by a real, static microphone array are used to create an egocentric virtual array that tracks the movement of the speaker over time. This virtual array is used to measure and encode the high-resolution directivity pattern of the speech signal as it evolves dynamically with natural speech and movement. In the second stage, the encoded directivity representation is utilized to train a machine learning model that can estimate the full, dynamic directivity pattern given a limited set of speech signals, such as those recorded using the microphones on a head-mounted display. Our results show that neural networks can accurately estimate the full directivity pattern of natural, unconstrained speech from limited information. The proposed method for estimating and reconstructing the spatial energy distribution pattern of natural speech, along with the evaluation of various machine learning models and training paradigms, provides an important contribution to the development of realistic vocal presence in virtual communication settings.
翻译:本文提出一种估计与重建自然语音空间能量分布模式的方法,这对于在虚拟通信场景中实现逼真的语音临场感至关重要。该方法包含两个阶段:首先,利用真实静态麦克风阵列采集的语音记录构建一个以自我为中心且随说话人运动实时追踪的虚拟阵列,通过该阵列测量并编码随自然语音及运动动态演变的高分辨率指向性模式;其次,在第二阶段中,利用编码后的指向性表征训练机器学习模型,使其能够根据有限语音信号(例如头戴式显示器麦克风所记录的信号)估计完整的动态指向性模式。实验结果表明,神经网络可从有限信息中准确估计自然无约束语音的完整指向性模式。本文提出的自然语音空间能量分布模式估计与重建方法,结合对多种机器学习模型及训练范式的评估,为虚拟通信场景中实现逼真的语音临场感作出了重要贡献。