Articulatory features can provide interpretable and flexible controls for the synthesis of human vocalizations by allowing the user to directly modify parameters like vocal strain or lip position. To make this manipulation through resynthesis possible, we need to estimate the features that result in a desired vocalization directly from audio recordings. In this work, we propose a white-box optimization technique for estimating glottal source parameters and vocal tract shapes from audio recordings of human vowels. The approach is based on inverse filtering and optimizing the frequency response of a wave\-guide model of the vocal tract with gradient descent, propagating error gradients through the mapping of articulatory features to the vocal tract area function. We apply this method to the task of matching the sound of the Pink Trombone, an interactive articulatory synthesizer, to a given vocalization. We find that our method accurately recovers control functions for audio generated by the Pink Trombone itself. We then compare our technique against evolutionary optimization algorithms and a neural network trained to predict control parameters from audio. A subjective evaluation finds that our approach outperforms these black-box optimization baselines on the task of reproducing human vocalizations.
翻译:发音特征能为人类发声合成提供可解释且灵活的控制方式,允许用户直接修改如声道紧张度或嘴唇位置等参数。为实现通过合成重放进行操作,我们需要直接从录音中估计产生目标发声的特征。本研究提出一种白盒优化技术,用于从人类元音录音中估计声门源参数和声道形状。该方法基于逆滤波,通过梯度下降优化声道波导模型的频率响应,将误差梯度沿着发音特征到声道面积函数的映射反向传播。我们将该方法应用于匹配交互式发音合成器Pink Trombone声音与给定发声的任务中。实验表明,该方法能精确恢复Pink Trombone自身生成音频的控制函数。随后,我们将该技术与进化优化算法以及训练用于从音频预测控制参数的神经网络进行对比。主观评估结果显示,在重现人类发声的任务中,我们的方法优于这些黑盒优化基线。