Recent advances in machine learning and the availability of articulatory datasets allow vocal tract synthesis to be conditioned on phonetic sequences, a primary task of articulatory speech synthesis. However, quality assessment needs a better definition. Generally, ranking generative models is tricky due to subjectivity. However, articulatory synthesis has the additional difficulty of requiring specialized knowledge in vocal tract anatomy and acoustics. To address this problem, this paper proposes to evaluate speech articulation synthesis using phoneme recognition as a proxy. Our hypothesis is that phoneme recognition using articulatory features better captures nuances in phoneme production, such as correct places of articulation, which traditional metrics (e.g., point-wise distance metrics) do not. We train a neural network with acoustic and articulatory features extracted from a single-speaker RT-MRI dataset. Then, we compare the recognition performance when testing the model with different synthetic articulatory features. Our results show that our articulatory feature set is phonetically rich and helps exploring additional dimensions on speech articulation synthesis.
翻译:近期机器学习领域的进展及发音数据集的可用性,使得声道合成能够以音素序列为条件,这是发音语音合成的核心任务。然而,质量评估需要更明确的标准。通常,由于主观性因素,对生成模型进行排序存在困难。而发音合成还额外要求具备声道解剖学与声学方面的专业知识。针对这一问题,本文提出以音素识别作为代理指标来评估言语发音合成质量。我们的假设是:利用发音特征进行的音素识别能更精准地捕捉音素生成中的细微差异(例如发音部位的准确性),而传统指标(如逐点距离度量)无法做到这一点。我们基于单说话人RT-MRI数据集提取声学与发音特征,训练了一个神经网络。随后,通过使用不同的合成发音特征测试模型,比较其识别性能。实验结果表明,我们的发音特征集具有丰富的语音学信息,有助于探索言语发音合成中的更多维度。