Articulatory representation learning is the fundamental research in modeling neural speech production system. Our previous work has established a deep paradigm to decompose the articulatory kinematics data into gestures, which explicitly model the phonological and linguistic structure encoded with human speech production mechanism, and corresponding gestural scores. We continue with this line of work by raising two concerns: (1) The articulators are entangled together in the original algorithm such that some of the articulators do not leverage effective moving patterns, which limits the interpretability of both gestures and gestural scores; (2) The EMA data is sparsely sampled from articulators, which limits the intelligibility of learned representations. In this work, we propose a novel articulatory representation decomposition algorithm that takes the advantage of guided factor analysis to derive the articulatory-specific factors and factor scores. A neural convolutive matrix factorization algorithm is then employed on the factor scores to derive the new gestures and gestural scores. We experiment with the rtMRI corpus that captures the fine-grained vocal tract contours. Both subjective and objective evaluation results suggest that the newly proposed system delivers the articulatory representations that are intelligible, generalizable, efficient and interpretable.
翻译:发音表征学习是建模神经语音产生系统的基础研究。我们前期工作建立了深度范式,将发音运动学数据分解为姿态(显式建模人类语音产生机制编码的音系与语言结构)及对应的姿态分数。本文在此研究方向上继续推进,提出两个关注点:(1) 原始算法中发音器官相互耦合,导致部分器官无法利用有效运动模式,限制了姿态与姿态分数的可解释性;(2) EMA数据对发音器官的稀疏采样,限制了所学表征的可理解性。本文提出新型发音表征分解算法,利用引导因子分析推导发音特异性因子及因子分数,进而采用神经卷积矩阵分解算法基于因子分数生成新姿态与姿态分数。我们使用捕获精细声道轮廓的rtMRI语料库进行实验。主观与客观评估结果表明,新系统生成的发音表征具备可理解性、泛化性、高效性与可解释性。