We present a novel speaker-independent acoustic-to-articulatory inversion (AAI) model, overcoming the limitations observed in conventional AAI models that rely on acoustic features derived from restricted datasets. To address these challenges, we leverage representations from a pre-trained self-supervised learning (SSL) model to more effectively estimate the global, local, and kinematic pattern information in Electromagnetic Articulography (EMA) signals during the AAI process. We train our model using an adversarial approach and introduce an attention-based Multi-duration phoneme discriminator (MDPD) designed to fully capture the intricate relationship among multi-channel articulatory signals. Our method achieves a Pearson correlation coefficient of 0.847, marking state-of-the-art performance in speaker-independent AAI models. The implementation details and code can be found online.
翻译:本文提出了一种新颖的说话人无关声学-发音反演模型,克服了传统AAI模型依赖受限数据集提取声学特征的局限性。为解决这些挑战,我们利用预训练自监督学习模型的表征,以更有效地估计AAI过程中电磁发音仪信号所包含的全局、局部及运动模式信息。我们采用对抗训练方法,并引入一种基于注意力的多时长音素判别器,该判别器旨在充分捕捉多通道发音信号间复杂的相互关系。本方法实现了0.847的皮尔逊相关系数,标志着说话人无关AAI模型达到了当前最优性能。具体实现细节与代码已在线公开。