To build speech processing methods that can handle speech as naturally as humans, researchers have explored multiple ways of building an invertible mapping from speech to an interpretable space. The articulatory space is a promising inversion target, since this space captures the mechanics of speech production. To this end, we build an acoustic-to-articulatory inversion (AAI) model that leverages self-supervision to generalize to unseen speakers. Our approach obtains 0.784 correlation on an electromagnetic articulography (EMA) dataset, improving the state-of-the-art by 12.5\%. Additionally, we show the interpretability of these representations through directly comparing the behavior of estimated representations with speech production behavior. Finally, we propose a resynthesis-based AAI evaluation metric that does not rely on articulatory labels, demonstrating its efficacy with an 18-speaker dataset.
翻译:为了构建能够像人类一样自然处理语音的语音处理方法,研究者探索了多种从语音到可解释空间的逆映射构建方式。调音空间作为逆映射目标具有显著优势,因其精准捕捉了语音产生的物理机制。为此,我们构建了一个声学-调音逆映射(AAI)模型,该模型通过自监督学习实现对未见说话人的泛化。在电磁发音描记(EMA)数据集上,我们的方法获得了0.784的相关性,将当前最优水平提升了12.5%。此外,通过将估计表征的行为与真实语音产生行为进行直接对比,我们展示了这些表征的可解释性。最后,我们提出了一种基于重合成的AAI评估指标,该指标不依赖调音标注,并在包含18位说话人的数据集上验证了其有效性。