Self-Supervised Learning (SSL) based models of speech have shown remarkable performance on a range of downstream tasks. These state-of-the-art models have remained blackboxes, but many recent studies have begun "probing" models like HuBERT, to correlate their internal representations to different aspects of speech. In this paper, we show "inference of articulatory kinematics" as fundamental property of SSL models, i.e., the ability of these models to transform acoustics into the causal articulatory dynamics underlying the speech signal. We also show that this abstraction is largely overlapping across the language of the data used to train the model, with preference to the language with similar phonological system. Furthermore, we show that with simple affine transformations, Acoustic-to-Articulatory inversion (AAI) is transferrable across speakers, even across genders, languages, and dialects, showing the generalizability of this property. Together, these results shed new light on the internals of SSL models that are critical to their superior performance, and open up new avenues into language-agnostic universal models for speech engineering, that are interpretable and grounded in speech science.
翻译:基于自监督学习的语音模型在下游任务中展现出卓越性能。尽管这些最先进的模型仍属黑箱系统,但近期大量研究开始对HuBERT等模型进行"探测",旨在将模型内部表征与语音的不同维度相关联。本文证明"发音运动学推断"是自监督模型的基础属性,即模型能将声学信号转化为支撑语音信号的因果发音动态。研究同时表明,这种抽象表征在不同训练语言数据间具有高度重叠性,且优先偏好音系结构相似的语言。此外,通过简单仿射变换,声学-发音反演可在说话人间迁移,甚至跨越性别、语言及方言的差异,充分验证了该属性的泛化性。这些发现不仅揭示了自监督模型内部机制对其卓越性能的关键作用,更为构建可解释且植根于语音科学的语言无关通用语音工程模型开辟了新路径。