Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored. We conduct a comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular attention to phonological vectors. We first show that there exist linear directions within the model's representation space that correspond to phonological features. We further demonstrate that the scale of these phonological vectors correlate to the degree of acoustic realization of their corresponding phonological features in a continuous manner. For example, the difference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing. Together, these findings indicate that S3Ms encode speech using phonologically interpretable and compositional vectors, demonstrating phonological vector arithmetic. All code and interactive demos are available at https://github.com/juice500ml/phonetic-arithmetic .
翻译:自监督语音模型(S3Ms)已知能编码丰富的语音信息,但这些信息如何被结构化仍待深入探索。我们对96种语言进行了全面研究,以分析S3M表征的底层结构,尤其关注音系向量。我们首先证明,在模型的表征空间内存在与音系特征相对应的线性方向。我们进一步证明,这些音系向量的尺度与其对应音系特征在声学上实现的程度呈连续相关性。例如,[d]与[t]之间的差异产生了一个浊音向量:将该向量加到[p]上会产生[b],而对其进行缩放则会产生一个连续的浊音谱。总之,这些发现表明,S3Ms使用具有音系可解释性和组合性的向量来编码语音,展现了音系向量算术。所有代码和交互式演示均可在 https://github.com/juice500ml/phonetic-arithmetic 获取。