Interpretability research has shown that self-supervised Spoken Language Models (SLMs) encode a wide variety of features in human speech from the acoustic, phonetic, phonological, syntactic and semantic levels, to speaker characteristics. The bulk of prior research on representations of phonology has focused on segmental features such as phonemes; the encoding of suprasegmental phonology (such as tone and stress patterns) in SLMs is not yet well understood. Tone is a suprasegmental feature that is present in more than half of the world's languages. This paper aims to analyze the tone encoding capabilities of SLMs, using Mandarin and Vietnamese as case studies. We show that SLMs encode lexical tone to a significant degree even when they are trained on data from non-tonal languages. We further find that SLMs behave similarly to native and non-native human participants in tone and consonant perception studies, but they do not follow the same developmental trajectory.
翻译:可解释性研究表明,自监督口语语言模型(SLMs)能够编码人类语音中从声学、语音学、音系学、句法、语义层面到说话人特征等多种特征。以往关于音系表征的研究大多聚焦于音素等音段特征,而SLMs对超音段音系特征(如声调和重音模式)的编码尚未得到充分理解。声调是一种超音段特征,存在于全球半数以上的语言中。本文以普通话和越南语为例,旨在分析SLMs的声调编码能力。我们发现,即使SLMs仅在非声调语言数据上训练,其依然能够显著编码词汇声调。进一步研究表明,SLMs在声调和辅音感知任务中的表现与母语及非母语人类被试相似,但其发展轨迹与人类并不相同。