State-of-the-art spoken dialogue models (Défossez et al. 2024; Schalkwyk et al. 2025) use neural audio codecs to "tokenize" audio signals into a lower-frequency stream of vectorial latent representations, each quantized using a hierarchy of vector codebooks. A transformer layer allows these representations to reflect some time- and context-dependent patterns. We train probes on labeled audio data from Cole et al. (2023) to test whether the pitch trajectories that characterize English phrase-final (nuclear) intonational tunes are among these patterns. Results: Linear probes trained on the unquantized latents or some of the associated codewords yield above-chance accuracy in distinguishing eight phonologically specified nuclear tunes with monotonal pitch accents (top average test accuracy (TATA): 0.31) and the five clusters of these tunes that are robust in human speech production and perception (TATA: 0.45). Greater accuracy (TATAs: 0.74-0.89) is attained for binary distinctions between classes of rising vs. falling tunes, respectively used for questions and assertions. Information about tunes is spread among all codebooks, which calls into question a distinction between 'semantic' and 'acoustic' codebooks found in the literature. Accuracies improve with nonlinear probes, but discrimination among the five clusters remains far from human performance, suggesting a fundamental limitation of current codecs.
翻译:最先进的语音对话模型(Défossez 等人,2024;Schalkwyk 等人,2025)使用神经音频编解码器将音频信号“标记化”为低频的向量潜在表示流,每个表示通过分层向量码本进行量化。Transformer 层使这些表示能够反映某些时间和上下文相关的模式。我们在 Cole 等人(2023)标注的音频数据上训练探针,以测试表征英语短语末尾(核心)语调调型的音高轨迹是否属于这些模式。结果:在未量化的潜在表示或部分相关码字上训练的线性探针,在区分八个具有单调音高重音的音系学指定核心调型(最高平均测试准确率(TATA):0.31)以及人类言语产生和感知中稳健的这八个调型的五个聚类(TATA:0.45)方面,取得了高于随机水平的准确率。对于分别用于疑问句和陈述句的上升调与下降调类别之间的二元区分,获得了更高的准确率(TATA:0.74-0.89)。调型信息分布在所有码本中,这对文献中发现的“语义”与“声学”码本之间的区分提出了质疑。使用非线性探针时准确率有所提高,但对五个聚类的区分能力仍远低于人类水平,这表明当前编解码器存在根本性局限。