Discrete speech tokens offer significant advantages for storage and language model integration, but their application in speech emotion recognition (SER) is limited by paralinguistic information loss during quantization. This paper presents a comprehensive investigation of discrete tokens for SER. Using a fine-tuned WavLM-Large model, we systematically quantify performance degradation across different layer configurations and k-means quantization granularities. To recover the information loss, we propose two key strategies: (1) attention-based multi-layer fusion to recapture complementary information from different layers, and (2) integration of openSMILE features to explicitly reintroduce paralinguistic cues. We also compare mainstream neural codec tokenizers (SpeechTokenizer, DAC, EnCodec) and analyze their behaviors when fused with acoustic features. Our findings demonstrate that through multi-layer fusion and acoustic feature integration, discrete tokens can close the performance gap with continuous representations in SER tasks.
翻译:离散语音标记在存储和语言模型集成方面具有显著优势,但其在语音情感识别中的应用受限于量化过程中的副语言信息损失。本文对离散标记在语音情感识别中的应用进行了全面研究。利用微调后的WavLM-Large模型,我们系统量化了不同层级配置和k-means量化粒度下的性能衰减。为恢复信息损失,我们提出两项关键策略:(1)基于注意力的多层融合以重新捕获不同层级的互补信息;(2)整合openSMILE特征以显式重引入副语言线索。我们还比较了主流神经编解码器标记器(SpeechTokenizer、DAC、EnCodec),并分析了它们与声学特征融合时的行为。研究结果表明,通过多层融合与声学特征整合,离散标记能够在语音情感识别任务中缩小与连续表征的性能差距。