Discrete speech tokens obtained from self-supervised learning (SSL) models provide efficient data compression while maintaining strong performance, and have been widely used as intermediate representations in various tasks. However, discretization inevitably causes information loss, leading to degraded performance compared with continuous SSL features. In this work, we propose to apply soft token assignment only during downstream inference. This approach preserves the efficiency of hard discretization during training while enhancing the expressiveness of the tokens at inference. The proposed method outperforms conventional hard assignment on both ASR and speech synthesis tasks, and exhibits particularly strong generalizability to out-of-domain data. For ASR of non-native speech, it even surpasses models using continuous SSL features. Moreover, analysis of the resulting representations shows they align more accurately with phonemes compared with conventional hard assignment.
翻译:从自监督学习(SSL)模型获得的离散语音令牌可在维持强劲性能的同时实现高效数据压缩,并已被广泛用作各种任务的中间表示。然而,离散化不可避免地造成信息损失,导致其性能相较于连续SSL特征有所下降。本研究提出仅在下游推理阶段应用软令牌分配方法。该方法在训练时保留硬离散化的效率,同时在推理阶段增强令牌的表达能力。所提方法在自动语音识别(ASR)和语音合成任务上均优于传统硬分配方法,并对域外数据展现出极强的泛化能力。针对非母语语音的ASR任务,其性能甚至超越了使用连续SSL特征的模型。此外,对生成表示的分析表明,与传统硬分配相比,该方法与音素的对应关系更为精确。