In recent years, there has been growing interest in representing speech with discrete tokens, which serve as pseudo-text for speech language models (speechLMs) and as efficient intermediate representations for downstream tasks. These tokens are typically categorized as acoustic and phonetic tokens: the former holds detailed acoustic information for reconstruction while the latter mainly captures linguistic content. In human speech communication, however, unnecessary acoustic details such as speaker information are abstracted, while both linguistic and prosodic information are utilized for speech comprehension and production. Given this, neither type of token seems an ideal representation for tasks sensitive to prosody, such as speechLMs. In this study, we propose the Phonological Tokenizer, a method that fine-tunes phonetic tokens via differentiable k-means with a multi-task objective of ASR and speech resynthesis. Experimental validation on diverse tasks confirms that our tokens retain phonological (both linguistic and prosodic) information while appropriately discarding speaker identity.
翻译:近年来,利用离散标记表示语音的研究日益受到关注,这些标记既可作为语音语言模型(speechLM)的伪文本,也可作为下游任务的高效中间表示。这些标记通常分为声学标记和音素标记两类:前者包含用于重建的详细声学信息,而后者主要捕捉语言内容。然而在人类语音交流中,不必要的声学细节(如说话人信息)会被抽象化,而语言信息和韵律信息均被用于语音理解与生成。鉴于此,这两类标记似乎都不是对韵律敏感任务(如speechLM)的理想表示。本研究提出音韵分词器,该方法通过可微分k均值对音素标记进行多任务目标(语音识别与语音重合成)的微调。在多样化任务上的实验验证表明,我们的标记在保留音韵(包括语言和韵律)信息的同时,恰当地摒弃了说话人身份信息。