A good language model starts with a good tokenizer. Tokenization is especially important for speech modeling, which must handle continuous signals that mix linguistic and non-linguistic information. A speech tokenizer should extract phonetics and prosody, suppress linguistically irrelevant information like speaker identity, and enable high-quality synthesis. We present Kanade, a single-layer disentangled speech tokenizer that realizes this ideal. Kanade separates out acoustic constants to create a single stream of tokens that captures rich phonetics and prosody. It does so without the need for auxiliary methods that existing disentangled codecs often rely on. Experiments show that Kanade achieves state-of-the-art speaker disentanglement and lexical availability, while maintaining excellent reconstruction quality.
翻译:优秀的语言模型始于优秀的分词器。分词对于语音建模尤为重要,因为语音建模必须处理混合了语言与非语言信息的连续信号。理想的语音分词器应能提取语音学特征与韵律特征,抑制说话人身份等与语言学无关的信息,并支持高质量合成。本文提出Kanade——一种实现该理想的单层解缠结语音分词器。Kanade通过分离声学常量,生成能够捕捉丰富语音学与韵律信息的单一令牌流。该方法无需依赖现有解缠结编解码器常使用的辅助手段。实验表明,Kanade在保持卓越重建质量的同时,实现了最先进的说话人解缠结效果与词汇可用性。