Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed-frame-rate codecs.
翻译:神经音频编解码器是现代对话语音技术的核心,它将连续语音转换为可由大型语言模型处理的离散标记序列。然而,现有编解码器通常以固定帧率运行,在时间上均匀分配标记并产生不必要的长序列。本文提出DyCAST,一种动态字符对齐的语音分词器,通过软字符级对齐和显式时长建模实现可变帧率分词。DyCAST在训练过程中学习将标记与字符级语言单元相关联,并在解码时支持无需对齐的推理,直接控制标记时长。为提升低帧率下的语音重合成质量,我们进一步引入检索增强的解码机制,在不增加比特率的情况下提高重建保真度。实验表明,DyCAST在使用显著少于固定帧率编解码器的标记数量的同时,实现了具有竞争力的语音重合成质量与下游任务性能。