Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed-frame-rate codecs. Code and checkpoints will be released publicly at https://github.com/lucadellalib/dycast.
翻译:神经音频编解码器是现代对话语音技术的核心,它将连续语音转换为可由大型语言模型处理的离散标记序列。然而,现有编解码器通常以固定帧率运行,在时间上均匀分配标记,导致产生不必要的长序列。本研究提出DyCAST(动态字符对齐语音标记器),通过软字符级对齐和显式时长建模,实现可变帧率的标记化。DyCAST在训练过程中学习将标记与字符级语言单元相关联,并在解码时支持无需对齐的推理,直接控制标记的持续时间。为了在低帧率下提升语音重合成质量,我们进一步引入检索增强的解码机制,在不增加比特率的情况下提高重建保真度。实验表明,DyCAST在显著减少标记使用量的同时,实现了具有竞争力的语音重合成质量与下游任务性能。代码与模型检查点将在https://github.com/lucadellalib/dycast公开。