Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate degradation remain insufficiently understood. We investigate these mechanisms through a controlled frame rate ablation. We reproduce a quality cliff at 6.25 Hz reported in previous works and evaluate candidate explanations: phonemic collisions and codebook saturation, neither of which shows evidence of a fundamental barrier. The cliff is instead caused by suboptimal training configuration: fixed clip duration during training yields too few tokens at low frame rates, starving the decoder of inter-token context. Once corrected, WER degrades smoothly with phonemic load down to 3.1 Hz and 1.6 Hz, suggesting the inference-time efficiency gains of low frame rate codecs are more accessible than previously assumed.
翻译:低帧率神经音频编解码器对自回归语音合成具有吸引力,其生成成本随序列长度线性增长。近期研究表明,编解码器可在12.5 Hz及以下的帧率运行,但低帧率退化背后的潜在机制仍未被充分理解。我们通过受控帧率消融实验探究这些机制,复现了先前研究报道的6.25 Hz质量突变现象,并评估了两种候选解释:音素碰撞与码本饱和——两者均未显示存在根本性障碍。该突变实由次优训练配置引发:训练过程中固定音频片段时长导致低帧率下生成的令牌数过少,使得解码器缺乏跨令牌上下文信息。修正该问题后,词错误率(WER)随音素负载增加而平滑退化,直至降至3.1 Hz与1.6 Hz,表明低帧率编解码器的推理效率增益比先前假设更易实现。