Neural speech codecs enable low-bitrate speech communication, yet at ultra-low bitrates (< 1000 bps) preserving perceptual quality and intelligibility is challenging. Existing designs often prioritize acoustic details, leaving limited capacity for the core linguistic message under tight bitrate constraints. To address this, we propose ContextCodec, a codec that transmits content-focused context features to explicitly guide reconstruction. ContextCodec adopts a dual-branch encoder that decouples acoustic details from content-focused context. The context branch is trained with a CLIP-style contrastive loss that aligns context features with phoneme indices, reducing paralinguistic leakage. During decoding, these features are injected at each decoding stage for explicit guidance. In addition, we introduce a lightweight autoregressive latent refinement module. Experiments show a strong quality-intelligibility trade-off down to 500 bps, with an RTF of 0.4886 on a typical mobile CPU.
翻译:神经语音编解码器实现了低比特率语音通信,但在超低比特率(< 1000 bps)下,保持感知质量与可懂度面临挑战。现有设计往往优先考虑声学细节,在严格比特率约束下留给核心语言信息的容量有限。为解决此问题,我们提出ContextCodec,一种通过传输面向内容的上下文特征显式引导重构的编解码器。ContextCodec采用双分支编码器,将声学细节与面向内容的上下文解耦。上下文分支通过CLIP风格对比损失函数进行训练,使上下文特征与音素索引对齐,减少副语言信息泄露。解码时,这些特征被注入每个解码阶段以提供显式引导。此外,我们引入轻量级自回归潜在特征精化模块。实验表明,在低至500 bps的码率下实现了质量与可懂度之间的强平衡,在典型移动CPU上的实时因子(RTF)为0.4886。