Neural speech codecs have become the discrete interface between raw audio and speech language models, yet they remain optimized primarily for acoustic reconstruction fidelity, which leaves emotion-relevant cues vulnerable to being discarded during quantization, limiting the affective capacity of downstream models. We trace this degradation to two mechanisms: reconstruction-driven bit allocation under limited bitrate and cross-stream leakage in concatenation-based codecs, where acoustic gradients can overwrite nominally emotion-reserved dimensions. We propose AffectCodec, an emotion-preserving neural speech codec built on Block-Diagonal Residual Finite Scalar Quantization (BD-RFSQ). By imposing block-diagonal input and output projections over emotion and acoustic subspaces, BD-RFSQ transforms bit allocation from implicit and loss-driven to explicit and structurally guaranteed, while still preserving a flat token interface for downstream speech language models. AffectCodec further combines this structurally constrained quantizer with multi-granularity emotion conditioning and multi-rate training, enabling robust affect preservation at low bitrates. Experiments across multiple emotional speech benchmarks show that AffectCodec substantially improves emotion preservation, especially in the low-bitrate regime, while maintaining competitive acoustic quality and intelligibility. These results suggest that structurally protected quantization is an effective principle for preserving emotion-relevant information and may provide a general route toward attribute-aware neural speech compression.
翻译:神经语音编解码器已成为原始音频与语音语言模型之间的离散接口,但其优化目标仍主要聚焦于声学重建保真度,导致情感相关线索在量化过程中易被丢弃,从而限制了下游模型的情感能力。我们追溯这一退化趋势至两种机制:有限比特率下基于重建驱动的比特分配,以及基于拼接的编解码器中的跨流泄漏——其中声学梯度可能覆盖标称的情感保留维度。我们提出AffectCodec,一种构建于块对角残差有限标量量化(BD-RFSQ)之上的情感保持神经语音编解码器。通过在情感子空间和声学子空间上施加块对角输入输出投影,BD-RFSQ将比特分配从隐式且受损失驱动的方式转变为显式且结构有保障的方式,同时仍为下游语音语言模型保留平坦令牌接口。AffectCodec进一步将这种结构约束量化器与多粒度情感条件控制及多码率训练相结合,从而在低比特率下实现稳健的情感保持。跨多个情感语音基准的实验表明,AffectCodec显著提升了情感保持性能,尤其在低比特率场景下,同时保持了具有竞争力的声学质量和可懂度。这些结果表明,结构受保护的量化是保留情感相关信息的一种有效原则,并可能为面向属性的神经语音压缩提供通用路径。