Benchmarking Neural Speech Compression from a Rate-Distortion Perspective

Learning-based speech compression has achieved promising low-bitrate performance, but many neural speech codecs still describe quantized latents with preset-rate discrete symbols or apply entropy coding only after symbol generation. Such designs decouple representation learning from probability modeling, limiting their ability to exploit the non-uniform usage and temporal dependencies of learned speech latents. In this paper, we benchmark neural speech compression from a rate--distortion perspective and further investigate entropy-constrained coding for low-bitrate speech compression. We first formulate a unified learning-based speech coding pipeline and provide a benchmark-style analysis of recent neural speech codecs, showing that explicit probability modeling remains underexplored in learned speech compression. We then propose ECC, an Entropy-Constrained Codec that combines scalar quantization with a learned entropy model. ECC integrates hyperprior-based side information, channel-wise context modeling, latent residual prediction, and lightweight temporal modeling to estimate latent likelihoods for rate estimation during training and arithmetic coding during inference. To further improve low-bitrate efficiency, ECC introduces entropy skip, which omits highly predictable residual symbols using decoder-available scale estimates without transmitting additional skip masks. Extensive experiments show that ECC achieves a favorable low-bitrate rate--distortion trade-off over conventional and neural codec baselines, reducing BD-rate by 39.9% on ViSQOL and 76.3% on PESQ on average over two widely-used test sets. Ablation and diagnostic studies further validate the effectiveness of entropy modeling. Project Page: https://avery-xu.github.io/ECC-demo/

翻译：基于学习的语音压缩在低码率性能上取得了显著进展，但许多神经语音编解码器仍以预设速率的离散符号描述量化潜变量，或仅在符号生成后应用熵编码。此类设计将表示学习与概率建模分离，限制了利用学习到的语音潜变量的非均匀分布与时间依赖性的能力。本文从率失真视角对神经语音压缩进行基准测试，并进一步研究面向低码率语音压缩的熵约束编码。我们首先构建统一的基于学习的语音编码流水线，并对近期神经语音编解码器展开基准式分析，揭示显式概率建模在基于学习的语音压缩中仍未被充分探索。随后提出ECC（熵约束编解码器），该模型将标量量化与学习的熵模型相结合。ECC整合基于超先验的边信息、通道级上下文建模、潜变量残差预测以及轻量级时间建模，在训练期间估计潜变量似然以计算码率，并在推理期间支持算术编码。为提升低码率效率，ECC引入熵跳跃机制：利用解码端可用的尺度估计省略高度可预测的残差符号，而无需传输额外跳跃掩码。大量实验表明，ECC在低码率下实现了优于传统及神经编解码器基线的率失真权衡，在两个广泛使用的测试集上，相比基线平均降低ViSQOL的BD-rate达39.9%、PESQ的BD-rate达76.3%。消融与诊断研究进一步验证了熵建模的有效性。项目页面：https://avery-xu.github.io/ECC-demo/