In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 300 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL)-based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 4.64% WER on the LibriSpeech test-clean set at 300 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.55% on test-clean and 10.4% on test-other, corresponding to a 23% relative reduction while preserving perceptual quality.
翻译:在带宽受限的通信场景(如卫星和水下信道)中,语音通常需以可懂度为首要目标的超低比特率传输。在如此极端的压缩水平下,基于声学重建损失训练的编解码器倾向于将比特分配给感知细节,导致词错误率(WER)显著恶化。本文提出ClariCodec——一个工作于每秒300比特(bps)的神经语音编解码器,它将量化过程重新表述为随机策略,从而实现对可懂度进行基于强化学习(RL)的优化。具体而言,编码器通过WER驱动的奖励进行微调,而声学重建管线保持冻结。即使在无需RL的情况下,ClariCodec在LibriSpeech测试干净集上以300bps实现了4.64%的WER,已与工作在更高比特率的编解码器相当。进一步RL微调将测试干净集与测试其他集的WER分别降至3.55%和10.4%,在保持感知质量的同时实现23%的相对降低。