Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency, often encoding perceptually irrelevant information such as background noise and recording artifacts at the expense of linguistically and acoustically meaningful content. We reframe audio tokenization as a selective information bottleneck problem and propose CleanCodec, a denoising audio codec which learns to encode only perceptually important features and discard imperceptible information. At just 12.5 tokens per second, CleanCodec achieves state-of-the-art tokenization efficiency, substantially outperforming existing codecs in speaker similarity and speech intelligibility. Evaluations on downstream text-to-speech and voice conversion tasks further demonstrate improved performance and up to 17x faster inference, highlighting significant efficiency gains.
翻译:神经音频编解码器是语音处理流水线的关键组件,能够将音频压缩为离散令牌供下游建模使用。然而,现有编解码器难以在重建质量与令牌效率间取得平衡,常以编码语言及声学意义内容为代价,过度关注背景噪声、录音伪影等感知无关信息。我们将音频分词化重构为选择性信息瓶颈问题,并提出CleanCodec——一种学习仅编码感知重要性特征并丢弃不可感知信息的去噪音频编解码器。在每秒仅12.5个令牌的条件下,CleanCodec实现了最先进的令牌化效率,在说话人相似度和语音可懂度方面显著超越现有编解码器。下游文本转语音与语音转换任务的评估进一步证实其性能提升及高达17倍的推理加速,凸显了显著的效率优势。