Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two-Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids, such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech, audio and music domains compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.
翻译:近期神经音频编解码器取得了令人瞩目的重建质量,通常依赖残差向量量化(RVQ)、向量量化(VQ)和有限标量量化(FSQ)等量化方法。然而,这些量化技术限制了潜在空间的几何结构,使得捕捉特征间的相关性变得更加困难,从而导致表征学习、码本利用率和令牌率方面的效率低下。本文提出二维量化(Q2D2)方案,该方案将特征对投影至六边形、菱形或矩形铺砌等结构化二维网格,并量化为最近网格值,由此生成由网格层级乘积定义的隐式码本,其码本大小与传统方法相当。尽管几何形式简洁,Q2D2在保持最先进重建质量的同时,以低令牌率和高码本利用率提升了音频压缩效率。具体而言,在与语音、音频及音乐领域最先进模型的广泛实验对比中,Q2D2在多项客观与主观重建指标上均达到具有竞争力乃至更优的性能。全面的消融研究进一步验证了我们的设计选择的有效性。