Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two-Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids, such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech, audio and music domains compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.
翻译:近期神经音频编解码器已实现出色的重建质量,通常采用残差矢量量化(RVQ)、矢量量化(VQ)和有限标量量化(FSQ)等量化方法。然而,这些量化技术限制了潜空间的几何结构,难以捕捉特征间的相关性,导致表示学习、码本利用率和令牌率效率低下。本文提出二维量化(Q2D2)方案,该方案将特征对投影到结构化二维网格(如六边形、菱形或矩形镶嵌),并通过量化至最近网格值来生成隐式码本——该码本由网格层级乘积定义,其码本规模与传统方法相当。尽管几何形式简洁,Q2D2仍能提升音频压缩效率:在保持最先进重建质量的同时实现低令牌率与高码本利用率。具体而言,在语音、音频和音乐领域的广泛实验中,与现有最先进模型相比,Q2D2在多项客观与主观重建指标上取得竞争性甚至更优性能。全面的消融研究进一步验证了设计选择的合理性。