Finite Scalar Quantization (FSQ) offers simplified training but suffers from residual magnitude decay in multi-stage settings, where subsequent stages receive exponentially weaker signals. We propose Robust Residual Finite Scalar Quantization (RFSQ), addressing this fundamental limitation through two novel conditioning strategies: learnable scaling factors and invertible layer normalization. Our experiments across audio and image modalities demonstrate RFSQ's effectiveness and generalizability. In audio reconstruction at 24 bits/frame, RFSQ-LayerNorm achieves 3.646 DNSMOS, a 3.6% improvement over state-of-the-art RVQ (3.518). On ImageNet, RFSQ achieves 0.102 L1 loss and 0.100 perceptual loss, with LayerNorm providing 9.7% L1 improvement and 17.4% perceptual improvement over unconditioned variants. The LayerNorm strategy consistently outperforms alternatives by maintaining normalized input statistics across stages, effectively preventing exponential magnitude decay that limits naive residual approaches. RFSQ combines FSQ's simplicity with multi-stage quantization's representational power, establishing a new standard for neural compression across diverse modalities.
翻译:有限标量量化(FSQ)提供了简化的训练过程,但在多阶段设置中存在残差幅度衰减问题,即后续阶段接收到的信号呈指数级减弱。我们提出了鲁棒残差有限标量量化(RFSQ),通过两种新颖的条件化策略解决了这一根本性限制:可学习的缩放因子和可逆层归一化。我们在音频和图像模态上的实验证明了RFSQ的有效性和泛化能力。在24比特/帧的音频重建任务中,RFSQ-LayerNorm实现了3.646 DNSMOS,比最先进的RVQ(3.518)提升了3.6%。在ImageNet数据集上,RFSQ实现了0.102的L1损失和0.100的感知损失,其中LayerNorm策略相比无条件化变体在L1损失上提升了9.7%,在感知损失上提升了17.4%。LayerNorm策略通过在各阶段保持归一化的输入统计特性,持续优于其他方案,有效防止了限制朴素残差方法的指数级幅度衰减。RFSQ结合了FSQ的简洁性与多阶段量化的表征能力,为跨多种模态的神经压缩建立了新标准。