Post-training quantization (PTQ) enables efficient deployment of deep networks using a small set of data. Its application to visual autoregressive models (VAR), however, remains relatively unexplored. We identify two key challenges for applying PTQ to VAR: (i) large reconstruction errors in attention-value products, especially at coarse scales where high attention scores occur more frequently; and (ii) a discrepancy between the sampling frequencies of codebook entries and their predicted probabilities due to limited calibration data. To address these challenges, we propose a PTQ framework tailored for VAR. First, we introduce a shift-and-sum quantization method that reduces reconstruction errors by aggregating quantized results from symmetrically shifted duplicates of value tokens. Second, we present a resampling strategy for calibration data that aligns sampling frequencies of codebook entries with their predicted probabilities. Experiments on class-conditional image generation, inpainting, outpainting, and class-conditional editing show consistent improvements across VAR architectures, establishing a new state of the art in PTQ for VAR.
翻译:后训练量化(PTQ)通过使用少量数据实现深度网络的高效部署。然而,其在视觉自回归模型(VAR)中的应用仍鲜有探索。我们识别出将PTQ应用于VAR的两个关键挑战:(i)注意力-值乘积中存在较大的重构误差,尤其是在粗尺度上高注意力分数出现更为频繁时;(ii)由于校准数据有限,码本条目的采样频率与其预测概率之间存在差异。为解决这些问题,我们提出了一种针对VAR的PTQ框架。首先,引入了一种移位求和量化方法,通过聚合对称移位后的值令牌副本的量化结果来降低重构误差。其次,提出了一种校准数据的重采样策略,使码本条目的采样频率与其预测概率对齐。在类别条件图像生成、图像修补、图像外推和类别条件编辑上的实验表明,该方法在多种VAR架构上均取得了一致性的改进,确立了VAR的PTQ领域新标杆。