Data-free quantization (DFQ) is a technique that creates a lightweight network from its full-precision counterpart without the original training data, often through a synthetic dataset. Although several DFQ methods have been proposed for vision transformer (ViT) architectures, they fail to achieve efficacy in low-bit settings. Examining the existing methods, we identify that their synthetic data produce misaligned attention maps, while those of the real samples are highly aligned. From the observation of aligned attention, we find that aligning attention maps of synthetic data helps to improve the overall performance of quantized ViTs. Motivated by this finding, we devise \aname, a novel DFQ method designed for ViTs that focuses on inter-head attention similarity. First, we generate synthetic data by aligning head-wise attention responses in relation to spatial query patches. Then, we apply head-wise structural attention distillation to align the attention maps of the quantized network to those of the full-precision teacher. The experimental results show that the proposed method significantly outperforms baselines, setting a new state-of-the-art performance for data-free ViT quantization.
翻译:无数据量化(DFQ)是一种无需原始训练数据(通常通过合成数据集)从全精度网络创建轻量级网络的技术。尽管已有多种针对视觉Transformer(ViT)架构的DFQ方法被提出,但它们在低比特设置下均未能实现有效量化。通过分析现有方法,我们发现其合成数据产生的注意力图存在错位问题,而真实样本的注意力图则高度对齐。基于对齐注意力的观察,我们发现对齐合成数据的注意力图有助于提升量化ViT的整体性能。受此发现启发,我们设计了一种名为\aname的新型DFQ方法,该方法专为ViT设计并聚焦于头间注意力相似性。首先,我们通过对齐相对于空间查询补丁的逐头注意力响应来生成合成数据。随后,我们采用逐头结构化注意力蒸馏,将量化网络的注意力图与全精度教师网络的注意力图对齐。实验结果表明,所提方法显著优于基线方法,为无数据ViT量化树立了新的性能标杆。