We investigate the effects of post-training quantization and quantization-aware training on the generalization of Transformer language models. We present a new method called self-distilled quantization (SDQ) that minimizes accumulative quantization errors and outperforms baselines. We apply SDQ to multilingual models XLM-R-Base and InfoXLM-Base and demonstrate that both models can be reduced from 32-bit floating point weights to 8-bit integer weights while maintaining a high level of performance on the XGLUE benchmark. Our results also highlight the challenges of quantizing multilingual models, which must generalize to languages they were not fine-tuned on.
翻译:我们研究了训练后量化与量化感知训练对Transformer语言模型泛化能力的影响。我们提出了一种名为自蒸馏量化(SDQ)的新方法,该方法能最小化累积量化误差,并优于基线方法。我们将SDQ应用于多语言模型XLM-R-Base和InfoXLM-Base,证明这两个模型可从32位浮点权重压缩至8位整数权重,同时在XGLUE基准测试中保持高水平性能。我们的结果还凸显了多语言模型量化的挑战——这些模型必须泛化到未经微调的语言上。