Emergent properties have been widely adopted as a term to describe behavior not present in smaller models but observed in larger models. Recent work suggests that the trade-off incurred by quantization is also an emergent property, with sharp drops in performance in models over 6B parameters. In this work, we ask "are quantization cliffs in performance solely a factor of scale?" Against a backdrop of increased research focus on why certain emergent properties surface at scale, this work provides a useful counter-example. We posit that it is possible to optimize for a quantization friendly training recipe that suppresses large activation magnitude outliers. Here, we find that outlier dimensions are not an inherent product of scale, but rather sensitive to the optimization conditions present during pre-training. This both opens up directions for more efficient quantization, and poses the question of whether other emergent properties are inherent or can be altered and conditioned by optimization and architecture design choices. We successfully quantize models ranging in size from 410M to 52B with minimal degradation in performance.
翻译:涌现特性这一术语已被广泛用于描述在较小模型中未出现、但在较大模型中观察到的行为。近期研究表明,量化带来的性能折衷也是一种涌现特性,当模型参数超过60亿时,其性能会出现急剧下降。本研究提出疑问:“量化导致的性能骤降是否完全取决于模型规模?”在学界日益关注某些涌现特性为何在大规模下出现的背景下,本文提供了一个有益的反例。我们提出,可以通过优化训练方案来抑制大激活值异常点,从而获得利于量化的训练配方。研究发现,异常维度并非规模效应的固有产物,而是对预训练过程中的优化条件高度敏感。这一发现既为更高效的量化研究开辟了新方向,也提出了更深层的问题:其他涌现特性究竟是固有不改的,还是可以通过优化策略和架构设计选择而改变与调控的。我们成功对参数规模从4.1亿到520亿的模型进行了量化,且性能损失极小。