We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) Towards accelerating training, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge -- the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) Towards stable training, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become under-estimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid, which we refer to as StableAdamW because it avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping.
翻译:我们提出了新的方法用于:1)加速和2)稳定大型语言-视觉模型的训练。1)针对训练加速,我们引入了SwitchBack——一种用于int8量化训练的线性层,能够在匹配bfloat16训练性能(差异在0.1个百分点以内)的同时,为1B参数的CLIP ViT-Huge模型带来13-25%的速度提升——这是迄今为止规模最大的int8训练。由于浮点8的GPU支持较为罕见,我们重点关注int8,但也通过模拟分析了浮点8训练。虽然SwitchBack对浮点8训练同样有效,但研究表明,如果网络训练和初始化时能够抑制大特征幅值(我们通过零初始化的层缩放实现这一点),标准技术也能取得成功。2)针对训练稳定性,我们分析了损失尖峰现象,发现它们总是出现在AdamW二阶矩估计器低估平方梯度之后的1-8次迭代中。因此,我们推荐一种AdamW-Adafactor混合优化器(称为StableAdamW),它在训练CLIP ViT-Huge模型时能避免损失尖峰,且性能优于梯度裁剪。