It is well established that increasing scale in deep transformer networks leads to improved quality and performance. This increase in scale often comes with an increase in compute cost and inference latency. Consequently, research into methods which help realize the benefits of increased scale without leading to an increase in the compute cost becomes important. We introduce Alternating Updates (AltUp), a simple-to-implement method to increase a model's capacity without the computational burden. AltUp enables the widening of the learned representation without increasing the computation time by working on a subblock of the representation at each layer. Our experiments on various transformer models and language tasks demonstrate the consistent effectiveness of alternating updates on a diverse set of benchmarks. Finally, we present extensions of AltUp to the sequence dimension, and demonstrate how AltUp can be synergistically combined with existing approaches, such as Sparse Mixture-of-Experts models, to obtain efficient models with even higher capacity.
翻译:众所周知,深度Transformer网络规模的扩大会带来模型质量与性能的提升。然而规模扩大往往伴随计算成本增加与推理延迟升高。因此,研究如何在避免计算成本增长的前提下实现规模扩大带来的效益变得至关重要。本文提出交替更新(AltUp)——一种易于实现的方法,能在不增加计算负担的情况下提升模型容量。AltUp通过每层处理表示的子块,在不增加计算时间的前提下扩展学习表示的宽度。我们在多种Transformer模型和语言任务上的实验表明,交替更新在各类基准测试中均保持稳定有效性。最后,我们将AltUp扩展至序列维度,并展示其如何与现有方法(如稀疏混合专家模型)协同结合,从而获得容量更高且计算高效的模型。