The costly self-attention layers in modern Transformers require memory and compute quadratic in sequence length. Existing approximation methods usually underperform and fail to obtain significant speedups in practice. Here we present SwitchHead - a novel method that reduces both compute and memory requirements and achieves wall-clock speedup, while matching the language modeling performance of baseline Transformers with the same parameter budget. SwitchHead uses Mixture-of-Experts (MoE) layers for the value and output projections and requires 4 to 8 times fewer attention matrices than standard Transformers. Our novel attention can also be combined with MoE MLP layers, resulting in an efficient fully-MoE "SwitchHead" Transformer model. Our code is public.
翻译:现代Transformer中昂贵的自注意力层需要与序列长度呈二次关系的内存和计算量。现有近似方法通常表现不佳,且难以在实践中获得显著加速。本文提出SwitchHead——一种新颖方法,在匹配相同参数预算下基线Transformer语言建模性能的同时,能同时降低计算和内存需求,并实现实际加速。该方法在数值投影和输出投影中采用混合专家(MoE)层,所需注意力矩阵数量仅为标准Transformer的1/4至1/8。这种新型注意力机制还可与MoE多层感知机(MLP)层结合,形成高效的完全MoE型"SwitchHead"Transformer模型。我们的代码已开源。