The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers. Empirical evidence shows that UTs have better compositional generalization than Vanilla Transformers (VTs) in formal language tasks. The parameter-sharing also affords it better parameter efficiency than VTs. Despite its many advantages, scaling UT parameters is much more compute and memory intensive than scaling up a VT. This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) and a new stick-breaking-based dynamic halting mechanism to reduce UT's computation complexity while retaining its parameter efficiency and generalization ability. Experiments show that SUT achieves the same performance as strong baseline models while only using half computation and parameters on WMT'14 and strong generalization results on formal language tasks (Logical inference and CFQ). The new halting mechanism also enables around 50\% reduction in computation during inference with very little performance decrease on formal language tasks.
翻译:通用变换器(UT)是Transformer的一种变体,其各层共享参数。实验证据表明,在形式语言任务中,UT比原始Transformer(VT)具有更好的组合泛化能力。参数共享还使其相比VT具有更高的参数效率。尽管优势显著,但扩展UT参数所需的计算和内存开销远高于扩展VT。本文提出稀疏通用变换器(SUT),它利用稀疏专家混合(SMoE)和基于断裂过程的新动态停止机制,在保持参数效率和泛化能力的同时降低UT的计算复杂度。实验表明,在WMT'14数据集上,SUT仅使用一半的计算量和参数即可达到强基线模型的相同性能,并在形式语言任务(逻辑推理和CFQ)上展现出强大的泛化效果。新的停止机制还能在推理过程中减少约50%的计算量,同时形式语言任务性能下降极小。