Despite their nearly universal adoption for large language models, the internal workings of transformers are not well understood. We aim to better understand the impact of removing or reorganizing information throughout the layers of a pretrained transformer. Such an understanding could both yield better usage of existing models as well as to make architectural improvements to produce new variants. We present a series of empirical studies on frozen models that show that the lower and final layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity. We further show that some classes of problems have robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel. Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel.
翻译:尽管 Transformer 在大语言模型中几乎被普遍采用,但其内部工作机制尚未得到充分理解。我们的目标是更好地理解在预训练 Transformer 的各层中移除或重组信息所产生的影响。这种理解既能促进对现有模型的更好利用,也能推动架构改进以产生新的变体。我们在一系列冻结模型上进行了实证研究,结果表明预训练 Transformer 的底层和最终层与中间层存在差异,但中间层却表现出惊人的一致性。我们进一步证明,某些类别的问题对于跳过某些层、以不同于训练时的顺序运行各层,或者并行运行各层具有鲁棒性。我们的观察表明,即使是冻结的预训练模型,也可能通过跳过某些层或并行运行层,在精度和延迟之间实现优雅的权衡。