Despite their nearly universal adoption for large language models, the internal workings of transformers are not well understood. We aim to better understand the impact of removing or reorganizing information throughout the layers of a pretrained transformer. Such an understanding could both yield better usage of existing models as well as to make architectural improvements to produce new variants. We present a series of empirical studies on frozen models that show that the lower and final layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity. We further show that some classes of problems have robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel. Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel.
翻译:尽管Transformer架构在大型语言模型中几乎被普遍采用,但其内部工作机制尚未得到充分理解。本文旨在更好地理解在预训练Transformer的各层中移除或重组信息所产生的影响。这种理解既能促进对现有模型的更好利用,也能推动架构改进以产生新的变体。我们通过对冻结模型进行一系列实证研究,发现预训练Transformer的底层和最终层与中间层存在差异,但中间层表现出惊人的一致性。我们进一步证明,某些类别的问题对于跳过某些层、以不同于训练时的顺序运行各层,或以并行方式运行各层具有鲁棒性。我们的观察表明,即使是冻结的预训练模型,也可能通过跳过某些层或以并行方式运行层,在精度与延迟之间实现优雅的权衡。