Recent works attribute the capability of in-context learning (ICL) in large pre-trained language models to implicitly simulating and fine-tuning an internal model (e.g., linear or 2-layer MLP) during inference. However, such constructions require large memory overhead, which makes simulation of more sophisticated internal models intractable. In this work, we propose an efficient construction, Transformer in Transformer (in short, TinT), that allows a transformer to simulate and fine-tune complex models internally during inference (e.g., pre-trained language models). In particular, we introduce innovative approximation techniques that allow a TinT model with less than 2 billion parameters to simulate and fine-tune a 125 million parameter transformer model within a single forward pass. TinT accommodates many common transformer variants and its design ideas also improve the efficiency of past instantiations of simple models inside transformers. We conduct end-to-end experiments to validate the internal fine-tuning procedure of TinT on various language modeling and downstream tasks. For example, even with a limited one-step budget, we observe TinT for a OPT-125M model improves performance by 4-16% absolute on average compared to OPT-125M. These findings suggest that large pre-trained language models are capable of performing intricate subroutines. To facilitate further work, a modular and extensible codebase for TinT is included.
翻译:近期的研究将大型预训练语言模型中的上下文学习能力归因于其在推理过程中隐式模拟和微调内部模型(例如线性模型或两层MLP)。然而,此类构造需要大量内存开销,使得模拟更复杂的内部模型变得困难。本文提出一种高效构造——Transformer in Transformer(简称TinT),使Transformer能够在推理过程中内部模拟和微调复杂模型(如预训练语言模型)。特别地,我们引入创新的近似技术,使参数量少于20亿的TinT模型能够在单次前向传播中模拟和微调一个含1.25亿参数的Transformer模型。TinT支持多种常见Transformer变体,其设计思想也提升了现有简单模型内嵌于Transformer时的效率。我们通过端到端实验验证了TinT在多种语言建模和下游任务中的内部微调过程。例如,即使只在单步预算限制下,我们观察到基于OPT-125M模型的TinT相较于原始OPT-125M平均获得了4-16%的绝对性能提升。这些发现表明大型预训练语言模型能够执行复杂的子程序。为促进后续研究,我们提供了模块化、可扩展的TinT代码库。