Recent works attribute the capability of in-context learning (ICL) in large pre-trained language models to implicitly simulating and fine-tuning an internal model (e.g., linear or 2-layer MLP) during inference. However, such constructions require large memory overhead, which makes simulation of more sophisticated internal models intractable. In this work, we propose an efficient construction, Transformer in Transformer (in short, TinT), that allows a transformer to simulate and fine-tune complex models internally during inference (e.g., pre-trained language models). In particular, we introduce innovative approximation techniques that allow a TinT model with less than 2 billion parameters to simulate and fine-tune a 125 million parameter transformer model within a single forward pass. TinT accommodates many common transformer variants and its design ideas also improve the efficiency of past instantiations of simple models inside transformers. We conduct end-to-end experiments to validate the internal fine-tuning procedure of TinT on various language modeling and downstream tasks. For example, even with a limited one-step budget, we observe TinT for a OPT-125M model improves performance by 4-16% absolute on average compared to OPT-125M. These findings suggest that large pre-trained language models are capable of performing intricate subroutines. To facilitate further work, a modular and extensible codebase for TinT is included.
翻译:近期研究表明,大型预训练语言模型在上下文中学习(ICL)的能力源于其在推理过程中隐式模拟并微调内部模型(如线性模型或两层MLP)。然而,此类构造需要大量内存开销,使得模拟更复杂的内部模型变得难以实现。本文提出一种高效架构——Transformer-in-Transformer(简称TinT),使Transformer能在推理过程中内部模拟并微调复杂模型(如预训练语言模型)。具体而言,我们引入创新性近似技术,使得参数量不足20亿的TinT模型可在单次前向传播中模拟并微调一个1.25亿参数的Transformer模型。TinT支持多种常见Transformer变体,其设计思想同时提升了过往在Transformer内部嵌入简单模型的效率。我们通过端到端实验验证了TinT在各语言建模及下游任务中的内部微调能力。例如,即使仅使用有限的一步预算,基于OPT-125M模型的TinT相比原始OPT-125M平均性能提升4-16%(绝对值)。这些发现表明,大型预训练语言模型具备执行复杂子程序的能力。为促进后续研究,我们提供了模块化、可扩展的TinT代码库。