Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce TokenFormer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at \url{https://github.com/Haiyang-W/TokenFormer}.
翻译:Transformer凭借其在各领域的卓越性能已成为基础模型的主导架构。然而,扩展这些模型所需的高昂成本仍是重要关切。该问题主要源于其对线性投影中固定参数数量的依赖。当引入架构修改(如通道维度调整)时,通常需要从头重新训练整个模型。随着模型规模持续增长,这种策略导致计算成本不断攀升且难以为继。为克服此问题,我们提出TokenFormer——一种原生可扩展架构,其不仅利用注意力机制处理输入标记间的计算,还通过标记与模型参数的交互增强架构灵活性。通过将模型参数视为标记,我们将Transformer中所有线性投影替换为标记-参数注意力层,其中输入标记作为查询向量,模型参数作为键值对。这种重构实现了渐进式高效扩展,无需从头重新训练。我们的模型通过增量添加新的键值参数对,实现了从1.24亿到14亿参数的扩展,在显著降低训练成本的同时,获得了与从头训练的Transformer相当的性能。代码与模型已发布于\url{https://github.com/Haiyang-W/TokenFormer}。