Transformer models are deployed in a wide range of settings, from multi-accelerator clusters to standalone mobile phones. The diverse inference constraints in these scenarios necessitate practitioners to train foundation models such as PaLM 2, Llama, & ViTs as a series of models of varying sizes. Due to significant training costs, only a select few model sizes are trained and supported, limiting more fine-grained control over relevant tradeoffs, including latency, cost, and accuracy. This work introduces MatFormer, a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints. Each Feed Forward Network (FFN) block of a MatFormer model is jointly optimized with a few nested smaller FFN blocks. This training procedure allows for the Mix'n'Match of model granularities across layers -- i.e., a trained universal MatFormer model enables extraction of hundreds of accurate smaller models, which were never explicitly optimized. We empirically demonstrate MatFormer's effectiveness across different model classes (decoders & encoders), modalities (language & vision), and scales (up to 2.6B parameters). We find that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B, each exhibiting comparable validation loss and one-shot downstream evaluations to their independently trained counterparts. Furthermore, we observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval. Finally, we showcase that speculative decoding with the accurate and consistent submodels extracted from MatFormer can further reduce inference latency.
翻译:Transformer模型被广泛应用于从多加速器集群到独立移动设备的各类场景中。不同部署场景下的推理约束差异,促使从业者需要将PaLM 2、Llama及ViT等基础模型训练成一系列不同尺寸的模型。由于训练成本高昂,通常仅支持少数特定尺寸的模型,这限制了在延迟、成本与精度等性能权衡上的细粒度控制。本文提出MatFormer——一种嵌套式Transformer架构,旨在为多样化部署约束提供弹性扩展能力。MatFormer模型的每个前馈网络(FFN)块均与若干嵌套的小型FFN块联合优化。该训练机制支持跨层的模型粒度混合匹配——即训练得到的通用MatFormer模型可提取数百个未经显式优化的精准小型模型。我们通过不同模型类别(解码器与编码器)、模态(语言与视觉)及规模(最高26亿参数)的实验验证了MatFormer的有效性。实验表明,基于26亿参数解码器仅结构的MatFormer语言模型(MatLM)可提取从15亿到26亿参数的小型子模型,这些子模型在验证损失与单次下游评估指标上均与独立训练的同参数量模型表现相当。此外,从通用MatFormer视觉Transformer(MatViT)编码器中提取的小规模编码器,能够保留度量空间结构以支持自适应大规模检索。最后,我们展示了利用MatFormer提取的精准一致性子模型进行推测性解码,可进一步降低推理延迟。