Transformer models encounter challenges in scaling hidden dimensions efficiently, as uniformly increasing them inflates computational and memory costs while failing to emphasize the most relevant features for each token. For further understanding, we study hidden dimension sparsity and observe that trained Transformers utilize only a small fraction of token dimensions, revealing an "activation flow" pattern. Notably, there are shared sub-dimensions with sustained activation across multiple consecutive tokens and specialized sub-dimensions uniquely activated for each token. To better model token-relevant sub-dimensions, we propose MoHD (Mixture of Hidden Dimensions), a sparse conditional activation architecture. Particularly, MoHD employs shared sub-dimensions for common token features and a routing mechanism to dynamically activate specialized sub-dimensions. To mitigate potential information loss from sparsity, we design activation scaling and group fusion mechanisms to preserve activation flow. In this way, MoHD expands hidden dimensions with negligible increases in computation or parameters, efficient training and inference while maintaining performance. Evaluations across 10 NLP tasks show that MoHD surpasses Vanilla Transformers in parameter efficiency and task performance. It achieves 1.7% higher performance with 50% fewer activation parameters and 3.7% higher performance with a 3x parameter expansion at constant activation cost. MOHD offers a new perspective for scaling the model, showcasing the potential of hidden dimension sparsity to boost efficiency
翻译:Transformer模型在高效扩展隐维方面面临挑战,因为均匀增加隐维会同时提升计算与内存开销,却无法强化每个词元最相关的特征。为进一步探究,我们研究了隐维稀疏性并观察到:训练后的Transformer仅利用了少量词元维度,呈现出一种“激活流”模式。值得注意的是,存在跨多个连续词元持续激活的共享子维度,以及为每个词元独特激活的专用子维度。为更好地建模词元相关子维度,我们提出MoHD(混合隐维)——一种稀疏条件激活架构。具体而言,MoHD采用共享子维度处理通用词元特征,并通过路由机制动态激活专用子维度。为缓解稀疏性可能导致的信息损失,我们设计了激活缩放与组融合机制以保持激活流。通过这种方式,MoHD能以可忽略的计算或参数增长扩展隐维,在保持性能的同时实现高效的训练与推理。在10项NLP任务上的评估表明,MoHD在参数效率与任务性能上均超越原始Transformer。在激活参数减少50%时性能提升1.7%,在激活成本不变条件下进行3倍参数扩展时性能提升3.7%。MoHD为模型扩展提供了新视角,展现了利用隐维稀疏性提升效率的潜力。