In this work, we systematically investigate the efficacy of dynamic activation mechanisms within the LLaMA family of language models. Despite the potential of dynamic activation methods to reduce computation and increase speed in models using the ReLU activation function, our empirical findings have uncovered several inherent pitfalls in the current dynamic activation schemes. Through extensive experiments across various dynamic activation strategies, we demonstrate that LLaMA models usually underperform when compared to their ReLU counterparts, particularly in scenarios demanding high sparsity ratio. We attribute these deficiencies to a combination of factors: 1) the inherent complexity of dynamically predicting activation heads and neurons; 2) the inadequate sparsity resulting from activation functions; 3) the insufficient preservation of information resulting from KV cache skipping. Our analysis not only sheds light on the limitations of dynamic activation in the context of large-scale LLaMA models but also proposes roadmaps for enhancing the design of future sparsity schemes.
翻译:本文系统研究了LLaMA系列语言模型中动态激活机制的有效性。尽管动态激活方法有潜力在使用ReLU激活函数的模型中减少计算量并提升速度,但我们的实证发现在当前动态激活方案中存在若干固有陷阱。通过对多种动态激活策略的广泛实验,我们证明LLaMA模型在性能上通常逊于其ReLU版本,尤其是在需要高稀疏率的场景中。我们将这些缺陷归因于多种因素的组合:1) 动态预测激活头与激活神经元的固有复杂性;2) 由激活函数导致的稀疏性不足;3) 因KV缓存跳过造成的信号保存不充分。本文的分析不仅揭示了大规模LLaMA模型中动态激活的局限性,还为未来稀疏性方案的设计优化提供了改进路线图。