Large Language Models (LLMs) exhibit substantial parameter redundancy, particularly in Feed-Forward Networks (FFNs). Existing pruning methods suffer from two primary limitations. First, reliance on dataset-specific calibration introduces significant data dependency and computational overhead. Second, being predominantly static, they fail to account for the evolving subset of knowledge neurons in LLMs during autoregressive generation as the context evolves. To address this, we introduce DART, i.e., Dynamic Attention-Guided Runtime Tracing), a lightweight, training-free method that performs on-the-fly context-based pruning. DART monitors shifts in attention score distributions to infer context changes, dynamically updating neuron-level masks to retain salient parameters. Across ten benchmarks, DART outperforms prior dynamic baseline, achieving accuracy gains of up to 14.5% on LLAMA-3.1-8B at 70% FFN sparsity. Furthermore, DART achieves up to 3x better ROUGE-L scores with respect to static-masked pruning on summarization tasks, with its performance comparable to the original dense models. We conclusively demonstrate that the proposed framework effectively adapts to diverse semantic contexts, preserves model capabilities across both general and domain-specific tasks while running at less than 10MBs of memory for LLAMA-3.1-8B(16GBs) with 0.1% FLOPs overhead. The code is available at https://github.com/seeder-research/DART.
翻译:大型语言模型(LLMs)存在显著的参数冗余,尤其是在前馈网络(FFNs)中。现有的剪枝方法存在两个主要局限:首先,对数据集特定校准的依赖引入了显著的数据依赖性和计算开销;其次,这些方法主要是静态的,未能考虑LLMs在自回归生成过程中随着上下文演变而动态变化的知识神经元子集。为解决这些问题,我们提出了DART(动态注意力引导的运行时追踪),这是一种轻量级、无需训练的方法,能够执行基于上下文的即时剪枝。DART通过监测注意力分数分布的变化来推断上下文变化,并动态更新神经元级别的掩码以保留关键参数。在十个基准测试中,DART超越了先前的动态基线方法,在LLAMA-3.1-8B模型上,当FFN稀疏度为70%时,准确率提升最高达14.5%。此外,在摘要任务中,DART相较于静态掩码剪枝方法,取得了最高3倍的ROUGE-L分数提升,其性能与原始稠密模型相当。我们最终证明,所提出的框架能有效适应多样化的语义上下文,在通用和领域特定任务中均能保持模型能力,同时对于LLAMA-3.1-8B(16GB)模型,其运行内存占用低于10MB,浮点运算开销仅为0.1%。代码已发布于 https://github.com/seeder-research/DART。