Large Language Models (LLMs) excel across diverse domains but suffer from high energy costs due to quadratic attention and dense Feed-Forward Network (FFN) operations. To address these issues, we propose Module-aware Architecture Refinement (MAR), a two-stage framework that integrates State Space Models (SSMs) for linear-time sequence modeling and applies activation sparsification to reduce FFN costs. In addition, to mitigate low information density and temporal mismatch in integrating Spiking Neural Networks (SNNs) with SSMs, we design the Adaptive Ternary Multi-step Neuron (ATMN) and the Spike-aware Bidirectional Distillation Strategy (SBDS). Extensive experiments demonstrate that MAR effectively restores the performance of its dense counterpart under constrained resources while substantially reducing inference energy consumption. Furthermore, it outperforms efficient models of comparable or even larger scale, underscoring its potential for building efficient and practical LLMs.
翻译:大语言模型(LLMs)在多个领域表现卓越,但由于二次注意力机制和密集的前馈网络(FFN)操作,其能耗成本高昂。为解决这些问题,我们提出了模块感知架构优化(MAR),这是一个两阶段框架,集成了状态空间模型(SSMs)以实现线性时间序列建模,并应用激活稀疏化以降低FFN成本。此外,为缓解脉冲神经网络(SNNs)与SSMs集成中的信息密度低和时间不匹配问题,我们设计了自适应三元多步神经元(ATMN)和脉冲感知双向蒸馏策略(SBDS)。大量实验表明,MAR在资源受限条件下能有效恢复其稠密对应模型的性能,同时显著降低推理能耗。此外,其性能优于规模相当甚至更大的高效模型,凸显了其在构建高效实用大语言模型方面的潜力。