Structured State Space Models (SSMs), including the S4 and S4D architectures, have recently emerged as powerful alternatives to attention-based models for capturing long-range dependencies in sequential data. Despite their strong empirical performance, deploying these models in time- and resource-constrained settings remains challenging due to their computational and memory demands. In this paper, we propose a novel incremental, operator-level pruning approach for S4- and S4D-based models that significantly reduces inference cost while preserving predictive performance. To the best of our knowledge, this is the first work to systematically investigate structured operator pruning for SSMs. Our method progressively prunes model operators by interleaving structured masking with fine-tuning, while jointly monitoring accuracy and inference latency. We implement this approach within a unified training and evaluation framework that enables systematic exploration of efficiency-accuracy trade-offs. Experiments across multiple benchmark datasets show that pruning up to 70% of the model operators preserves the performance of the original models in most cases, while substantially reducing inference latency. These results demonstrate that structured operator pruning is an effective and previously unexplored strategy for improving the efficiency of SSMs and facilitate their deployment in practical, resource-constrained scenarios.
翻译:结构化状态空间模型(SSMs),包括S4和S4D架构,近期已成为基于注意力模型在捕捉序列数据长程依赖关系方面的有力替代方案。尽管其实验性能强劲,但由于计算和内存需求,将这些模型部署在时间与资源受限场景中仍具挑战性。本文提出一种新颖的增量式算子级剪枝方法,专门针对基于S4和S4D的模型,在保持预测性能的同时显著降低推理成本。据我们所知,这是首个系统性研究结构化算子剪枝在SSMs中应用的工作。该方法通过将结构化掩码与微调交替进行,逐步剪除模型算子,并同步监控准确率与推理延迟。我们基于统一的训练与评估框架实现该方法,从而系统化探索效率-准确率之间的权衡。在多个基准数据集上的实验表明,即使剪除高达70%的模型算子,在多数情况下仍能保持原始模型的性能,同时大幅降低推理延迟。这些结果证明,结构化算子剪枝是提升SSMs效率的有效且此前未被探索的策略,有助于其在资源受限的实际场景中的部署。