State Space Models (SSMs) such as Mamba-2 offer linear-time inference but their memory footprint limits edge deployment. Prior ternary SSM work (Slender-Mamba) trains from scratch on 150B tokens; we show a pretrained checkpoint suffices, reducing the marginal token budget by 1,000x. Using grouped quantization-aware training (QAT) with knowledge distillation from a frozen FP16 teacher, we compress Mamba-2 1.3B to 3.61x (2,687 to 744 MB) and achieve 48.1% zero-shot accuracy (7-task average) in just 102M tokens (4 GPU-hours, single H100) -- approaching Bi-Mamba's 48.4% (within +/-0.9pp CI). This QAT-from-pretrained setting reveals zero-ratio collapse, a novel instability caused by learnable quantization scales that does not arise in from-scratch training. We further show that post-hoc correction strategies effective for Transformers fail for SSMs due to error accumulation through the recurrence. These results demonstrate that ternary SSMs do not require expensive from-scratch training: QAT from pretrained checkpoints with KD is a data-efficient alternative.
翻译:状态空间模型(SSMs,如Mamba-2)支持线性时间推理,但其内存占用限制了边缘部署。先前三元SSM工作(Slender-Mamba)需在150B个token上从头训练;我们证明预训练检查点足以满足需求,将边际token预算降低1000倍。通过使用冻结FP16教师模型的知识蒸馏进行分组量化感知训练(QAT),我们将Mamba-2 1.3B模型压缩3.61倍(从2,687 MB降至744 MB),仅用102M个token(单张H100 4 GPU小时)即达到48.1%零样本准确率(7任务平均值)——接近Bi-Mamba的48.4%(在±0.9个百分点置信区间内)。这种基于预训练模型的QAT设置揭示了零比例坍塌,这是一种由可学习量化尺度引发的新不稳定性,在从头训练中不会出现。我们进一步证明,由于循环过程中的误差累积,对Transformer有效的后置修正策略对SSMs失效。这些结果表明,三元SSM无需昂贵的从头训练:基于预训练检查点的QAT结合知识蒸馏是一种数据高效的替代方案。