Mamba selective state space models (SSMs) provide linear-time sequence modeling but are often limited by memory bandwidth in practice, where selective state updates are executed as fragmented kernels with repeated intermediate tensor materialization. We present COREY, a prototype scheduler that uses activation entropy estimated via fixed-width histograms as a runtime signal for chunk-size selection at the kernel-invocation level. COREY is positioned as a Concept and Feasibility contribution: a single-parameter runtime auto-tuner built on an existing Triton selective-scan kernel rather than a new fused implementation. Evidence is organized in three tiers. Tier 1 (Python cost model) shows that entropy-guided grouping reduces surrogate latency and DRAM traffic. Tier 2a (real-checkpoint inline hook) demonstrates that entropy computation and chunk selection can run on the critical path of model.generate(); on Mamba-370M (RTX 3070, n=5), measured overhead is 8.3 percent with full instrumentation and estimated about 2 percent with sparse sampling. Tier 2b (kernel-level scan benchmark) shows that, under a principled calibration where H_ref equals log(K), COREY selects the same chunk as a one-time-profile oracle without offline sweeps and achieves up to 4.41x speedup over static chunk-64. This work does not yet include a fully integrated end-to-end run connecting Tier 2a and Tier 2b, which remains key future work. Across 80 LongBench prompts, entropy distributions are stable, supporting COREY as a practical runtime auto-tuner within a single regime. Code and data: https://github.com/mabo1215/COREY_Transformer/.
翻译:Mamba选择性状态空间模型(SSMs)提供线性时间序列建模能力,但在实际应用中常受限于内存带宽——选择性状态更新以碎片化核的形式执行,并伴随重复的中间张量物化。我们提出COREY,一种利用固定宽度直方图估计激活熵、在核调用层级进行块大小选择的原型调度器。COREY定位为概念与可行性贡献:基于现有Triton选择性扫描核构建的单参数运行时自动调优器,而非全新融合实现。实验证据按三层组织。第一层(Python成本模型)表明,熵引导分组可减少代理延迟与DRAM流量。第二层a(真实检查点内联钩子)证明,熵计算与块选择可在model.generate()关键路径上运行;在Mamba-370M(RTX 3070,n=5)上,完全仪表化的实测开销为8.3%,采用稀疏采样时估计降至约2%。第二层b(核级扫描基准)显示,在H_ref=log(K)的原则性校准下,COREY无需离线扫描即可选择与一次性轮廓预测器相同的块,相较于静态块64实现最高4.41倍加速。本工作尚未包含连接第二层a与第二层b的完整端到端运行,这仍是关键未来工作。对80个LongBench提示的分析显示,熵分布稳定,支持COREY在单域内作为实用运行时自动调优器。代码与数据:https://github.com/mabo1215/COREY\_Transformer/。