Existing memory management techniques severely hinder efficient Large Language Model serving on accelerators constrained by poor random-access bandwidth.While static pre-allocation preserves memory contiguity,it incurs significant overhead due to worst-case provisioning.Conversely,fine-grained paging mitigates this overhead but relies on HBM's high random-access tolerance, making it unsuitable for LPDDR systems where non-sequential access rapidly degrades bandwidth. Furthermore, prior works typically assume static distributions and HBM characteristics, thereby failing to resolve the critical fragmentation and bandwidth constraints inherent to LPDDR hardware. We present ODMA, an on-demand memory allocation strategy tailored for random-access-constrained accelerators, such as the Cambricon MLU series.ODMA advances generation-length prediction by addressing two critical limitations in production workloads: (i) distribution drift that invalidates static bucket boundaries, and (ii) performance fragility under heavy-tailed request patterns. ODMA integrates a lightweight length predictor with adaptive bucket partitioning and a fallback safety pool. Bucket boundaries are dynamically recalibrated via online histograms to maximize utilization, while the safety pool ensures robustness against prediction errors. On Alpaca and Google-NQ benchmarks, ODMA improves S3's prediction accuracy from 98.60% to 99.55% and 82.68% to 93.36%, respectively. Deployment with DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4 accelerators demonstrates that ODMA increases KV-cache utilization by up to 19.25% (absolute) and throughput (TPS) by 23-27% over static baselines, validating the efficacy of predictor-driven contiguous allocation for LPDDR-class devices.
翻译:现有内存管理技术严重制约了受限于低随机访问带宽的加速器上的高效大语言模型服务。静态预分配虽能保证内存连续性,但因最坏情况预置导致显著开销;而细粒度分页虽可缓解此类开销,却依赖HBM的高随机访问容忍度,不适用于非顺序访问会急剧损耗带宽的LPDDR系统。此外,先前工作通常假设静态分布与HBM特性,未能解决LPDDR硬件固有的关键碎片化与带宽约束问题。本文提出ODMA——一种专为随机访问受限加速器(如寒武纪MLU系列)设计的按需内存分配策略。ODMA通过解决生产负载中的两个关键局限性来推进生成长度预测:(i) 使静态桶边界失效的分布漂移,(ii) 重尾请求模式下的性能脆弱性。该策略集成轻量级长度预测器、自适应桶分区与回退安全池:通过在线直方图动态校准桶边界以最大化利用率,同时利用安全池确保对预测误差的鲁棒性。在Alpaca和Google-NQ基准测试中,ODMA将S3的预测准确率分别从98.60%提升至99.55%和从82.68%提升至93.36%。在寒武纪MLU370-X4加速器上部署DeepSeek-R1-Distill-Qwen-7B的实测表明:相比静态基线,ODMA使KV缓存利用率提升高达19.25%(绝对值),吞吐量(TPS)提升23%-27%,验证了预测驱动型连续分配策略对LPDDR级设备的有效性。