Existing memory management techniques severely hinder efficient Large Language Model serving on accelerators constrained by poor random-access bandwidth.While static pre-allocation preserves memory contiguity,it incurs significant overhead due to worst-case provisioning.Conversely,fine-grained paging mitigates this overhead but relies on HBM's high random-access tolerance, making it unsuitable for LPDDR systems where non-sequential access rapidly degrades bandwidth. Furthermore, prior works typically assume static distributions and HBM characteristics, thereby failing to resolve the critical fragmentation and bandwidth constraints inherent to LPDDR hardware. We present ODMA, an on-demand memory allocation strategy tailored for random-access-constrained accelerators, such as the Cambricon MLU series.ODMA advances generation-length prediction by addressing two critical limitations in production workloads: (i) distribution drift that invalidates static bucket boundaries, and (ii) performance fragility under heavy-tailed request patterns. ODMA integrates a lightweight length predictor with adaptive bucket partitioning and a fallback safety pool. Bucket boundaries are dynamically recalibrated via online histograms to maximize utilization, while the safety pool ensures robustness against prediction errors. On Alpaca and Google-NQ benchmarks, ODMA improves S3's prediction accuracy from 98.60% to 99.55% and 82.68% to 93.36%, respectively. Deployment with DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4 accelerators demonstrates that ODMA increases KV-cache utilization by up to 19.25% (absolute) and throughput (TPS) by 23-27% over static baselines, validating the efficacy of predictor-driven contiguous allocation for LPDDR-class devices.
翻译:现有内存管理技术严重限制了随机访问带宽受限加速器上大语言模型的高效推理。静态预分配虽能保持内存连续性,但最坏情况下的资源预留会产生显著开销;而细粒度分页虽能缓解该开销,却依赖HBM的高随机访问容忍度,不适用于非顺序访问会快速恶化的LPDDR系统。此外,现有工作通常假设静态分布及HBM特性,未能解决LPDDR硬件固有的关键碎片化与带宽约束问题。我们提出ODMA——一种专为随机访问受限加速器(如寒武纪MLU系列)设计的按需内存分配策略。ODMA通过解决生产负载中的两个关键限制改进生成长度预测:(i)使静态桶边界失效的分布漂移,(ii)重尾请求模式下的性能脆弱性。ODMA将轻量级长度预测器与自适应桶划分及备用安全池相结合:通过在线直方图动态校准桶边界以最大化利用效率,同时利用安全池确保对预测误差的鲁棒性。在Alpaca和Google-NQ基准测试中,ODMA分别将S3的预测准确率从98.60%提升至99.55%、从82.68%提升至93.36%。在寒武纪MLU370-X4加速器上部署DeepSeek-R1-Distill-Qwen-7B的实验表明,相较于静态基线,ODMA将KV缓存利用率提升高达19.25%(绝对值),吞吐量(TPS)提升23-27%,验证了面向LPDDR类设备的预测驱动连续分配策略的有效性。