Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.
翻译:大语言模型(LLMs)虽性能强劲,但在实际服务中推理成本高昂,尤其在跨用户与会话场景下频繁出现重复或近似查询的工作负载中。本文提出MemBoost,一种内存增强的大模型服务框架,使轻量级模型能够复用先前生成的答案并检索相关支持信息以完成低成本推理,同时选择性地将困难或不确定性较高的查询升级至更强模型。与主要依赖单一响应的事实增强生成不同,MemBoost专为交互式场景设计,支持答案复用、持续内存增长及成本感知路由。在模拟工作负载下对多个模型的实验表明,MemBoost在保持与强模型基线相当的高质量答案的同时,显著减少了昂贵的大模型调用次数及整体推理成本。