Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.
翻译:大型语言模型(LLM)虽性能卓越,但在实际服务中推理成本高昂,尤其在跨用户和会话存在重复或近似重复查询的工作负载下尤为突出。本文提出MemBoost——一种内存增强的LLM服务框架,该框架允许轻量级模型复用先前生成的答案并检索相关支持信息以实现低成本推理,同时选择性将困难或不确定的查询升级至更强模型。与主要基于单一响应的标准检索增强生成不同,MemBoost通过支持答案复用、持续内存增长和成本感知路由,专为交互式场景设计。在模拟工作负载下对多种模型进行的实验表明,MemBoost在保持与强模型基线相当的高质量答案的同时,显著降低了昂贵的强模型调用次数和总体推理成本。