LLM-based multi-agent simulations are increasingly adopted across application domains, but remain difficult to scale due to GPU memory pressure. Each agent maintains private GPU-resident states, including models, prefix caches, and adapters, which quickly exhaust device memory as the agent count grows. We identify two key properties of these workloads: sparse agent activation and an estimable agent invocation order. Based on an analysis of representative workload classes, we introduce invocation distance, a unified abstraction that estimates the relative order in which agents will issue future LLM requests. Leveraging this abstraction, we present ScaleSim, a memory-efficient LLM serving system for large-scale multi-agent simulations. ScaleSim enables proactive prefetching and priority-based eviction, supports diverse agent-specific memory through a modular interface, and achieves up to 1.74x speedup over SGLang on simulation benchmarks.
翻译:基于大语言模型的多智能体仿真在应用领域日益普及,但由于GPU内存压力难以实现规模化扩展。每个智能体需维护私有GPU驻留状态(包括模型、前缀缓存和适配器),随着智能体数量增加,设备内存迅速耗尽。我们识别出此类工作负载的两个关键特性:稀疏的智能体激活模式与可预估的智能体调用顺序。通过对典型工作负载类别的分析,我们提出调用距离这一统一抽象概念,用于预估智能体未来发出大语言模型请求的相对顺序。基于此抽象,我们设计ScaleSim——一个面向大规模多智能体仿真的内存高效大语言模型服务系统。ScaleSim通过模块化接口支持异构的智能体专属内存管理,实现主动预取与基于优先级的逐出机制,在仿真基准测试中较SGLang最高可获得1.74倍的加速比。