Realistic evaluation of LLM serving systems requires online workloads, dynamic arrivals, queueing, and the serving engine's local scheduling for execution batching, but running such experiments on GPUs is expensive. Existing simulators reduce this cost, but often operate offline or in time-warped mode, re-implement serving-engine schedulers, or require accurate operator/kernel-level latency models. We present LLM-Emu, a serving-native emulator for vLLM that preserves the production HTTP, scheduling, KV-cache, and output-processing paths while replacing only GPU forward execution with profile-sampled latency and synthetic output tokens. Tested on two different GPUs, four model variants, two model families, two attention backends, and both Poisson and bursty ShareGPT workloads, LLM-Emu closely tracks real vLLM serving behavior: TPOT and ITL stay within $4.8\%$ absolute error, E2E latency within $5.3\%$, and output throughput within $1.9\%$; TTFT is less stable, with maximum error $10.4\%$, reflecting its sensitivity to admission and queue state. These results suggest that lightweight, serving-native emulation can support practical online experimentation for LLM-serving systems. LLM-Emu is open sourced at https://github.com/AKafakA/llm-emu.
翻译:对LLM服务系统进行真实评估需要在线工作负载、动态到达请求、队列机制以及服务引擎用于执行批处理的本地调度,但在GPU上运行此类实验成本高昂。现有仿真器虽能降低成本,但通常以离线或时间扭曲模式运行,重新实现服务引擎调度器,或需要精确的算子/内核级延迟模型。我们提出LLM-Emu,一种面向vLLM的服务原生仿真器,它保留了生产环境的HTTP路径、调度路径、KV缓存路径及输出处理路径,仅用性能剖面采样延迟和合成输出令牌替代GPU前向执行。在两个不同GPU、四个模型变体、两个模型家族、两个注意力后端以及泊松与突发型ShareGPT工作负载上的测试表明,LLM-Emu能紧密追踪真实vLLM服务行为:TPOT与ITL的绝对误差保持在4.8%以内,端到端延迟误差在5.3%以内,输出吞吐量误差在1.9%以内;TTFT稳定性稍弱,最大误差为10.4%,这反映了其对准入与队列状态的敏感性。这些结果表明,轻量级、服务原生的仿真能够支持LLM服务系统的实用在线实验。LLM-Emu已在https://github.com/AKafakA/llm-emu开源。