Deploying LLMs efficiently requires testing hundreds of serving configurations, but evaluating each one on a GPU cluster takes hours and costs thousands of dollars. Discrete-event simulators are faster and cheaper, but they require re-implementing the serving system's control logic -- a burden that compounds as frameworks evolve. We present Revati, a time-warp emulator that enables performance modeling by directly executing real serving system code at simulation-like speed. The system intercepts CUDA API calls to virtualize device management, allowing serving frameworks to run without physical GPUs. Instead of executing GPU kernels, it performs time jumps -- fast-forwarding virtual time by predicted kernel durations. We propose a coordination protocol that synchronizes these jumps across distributed processes while preserving causality. On vLLM and SGLang, Revati achieves less than 5% prediction error across multiple models and parallelism configurations, while running 5-17x faster than real GPU execution.
翻译:高效部署大语言模型需要测试数百种服务配置,但在GPU集群上评估每种配置需耗时数小时且成本高达数千美元。离散事件仿真器虽更快速经济,但需重新实现服务系统的控制逻辑——这一负担随着框架演进而不断加重。本文提出Revati,一种时间扭曲仿真器,可通过直接执行真实服务系统代码实现类仿真速度的性能建模。该系统通过拦截CUDA API调用实现设备管理虚拟化,使服务框架无需物理GPU即可运行。其核心机制采用时间跳跃而非执行GPU内核——依据预测的内核执行时长快速推进虚拟时间。我们提出一种跨分布式进程的协调协议,在保持因果一致性的同时同步这些时间跳跃。在vLLM与SGLang框架上的实验表明,Revati在多种模型与并行配置下预测误差低于5%,且运行速度比真实GPU执行快5-17倍。