Deploying large-scale LLM training and inference with optimal performance is exceptionally challenging due to a complex design space of parallelism strategies, system optimizations, and hardware configurations. Accurate and rapid performance simulation is critical for guiding optimization efforts and system studies by validating "what-if" Hooker Figure hypotheses. To address this, we introduce Charon, a unified, modular, and fine-grained simulator for accurately predicting LLM performance. Experiments show Charon achieves high accuracy across different models and configurations, with an overall prediction error consistently under 5.35%, and even under 3.74% for training with a large-scale GPU cluster. In a practical inference deployment case, Charon discovered a configuration that improved system throughput over an engineering-tuned baseline, demonstrating its significant real-world value.
翻译:以最优性能部署大规模LLM训练与推理面临极大挑战,其根源在于并行策略、系统优化及硬件配置构成的复杂设计空间。通过验证"假设分析"Hooker图假设,精确快速的性能模拟对于指导优化工作及系统研究至关重要。为此,我们提出Charon——一个统一、模块化且细粒度的模拟器,用于精确预测LLM性能。实验表明,Charon在不同模型与配置下均能实现高精度,总体预测误差持续低于5.35%;在大规模GPU集群训练场景中,该误差甚至降至3.74%以下。在一个实际推理部署案例中,Charon发现了一种配置,该配置较工程调优基线提升了系统吞吐量,充分彰显了其重要的实际应用价值。