LLM inference latency critically determines user experience and operational costs, directly impacting throughput under SLO constraints. Even brief latency spikes degrade service quality despite acceptable average performance. However, distributed inference environments featuring diverse software frameworks and XPU architectures combined with dynamic workloads make latency analysis challenging. Constrained by intrusive designs that necessitate service restarts or even suspension, and by hardware-bound implementations that fail to adapt to heterogeneous inference environments, existing AI profiling methods are often inadequate for real-time production analysis. We present LatencyPrism, the first zero-intrusion multi-platform latency sculpting system. It aims to break down the inference latency across pipeline, proactively alert on inference latency anomalies, and guarantee adherence to SLOs, all without requiring code modifications or service restarts. LatencyPrism has been deployed across thousands of XPUs for over six months. It enables low-overhead real-time monitoring at batch level with alerts triggered in milliseconds. This approach distinguishes between workload-driven latency variations and anomalies indicating underlying issues with an F1-score of 0.98. We also conduct extensive experiments and investigations into root cause analysis to demonstrate LatencyPrism's capability. Furthermore, we introduce the first LLM anomaly simulation toolkit to facilitate future research in robust and predictable inference systems.
翻译:LLM推理延迟是决定用户体验与运营成本的关键因素,直接影响SLO约束下的系统吞吐量。即使平均性能达标,短暂的延迟尖峰仍会显著降低服务质量。然而,分布式推理环境中多样化的软件框架与XPU架构,结合动态工作负载,使得延迟分析极具挑战。现有AI性能分析方法受限于需重启甚至暂停服务的侵入式设计,以及无法适配异构推理环境的硬件绑定实现,往往难以满足实时生产环境分析需求。本文提出LatencyPrism——首个零侵入、跨平台的延迟塑形系统。该系统旨在分解流水线各阶段推理延迟,主动预警推理延迟异常,并在无需代码修改或服务重启的前提下保障SLO合规。LatencyPrism已在数千个XPU上部署运行超过六个月,支持批处理级别的低开销实时监控,毫秒级触发预警机制。该方法能有效区分工作负载驱动的正常延迟波动与表征潜在问题的异常状态,F1分数达0.98。我们通过大量实验与根因分析研究验证了LatencyPrism的效能,并首次推出LLM异常模拟工具包,为构建鲁棒可预测的推理系统提供研究支持。