LLM serving frameworks are quickly evolving with a complex software stack and a vast number of optimizations. The rapid development process can introduce silent errors where output quality silently degrades without any explicit error signals. Diagnosing silent errors is notoriously difficult due to the substantial semantic gap between the high-level symptoms and the low-level root causes. We observe that diagnosis of silent errors can be effectively framed as a differential debugging problem by leveraging the existence of semantically correct reference implementations. We propose Ekka, an automated diagnosis system that identifies root causes by systematically aligning and comparing intermediate execution states between a target and a reference framework. We constructed a benchmark of real-world silent errors from popular serving frameworks, where Ekka shows 80% pass@1 diagnosis accuracy and 88% pass@5 diagnosis accuracy, outperforming state-of-the-art systems. Ekka also diagnoses 4 new silent errors from serving frameworks, all of which have been confirmed by the developers.
翻译:LLM服务框架正快速演变,其软件栈复杂且包含大量优化。快速开发过程可能引入静默错误,即输出质量在无显式错误信号的情况下悄然下降。由于高级症状与低级根本原因之间存在巨大的语义鸿沟,诊断静默错误极其困难。我们观察到,利用语义正确的参考实现的存在,可以将静默错误的诊断有效构建为差分调试问题。我们提出Ekka,一种自动化诊断系统,通过系统地对齐并比较目标框架与参考框架间的中间执行状态,识别根本原因。我们构建了一个来自主流服务框架的真实静默错误基准测试集,其中Ekka实现了80%的pass@1诊断准确率和88%的pass@5诊断准确率,优于现有最优系统。此外,Ekka还诊断出服务框架中的4个新静默错误,均已获开发者确认。