As large language models (LLMs) move from research to production, understanding how inference engines behave in real time has become both essential and elusive. Unlike general-purpose engines such as ONNX Runtime, today's LLM inference systems offer little operator-level visibility, leaving developers blind to where time and resources go. Even basic questions -- is this workload memory-bound or compute-bound? -- often remain unanswered. To close this gap, we develop a fine-grained, non-intrusive profiling framework for modern LLM inference engines, exemplified by llama.cpp but applicable to similar runtime architectures. Built on extended Berkeley Packet Filter (eBPF) technology, our system dynamically attaches probes to runtime functions across multiple layers -- without modifying or recompiling the source. It transforms collected traces into rich visualizations of operators, graphs, timelines, and hardware counter trends, exposing how dense inference, Mixture-of-Experts routing, and operator offloading behave in practice. With less than 4% runtime overhead and high profiling fidelity, our framework makes LLM inference both transparent and diagnosable, turning performance profiling into a practical tool for optimization, scheduling, and resource-aware deployment.
翻译:随着大语言模型从研究走向生产应用,实时理解推理引擎的运行行为变得至关重要却又难以实现。与ONNX Runtime等通用引擎不同,当前的大语言模型推理系统几乎不提供算子层级的可见性,导致开发者无法洞察时间和资源的实际消耗去向。即使是基础性问题——当前工作负载是内存受限还是计算受限?——也往往无法得到解答。为填补这一空白,我们开发了一种针对现代大语言模型推理引擎的细粒度、非侵入式性能剖析框架,该框架以llama.cpp为典型代表,但同样适用于类似的运行时架构。基于扩展伯克利数据包过滤器技术构建,我们的系统能够动态地将探针附加到跨多个层级的运行时函数——无需修改或重新编译源代码。它将收集到的追踪数据转化为丰富的可视化图表,包括算子视图、计算图、时间轴及硬件计数器趋势,从而揭示稠密推理、专家混合路由和算子卸载在实际运行中的行为特征。在低于4%的运行时开销和高保真剖析性能的条件下,我们的框架使大语言模型推理过程既透明又可诊断,将性能剖析转化为支持优化、调度和资源感知部署的实用工具。