Large Language Model (LLM) inference is widely used in interactive assistants and agentic systems. In latency-sensitive deployments, inference time can become dominated by host-side overheads. Existing approaches typically expose this cost only as an aggregate residual or a launch/queue metric, which is often insufficient to identify which execution layer should be optimized. This work presents TaxBreak, a trace-driven methodology for decomposing host-visible orchestration overhead into three components: framework translation time, CUDA library translation time, and kernel launch-path time. We validate TaxBreak on NVIDIA H100 and H200 systems and use it to derive our proposed Host-Device Balance Index (HDBI), a boundedness summary index that relates device-active execution to host-visible orchestration. Across representative dense and mixture-of-experts workloads in both prefill and decode, we show that aggregate latency, GPU inactivity, or boundedness ratios alone can obscure the dominant optimization target. TaxBreak instead distinguishes cases where optimization should reduce software-stack overhead from cases where the primary win comes from reducing device-side work. We further show that MoE models dispatch 8-11x more kernels per output token than dense models, and that for such host-bound workloads, CPU single-thread performance is a first-order parameter: a faster host CPU reduces orchestration overhead by 10-29% and improves end-to-end latency by up to 14%, even when paired with a slower-clocked GPU. These results position TaxBreak as a diagnostic tool for assessing whether optimization effort should target the software stack or the device-side workload execution.
翻译:大型语言模型(LLM)推理广泛应用于交互式助手与智能体系统中。在延迟敏感的应用场景中,推理时间往往受主机端开销主导。现有方法通常仅将此类成本以聚合残差或启动/队列指标的形式呈现,这往往不足以确定应优化哪个执行层级。本研究提出TaxBreak,一种基于追踪的分解方法,可将主机端可见的编排开销分解为三个组成部分:框架转换时间、CUDA库转换时间以及内核启动路径时间。我们在NVIDIA H100与H200系统上验证了TaxBreak,并利用其推导出我们提出的主机-设备平衡指数(HDBI)——一种将设备活跃执行时间与主机端可见编排开销相关联的有界性综合指标。通过对典型稠密模型与专家混合模型在预填充和解码阶段的代表性工作负载进行分析,我们发现仅依赖聚合延迟、GPU空闲时间或有界性比率可能掩盖关键的优化目标。TaxBreak能够有效区分以下两种情况:应优先降低软件栈开销的场景,以及主要优化收益来自减少设备端计算负载的场景。我们进一步研究表明,专家混合模型每输出词元调度的内核数量是稠密模型的8-11倍;对于此类主机端受限的工作负载,CPU单线程性能成为关键参数:即使搭配主频较低的GPU,更快的宿主CPU仍能降低10-29%的编排开销,并使端到端延迟提升最高达14%。这些结果确立了TaxBreak作为诊断工具的价值,可用于评估优化重点应指向软件栈还是设备端工作负载执行。