Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.

翻译：物理人工智能系统（包括机器人、自动驾驶车辆、具身智能体及边缘副驾驶）通常处理与云端大语言模型服务不同的推理负载：单流、批大小为1的自回归解码模式。在此模式下，单个机器人、摄像头输入或用户会话需等待下一个token生成。此类负载通常被描述为受内存带宽限制。每次解码步骤需流式加载模型权重与活跃KV缓存，因此延迟应与HBM峰值带宽呈线性关系。我们证明这一观点正确但不完整。我们针对三款7至8B级GQA Transformer模型在四块NVIDIA GPU（H100 SXM5、A100-80GB SXM4、L40S和L4）上测量了批大小为1的解码性能，在受控的bf16 SDPA环境下评估了2048至16384的上下文长度，共生成44个有效数据点。实验发现峰值HBM带宽利用率随带宽提升而下降：以Qwen-2.5-7B在上下文长度2048下的典型数据为例，L4利用率约为理论内存下限的81%，而H100仅达27%。物理AI解码虽以内存为主导，但内存速度提升并未转化为等比例延迟改善。我们通过CUDA Graphs A/B实验验证了缺失因素：在H100上针对上下文长度2048，CUDA Graphs使解码延迟提升1.259倍（基于N=10次新会话的95% bootstrap置信区间为1.253至1.267），而在L4上仅提升1.028倍。这揭示了仅在高速GPU可见但在带宽受限的慢速GPU上基本被隐藏的启动侧开销。部署启示在于：内存节省量必须在运行时层面实际体现才有意义。在L4上，bf16解码已接近内存下限，但常用量化方案未能实现预期的4倍权重传输缩减：bnb-nf4达到59.36毫秒/步，AutoAWQ+Marlin达到45.24毫秒/步（基数为62.32毫秒的bf16基线），而采用Ada调优int4内核的GPTQ+ExLlamaV2则达到17.36毫秒/步。