Beyond CPU-GPU Frequency: Memory-Clock and Tail Effects in Edge Inference Latency Estimation

from arxiv, 12 pages, 9 figures, 5 tables. Code and data: https://github.com/dankang21/jetson-latency-lab ; traces: https://doi.org/10.5281/zenodo.20694688

Frequency-aware latency estimators enable deadline-aware DVFS for edge ML inference by modeling latency over CPU and GPU frequencies. We present a measurement study on an NVIDIA Jetson Orin Nano showing three phenomena outside this modeling scope. (1) The memory clock is a missing axis: across the realistic upper EMC range (2133->3199 MHz) it shifts median latency by +11% to +48% depending on workload, and for a synthetic L2-resident kernel at the top GPU clock we observe a reproducible non-monotonic case (-9%). A GPU-frequency estimator profiled under one power profile and deployed under another consequently underestimates latency by up to 32%; tabulating the four lockable EMC points repairs most workloads, while a parametric 1/f_emc term does not. (2) Aggregate miss rates hide bursts: at fixed clocks, 100k-cycle runs show knife-edge distributions whose deadline-miss cliffs span ~1 ms, yet misses cluster far beyond independence - at a 0.1% aggregate miss rate, the next cycle also misses with probability up to 74% (740x the independent baseline). Gaussian mu+3sigma margins overshoot a 0.1% miss target by 13x-29x, while out-of-sample generalized Pareto margins stay within ~2x of it across all eight configurations. (3) Frequency actuation is not free: per-domain transition stalls stay below 100 us, but the new operating point takes 1/5/8 ms (CPU/GPU/EMC) to take effect - a substantial fraction of typical inference periods for per-inference governors. We release the full measurement harness and discuss implications for the next generation of frequency-aware estimators and governors.

翻译：频率感知延迟估计器通过建模CPU和GPU频率上的延迟，为边缘机器学习推理提供截止时间感知的动态电压频率调整（DVFS）。我们基于NVIDIA Jetson Orin Nano的测量研究揭示了该建模范围之外的三个现象。（1）内存时钟是一个缺失的维度：在现实高EMC范围（2133->3199 MHz）内，根据工作负载的不同，它使中位数延迟偏移+11%至+48%，而在最高GPU时钟下对合成L2驻留内核的测试中，我们观察到可重复的非单调情况（-9%）。在一个功率配置下配置、另一功率配置下部署的GPU频率估计器因此低估了延迟高达32%；列出四个可锁定的EMC点可修复大部分工作负载，而参数化的1/f_emc项却无法做到。（2）聚合失效率掩盖了突发性：在固定时钟下，10万次运行的测试显示刀锋边缘分布，其截止时间失效悬崖跨度约1毫秒，但失效的聚集程度远超独立性假设——在0.1%的聚合失效率下，下一个周期也失效的概率高达74%（独立基线的740倍）。高斯分布的μ+3σ裕度对0.1%失效目标的超调达到13倍至29倍，而样本外广义帕累托裕度在所有八种配置下均保持在目标的约2倍以内。（3）频率调节并非无代价：每个域的转换停驻时间低于100微秒，但新工作点需要1/5/8毫秒（CPU/GPU/EMC）才能生效——对于每个推理周期的调控器而言，这占典型推理周期的相当大比例。我们发布了完整的测量工具集，并讨论了对下一代频率感知估计器和调控器的启示。