Frequency-aware latency estimators enable deadline-aware DVFS for edge ML inference by modeling latency over CPU and GPU frequencies. We present a measurement study on an NVIDIA Jetson Orin Nano showing three phenomena outside this modeling scope. (1) The memory clock is a missing axis: across the realistic upper EMC range (2133->3199 MHz) it shifts median latency by +11% to +48% depending on workload, and for a synthetic L2-resident kernel at the top GPU clock we observe a reproducible non-monotonic case (-9%). A GPU-frequency estimator profiled under one power profile and deployed under another consequently underestimates latency by up to 32%; tabulating the four lockable EMC points repairs most workloads, while a parametric 1/f_emc term does not. (2) Aggregate miss rates hide bursts: at fixed clocks, 100k-cycle runs show knife-edge distributions whose deadline-miss cliffs span ~1 ms, yet misses cluster far beyond independence - at a 0.1% aggregate miss rate, the next cycle also misses with probability up to 74% (740x the independent baseline). Gaussian mu+3sigma margins overshoot a 0.1% miss target by 13x-29x, while out-of-sample generalized Pareto margins stay within ~2x of it across all eight configurations. (3) Frequency actuation is not free: per-domain transition stalls stay below 100 us, but the new operating point takes 1/5/8 ms (CPU/GPU/EMC) to take effect - a substantial fraction of typical inference periods for per-inference governors. We release the full measurement harness and discuss implications for the next generation of frequency-aware estimators and governors.
翻译:频率感知延迟估计器通过建模CPU和GPU频率上的延迟,为边缘机器学习推理提供截止时间感知的动态电压频率调整(DVFS)。我们基于NVIDIA Jetson Orin Nano的测量研究揭示了该建模范围之外的三个现象。(1)内存时钟是一个缺失的维度:在现实高EMC范围(2133->3199 MHz)内,根据工作负载的不同,它使中位数延迟偏移+11%至+48%,而在最高GPU时钟下对合成L2驻留内核的测试中,我们观察到可重复的非单调情况(-9%)。在一个功率配置下配置、另一功率配置下部署的GPU频率估计器因此低估了延迟高达32%;列出四个可锁定的EMC点可修复大部分工作负载,而参数化的1/f_emc项却无法做到。(2)聚合失效率掩盖了突发性:在固定时钟下,10万次运行的测试显示刀锋边缘分布,其截止时间失效悬崖跨度约1毫秒,但失效的聚集程度远超独立性假设——在0.1%的聚合失效率下,下一个周期也失效的概率高达74%(独立基线的740倍)。高斯分布的μ+3σ裕度对0.1%失效目标的超调达到13倍至29倍,而样本外广义帕累托裕度在所有八种配置下均保持在目标的约2倍以内。(3)频率调节并非无代价:每个域的转换停驻时间低于100微秒,但新工作点需要1/5/8毫秒(CPU/GPU/EMC)才能生效——对于每个推理周期的调控器而言,这占典型推理周期的相当大比例。我们发布了完整的测量工具集,并讨论了对下一代频率感知估计器和调控器的启示。