Autoregressive inference is typically assumed to scale predictably with decoding length, and key-value (KV) caching is widely regarded as a universally beneficial optimization for accelerating decoding. In this work, we identify unexpected non-monotonic latency behavior in the Apple MPS backend, where latency changes abruptly across nearby decoding configurations. Using transformer models from multiple families (GPT-2, BLOOM, and OPT), we observe latency spikes of up to 21x within specific decoding-budget intervals, followed by recovery at neighboring configurations. Controlled experiments show that these anomalies are not explained by memory pressure or prefill cost, but are instead consistent with backend execution dynamics, while CPU and NVIDIA T4 (CUDA) exhibit smooth monotonic scaling under identical conditions. Our findings highlight the importance of hardware-aware evaluation for autoregressive inference and caution against relying on aggregated decoding-budget benchmarks, as performance can vary discontinuously across nearby configurations.
翻译:自回归推理通常被认为会随解码长度呈现可预测的缩放行为,而键值(KV)缓存被广泛视为加速解码的通用优化手段。在本工作中,我们发现了苹果MPS后端中意想不到的非单调延迟现象——在临近解码配置间延迟出现突变。通过使用多系列变压器模型(GPT-2、BLOOM和OPT),我们在特定解码预算区间内观察到高达21倍的延迟尖峰,而邻近配置却能恢复至正常水平。控制实验表明,这些异常现象既非内存压力或预填充成本所致,反而与后端执行动力学特征高度吻合;而在相同条件下,CPU和NVIDIA T4(CUDA)展现出平滑的单调缩放特性。本研究揭示了硬件感知评估对自回归推理的重要性,并提醒研究者切勿过度依赖聚合型解码预算基准测试,因为性能可能在不同邻近配置间呈现不连续性变化。