Autoregressive inference is typically assumed to scale predictably with decoding length, with latency increasing smoothly as generated sequence length grows. In this work, we identify unexpected non-monotonic latency behavior in the Apple MPS backend, where latency changes abruptly across nearby decoding configurations during transformer decoding. Using multiple model families (GPT-2, BLOOM, and OPT), we observe latency spikes of up to 21x within specific decoding-budget intervals, followed by recovery at neighboring configurations. Controlled experiments show that these anomalies originate primarily during the decode phase rather than prefill, are not explained by memory pressure alone, and remain absent on CPU and NVIDIA CUDA backends under identical conditions. We further show that key-value (KV) cache interacts strongly with these pathological execution regimes: KV caching remains beneficial overall, but its practical speedup collapses sharply within anomalous configurations, while cache-disabled decoding still exhibits residual non-monotonic behavior. These findings suggest that autoregressive decoding on MPS enters discrete execution regimes that are not captured by coarse-grained benchmarking, highlighting the importance of hardware-aware evaluation for long-context inference.
翻译:自回归推理通常被认为随解码长度呈现可预测的缩放规律,即延迟随生成序列长度增长而平滑增加。本研究发现Apple MPS后端存在意外的非单调延迟行为——在Transformer解码过程中,延迟在邻近解码配置间发生突变。使用多种模型族(GPT-2、BLOOM、OPT),我们在特定解码预算区间内观测到高达21倍的延迟峰值,并在相邻配置中恢复。控制实验表明:这些异常主要源于解码阶段而非预填充阶段,无法仅用内存压力解释,且在相同条件下CPU和NVIDIA CUDA后端均未出现。进一步研究发现,键值(KV)缓存与这些病态执行模式存在强交互:KV缓存整体上仍具性能优势,但其实际加速比在异常配置中急剧坍塌,而禁用缓存的解码仍表现出残余的非单调行为。这些发现表明,MPS上的自回归解码会进入离散执行模式,该模式无法通过粗粒度基准测试捕获,凸显了硬件感知评估对长上下文推理的重要性。