Distributed LLM serving systems optimize per-request latency and throughput. However, under long-context workloads, inference accuracy becomes more variable. When incorrect responses trigger retries, accuracy directly translates into cumulative user-visible delay that is not captured by single-shot latency metrics. In this work, we argue that under long-context serving, \textbf{accuracy becomes speed} through retry dynamics. We introduce \textit{Time-to-Correct-Answer (TTCA)}, a metric that measures the wall-clock time required to obtain the first correct response. Our measurement study shows that prompt characteristics such as length and language amplify accuracy variance, which inflates TTCA. We demonstrate \textit{Lightweight Accuracy-Aware Routing (LAAR)}, a capability-based routing design that reduces TTCA. Our results suggest that in long-context distributed serving, accuracy should be treated as a first-class systems objective.
翻译:分布式大语言模型服务系统优化了单次请求的延迟与吞吐量。然而在长上下文工作负载下,推理准确率波动加剧。当错误响应触发重试时,准确率直接转化为累积的用户可见延迟——这一指标未被单次延迟度量所捕捉。本研究认为,在长上下文服务场景中,通过重试动力学机制,**准确率转化为速度**。我们引入《正确应答时间》(TTCA) 这一度量,用于衡量获得首个正确响应所需的挂钟时间。测量研究表明,提示的长度与语言等特征会放大准确率方差,从而抬高TTCA。我们提出《轻量级准确率感知路由》(LAAR),一种基于能力的路由设计方案,可降低TTCA。实验结果表明,在长上下文分布式服务中,准确率应被视为首要系统优化目标。