Deployed multi-turn LLM systems routinely switch models mid-interaction due to upgrades, cross-provider routing, and fallbacks. Such handoffs create a context mismatch: the model generating later turns must condition on a dialogue prefix authored by a different model, potentially inducing silent performance drift. We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence intervals. Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by -8 to +13 percentage points in Multi-IF strict success rate and +/- 4 absolute F1 on CoQA, comparable to the no-switch gap between common model tiers (e.g., GPT-5-nano vs GPT-5-mini). We further find systematic compatibility patterns: some suffix models degrade under nearly any non-self dialogue history, while others improve under nearly any foreign prefix. To enable compressed handoff risk monitoring, we decompose switch-induced drift into per-model prefix influence and suffix susceptibility terms, accounting for ~70% of variance across benchmarks. These results position handoff robustness as an operational reliability dimension that single-model benchmarks miss, motivating explicit monitoring and handoff-aware mitigation in multi-turn systems.
翻译:已部署的多轮LLM系统在交互过程中常因升级、跨供应商路由和故障回退而切换模型。此类交接会产生上下文不匹配:生成后续轮次的模型必须基于由不同模型生成的对话前缀进行条件生成,可能导致隐性的性能漂移。我们引入了一种切换矩阵基准测试,通过在前几轮使用前缀模型、在最后一轮使用后缀模型来测量这种效应,并使用配对情节级自助置信区间与无切换基线进行比较。在CoQA对话式问答和Multi-IF基准测试中,即使是单轮交接也会产生普遍且具有统计显著性的定向效应,可能导致Multi-IF严格成功率波动-8至+13个百分点,CoQA绝对F1值波动±4分,其影响程度与常见模型层级间的无切换性能差距相当(例如GPT-5-nano与GPT-5-mini)。我们进一步发现了系统性的兼容模式:某些后缀模型在几乎所有非自身对话历史下性能都会下降,而另一些则在几乎所有外部前缀下性能提升。为实现压缩化的交接风险监控,我们将切换引起的漂移分解为各模型的前缀影响项和后缀敏感项,这两个因素可解释跨基准测试约70%的方差。这些结果表明交接鲁棒性是单模型基准测试所忽略的运行可靠性维度,这促使我们在多轮系统中实施显式监控和交接感知的缓解策略。