Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentage point (pp) collapse on one model while affecting another by near zero pp. This variability demonstrates that LLM critic accuracy alone is insufficient to determine whether intervention is safe. We identify a disruption-recovery tradeoff: interventions may recover failing trajectories but also disrupt trajectories that would have succeeded. Based on this insight, we propose a pre-deployment test that uses a small pilot of 50 tasks to estimate whether intervention is likely to help or harm, without requiring full deployment. Across benchmarks, the test correctly anticipates outcomes: intervention degrades performance on high-success tasks (0 to -26 pp), while yielding a modest improvement on the high-failure ALFWorld benchmark (+2.8 pp, p=0.014). The primary value of our framework is therefore identifying when not to intervene, preventing severe regressions before deployment.
翻译:尽管通常假设LLM批评模型的主动干预能提升可靠性,但其在部署时的实际影响尚不明确。本文证明,即使具备强大的离线准确率(AUROC 0.94)的二元LLM批评模型,仍可能导致严重的性能退化:在一个模型上引发26个百分点(pp)的性能崩塌,而对另一模型的影响近乎为零。这种差异性表明,仅凭LLM批评模型的准确率不足以判断干预是否安全。我们揭示了干预过程中的“破坏-恢复”权衡:干预可能挽回失败轨迹,但同时也会破坏本应成功的轨迹。基于此发现,我们提出一种部署前测试方法,仅需50个任务的少量试点即可预估干预可能产生的利弊,无需完整部署。在多个基准测试中,该测试能准确预测结果:干预在高成功率任务上会降低性能(0至-26 pp),而在高失败率的ALFWorld基准上则带来小幅改善(+2.8 pp, p=0.014)。因此,本框架的核心价值在于识别何时不应实施干预,从而在部署前避免严重的性能衰退。