interwhen: A Generalizable Framework for Steering Reasoning Models with Test-time Verification

Vishak K Bhat,Prateek Chanda,Vijval Ekbote,Ashmit Khandelwal,Maitreyi Swaroop,Vineeth N. Balasubramanian,Subbarao Kambhampati,Nagarajan Natarajan,Amit Sharma

from arxiv, 56 pages, 6 figures

Reasoning models produce long traces of intermediate decisions and tool calls, making test-time verification important for ensuring correctness. Existing approaches either verify only the final answer, which misses early errors, or rely on branch-and-verify strategies that explore multiple trajectories. We introduce interwhen, a single-trajectory verification framework that steers model behavior by providing feedback on intermediate reasoning traces. It addresses two key challenges. First, given a set of verifiers, obtaining verifiable states from the reasoning trace typically requires prompt engineering or external task decomposition into fixed steps. Instead, we propose a monitoring system that periodically polls the reasoning trace and forks inference of the reasoning model to recover intermediate states. Verifiers are run asynchronously alongside generation, adding negligible overhead on correct executions and intervening only when violations occur. Second, beyond math and code, a central challenge for process verification is the scarcity of verifiers. interwhen addresses this through automatic verifier synthesis from natural-language policy documents. Given a policy, it can generate code-based verifiers, including provably correct verifiers in Lean and z3. Together, these contributions yield a plug-and-play test-time verification system that can improve task completion and policy compliance of any reasoning agent. On reasoning benchmarks where policies encode mathematical or logical constraints, interwhen achieves near-perfect accuracy for reasoning models using a fraction of the tokens of baselines. On agentic benchmarks with policy-based verifier generation, it enables improvements in task quality for SLMs without any finetuning, e.g., task completion rate of Qwen3-30B jumps from 32% to 87% on the telecom domain in tau2-bench. Code at https://github.com/microsoft/interwhen.

翻译：推理模型会产生长序列的中间决策和工具调用，这使得测试时验证对确保正确性至关重要。现有方法要么仅验证最终答案（从而遗漏早期错误），要么依赖分支验证策略探索多条轨迹。我们提出interwhen——一种单轨迹验证框架，通过向中间推理过程提供反馈来引导模型行为。该框架解决两个核心挑战：第一，在给定验证器集合的情况下，从推理轨迹获取可验证状态通常需要提示工程或通过外部任务分解为固定步骤。为此，我们设计监控系统周期性轮询推理轨迹，并fork推理模型的推理过程以恢复中间状态。验证器与推理生成异步运行，在正确执行时仅增加极小开销，仅在发生违规时干预。第二，除数学与代码外，过程验证的核心挑战在于验证器的稀缺性。interwhen通过从自然语言政策文档自动合成验证器解决此问题。给定政策文档，它能生成基于代码的验证器（包括Lean和z3中可证明正确的验证器）。这些贡献共同构成即插即用的测试时验证系统，可提升任意推理代理的任务完成度与政策合规性。在编码数学或逻辑约束的推理基准测试中，interwhen使推理模型使用基线极少的令牌即可达到近完美准确率；在基于策略验证器生成的智能体基准中，它无需微调即可提升小型语言模型的任务质量——例如tau2-bench电信领域，Qwen3-30B任务完成率从32%跃升至87%。代码见https://github.com/microsoft/interwhen。