Process reward models (PRMs) require supervision that identifies not only whether a reasoning trajectory is correct, but also where the reasoning process first becomes unsupported by its prefix. We frame this requirement as verifiable counterfactual process supervision with paired correct and erroneous trajectories in which the first invalid transition is known, the error mechanism is controlled, and the downstream continuation remains coherent under the corrupted state. Starting from a verified symbolic reasoning chain, our method injects a template-aware error at a selected intermediate step, recomputes all subsequent steps under the corrupted state, and verifies that the injected step is not derivable from its original prefix. The resulting trajectories provide prefix-valid first-error annotations and are translated into aligned natural-language processes for PRM training and evaluation. Experiments show that the synthesized data improve Best-of-8 reranking on logical reasoning benchmarks and show preliminary transfer to mathematical process evaluation.
翻译:暂无翻译