Large language models produce fluent but often incorrect multi-step reasoning, and naive correction methods risk degrading already-correct answers. We introduce Denoising Iterative Self-Correction (DISC), a test-time procedure that treats verification question outputs as noisy measurements of where a solution may be corrupted. Using these signals, DISC progressively reduces errors across multiple verify-judge-correct passes, analogous to traditional iterative denoising. A binary judgment gate controls correction precision by blocking rewrites that would damage already-correct answers while the verifier and corrector together repair errors. We evaluate this trade-off using two paired diagnostics: an improvement-to-degradation ratio (precision) and a repair rate (recall). Across three benchmarks (BIG-Bench Mistake, HotpotQA, GPQA Diamond) and four models, DISC dominates Chain-of-Verification and Self-Refine on the precision-recall trade-off, reaching 81.6% accuracy with 13x more improvements per degradation than Chain-of-Verification and 5x more than Self-Refine on BIG-Bench Mistake (Sonnet~4.5). On GPQA Diamond, we identify a capability floor below which judges acknowledge contradictions in evidence but cannot translate that recognition into a correction. We further show that cross-model role allocation -- assigning verification and judgment to a model different from the generator -- mitigates self-confirmation bias.
翻译:暂无翻译