Large Language Models (LLMs) often rely on test-time scaling via parallel decoding (for example, 512 samples) to boost reasoning accuracy, but this incurs substantial compute. We introduce CoRefine, a confidence-guided self-refinement method that achieves competitive accuracy using a fraction of the tokens via a lightweight 211k-parameter Conv1D controller atop a frozen LLM. The controller consumes full-trace confidence to decide whether to halt, re-examine, or try a different approach, enabling targeted self-correction with an average of 2.7 refinement steps per problem and roughly 190-fold token reduction relative to 512-sample baselines. Across diverse reasoning benchmarks and three open-source models, the controller achieves 92.6 percent precision when it confidently halts, indicating that confidence dynamics reliably signal correctness without ground-truth verification. We extend this to CoRefine-Tree, a hybrid sequential-parallel variant that adaptively balances exploration and exploitation, with easy serving integration and verifier compatibility. By treating confidence as a control signal rather than a correctness guarantee, CoRefine provides a modular primitive for scalable reasoning and agentic settings with imperfect verifiers.
翻译:大型语言模型(LLMs)通常依赖测试时通过并行解码(例如512个样本)进行规模扩展以提升推理准确率,但这会带来巨大的计算开销。本文提出CoRefine,一种基于置信度引导的自优化方法,通过在冻结的LLM顶部部署仅含21.1万参数的轻量级Conv1D控制器,仅使用少量token即可达到具有竞争力的准确率。该控制器通过分析全轨迹置信度来决定是否终止推理、重新审视问题或尝试不同解决方案,从而实现针对性的自我修正——平均每个问题仅需2.7步优化迭代,相较于512样本基线方法可减少约190倍的token消耗。在多样化推理基准测试和三个开源模型上的实验表明,当控制器置信终止时其决策精确率达到92.6%,这证明置信度动态变化能可靠指示正确性而无需真实标签验证。我们进一步扩展为CoRefine-Tree,这是一种混合串行-并行架构的变体,能自适应平衡探索与利用,并具备易于部署和验证器兼容的特性。通过将置信度视为控制信号而非正确性保证,CoRefine为不完美验证器环境下的可扩展推理与智能体场景提供了模块化基础组件。