Large language models are increasingly deployed in high-stakes tasks, where confident yet incorrect inferences may cause severe real-world harm, bringing the previously overlooked issue of confidence faithfulness back to the forefront. A promising solution is to jointly optimize unsupervised Reinforcement Learning from Internal Feedback (RLIF) with reasoning-trace-guided Reasoning Distillation (RD), which may face three persistent challenges: scarcity of high-quality training corpora, factually unwarranted overconfidence and indiscriminate fusion that amplifies erroneous updates. Inspired by the human confidence accumulation from uncertainty to certainty, we propose Progressive Reasoning Gain (PRG) to measure whether reasoning steps progressively strengthen support for the final answer. Furthermore, we introduce HyTuning, a hybrid post-training framework that adaptively reweights RD and RLIF via a PRG-style metric, using scarce supervised reasoning traces as a stable anchor while exploiting abundant unlabeled queries for scalability. Experiments on several domain-specific and general benchmarks demonstrate that HyTuning improves accuracy while achieving confidence faithfulness under limited supervision, supporting a practical "Less Approximates More" effect.
翻译:大语言模型越来越多地部署于高风险任务,此类任务中自信但错误的推理可能造成严重的现实危害,使此前被忽视的置信度忠实性问题重新成为焦点。一个有前景的解决思路是联合优化基于内部反馈的无监督强化学习(RLIF)与推理轨迹引导的推理蒸馏(RD),但这面临着三个持续性挑战:高质量训练语料的稀缺性、事实依据缺失的过度自信,以及放大错误更新的无差别融合。受人类从不确定性到确定性的置信度积累过程启发,我们提出渐进式推理增益(PRG)指标,用于衡量推理步骤是否逐步增强对最终答案的支持度。在此基础上,我们引入混合后训练框架HyTuning,通过PRG风格指标自适应地重新加权RD与RLIF——以稀缺的监督推理轨迹作为稳定锚点,同时利用大量无标签查询实现可扩展性。在多个领域专用基准与通用基准上的实验表明,HyTuning能在有限监督下提升准确率并实现置信度忠实性,支撑了“少近似多”(Less Approximates More)的实用效应。