Test-time adaptation offers a promising avenue for improving reasoning performance in large language models without additional supervision, but existing approaches often apply a uniform optimization objective across all inputs, leading to inefficient or unstable adaptation on heterogeneous reasoning problems. We propose DiSCTT, a difficulty-aware, consensus-guided self-curriculum framework that dynamically allocates test-time optimization strategies based on instance-level epistemic uncertainty estimated from agreement among sampled reasoning trajectories. Inputs with high consensus are consolidated via supervised fine-tuning using majority-agreed solutions as pseudo-labels, while low-consensus inputs are optimized via reinforcement learning with a consensus-regularized objective that encourages diversity under relevance constraints. Across a broad suite of mathematical and general reasoning benchmarks, DiSCTT consistently outperforms strong test-time adaptation baselines, achieving higher accuracy with reduced variance and substantially lower computation and wall-clock training times. These results demonstrate that explicitly accounting for instance difficulty and uncertainty enables more stable, efficient, and effective test-time adaptation for reasoning models.
翻译:测试时适应为提升大型语言模型的推理性能提供了一条无需额外监督的可行路径,但现有方法通常对所有输入采用统一的优化目标,导致在异构推理问题上存在适应效率低下或不稳定的问题。本文提出DiSCTT,一种基于难度感知与共识引导的自课程框架,该框架根据从采样推理轨迹一致性中估计的实例级认知不确定性,动态分配测试时优化策略。对于高共识输入,我们采用以多数一致解作为伪标签的监督微调进行巩固;对于低共识输入,则通过强化学习进行优化,其目标函数经过共识正则化处理,以在相关性约束下鼓励多样性。在广泛的数学与通用推理基准测试中,DiSCTT持续优于现有强测试时适应基线,以更低的计算成本和实际训练时间实现了更高的准确率与更低的方差。这些结果表明,显式考虑实例难度与不确定性能够为推理模型带来更稳定、高效且有效的测试时适应。