Modern Large Language Models (LLMs) have shown rapid improvements in reasoning capabilities, driven largely by reinforcement learning (RL) with verifiable rewards. Here, we ask whether these LLMs can self-improve without the need for additional training. We identify two core challenges for such systems: (i) efficiently generating diverse, high-quality candidate solutions, and (ii) reliably selecting correct answers in the absence of ground-truth supervision. To address these challenges, we propose Test-time Recursive Thinking (TRT), an iterative self-improvement framework that conditions generation on rollout-specific strategies, accumulated knowledge, and self-generated verification signals. Using TRT, open-source models reach 100% accuracy on AIME-25/24, and on LiveCodeBench's most difficult problems, closed-source models improve by 10.4-14.8 percentage points without external feedback.
翻译:现代大型语言模型(LLM)在推理能力方面展现出快速进步,这主要得益于基于可验证奖励的强化学习(RL)。本文探讨这些LLM是否能在无需额外训练的情况下实现自我改进。我们识别出此类系统的两个核心挑战:(i)高效生成多样化、高质量的候选解决方案;(ii)在缺乏真实监督的情况下可靠选择正确答案。为应对这些挑战,我们提出测试时递归思考(TRT)——一种迭代式自我改进框架,该框架将生成过程与特定推演策略、累积知识及自生成的验证信号相结合。通过TRT,开源模型在AIME-25/24上达到100%准确率,在LiveCodeBench最难题库中,闭源模型在无外部反馈的情况下性能提升10.4-14.8个百分点。