We study the reasoning behavior of large language models (LLMs) under limited computation budgets. In such settings, producing useful partial solutions quickly is often more practical than exhaustive reasoning, which incurs high inference costs. Many real-world tasks, such as trip planning, require models to deliver the best possible output within a fixed reasoning budget. We introduce an anytime reasoning framework and the Anytime Index, a metric that quantifies how effectively solution quality improves as reasoning tokens increase. To further enhance efficiency, we propose an inference-time self-improvement method using LLM-synthesized preference data, where models learn from their own reasoning comparisons to produce better intermediate solutions. Experiments on NaturalPlan (Trip), AIME, and GPQA datasets show consistent gains across Grok-3, GPT-oss, GPT-4.1/4o, and LLaMA models, improving both reasoning quality and efficiency under budget constraints.
翻译:我们研究了大型语言模型(LLM)在计算预算有限条件下的推理行为。在此类设置中,快速生成有用的部分解往往比耗时的高推理成本穷举推理更为实用。诸如行程规划等现实任务要求模型在固定推理预算内输出尽可能优的结果。我们提出了一种即时推理框架及即时指数指标——该指标量化了推理令牌数量增加时解质量提升的有效性。为进一步提升效率,我们创新性地利用LLM合成偏好数据实现推理时自我改进方法,使模型通过自身推理对比学习以生成更优的中间解。在NaturalPlan(行程)、AIME和GPQA数据集上的实验表明,该方法在Grok-3、GPT-oss、GPT-4.1/4o及LLaMA系列模型上均取得一致性增益,在预算约束下显著提升推理质量与效率。