Large language models (LLMs) can achieve strong reasoning performance with sufficient computation, but they do not inherently know how much computation a task requires. We study budgeted inference-time reasoning for multiple tasks under a strict global token constraint and formalize it as a Ordered Stochastic Multiple-Choice Knapsack Problem(OS-MCKP). This perspective highlights a meta-cognitive requirement -- anticipating task difficulty, estimating return over investment (ROI), and allocating computation strategically. We propose ROI-Reasoning, a two-stage framework that endows LLMs with intrinsic, budget-aware rationality. In the first stage, Meta-Cognitive Fine-Tuning teaches models to predict reasoning cost and expected utility before generation, enabling explicit solve-or-skip decisions. Next, Rationality-Aware Reinforcement Learning optimizes sequential decision making under a hard token budget, allowing models to learn long-horizon allocation strategies. Across budgeted mathematical reasoning benchmarks, ROI-Reasoning consistently improves overall score while substantially reducing regret under tight computation budgets.
翻译:大型语言模型(LLMs)在充足计算资源下可实现强大的推理性能,但其本身无法预知任务所需计算量。本研究针对严格全局令牌约束下的多任务推理场景,将其形式化为有序随机多选择背包问题(OS-MCKP)。该视角揭示了一种元认知需求——需要预判任务难度、评估投资回报率(ROI)并策略性分配计算资源。我们提出ROI-Reasoning框架,通过两阶段训练赋予LLMs内在的预算感知理性能力。第一阶段通过元认知微调,使模型在生成前预测推理成本与期望效用,从而支持显式的“求解-跳过”决策。第二阶段通过理性感知强化学习,在硬性令牌预算下优化序列决策,使模型习得长时程资源分配策略。在多个预算约束数学推理基准测试中,ROI-Reasoning在严格计算预算下持续提升总体得分,同时显著降低决策遗憾度。