Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.
翻译:大语言模型通过扩展思维链推理提升了最终答案的准确性,但往往存在令牌使用效率低下且推理阶段控制手段匮乏的问题。现有高效推理方法通过缩短、提前终止或压缩推理路径来控制思考长度,却隐式地影响了模型的思考方式。本文提出Agentic Chain-of-Thought Steering(ACTS),将推理引导形式化为马尔可夫决策过程,其中控制器智能体在推理阶段自适应引导冻结的推理器。在每一步中,控制器观察当前推理轨迹和剩余思考预算,随后发出由推理策略和引导短语组成的引导动作,以启动推理器的下一步生成。该方法在保留推理器生成连贯性的同时,实现了面向高效推理的预算感知策略控制。我们通过构建的多预算增强合成引导轨迹初始化控制器智能体,并进一步采用带预算条件奖励塑形的强化学习进行优化。跨多个基准的实验表明,ACTS在显著节省令牌消耗的同时保持了全量思考的性能表现,且在不同推理器与任务间实现了可控的准确率-效率权衡。代码已开源至https://github.com/Andree-9/ACTS。