Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce BOTS, a unified framework for Bayesian Online Task Selection in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates explicit evidence from direct evaluations of selected tasks and implicit evidence inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation for task selection. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT. Code is available at https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/bots.
翻译:强化微调(RFT)是将大语言模型(LLM)与人类偏好对齐并增强其推理能力的关键技术,但其效果对训练过程中探索哪些任务高度敏感。均匀任务采样效率低下,会浪费计算资源在过于简单或无法解决的任务上,而现有的任务选择方法往往存在高采样成本、适应性差或证据不完整的问题。本文提出BOTS,一个用于LLM强化微调中贝叶斯在线任务选择的统一框架。BOTS基于贝叶斯推断,在模型演化过程中自适应地维护任务难度的后验估计。它同时整合了来自所选任务直接评估的显性证据,以及从这些评估中推断出的未选任务的隐性证据,并通过汤普森采样确保任务选择在探索与利用之间达到原则性平衡。为使隐性证据具有实用性,我们采用基于超轻量插值的即插即用模块进行实例化,该模块可在无需额外采样的前提下估计任务难度,仅增加可忽略的开销。实验表明,在多样化领域和不同规模的LLM上,BOTS相较于基线方法和消融实验均能持续提升数据效率与性能,为RFT中的动态任务选择提供了实用且可扩展的解决方案。代码发布于https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/bots。