Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce \textbf{BOTS}, a unified framework for \textbf{B}ayesian \textbf{O}nline \textbf{T}ask \textbf{S}election in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates \emph{explicit evidence} from direct evaluations of selected tasks and \emph{implicit evidence} inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.
翻译:强化微调(RFT)是将大语言模型(LLM)与人类偏好对齐并增强推理能力的关键技术,但其效果高度依赖于训练过程中探索的任务选择。均匀任务采样效率低下,浪费计算资源于过于简单或无法解决的任务,而现有任务选择方法常面临高展开成本、适应性差或证据不完整的问题。本文提出\\textbf{BOTS},一个用于LLM强化微调中\\textbf{贝叶斯在线任务选择}的统一框架。基于贝叶斯推断,BOTS在模型演化过程中自适应地维护任务难度的后验估计。它同时结合来自所选任务直接评估的\\emph{显式证据},以及从这些评估中推断未选任务的\\emph{隐式证据},并通过汤普森采样确保探索与利用之间的原则性平衡。为使隐式证据实用化,我们采用基于超轻量插值的插件实现,无需额外展开即可估计未评估任务的难度,增加的开销可忽略不计。实证表明,在多样化领域和LLM规模下,BOTS相较于基线方法和消融实验,持续提升了数据效率和性能,为RFT中的动态任务选择提供了实用且可扩展的解决方案。