Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rollout-intensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history. Intermediate-difficulty prioritization and history-anchored diversity are incorporated into the batch acquisition principle to select informative prompt batches. The small predictive model also generalizes at test-time for efficient computational allocation. Experiments across varied reasoning benchmarks indicate GPS's substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods.
翻译:强化学习能增强大型语言模型的推理能力,但因其涉及大量计算密集的优化过程,常导致高昂的计算成本。在线提示选择通过优先选择信息量大的提示来提高训练效率,是一种可行的解决方案。然而,现有方法要么依赖于成本高昂的精确评估,要么构建缺乏跨提示泛化能力的提示专用预测模型。本研究提出可泛化预测性提示选择方法,该方法利用在共享优化历史数据上训练的轻量级生成模型,对提示难度进行贝叶斯推断。该方法将中等难度优先原则和历史锚定的多样性纳入批量获取准则,以选择信息量丰富的提示批次。该小型预测模型在测试时也展现出泛化能力,可实现高效的计算资源分配。在多种推理基准测试上的实验表明,相较于先进的基线方法,GPS在训练效率、最终性能及测试时效率方面均有显著提升。