The development of Large Language Models (LLMs) often confronts challenges stemming from the heavy reliance on human annotators in the reinforcement learning with human feedback (RLHF) framework, or the frequent and costly external queries tied to the self-instruct paradigm. In this work, we pivot to Reinforcement Learning (RL) -- but with a twist. Diverging from the typical RLHF, which refines LLMs following instruction data training, we use RL to directly generate the foundational instruction dataset that alone suffices for fine-tuning. Our method, TeaMs-RL, uses a suite of textual operations and rules, prioritizing the diversification of training datasets. It facilitates the generation of high-quality data without excessive reliance on external advanced models, paving the way for a single fine-tuning step and negating the need for subsequent RLHF stages. Our findings highlight key advantages of our approach: reduced need for human involvement and fewer model queries (only $5.73\%$ of WizardLM's total), along with enhanced capabilities of LLMs in crafting and comprehending complex instructions compared to strong baselines, and substantially improved model privacy protection.
翻译:大语言模型(LLM)的发展常面临挑战:在基于人类反馈的强化学习(RLHF)框架中过度依赖人工标注者,或是在自指令范式中频繁且昂贵的外部查询。本研究转向强化学习(RL),但方法有所不同。与通过指令数据训练后优化LLM的典型RLHF不同,我们直接使用RL生成基础指令数据集——该数据集单独即可实现模型微调。本方法TeaMs-RL采用系列文本操作与规则,优先强化训练数据集的多样性。它能在不过度依赖外部先进模型的情况下生成高质量数据,仅需单步微调即可完成,无需后续RLHF阶段。研究结果凸显了我们方法的关键优势:减少人类参与需求与模型查询次数(仅为WizardLM总查询量的$5.73\%$),同时相比强基线模型显著提升LLM编写和理解复杂指令的能力,并大幅增强模型隐私保护。