We show that iterative deployment of large language models (LLMs), each fine-tuned on data carefully curated by users from the previous models' deployment, can significantly change the properties of the resultant models. By testing this mechanism on various planning domains, we observe substantial improvements in planning skills, with later models displaying emergent generalization by discovering much longer plans than the initial models. We then provide theoretical analysis showing that iterative deployment effectively implements reinforcement learning (RL) training in the outer-loop (i.e. not as part of intentional model training), with an implicit reward function. The connection to RL has two important implications: first, for the field of AI safety, as the reward function entailed by repeated deployment is not defined explicitly, and could have unexpected implications to the properties of future model deployments. Second, the mechanism highlighted here can be viewed as an alternative training regime to explicit RL, relying on data curation rather than explicit rewards.
翻译:研究表明,通过迭代部署大型语言模型(LLMs),并在每次部署后由用户从前一模型输出中精心筛选数据进行微调,能够显著改变最终模型的性质。通过在多种规划领域测试该机制,我们观察到规划能力获得实质性提升——后续模型展现出涌现泛化能力,能够生成比初始模型更长的规划序列。理论分析表明,迭代部署本质上是在外层循环中实现了强化学习(RL)训练(即非刻意设计的模型训练过程),并隐含了奖励函数。这种与RL的关联具有双重重要意义:其一,对AI安全领域而言,由于重复部署所隐含的奖励函数未被明确定义,可能对未来模型部署的特性产生不可预知的影响;其二,本文揭示的机制可视为显式强化学习的替代训练范式,其依赖于数据筛选而非显式奖励信号。