Reinforcement learning (RL) faces challenges in evaluating policy trajectories within intricate game tasks due to the difficulty in designing comprehensive and precise reward functions. This inherent difficulty curtails the broader application of RL within game environments characterized by diverse constraints. Preference-based reinforcement learning (PbRL) presents a pioneering framework that capitalizes on human preferences as pivotal reward signals, thereby circumventing the need for meticulous reward engineering. However, obtaining preference data from human experts is costly and inefficient, especially under conditions marked by complex constraints. To tackle this challenge, we propose a LLM-enabled automatic preference generation framework named LLM4PG , which harnesses the capabilities of large language models (LLMs) to abstract trajectories, rank preferences, and reconstruct reward functions to optimize conditioned policies. Experiments on tasks with complex language constraints demonstrated the effectiveness of our LLM-enabled reward functions, accelerating RL convergence and overcoming stagnation caused by slow or absent progress under original reward structures. This approach mitigates the reliance on specialized human knowledge and demonstrates the potential of LLMs to enhance RL's effectiveness in complex environments in the wild.
翻译:在复杂游戏任务中,由于设计全面且精确的奖励函数存在困难,强化学习在评估策略轨迹方面面临挑战。这一固有难题限制了强化学习在具有多样化约束的游戏环境中的广泛应用。基于偏好的强化学习提出了一种创新框架,其利用人类偏好作为关键奖励信号,从而避免了对精细奖励工程的需求。然而,从人类专家处获取偏好数据成本高昂且效率低下,在具有复杂约束的条件下尤为如此。为应对这一挑战,我们提出了一种基于大语言模型的自动偏好生成框架LLM4PG,该框架利用大语言模型的能力来抽象轨迹、排序偏好并重构奖励函数,以优化条件策略。在具有复杂语言约束的任务上的实验表明,我们基于大语言模型的奖励函数能有效加速强化学习收敛,并克服原始奖励结构下因进展缓慢或停滞所导致的问题。该方法降低了对专业人类知识的依赖,并展示了大语言模型在提升强化学习于复杂现实环境中有效性的潜力。