Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning

Achieving the effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we propose ERFSL, an efficient reward function searcher using LLMs, which enables LLMs to be effective white-box searchers and highlights their advanced semantic understanding capabilities. Specifically, we generate reward components for each numerically explicit user requirement and employ a reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively adjust the weights without ambiguity and redundant adjustments by flexibly adopting directional mutation and crossover strategies, similar to genetic algorithms, based on the context provided by the training log analyzer. We applied the framework to an underwater data collection RL task without direct human feedback or reward examples (zero-shot learning). The reward critic successfully corrects the reward code with only one feedback instance for each requirement, effectively preventing unrectifiable errors. The initialization of weights enables the acquisition of different reward functions within the Pareto solution set without the need for weight search. Even in cases where a weight is 500 times off, on average, only 5.2 iterations are needed to meet user requirements. The ERFSL also works well with most prompts utilizing GPT-4o mini, as we decompose the weight searching process to reduce the requirement for numerical and long-context understanding capabilities

翻译：在具有复杂定制环境和多重需求的强化学习任务中，实现奖励函数的有效设计与改进面临显著挑战。本文提出ERFSL，一种利用大型语言模型的高效奖励函数搜索器，使大型语言模型成为有效的白盒搜索器，并突显其高级语义理解能力。具体而言，我们为每个数值明确的用户需求生成奖励分量，并采用奖励评判器识别正确的代码形式。随后，大型语言模型为奖励分量分配权重以平衡其数值，并基于训练日志分析器提供的上下文，灵活采用定向突变和交叉策略（类似于遗传算法），在无歧义且无需冗余调整的情况下迭代优化权重。我们将该框架应用于无直接人工反馈或奖励示例（零样本学习）的水下数据收集强化学习任务。奖励评判器仅需每个需求一个反馈实例即可成功修正奖励代码，有效防止不可纠正的错误。权重的初始化使得无需权重搜索即可获得帕累托解集中的不同奖励函数。即使权重偏差高达500倍，平均也仅需5.2次迭代即可满足用户需求。ERFSL在使用GPT-4o mini的大多数提示场景下表现良好，因为我们通过分解权重搜索过程降低了对数值理解和长上下文理解能力的要求。