Reinforcement learning (RL) shows promise in control problems, but its practical application is often hindered by the complexity arising from intricate reward functions with constraints. While the reward hypothesis suggests these competing demands can be encapsulated in a single scalar reward function, designing such functions remains challenging. Building on existing work, we start by formulating preferences over trajectories to derive a realistic reward function that balances goal achievement with constraint satisfaction in the application of mobile robotics with dynamic obstacles. To mitigate reward exploitation in such complex settings, we propose a novel two-stage reward curriculum combined with a flexible replay buffer that adaptively samples experiences. Our approach first learns on a subset of rewards before transitioning to the full reward, allowing the agent to learn trade-offs between objectives and constraints. After transitioning to a new stage, our method continues to make use of past experiences by updating their rewards for sample-efficient learning. We investigate the efficacy of our approach in robot navigation tasks and demonstrate superior performance compared to baselines in terms of true reward achievement and task completion, underlining its effectiveness.
翻译:强化学习(RL)在控制问题中展现出潜力,但其实际应用常因包含约束的复杂奖励函数带来的复杂性而受阻。尽管奖励假设表明这些相互竞争的需求可被封装于单一标量奖励函数中,设计此类函数仍具挑战性。基于现有工作,我们首先通过轨迹偏好建模,为具有动态障碍物的移动机器人应用推导出一个平衡目标达成与约束满足的现实奖励函数。为缓解此类复杂设置中的奖励利用问题,我们提出一种新颖的两阶段奖励课程,结合一个自适应采样经验的灵活回放缓冲区。我们的方法首先在奖励子集上学习,随后过渡到完整奖励,使智能体能够学习目标与约束之间的权衡。在过渡到新阶段后,我们的方法通过更新过往经验的奖励以继续利用它们,实现样本高效学习。我们在机器人导航任务中研究了该方法的有效性,并在真实奖励达成与任务完成方面展示了优于基线的性能,从而验证了其有效性。