Reward shaping is a critical component in reinforcement learning (RL), particularly for complex tasks where sparse rewards can hinder learning. While shaping rewards have been introduced to provide additional guidance, selecting effective shaping functions remains challenging and computationally expensive. This paper introduces Online Reward Selection and Policy Optimization (ORSO), a novel approach that frames shaping reward selection as an online model selection problem. ORSO employs principled exploration strategies to automatically identify promising shaping reward functions without human intervention, balancing exploration and exploitation with provable regret guarantees. We demonstrate ORSO's effectiveness across various continuous control tasks using the Isaac Gym simulator. Compared to traditional methods that fully evaluate each shaping reward function, ORSO significantly improves sample efficiency, reduces computational time, and consistently identifies high-quality reward functions that produce policies comparable to those generated by domain experts through hand-engineered rewards.
翻译:奖励塑形是强化学习中的关键组成部分,尤其对于稀疏奖励可能阻碍学习的复杂任务。虽然已引入塑形奖励以提供额外指导,但选择有效的塑形函数仍然具有挑战性且计算成本高昂。本文提出在线奖励选择与策略优化(ORSO),这是一种将塑形奖励选择构建为在线模型选择问题的新方法。ORSO采用基于原则的探索策略,无需人工干预即可自动识别有前景的塑形奖励函数,并通过可证明的遗憾保证来平衡探索与利用。我们使用Isaac Gym仿真器在多种连续控制任务中验证了ORSO的有效性。相较于需要完整评估每个塑形奖励函数的传统方法,ORSO显著提升了样本效率,减少了计算时间,并持续识别出能产生与领域专家通过手工设计奖励所生成策略相媲美的高质量奖励函数。