Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration (MORSE), a general framework that automatically combines multiple human-designed heuristic rewards into a unified reward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neural network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those obtained with manually tuned reward functions.
翻译:设计有效的奖励函数仍然是强化学习中的核心挑战,尤其是在多目标环境中。本文提出了一种名为“多目标探索性奖励塑造”(MORSE)的通用框架,该框架能够自动将多个人工设计的启发式奖励组合成一个统一的奖励函数。MORSE将奖励塑造过程建模为一个双层优化问题:内层循环训练策略以最大化当前塑造的奖励,而外层循环更新奖励函数以优化任务性能。为了鼓励在奖励空间中进行探索并避免陷入次优局部最小值,MORSE在塑造过程中引入了随机性,通过任务性能和一个固定、随机初始化的神经网络的预测误差来引导噪声注入。在MuJoCo和Isaac Sim环境中的实验结果表明,MORSE能够有效平衡多种机器人任务中的多个目标,其任务性能与通过手动调优奖励函数所获得的结果相当。