The success of reinforcement learning (RL) is fundamentally tied to having a reward function that accurately reflects the task objective. Yet, designing reward functions is notoriously time-consuming and prone to misspecification. To address this issue, our first goal is to understand how to support RL practitioners in specifying appropriate weights for a reward function. We leverage the Trajectory Alignment Coefficient (TAC), a metric that evaluates how closely a reward function's induced preferences match those of a domain expert. To evaluate whether TAC provides effective support in practice, we conducted a human-subject study in which RL practitioners tuned reward weights for Lunar Lander. We found that providing TAC during reward tuning led participants to produce more performant reward functions and report lower cognitive workload relative to standard tuning without TAC. However, the study also underscored that manual reward design, even with TAC, remains labor-intensive. This limitation motivated our second goal: to learn a reward model that maximizes TAC directly. Specifically, we propose Soft-TAC, a differentiable approximation of TAC that can be used as a loss function to train reward models from human preference data. Validated in the racing simulator Gran Turismo 7, reward models trained using Soft-TAC successfully captured preference-specific objectives, resulting in policies with qualitatively more distinct behaviors than models trained with standard Cross-Entropy loss. This work demonstrates that TAC can serve as both a practical tool for guiding reward tuning and a reward learning objective in complex domains.
翻译:强化学习(RL)的成功从根本上依赖于能够准确反映任务目标的奖励函数。然而,奖励函数的设计不仅耗时,且极易出现设定偏差。为解决这一问题,我们的首要目标是理解如何支持RL从业者为奖励函数指定合适的权重。我们利用轨迹对齐系数(TAC)——一种评估奖励函数所诱导的偏好与领域专家偏好匹配程度的指标。为评估TAC在实践中是否提供有效支持,我们开展了一项人类受试者研究,要求RL从业者为Lunar Lander任务调整奖励权重。研究发现,在奖励调优过程中提供TAC,相较于无TAC的标准调优,能使参与者设计出性能更优的奖励函数,并报告更低的认知负荷。然而,该研究也凸显出,即使借助TAC,手动设计奖励函数依然劳动密集。这一局限促使我们转向第二个目标:直接学习一个最大化TAC的奖励模型。具体而言,我们提出了Soft-TAC——一种TAC的可微近似,可作为损失函数,利用人类偏好数据训练奖励模型。在赛车模拟器Gran Turismo 7中的验证表明,使用Soft-TAC训练的奖励模型成功捕获了偏好特定的目标,所产生的策略在行为上比使用标准交叉熵损失训练的模型具有更显著的质性差异。本工作证明,TAC既可作为一种指导奖励调优的实用工具,也可作为复杂领域中奖励学习的目标函数。