Reinforcement Learning (RL) has demonstrated its ability to solve complex decision-making problems in a variety of domains, by optimizing reward signals obtained through interaction with an environment. However, many real-world scenarios involve multiple, potentially conflicting objectives that cannot be easily represented by a single scalar reward. Multi-Objective Reinforcement Learning (MORL) addresses this limitation by enabling agents to optimize several objectives simultaneously, explicitly reasoning about trade-offs between them. However, the ``black box" nature of the RL models makes the decision process behind chosen objective trade-offs unclear. Current Explainable Reinforcement Learning (XRL) methods are typically designed for single scalar rewards and do not account for explanations with respect to distinct objectives or user preferences. To address this gap, in this paper we propose TREX, a Trajectory based Explainability framework to explain Multi-objective Reinforcement Learning policies, based on trajectory attribution. TREX generates trajectories directly from the learned expert policy, across different user preferences and clusters them into semantically meaningful temporal segments. We quantify the influence of these behavioural segments on the Pareto trade-off by training complementary policies that exclude specific clusters, measuring the resulting relative deviation on the observed rewards and actions compared to the original expert policy. Experiments on multi-objective MuJoCo environments - HalfCheetah, Ant and Swimmer, demonstrate the framework's ability to isolate and quantify the specific behavioural patterns.
翻译:强化学习(RL)通过优化与环境交互获得的奖励信号,已在多个领域展现出解决复杂决策问题的能力。然而,许多现实场景涉及多个可能相互冲突的目标,这些目标很难用单一标量奖励简单表示。多目标强化学习(MORL)通过使智能体能够同时优化多个目标并明确推理目标间的权衡,弥补了这一局限性。然而,强化学习模型的"黑箱"特性导致所选目标权衡背后的决策过程不透明。现有可解释强化学习(XRL)方法通常针对单一标量奖励设计,无法针对不同目标或用户偏好提供解释。为填补这一空白,本文提出TREX——一种基于轨迹归因的多目标强化学习策略轨迹级可解释性框架。TREX直接基于学习到的专家策略生成跨越不同用户偏好的轨迹,并将其聚类为语义上有意义的时间片段。通过训练排除特定聚类的互补策略,测量其与原始专家策略在观测奖励和动作上的相对偏差,我们量化了这些行为片段对帕累托权衡的影响。在多目标MuJoCo环境(HalfCheetah、Ant和Swimmer)上的实验表明,该框架能够分离并量化特定的行为模式。