Many decision-making problems feature multiple objectives. In such problems, it is not always possible to know the preferences of a decision-maker for different objectives. However, it is often possible to observe the behavior of decision-makers. In multi-objective decision-making, preference inference is the process of inferring the preferences of a decision-maker for different objectives. This research proposes a Dynamic Weight-based Preference Inference (DWPI) algorithm that can infer the preferences of agents acting in multi-objective decision-making problems, based on observed behavior trajectories in the environment. The proposed method is evaluated on three multi-objective Markov decision processes: Deep Sea Treasure, Traffic, and Item Gathering. The performance of the proposed DWPI approach is compared to two existing preference inference methods from the literature, and empirical results demonstrate significant improvements compared to the baseline algorithms, in terms of both time requirements and accuracy of the inferred preferences. The Dynamic Weight-based Preference Inference algorithm also maintains its performance when inferring preferences for sub-optimal behavior demonstrations. In addition to its impressive performance, the Dynamic Weight-based Preference Inference algorithm does not require any interactions during training with the agent whose preferences are inferred, all that is required is a trajectory of observed behavior.
翻译:许多决策问题具有多个目标。在此类问题中,决策者对不同目标的偏好并不总是可知的。然而,通常可以观察到决策者的行为。在多目标决策中,偏好推断是根据观察到的行为轨迹推断决策者对不同目标偏好的过程。本研究提出一种基于动态权重的偏好推断(DWPI)算法,该算法能够根据环境中观察到的行为轨迹推断多目标决策问题中主体的偏好。该方法在三个多目标马尔可夫决策过程(深海寻宝、交通和物品收集)上进行了评估。将所提出的DWPI方法与现有文献中的两种偏好推断方法进行了性能比较,实证结果表明,在时间需求和推断偏好准确性方面,该算法相较于基线算法有显著改进。动态权重偏好推断算法在推断次优行为示范的偏好时也能保持其性能。除结果显著外,该算法无需在训练过程中与被推断偏好的主体进行任何交互,仅需一段观察到的行为轨迹。