Many decision-making problems feature multiple objectives where it is not always possible to know the preferences of a human or agent decision-maker for different objectives. However, demonstrated behaviors from the decision-maker are often available. This research proposes a dynamic weight-based preference inference (DWPI) algorithm that can infer the preferences of agents acting in multi-objective decision-making problems from demonstrations. The proposed algorithm is evaluated on three multi-objective Markov decision processes: Deep Sea Treasure, Traffic, and Item Gathering, and is compared to two existing preference inference algorithms. Empirical results demonstrate significant improvements compared to the baseline algorithms, in terms of both time efficiency and inference accuracy. The DWPI algorithm maintains its performance when inferring preferences for sub-optimal demonstrations. Moreover, the DWPI algorithm does not necessitate any interactions with the user during inference - only demonstrations are required. We provide a correctness proof and complexity analysis of the algorithm and statistically evaluate the performance under different representation of demonstrations.
翻译:许多决策问题涉及多个目标,通常难以获知人类或智能体决策者对不同目标的偏好。然而,决策者的行为演示往往是可获取的。本研究提出一种基于动态权重的偏好推断(DWPI)算法,能够从演示中推断在多目标决策问题中行动的智能体的偏好。所提算法在三个多目标马尔可夫决策过程——深海寻宝、交通控制和物品收集——上进行了评估,并与两种现有的偏好推断算法进行了比较。实证结果表明,该算法在时间效率和推断准确性方面均较基线算法有显著提升。DWPI 算法在推断次优演示的偏好时仍能保持其性能。此外,DWPI 算法在推断过程中无需与用户进行任何交互——仅需演示数据。我们提供了算法的正确性证明与复杂度分析,并对不同演示表示下的性能进行了统计评估。