Multi-objective optimization is a type of decision making problems where multiple conflicting objectives are optimized. We study offline optimization of multi-objective policies from data collected by an existing policy. We propose a pessimistic estimator for the multi-objective policy values that can be easily plugged into existing formulas for hypervolume computation and optimized. The estimator is based on inverse propensity scores (IPS), and improves upon a naive IPS estimator in both theory and experiments. Our analysis is general, and applies beyond our IPS estimators and methods for optimizing them. The pessimistic estimator can be optimized by policy gradients and performs well in all of our experiments.
翻译:多目标优化是一类需要同时优化多个相互冲突目标的决策问题。本文研究如何利用现有策略收集的数据进行多目标策略的离线优化。我们提出了一种悲观估计器,用于评估多目标策略的效能,该估计器可以轻松嵌入现有超体积计算公式并进行优化。该估计器基于逆倾向得分(IPS),在理论和实验两方面均优于朴素IPS估计器。我们的分析具有普适性,不仅适用于IPS估计器及其优化方法。该悲观估计器可通过策略梯度进行优化,并在所有实验中均表现出色。