We study the problem of designing optimal learning and decision-making formulations when only historical data is available. Prior work typically commits to a particular class of data-driven formulation and subsequently tries to establish out-of-sample performance guarantees. We take here the opposite approach. We define first a sensible yard stick with which to measure the quality of any data-driven formulation and subsequently seek to find an optimal such formulation. Informally, any data-driven formulation can be seen to balance a measure of proximity of the estimated cost to the actual cost while guaranteeing a level of out-of-sample performance. Given an acceptable level of out-of-sample performance, we construct explicitly a data-driven formulation that is uniformly closer to the true cost than any other formulation enjoying the same out-of-sample performance. We show the existence of three distinct out-of-sample performance regimes (a superexponential regime, an exponential regime and a subexponential regime) between which the nature of the optimal data-driven formulation experiences a phase transition. The optimal data-driven formulations can be interpreted as a classically robust formulation in the superexponential regime, an entropic distributionally robust formulation in the exponential regime and finally a variance penalized formulation in the subexponential regime. This final observation unveils a surprising connection between these three, at first glance seemingly unrelated, data-driven formulations which until now remained hidden.
翻译:我们研究了在仅能获取历史数据时设计最优学习与决策建模的问题。现有研究通常先固定一类特定数据驱动模型,再尝试建立其样本外性能保证。本文采取相反的研究路径:首先定义衡量任意数据驱动模型质量的合理标尺,进而寻求最优的此类模型。直观而言,任何数据驱动模型都需要在估计成本与实际成本的接近程度与样本外性能保证水平之间取得平衡。给定可接受的样本外性能水平,我们显式构造了一种数据驱动模型,该模型比所有具有相同样本外性能的其他模型更一致地逼近真实成本。我们证明了三种不同的样本外性能区间(超指数区间、指数区间和次指数区间)的存在性,最优数据驱动模型的性质在这三个区间之间会发生相变。最优数据驱动模型在超指数区间可解释为经典鲁棒模型,在指数区间为熵分布鲁棒模型,在次指数区间则为方差惩罚模型。这一最终发现揭示了这三种看似无关的数据驱动模型之间长期未被发现的惊人联系。