The design of data-driven formulations for machine learning and decision-making with good out-of-sample performance is a key challenge. The observation that good in-sample performance does not guarantee good out-of-sample performance is generally known as overfitting. Practical overfitting can typically not be attributed to a single cause but instead is caused by several factors all at once. We consider here three overfitting sources: (i) statistical error as a result of working with finite sample data, (ii) data noise which occurs when the data points are measured only with finite precision, and finally (iii) data misspecification in which a small fraction of all data may be wholly corrupted. We argue that although existing data-driven formulations may be robust against one of these three sources in isolation they do not provide holistic protection against all overfitting sources simultaneously. We design a novel data-driven formulation which does guarantee such holistic protection and is furthermore computationally viable. Our distributionally robust optimization formulation can be interpreted as a novel combination of a Kullback-Leibler and Levy-Prokhorov robust optimization formulation which is novel in its own right. However, we show how in the context of classification and regression problems that several popular regularized and robust formulations reduce to a particular case of our proposed novel formulation. Finally, we apply the proposed HR formulation on a portfolio selection problem with real stock data, and analyze its risk/return tradeoff against several benchmarks formulations. Our experiments show that our novel ambiguity set provides a significantly better risk/return trade-off.
翻译:设计具有良好样本外性能的机器学习与决策数据驱动公式,是一项关键挑战。样本内性能优异但样本外性能不佳的现象通常被称为过拟合。实际过拟合通常不能归因于单一原因,而是由多种因素共同导致。本文考虑三种过拟合来源:(i) 处理有限样本数据时产生的统计误差;(ii) 数据测量精度有限导致的噪声误差;以及(iii) 数据中的一小部分可能完全被破坏的误设定误差。我们认为,尽管现有数据驱动公式可能对这三种来源中的某一项具有稳健性,但无法同时提供对所有过拟合来源的整体防护。我们设计了一种新颖的数据驱动公式,既能保证这种整体防护,又具有计算可行性。该分布鲁棒优化公式可被解读为Kullback-Leibler与Levy-Prokhorov鲁棒优化公式的创新组合,这一组合本身即具有新颖性。然而,我们展示了在分类与回归问题背景下,多种主流正则化与稳健公式如何归约为我们提出的新颖公式的特例。最终,我们将所提出的整体稳健公式应用于基于真实股票数据的投资组合选择问题,并分析其相较于多个基准公式的风险/收益权衡。实验表明,我们的新颖模糊集能显著改善风险/收益权衡。