In this paper we consider optimization with relaxation, an ample paradigm to make data-driven designs. This approach was previously considered by the same authors of this work in Garatti and Campi (2019), a study that revealed a deep-seated connection between two concepts: risk (probability of not satisfying a new, out-of-sample, constraint) and complexity (according to a definition introduced in paper Garatti and Campi (2019)). This connection was shown to have profound implications in applications because it implied that the risk can be estimated from the complexity, a quantity that can be measured from the data without any knowledge of the data-generation mechanism. In the present work we establish new results. First, we expand the scope of Garatti and Campi (2019) so as to embrace a more general setup that covers various algorithms in machine learning. Then, we study classical support vector methods - including SVM (Support Vector Machine), SVR (Support Vector Regression) and SVDD (Support Vector Data Description) - and derive new results for the ability of these methods to generalize. All results are valid for any finite size of the data set. When the sample size tends to infinity, we establish the unprecedented result that the risk approaches the ratio between the complexity and the cardinality of the data sample, regardless of the value of the complexity.
翻译:本文考虑松弛优化这一数据驱动设计的广泛范式。该方法的先前研究由本文同一作者在Garatti与Campi(2019)中完成,揭示了两个概念间的深层联系:风险(新样本约束不满足的概率)与复杂度(依据Garatti与Campi(2019)论文定义)。该联系在应用中具有深远意义,因为其表明风险可通过复杂度估计,而复杂度作为可测量量,无需依赖数据生成机制即可从数据中获取。本工作建立了新成果:首先,将Garatti与Campi(2019)的研究范围扩展至覆盖机器学习中多种算法的更通用框架;其次,研究经典支持向量方法(包括支持向量机SVM、支持向量回归SVR及支持向量数据描述SVDD),推导出这些方法泛化能力的新结论。所有结论对任意有限数据集均成立。当样本量趋于无穷时,我们建立了前所未有的结论:无论复杂度取值为何,风险将趋近于复杂度与数据样本基数的比值。