Cross-validation techniques for risk estimation and model selection are widely used in statistics and machine learning. However, the understanding of the theoretical properties of learning via model selection with cross-validation risk estimation is quite low in face of its widespread use. In this context, this paper presents learning via model selection with cross-validation risk estimation as a general systematic learning framework within classical statistical learning theory and establishes distribution-free deviation bounds in terms of VC dimension, giving detailed proofs of the results and considering both bounded and unbounded loss functions. In particular, we investigate how the generalization of learning via model selection may be increased by modeling the collection of candidate models. We define the Learning Spaces as a class of candidate models in which the partial order by inclusion reflects the models complexities, and we formalize a manner of defining them based on domain knowledge. We illustrate this modeling in a worst-case scenario of learning a classifier with finite domain and a typical scenario of linear regression. Through theoretical insights and concrete examples, we aim to provide guidance on selecting the family of candidate models based on domain knowledge to increase generalization.
翻译:交叉验证技术在风险估计与模型选择中广泛应用于统计学与机器学习领域。然而,尽管该方法被广泛采用,对于通过交叉验证风险估计进行模型选择的学习过程之理论性质的理解仍相当有限。在此背景下,本文将基于交叉验证风险估计的模型选择学习构建为经典统计学习理论中的一个通用系统化学习框架,并建立了以VC维表示的无分布偏差界,同时给出了结果的详细证明,并考虑了有界及无界损失函数的情形。特别地,我们研究了如何通过对候选模型集合进行建模来提升模型选择学习的泛化能力。我们定义了"学习空间"作为一类候选模型,其中通过包含关系形成的偏序反映了模型的复杂度,并形式化了一种基于领域知识构建此类空间的方法。我们通过有限域分类器学习的最坏情况场景与线性回归的典型场景展示了这种建模方式。通过理论分析与具体案例,本文旨在为基于领域知识选择候选模型族以提升泛化能力提供指导。