Learnability, Sample Complexity, and Hypothesis Class Complexity for Regression Models

The goal of a learning algorithm is to receive a training data set as input and provide a hypothesis that can generalize to all possible data points from a domain set. The hypothesis is chosen from hypothesis classes with potentially different complexities. Linear regression modeling is an important category of learning algorithms. The practical uncertainty of the target samples affects the generalization performance of the learned model. Failing to choose a proper model or hypothesis class can lead to serious issues such as underfitting or overfitting. These issues have been addressed by alternating cost functions or by utilizing cross-validation methods. These approaches can introduce new hyperparameters with their own new challenges and uncertainties or increase the computational complexity of the learning algorithm. On the other hand, the theory of probably approximately correct (PAC) aims at defining learnability based on probabilistic settings. Despite its theoretical value, PAC does not address practical learning issues on many occasions. This work is inspired by the foundation of PAC and is motivated by the existing regression learning issues. The proposed approach, denoted by epsilon-Confidence Approximately Correct (epsilon CoAC), utilizes Kullback Leibler divergence (relative entropy) and proposes a new related typical set in the set of hyperparameters to tackle the learnability issue. Moreover, it enables the learner to compare hypothesis classes of different complexity orders and choose among them the optimum with the minimum epsilon in the epsilon CoAC framework. Not only the epsilon CoAC learnability overcomes the issues of overfitting and underfitting, but it also shows advantages and superiority over the well known cross-validation method in the sense of time consumption as well as in the sense of accuracy.

翻译：学习算法的目标是以训练数据集为输入，并提供能够泛化至领域集所有可能数据点的假设。该假设从具有潜在不同复杂度的假设类中选择。线性回归建模是学习算法的重要类别。目标样本的实际不确定性会影响所学模型的泛化性能。若未能选择合适的模型或假设类，可能导致欠拟合或过拟合等严重问题。这些问题可通过交替使用代价函数或采用交叉验证方法来解决。然而，这些方法会引入带有新挑战和不确定性的超参数，或增加学习算法的计算复杂度。另一方面，概率近似正确（PAC）理论旨在基于概率设定定义学习性。尽管具有理论价值，PAC在许多情况下未能解决实际学习问题。本研究受PAC理论基础的启发，并基于现有回归学习问题驱动。所提出的方法（记为ε-置信近似正确，即ε CoAC）利用Kullback-Leibler散度（相对熵），并在超参数集合中提出一种相关的新型典型集以解决学习性问题。此外，该方法使学习者能够比较不同复杂度阶次的假设类，并在ε CoAC框架下选择具有最小ε的最优假设类。ε CoAC学习性不仅克服了过拟合与欠拟合问题，还在时间消耗和准确性方面展现出优于经典交叉验证方法的优势。