Statistical Agnostic Regression: a machine learning method to validate regression models

Regression analysis is a central topic in statistical modeling, aiming to estimate the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in several fields of research, such as prediction, forecasting, or causal inference. Beyond various classical methods to solve linear regression problems, such as Ordinary Least Squares, Ridge, or Lasso regressions - which are often the foundation for more advanced machine learning (ML) techniques - the latter have been successfully applied in this scenario without a formal definition of statistical significance. At most, permutation or classical analyses based on empirical measures (e.g., residuals or accuracy) have been conducted to reflect the greater ability of ML estimations for detection. In this paper, we introduce a method, named Statistical Agnostic Regression (SAR), for evaluating the statistical significance of an ML-based linear regression based on concentration inequalities of the actual risk using the analysis of the worst case. To achieve this goal, similar to the classification problem, we define a threshold to establish that there is sufficient evidence with a probability of at least 1-eta to conclude that there is a linear relationship in the population between the explanatory (feature) and the response (label) variables. Simulations in only two dimensions demonstrate the ability of the proposed agnostic test to provide a similar analysis of variance given by the classical $F$ test for the slope parameter.

翻译：回归分析是统计建模中的核心主题，旨在估计因变量（通常称为响应变量）与一个或多个自变量（即解释变量）之间的关系。线性回归目前是多个研究领域（如预测、预报或因果推断）中最流行的方法。除了解决线性回归问题的各种经典方法（如普通最小二乘回归、岭回归或套索回归——这些通常是更高级机器学习技术的基础）之外，后者已成功应用于此场景，但缺乏统计显著性的形式化定义。最多，基于经验度量（例如残差或精度）的置换检验或经典分析被用来反映机器学习估计在检测方面的更强能力。在本文中，我们提出了一种名为统计不可知回归（SAR）的方法，该方法基于实际风险的最坏情况分析，利用风险集中不等式来评估基于机器学习的线性回归的统计显著性。为实现这一目标，与分类问题类似，我们定义了一个阈值，从而以至少1-η的概率确立存在充分证据，表明总体中解释变量（特征）与响应变量（标签）之间存在线性关系。仅在二维空间中的模拟结果表明，所提出的不可知检验能够提供与经典斜率参数F检验类似的方差分析能力。