Regression analysis is a central topic in statistical modeling, aimed at estimating the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in various fields of research, such as data integration and predictive modeling when combining information from multiple sources. Classical methods for solving linear regression problems, such as Ordinary Least Squares (OLS), Ridge, or Lasso regressions, often form the foundation for more advanced machine learning (ML) techniques, which have been successfully applied, though without a formal definition of statistical significance. At most, permutation or analyses based on empirical measures (e.g., residuals or accuracy) have been conducted, leveraging the greater sensitivity of ML estimations for detection. In this paper, we introduce Statistical Agnostic Regression (SAR) for evaluating the statistical significance of ML-based linear regression models. This is achieved by analyzing concentration inequalities of the actual risk (expected loss) and considering the worst-case scenario. To this end, we define a threshold that ensures there is sufficient evidence, with a probability of at least $1-\eta$, to conclude the existence of a linear relationship in the population between the explanatory (feature) and the response (label) variables. Simulations demonstrate the ability of the proposed agnostic (non-parametric) test to provide an analysis of variance similar to the classical multivariate $F$-test for the slope parameter, without relying on the underlying assumptions of classical methods. Moreover, the residuals computed from this method represent a trade-off between those obtained from ML approaches and the classical OLS.
翻译:回归分析是统计建模的核心课题,旨在估计因变量(通常称为响应变量)与一个或多个自变量(即解释变量)之间的关系。线性回归是目前各研究领域(如多源信息整合中的数据集成与预测建模)执行此任务最常用的方法。解决线性回归问题的经典方法(如普通最小二乘法、岭回归或Lasso回归)通常构成更先进机器学习技术的基础,这些技术虽已成功应用,但缺乏统计显著性的形式化定义。至多通过利用机器学习估计更高检测敏感性的置换检验或基于经验度量(如残差或精度)的分析得以实施。本文提出统计不可知回归方法,用于评估基于机器学习的线性回归模型的统计显著性。该方法通过分析实际风险(期望损失)的集中不等式并考虑最坏情况来实现。为此,我们定义了一个阈值,确保至少有$1-\eta$的概率获得充分证据,可推断总体中解释变量(特征)与响应变量(标签)之间存在线性关系。仿真实验表明,所提出的不可知(非参数)检验能够提供类似于经典多元$F$检验对斜率参数的方差分析,且不依赖于经典方法的基本假设。此外,该方法计算所得的残差在机器学习方法与经典普通最小二乘法所得残差之间实现了权衡。