Regression analysis is a central topic in statistical modeling, aiming to estimate the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in several fields of research, such as prediction, forecasting, or causal inference. Beyond various classical methods to solve linear regression problems, such as Ordinary Least Squares, Ridge, or Lasso regressions - which are often the foundation for more advanced machine learning (ML) techniques - the latter have been successfully applied in this scenario without a formal definition of statistical significance. At most, permutation or classical analyses based on empirical measures (e.g., residuals or accuracy) have been conducted to reflect the greater ability of ML estimations for detection. In this paper, we introduce a method, named Statistical Agnostic Regression (SAR), for evaluating the statistical significance of an ML-based linear regression based on concentration inequalities of the actual risk using the analysis of the worst case. To achieve this goal, similar to the classification problem, we define a threshold to establish that there is sufficient evidence with a probability of at least 1-eta to conclude that there is a linear relationship in the population between the explanatory (feature) and the response (label) variables. Simulations in only two dimensions demonstrate the ability of the proposed agnostic test to provide a similar analysis of variance given by the classical $F$ test for the slope parameter.
翻译:回归分析是统计建模中的核心主题,旨在估计因变量(通常称为响应变量)与一个或多个自变量(即解释变量)之间的关系。线性回归是预测、预报或因果推断等研究领域中最常用的方法。除了解决线性回归问题的各种经典方法(如普通最小二乘法、岭回归或Lasso回归——这些常是更高级机器学习技术的基础)之外,后者已成功应用于此类场景,但缺乏统计显著性的正式定义。最多通过基于经验度量(如残差或准确率)的置换检验或经典分析来反映机器学习估计在检测中的更强能力。本文提出一种名为统计不可知回归(SAR)的方法,基于实际风险的集中不等式并利用最坏情况分析,评估基于机器学习的线性回归的统计显著性。为实现此目标,类似于分类问题,我们定义了一个阈值,以确定有至少1-η的概率存在充分证据,表明总体中解释变量(特征)与响应变量(标签)之间存在线性关系。仅在二维空间中的模拟表明,所提出的不可知检验能够提供与经典斜率参数F检验相似的分析能力。