Studying the generalization abilities of linear models with real data is a central question in statistical learning. While there exist a limited number of prior important works (Loureiro et al. (2021A, 2021B), Wei et al. 2022) that do validate theoretical work with real data, these works have limitations due to technical assumptions. These assumptions include having a well-conditioned covariance matrix and having independent and identically distributed data. These assumptions are not necessarily valid for real data. Additionally, prior works that do address distributional shifts usually make technical assumptions on the joint distribution of the train and test data (Tripuraneni et al. 2021, Wu and Xu 2020), and do not test on real data. In an attempt to address these issues and better model real data, we look at data that is not I.I.D. but has a low-rank structure. Further, we address distributional shift by decoupling assumptions on the training and test distribution. We provide analytical formulas for the generalization error of the denoising problem that are asymptotically exact. These are used to derive theoretical results for linear regression, data augmentation, principal component regression, and transfer learning. We validate all of our theoretical results on real data and have a low relative mean squared error of around 1% between the empirical risk and our estimated risk.
翻译:研究真实数据下线性模型的泛化能力是统计学习中的核心问题。尽管已有少量重要工作(Loureiro等人,2021A、2021B;Wei等人,2022)通过真实数据验证了理论成果,但这些工作因技术假设存在局限性。这些假设包括协方差矩阵的良好条件性及数据的独立同分布性,而真实数据未必满足这些条件。此外,以往处理分布偏移的工作通常对训练与测试数据的联合分布施加技术假设(Tripuraneni等人,2021;Wu和Xu,2020),且未在真实数据上进行验证。为解决上述问题并更准确地建模真实数据,本研究考察了非独立同分布但具有低秩结构的数据。进一步,我们通过解耦训练与测试分布假设来处理分布偏移。我们为去噪问题提供了渐近精确的泛化误差解析公式,并利用这些公式推导出线性回归、数据增强、主成分回归及迁移学习的理论结果。所有理论结果均在真实数据上得到验证,实证风险与估计风险的低相对均方误差约为1%。