Improving Linear Regression on Small Datasets via Gaussian Process and Extreme Value Theory-Based Data Augmentation

Small sample sizes pose significant challenges in regression analysis, often leading to violations of classical assumptions such as normality, homoscedasticity, and independence of residuals. These violations compromise parameter estimation accuracy, reduce statistical power, and limit the generalizability of findings. This study introduces the Gaussian Process-based Modified Extreme Value Theorem (GP-MEVT) method, a novel hybrid data augmentation approach that combines Gaussian Process with Extreme Value Theory to address these limitations. The GP-MEVT method generates augmented observations that extend the predictor space beyond the observed range while preserving the underlying linear structure and introducing controlled variability based on residual variation, through comprehensive simulation studies across three variance scenarios (sigma = 2, 5, 8) and sample sizes (n = 10, 15, 20). Here, we demonstrate that GP-MEVT achieves a higher rate of assumption satisfaction, substantially outperforming standard bootstrap and bootstrap with noise methods. The proposed method also exhibits reasonable parameter estimation accuracy, with intercept and slope estimates consistently closer to true parameter values, and maintains competitive or superior model fitting performance as measured by root mean square error. Application to a real-world dataset confirms these advantages, with GP-MEVT achieving a 67.1% assumption satisfaction rate compared to 17.3% and 21.2% for bootstrap alternatives. These findings establish GP-MEVT as a robust and reliable framework for fitting linear regression models to small datasets, offering practitioners a principled approach to statistical inference when sample size limitations are unavoidable.

翻译：小样本量给回归分析带来严峻挑战，常导致经典假设（如正态性、方差齐性和残差独立性）难以满足。这些违反假设的情况会损害参数估计精度，降低统计检验力，并限制结论的泛化能力。本研究提出基于高斯过程的改进极值定理（GP-MEVT）方法，这是一种将高斯过程与极值理论相结合的新型混合数据增强技术。通过生成预测变量空间超越观测范围的增强观测值，GP-MEVT方法在保持基础线性结构的同时，基于残差变异引入受控变异性。我们在三种方差场景（sigma=2, 5, 8）和样本量（n=10, 15, 20）下开展系统模拟研究，证实GP-MEVT方法能显著提高假设满足率，明显优于标准自助法和加噪自助法。该方法展现出合理的参数估计精度，截距和斜率估计值持续接近真实参数值，并且在均方根误差指标下保持具有竞争力乃至更优的模型拟合性能。真实数据集的应用进一步验证了这些优势：GP-MEVT方法的假设满足率达到67.1%，而两种自助法分别为17.3%和21.2%。这些发现确立了GP-MEVT作为小样本线性回归模型稳健可靠框架的地位，为受样本量限制的统计推断提供了原则性解决方案。