We study partially linear models in settings where observations are arranged in independent groups but may exhibit within-group dependence. Existing approaches estimate linear model parameters through weighted least squares, with optimal weights (given by the inverse covariance of the response, conditional on the covariates) typically estimated by maximising a (restricted) likelihood from random effects modelling or by using generalised estimating equations. We introduce a new 'sandwich loss' whose population minimiser coincides with the weights of these approaches when the parametric forms for the conditional covariance are well-specified, but can yield arbitrarily large improvements in linear parameter estimation accuracy when they are not. Under relatively mild conditions, our estimated coefficients are asymptotically Gaussian and enjoy minimal variance among estimators with weights restricted to a given class of functions, when user-chosen regression methods are used to estimate nuisance functions. We further expand the class of functional forms for the weights that may be fitted beyond parametric models by leveraging the flexibility of modern machine learning methods within a new gradient boosting scheme for minimising the sandwich loss. We demonstrate the effectiveness of both the sandwich loss and what we call 'sandwich boosting' in a variety of settings with simulated and real-world data.
翻译:我们研究在观测数据按独立分组排列但组内可能存在依赖关系的场景下的部分线性模型。现有方法通过加权最小二乘法估计线性模型参数,其中最优权重(由响应变量在给定协变量条件下的逆协方差给出)通常通过最大化随机效应建模的(受限)似然或使用广义估计方程来估计。我们引入一种新型"夹心损失函数",其总体最小化器在条件协方差的参数形式正确设定时与这些方法得到的权重一致,但在设定错误时能够在线性参数估计精度上实现任意幅度的提升。在相对温和的条件下,当使用用户选择的回归方法估计干扰函数时,我们的估计系数渐近服从高斯分布,并且在限制于给定函数类别的权重估计量中具有最小方差。我们进一步通过利用现代机器学习方法的灵活性,在一种新的梯度提升方案中最小化夹心损失,将可拟合的权重函数形式从参数模型扩展到更广泛的类别。我们通过模拟数据和真实数据在多种场景下验证了夹心损失函数和"夹心提升法"的有效性。