In the context of linear regression, we construct a data-driven convex loss function with respect to which empirical risk minimisation yields optimal asymptotic variance in the downstream estimation of the regression coefficients. Our semiparametric approach targets the best decreasing approximation of the derivative of the log-density of the noise distribution. At the population level, this fitting process is a nonparametric extension of score matching, corresponding to a log-concave projection of the noise distribution with respect to the Fisher divergence. The procedure is computationally efficient, and we prove that our procedure attains the minimal asymptotic covariance among all convex $M$-estimators. As an example of a non-log-concave setting, for Cauchy errors, the optimal convex loss function is Huber-like, and our procedure yields an asymptotic efficiency greater than 0.87 relative to the oracle maximum likelihood estimator of the regression coefficients that uses knowledge of this error distribution; in this sense, we obtain robustness without sacrificing much efficiency. Numerical experiments confirm the practical merits of our proposal.
翻译:在线性回归背景下,我们构建了一种数据驱动的凸损失函数,基于该函数的经验风险最小化可在回归系数的下游估计中实现最优渐近方差。我们的半参数方法旨在逼近噪声分布对数密度导数的最佳递减近似。在总体层面上,此拟合过程是分数匹配的非参数扩展,对应于噪声分布在Fisher散度下的对数凹投影。该过程计算高效,我们证明了该方法在所有凸$M$估计量中达到了最小渐近协方差。以非对数凹场景为例,对于柯西误差,最优凸损失函数呈Huber型,且我们的方法相对于利用误差分布信息的回归系数先知最大似然估计,实现了大于0.87的渐近效率;在此意义上,我们在不显著牺牲效率的前提下获得了鲁棒性。数值实验证实了所提方法的实际有效性。