The history of the seemingly simple problem of straight line fitting in the presence of both $x$ and $y$ errors has been fraught with misadventure, with statistically ad hoc and poorly tested methods abounding in the literature. The problem stems from the emergence of latent variables describing the "true" values of the independent variables, the priors on which have a significant impact on the regression result. By analytic calculation of maximum a posteriori values and biases, and comprehensive numerical mock tests, we assess the quality of possible priors. In the presence of intrinsic scatter, the only prior that we find to give reliably unbiased results in general is a mixture of one or more Gaussians with means and variances determined as part of the inference. We find that a single Gaussian is typically sufficient and dub this model Marginalised Normal Regression (MNR). We illustrate the necessity for MNR by comparing it to alternative methods on an important linear relation in cosmology, and extend it to nonlinear regression and an arbitrary covariance matrix linking $x$ and $y$. We publicly release a Python/Jax implementation of MNR and its Gaussian mixture model extension that is coupled to Hamiltonian Monte Carlo for efficient sampling, which we call ROXY (Regression and Optimisation with X and Y errors).
翻译:看似简单的同时存在$x$和$y$误差的直线拟合问题,其历史充满波折:文献中充斥着统计上临时拼凑且未经充分检验的方法。问题根源在于描述自变量"真实"值的潜变量出现,而对这些潜变量设定的先验对回归结果有显著影响。通过解析计算最大后验值与偏差,并结合全面的数值模拟检验,我们评估了多种可行先验的质量。在存在内在弥散的情况下,我们发现唯一能普遍提供可靠无偏结果的先验是混合高斯分布(其均值和方差作为推断的一部分确定)。通常单高斯分布就足够有效,我们将此模型命名为边缘正态回归。通过将该方法与替代方法应用于宇宙学中一条重要的线性关系,我们阐明了边缘正态回归的必要性,并将其推广至非线性回归以及连接$x$和$y$的任意协方差矩阵。我们公开发布了边缘正态回归及其高斯混合模型扩展的Python/Jax实现,该实现与哈密顿蒙特卡洛方法耦合以实现高效采样,我们将其命名为ROXY(含X和Y误差的回归与优化)。