The history of the seemingly simple problem of straight line fitting in the presence of both $x$ and $y$ errors has been fraught with misadventure, with statistically ad hoc and poorly tested methods abounding in the literature. The problem stems from the emergence of latent variables describing the "true" values of the independent variables, the priors on which have a significant impact on the regression result. By analytic calculation of maximum a posteriori values and biases, and comprehensive numerical mock tests, we assess the quality of possible priors. In the presence of intrinsic scatter, the only prior that we find to give reliably unbiased results in general is a mixture of one or more Gaussians with means and variances determined as part of the inference. We find that a single Gaussian is typically sufficient and dub this model Marginalised Normal Regression (MNR). We illustrate the necessity for MNR by comparing it to alternative methods on an important linear relation in cosmology, and extend it to nonlinear regression and an arbitrary covariance matrix linking $x$ and $y$. We publicly release a Python/Jax implementation of MNR and its Gaussian mixture model extension that is coupled to Hamiltonian Monte Carlo for efficient sampling, which we call ROXY (Regression and Optimisation with X and Y errors).
翻译:看似简单的在$x$和$y$均存在误差条件下的直线拟合问题,其研究历程充满坎坷:文献中充斥着统计上非严谨且缺乏充分检验的方法。该问题的根源在于描述自变量“真实”值的潜变量出现,其先验分布对回归结果具有显著影响。通过最大后验估计值与偏差的解析计算,以及全面的数值模拟测试,我们评估了各类可能先验的质量。在存在固有弥散的情况下,我们发现在一般条件下唯一能稳定产生无偏结果的先验是混合高斯分布(其均值和方差在推断过程中确定)。研究表明单高斯分布通常已足够,我们将此模型命名为边缘化正态回归(MNR)。通过将MNR与替代方法在宇宙学重要线性关系上的比较,我们论证了其必要性,并将其推广至非线性回归及链接$x$和$y$的任意协方差矩阵。我们公开了MNR及其高斯混合模型扩展的Python/Jax实现,该实现耦合了哈密顿蒙特卡洛方法以实现高效采样,并将该软件命名为ROXY(含X和Y误差的回归与优化)。