Reliably measuring the collinearity of bivariate data is crucial in statistics, particularly for time-series analysis or ongoing studies in which incoming observations can significantly impact current collinearity estimates. Leveraging identities from Welford's online algorithm for sample variance, we develop a rigorous theoretical framework for analyzing the maximal change to the Pearson correlation coefficient and its p-value that can be induced by additional data. Further, we show that the resulting optimization problems yield elegant closed-form solutions that can be accurately computed by linear- and constant-time algorithms. Our work not only creates new theoretical avenues for robust correlation measures, but also has broad practical implications for disciplines that span econometrics, operations research, clinical trials, climatology, differential privacy, and bioinformatics. Software implementations of our algorithms in Cython-wrapped C are made available at https://github.com/marc-harary/sensitivity for reproducibility, practical deployment, and future theoretical development.
翻译:可靠测量双变量数据的共线性在统计学中至关重要,尤其对于时间序列分析或持续研究,其中新增观测值可能显著影响当前的共线性估计。利用Welford在线样本方差算法中的恒等式,我们建立了一个严谨的理论框架,用于分析新增数据可能引起的皮尔逊相关系数及其p值的最大变化。进一步,我们证明由此产生的优化问题可导出优雅的闭式解,并能通过线性和常数时间算法精确计算。我们的工作不仅为稳健相关性度量开辟了新的理论途径,还对计量经济学、运筹学、临床试验、气候学、差分隐私和生物信息学等广泛领域具有重要的实践意义。我们在https://github.com/marc-harary/sensitivity 提供了Cython封装的C语言算法实现,以确保可复现性、实际部署及未来理论发展。