Interpretation of High-Dimensional Linear Regression: Effects of Nullspace and Regularization Demonstrated on Battery Data

from arxiv, Manuscript: 14 pages, 7 figures; Supplementary Information: 4 pages, 2 figures; Code available: https://github.com/JoachimSchaeffer/HDRegAnalytics

High-dimensional linear regression is important in many scientific fields. This article considers discrete measured data of underlying smooth latent processes, as is often obtained from chemical or biological systems. Interpretation in high dimensions is challenging because the nullspace and its interplay with regularization shapes regression coefficients. The data's nullspace contains all coefficients that satisfy $\mathbf{Xw}=\mathbf{0}$, thus allowing very different coefficients to yield identical predictions. We developed an optimization formulation to compare regression coefficients and coefficients obtained by physical engineering knowledge to understand which part of the coefficient differences are close to the nullspace. This nullspace method is tested on a synthetic example and lithium-ion battery data. The case studies show that regularization and z-scoring are design choices that, if chosen corresponding to prior physical knowledge, lead to interpretable regression results. Otherwise, the combination of the nullspace and regularization hinders interpretability and can make it impossible to obtain regression coefficients close to the true coefficients when there is a true underlying linear model. Furthermore, we demonstrate that regression methods that do not produce coefficients orthogonal to the nullspace, such as fused lasso, can improve interpretability. In conclusion, the insights gained from the nullspace perspective help to make informed design choices for building regression models on high-dimensional data and reasoning about potential underlying linear models, which are important for system optimization and improving scientific understanding.

翻译：高维线性回归在许多科学领域中具有重要意义。本文考虑从化学或生物系统中常获取的底层平滑潜在过程的离散测量数据。高维解释具有挑战性，因为零空间及其与正则化的相互作用塑造了回归系数。数据的零空间包含所有满足$\mathbf{Xw}=\mathbf{0}$的系数，从而允许截然不同的系数产生相同的预测。我们开发了一种优化公式，用于比较回归系数与通过物理工程知识获得的系数，以理解系数差异的哪一部分接近零空间。该零空间方法在合成示例和锂离子电池数据上进行了测试。案例研究表明，正则化和z评分是设计选择，若根据先验物理知识进行选择，则能产生可解释的回归结果；否则，零空间与正则化的组合会阻碍可解释性，且在存在真实潜在线性模型时，可能无法获得接近真实系数的回归系数。此外，我们证明了不产生与零空间正交系数的回归方法（如fused lasso）能够提升可解释性。总之，从零空间视角获得的见解有助于在构建高维数据回归模型及推理潜在线性模型时做出明智的设计选择，这对系统优化和提升科学理解具有重要意义。