Towards a data-scale independent regulariser for robust sparse identification of non-linear dynamics

Data normalisation, a common and often necessary preprocessing step in engineering and scientific applications, can severely distort the discovery of governing equations by magnitudebased sparse regression methods. This issue is particularly acute for the Sparse Identification of Nonlinear Dynamics (SINDy) framework, where the core assumption of sparsity is undermined by the interaction between data scaling and measurement noise. The resulting discovered models can be dense, uninterpretable, and physically incorrect. To address this critical vulnerability, we introduce the Sequential Thresholding of Coefficient of Variation (STCV), a novel, computationally efficient sparse regression algorithm that is inherently robust to data scaling. STCV replaces conventional magnitude-based thresholding with a dimensionless statistical metric, the Coefficient Presence (CP), which assesses the statistical validity and consistency of candidate terms in the model library. This shift from magnitude to statistical significance makes the discovery process invariant to arbitrary data scaling. Through comprehensive benchmarking on canonical dynamical systems and practical engineering problems, including a physical mass-spring-damper experiment, we demonstrate that STCV consistently and significantly outperforms standard Sequential Thresholding Least Squares (STLSQ) and Ensemble-SINDy (E-SINDy) on normalised, noisy datasets. The results show that STCV-based methods can successfully identify the correct, sparse physical laws even when other methods fail. By mitigating the distorting effects of normalisation, STCV makes sparse system identification a more reliable and automated tool for real-world applications, thereby enhancing model interpretability and trustworthiness.

翻译：数据归一化作为工程与科学应用中常见且必要的预处理步骤，会严重扭曲基于幅值的稀疏回归方法对控制方程的发现。这一问题在非线性动力学稀疏辨识（SINDy）框架中尤为突出，数据缩放与测量噪声的相互作用破坏了稀疏性这一核心假设，导致所发现的模型可能变得稠密、难以解释且物理上不正确。为应对这一关键缺陷，我们提出了变异系数序贯阈值法（STCV），这是一种新颖且计算高效的稀疏回归算法，其本质上对数据缩放具有鲁棒性。STCV采用无量纲统计度量——系数存在性（CP）替代传统的基于幅值的阈值处理，该度量用于评估模型库中候选项的统计有效性与一致性。这种从幅值到统计显著性的转变使得发现过程对任意数据缩放具有不变性。通过对典型动力系统及实际工程问题（包括物理质量-弹簧-阻尼器实验）的综合基准测试，我们证明在归一化的含噪声数据集上，STCV始终显著优于标准的序贯阈值最小二乘法（STLSQ）和集成SINDy（E-SINDy）。结果表明，即使在其他方法失效时，基于STCV的方法仍能成功识别出正确的稀疏物理定律。通过消除归一化带来的扭曲效应，STCV使稀疏系统辨识成为现实应用中更可靠、自动化的工具，从而提升了模型的可解释性与可信度。