We propose KO-PDE-IDENT, a data-driven framework for identifying parsimonious partial differential equations (PDEs) with false discovery rate (FDR) control. PDE discovery from noisy observations is often hindered by extreme multicollinearity among candidate terms, which causes typical sparse-regression methods to select spurious terms. To address this problem, KO-PDE-IDENT initially mines a support set of potential candidate terms via model-X knockoff filters with finite-sample FDR control, then refines and ranks the surviving PDE alternatives. The framework integrates three components. First, knockoff feature statistics are constructed by coupling $\ell_{0}$-constrained adaptive best-subset selection with SHapley Additive exPlanations (SHAP), yielding an effective and computationally efficient difference statistic. Second, a recursive feature elimination (RFE) procedure removes terms whose marginal contributions are dispensable and assesses statistical necessity through knockoff-perturbed hypothesis testing. Third, the final model selection is formulated as a multi-criteria decision-making (MCDM) problem, where the optimal governing equation is the alternative that best balances a wide range of criteria such as predictive accuracy, model complexity and coefficient uncertainty. We validate KO-PDE-IDENT on five canonical PDEs under severe noise corruption. Empirical results show that our framework can exactly recover the true PDE structure, eliminating false discoveries while retaining all true underlying terms, with low coefficient estimation error.
翻译:我们提出KO-PDE-IDENT,一个用于在控制错误发现率(FDR)下识别简约偏微分方程(PDE)的数据驱动框架。从含噪观测中识别偏微分方程常因候选项间的极端多重共线性而受阻,这导致典型的稀疏回归方法会选出伪项。为解决此问题,KO-PDE-IDENT首先通过具有有限样本FDR控制的模型-X敲除滤波器挖掘候选项的支撑集,然后对幸存PDE备选方案进行优化与排序。该框架整合了三个组成部分。首先,通过将$\ell_{0}$约束的自适应最优子集选择与SHapley加法解释(SHAP)耦合,构建敲除特征统计量,从而得到有效且计算高效的差分统计量。其次,采用递归特征消除(RFE)过程去除边际贡献可忽略的项,并通过敲除扰动假设检验评估其统计必要性。第三,将最终模型选择表述为多准则决策(MCDM)问题,其中最优控制方程为能在预测精度、模型复杂度和系数不确定性等多类准则之间取得最佳平衡的备选方案。我们在严重噪声污染条件下使用五个经典偏微分方程验证了KO-PDE-IDENT。实验结果表明,我们的框架能够精确恢复真实的偏微分方程结构,在保留所有真实底层项的同时剔除错误发现,且系数估计误差较低。