On sparse regression, Lp-regularization, and automated model discovery

Sparse regression and feature extraction are the cornerstones of knowledge discovery from massive data. Their goal is to discover interpretable and predictive models that provide simple relationships among scientific variables. While the statistical tools for model discovery are well established in the context of linear regression, their generalization to nonlinear regression in material modeling is highly problem-specific and insufficiently understood. Here we explore the potential of neural networks for automatic model discovery and induce sparsity by a hybrid approach that combines two strategies: regularization and physical constraints. We integrate the concept of Lp regularization for subset selection with constitutive neural networks that leverage our domain knowledge in kinematics and thermodynamics. We train our networks with both, synthetic and real data, and perform several thousand discovery runs to infer common guidelines and trends: L2 regularization or ridge regression is unsuitable for model discovery; L1 regularization or lasso promotes sparsity, but induces strong bias; only L0 regularization allows us to transparently fine-tune the trade-off between interpretability and predictability, simplicity and accuracy, and bias and variance. With these insights, we demonstrate that Lp regularized constitutive neural networks can simultaneously discover both, interpretable models and physically meaningful parameters. We anticipate that our findings will generalize to alternative discovery techniques such as sparse and symbolic regression, and to other domains such as biology, chemistry, or medicine. Our ability to automatically discover material models from data could have tremendous applications in generative material design and open new opportunities to manipulate matter, alter properties of existing materials, and discover new materials with user-defined properties.

翻译：稀疏回归和特征提取是大规模数据知识发现的基石，其目标是发现可解释且具有预测能力的模型，揭示科学变量间的简单关系。尽管线性回归中的统计工具已发展成熟，但在材料建模的非线性回归中，其推广高度依赖具体问题且尚未得到充分理解。本文探索神经网络在自动模型发现中的潜力，通过结合正则化与物理约束两种策略的混合方法实现稀疏性诱导。我们将用于子集选择的Lp正则化概念与利用运动学和热力学领域知识的本构神经网络相融合。使用合成数据与真实数据训练网络，通过数千次发现实验总结规律与趋势：L2正则化（岭回归）不适用于模型发现；L1正则化（lasso）虽能促进稀疏性但引入强偏差；仅L0正则化可透明地调节可解释性与预测性、简洁性与准确性、偏差与方差之间的权衡。基于此发现，我们证明Lp正则化本构神经网络能同时发现可解释模型与具有物理意义的参数。我们预测该发现可推广至稀疏回归、符号回归等替代发现技术，以及生物学、化学、医学等其它领域。从数据中自动发现材料模型的能力将在生成式材料设计领域产生重大应用，为操控物质、改变现有材料性质、发现用户自定义属性的新材料开辟新机遇。