In the emerging field of materials informatics, a fundamental task is to identify physicochemically meaningful descriptors, or materials genes, which are engineered from primary variables and a set of elementary algebraic operators through compositions. Standard practice directly analyzes the high-dimensional candidate predictor space in a linear model; statistical analyses are then substantially hampered by the daunting challenge posed by the astronomically large number of correlated predictors with limited sample size. We formulate this problem as variable selection with operator-induced structure (OIS), and propose a new method to achieve unconventional dimension reduction by utilizing the geometry embedded in OIS. Although the model remains linear, we iterate nonparametric variable selection for effective dimension reduction. This enables variable selection based on ab initio primary variables, leading to a method that is orders of magnitude faster than existing methods, with improved accuracy. An OIS screening property for variable selection with OIS is introduced; interestingly, finite sample assessment indicates that the employed Bayesian Additive Regression Trees (BART)-based variable selection method enjoys this property under the simulation settings. Numerical studies show the superiority of the proposed method, which continues to exhibit robust performance when the dimension of engineered features is out of reach of existing methods. Our analysis to single-atom catalysis identifies physical descriptors that explain the binding energy of metal-support pairs with high explanatory power, leading to interpretable insights to guide the prevention of a notorious problem called sintering and aid catalysis design.
翻译:在材料信息学这一新兴领域,一项基本任务是识别具有物理化学意义的描述符(即材料基因),这些描述符通过基本变量与一组基本代数算子的组合而构建。标准方法直接在线性模型中对高维候选预测变量空间进行分析;然而,由于样本量有限且相关预测变量数量极其庞大,统计分析面临严峻挑战。我们将此问题形式化为具有算子诱导结构(OIS)的变量选择,并提出一种新方法,通过利用OIS中嵌入的几何结构实现非常规降维。尽管模型仍保持线性,但我们迭代使用非参数变量选择以实现有效降维。这使得基于从头算基本变量的变量选择成为可能,所提方法的速度比现有方法快数个数量级,且精度更高。我们引入了用于OIS变量选择的OIS筛选性质;有趣的是,有限样本评估表明,在模拟设置下,所采用的基于贝叶斯加性回归树(BART)的变量选择方法具备此性质。数值研究展示了所提方法的优越性,当工程化特征维度超出现有方法可处理范围时,该方法仍能保持稳健性能。我们对单原子催化剂的分析识别出物理描述符,这些描述符对金属-载体对的结合能具有高解释力,从而提供可解释的见解以指导预防被称为烧结的常见问题并辅助催化剂设计。