Materials datasets are usually featured by the existence of many redundant (highly similar) materials due to the tinkering material design practice over the history of materials research. For example, the materials project database has many perovskite cubic structure materials similar to SrTiO$_3$. This sample redundancy within the dataset makes the random splitting of machine learning model evaluation to fail so that the ML models tend to achieve over-estimated predictive performance which is misleading for the materials science community. This issue is well known in the field of bioinformatics for protein function prediction, in which a redundancy reduction procedure (CD-Hit) is always applied to reduce the sample redundancy by ensuring no pair of samples has a sequence similarity greater than a given threshold. This paper surveys the overestimated ML performance in the literature for both composition based and structure based material property prediction. We then propose a material dataset redundancy reduction algorithm called MD-HIT and evaluate it with several composition and structure based distance threshold sfor reducing data set sample redundancy. We show that with this control, the predicted performance tends to better reflect their true prediction capability. Our MD-hit code can be freely accessed at https://github.com/usccolumbia/MD-HIT
翻译:材料数据集通常因材料研究历史中不断调整的设计实践而存在大量冗余(高度相似)材料。例如,材料项目数据库中存在许多与SrTiO$_3$类似的钙钛矿立方结构材料。数据集内这种样本冗余会导致机器学习模型评估的随机分割失效,使得ML模型倾向于获得过高的预测性能,从而对材料科学界产生误导。这一问题在生物信息学领域(如蛋白质功能预测)已得到充分认识,其中常采用冗余缩减流程(CD-Hit)通过确保任意样本对的序列相似度不超过给定阈值来降低样本冗余。本文系统调研了文献中基于成分和结构的材料性能预测中ML性能被高估的现象。我们提出一种名为MD-HIT的材料数据集冗余缩减算法,并通过多种基于成分和结构的距离阈值对其进行评估以减少数据集样本冗余。研究表明,通过这种控制,预测性能能够更准确地反映其真实预测能力。我们的MD-hit代码可免费访问:https://github.com/usccolumbia/MD-HIT