It stands to reason that the amount and the quality of big data is of key importance for setting up accurate AI-driven models. Nonetheless, we believe there are still critical roadblocks in the inherent generation of databases, that are often underestimated and poorly discussed in the literature. In our view, such issues can seriously hinder the AI-based discovery process, even when high quality, sufficiently large and highly reputable data sources are available. Here, considering superconducting and thermoelectric materials as two representative case studies, we specifically discuss three aspects, namely intrinsically biased sample selection, possible hidden variables, disparate data age. Importantly, to our knowledge, we suggest and test a first strategy capable of detecting and quantifying the presence of the intrinsic data bias.
翻译:大数据的数据量与质量对于建立精确的AI驱动模型至关重要,这一点不言而喻。然而,我们认为数据库在生成过程中仍存在一些关键障碍,这些障碍在文献中常被低估且缺乏充分讨论。在我们看来,即使拥有高质量、足够大且声誉卓著的数据源,此类问题仍可能严重阻碍基于AI的发现过程。本文以超导材料和热电材料作为两个代表性案例,具体探讨三个方面:固有偏置的样本选择、可能的隐藏变量以及数据时效性差异。重要的是,据我们所知,我们首次提出并测试了一种能够检测和量化固有数据偏置存在性的策略。