Leveraging Large Language Models to Address Data Scarcity in Machine Learning: Applications in Graphene Synthesis

Machine learning in materials science faces challenges due to limited experimental data, as generating synthesis data is costly and time-consuming, especially with in-house experiments. Mining data from existing literature introduces issues like mixed data quality, inconsistent formats, and variations in reporting experimental parameters, complicating the creation of consistent features for the learning algorithm. Additionally, combining continuous and discrete features can hinder the learning process with limited data. Here, we propose strategies that utilize large language models (LLMs) to enhance machine learning performance on a limited, heterogeneous dataset of graphene chemical vapor deposition synthesis compiled from existing literature. These strategies include prompting modalities for imputing missing data points and leveraging large language model embeddings to encode the complex nomenclature of substrates reported in chemical vapor deposition experiments. The proposed strategies enhance graphene layer classification using a support vector machine (SVM) model, increasing binary classification accuracy from 39% to 65% and ternary accuracy from 52% to 72%. We compare the performance of the SVM and a GPT-4 model, both trained and fine-tuned on the same data. Our results demonstrate that the numerical classifier, when combined with LLM-driven data enhancements, outperforms the standalone LLM predictor, highlighting that in data-scarce scenarios, improving predictive learning with LLM strategies requires more than simple fine-tuning on datasets. Instead, it necessitates sophisticated approaches for data imputation and feature space homogenization to achieve optimal performance. The proposed strategies emphasize data enhancement techniques, offering a broadly applicable framework for improving machine learning performance on scarce, inhomogeneous datasets.

翻译：材料科学中的机器学习因实验数据有限而面临挑战，因为生成合成数据成本高昂且耗时，尤其是通过内部实验。从现有文献中挖掘数据会引入诸如数据质量参差不齐、格式不一致以及实验参数报告方式各异等问题，这为学习算法创建一致的特征带来了困难。此外，结合连续和离散特征可能会在数据有限的情况下阻碍学习过程。本文提出利用大型语言模型（LLMs）的策略，以提升在从现有文献汇编的、有限且异质的石墨烯化学气相沉积合成数据集上的机器学习性能。这些策略包括：采用提示模式来填补缺失数据点，以及利用大型语言模型嵌入来编码化学气相沉积实验中报告的复杂基底命名法。所提出的策略通过支持向量机（SVM）模型增强了石墨烯层数分类，将二元分类准确率从39%提高到65%，三元分类准确率从52%提高到72%。我们比较了在同一数据上训练和微调的SVM模型与GPT-4模型的性能。我们的结果表明，当数值分类器与LLM驱动的数据增强相结合时，其表现优于独立的LLM预测器，这突显了在数据稀缺场景下，利用LLM策略提升预测学习能力需要的不仅仅是对数据集进行简单的微调。相反，它需要采用复杂的数据填补和特征空间同质化方法以实现最佳性能。所提出的策略强调了数据增强技术，为提升在稀缺、非均匀数据集上的机器学习性能提供了一个广泛适用的框架。