Molecule discovery serves as a cornerstone in numerous scientific domains, fueling the development of new materials and innovative drug designs. Recent developments of in-silico molecule discovery have highlighted the promising results of cross-modal techniques, which bridge molecular structures with their descriptive annotations. However, these cross-modal methods frequently encounter the issue of data scarcity, hampering their performance and application. In this paper, we address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs). We first introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data. Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model scale, reduced data size and lower training cost, highlighting its efficiency. Furthermore, our method shows a sustained improvement as the volume of pseudo data increases, revealing the great potential of pseudo data in advancing low-resource cross-modal molecule discovery. Our code and data are available at https://github.com/SCIR-HI/ArtificiallyR2R.
翻译:分子发现是众多科学领域的基石,推动着新材料开发和创新药物设计。近年来,基于计算机模拟的分子发现研究凸显了跨模态技术的潜力——该技术通过将分子结构与其描述性标注相关联,取得了令人瞩目的成果。然而,这类跨模态方法常面临数据稀缺问题,从而限制了其性能表现与应用范围。本文通过利用大语言模型(LLMs)生成的“人工真实”数据来应对低资源挑战。我们首先提出了一种基于检索的提示策略以构建高质量伪数据,随后探索了有效利用该伪数据的最优方法。实验表明,采用伪数据进行领域自适应的方法不仅超越了所有现有方法,同时所需模型规模更小、数据量更少、训练成本更低,凸显了其高效性。此外,随着伪数据量的增加,该方法展现出持续的性能提升,揭示了伪数据在推动低资源跨模态分子发现中的巨大潜力。我们的代码与数据已开源在https://github.com/SCIR-HI/ArtificiallyR2R。