Molecule discovery serves as a cornerstone in numerous scientific domains, fueling the development of new materials and innovative drug designs. Recent developments of in-silico molecule discovery have highlighted the promising results of cross-modal techniques, which bridge molecular structures with their descriptive annotations. However, these cross-modal methods frequently encounter the issue of data scarcity, hampering their performance and application. In this paper, we address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs). We first introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data. Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model scale, reduced data size and lower training cost, highlighting its efficiency. Furthermore, our method shows a sustained improvement as the volume of pseudo data increases, revealing the great potential of pseudo data in advancing low-resource cross-modal molecule discovery. Our code and data are available at https://github.com/SCIR-HI/ArtificiallyR2R.
翻译:分子发现是众多科学领域的基石,推动着新材料开发与创新药物设计。近年来,计算机模拟分子发现的研究展示了跨模态技术的广阔前景——该技术通过连接分子结构及其描述性注释实现突破。然而,这些跨模态方法常面临数据稀缺问题,制约了其性能与应用。本文通过利用大语言模型(LLMs)生成的人工真实数据应对低资源挑战。我们首先提出基于检索的提示策略构建高质量伪数据,进而探索有效利用伪数据的最优方法。实验表明,使用伪数据进行领域适应的效果超越现有所有方法,同时所需模型规模更小、数据量更少、训练成本更低,彰显了其高效性。此外,随着伪数据量增加,我们的方法呈现持续改进趋势,揭示了伪数据在推动低资源跨模态分子发现方面的巨大潜力。我们的代码与数据已在 https://github.com/SCIR-HI/ArtificiallyR2R 开源。