Molecule discovery serves as a cornerstone in numerous scientific domains, fueling the development of new materials and innovative drug designs. Recent developments of in-silico molecule discovery have highlighted the promising results of cross-modal techniques, which bridge molecular structures with their descriptive annotations. However, these cross-modal methods frequently encounter the issue of data scarcity, hampering their performance and application. In this paper, we address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs). We first introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data. Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model scale, reduced data size and lower training cost, highlighting its efficiency. Furthermore, our method shows a sustained improvement as the volume of pseudo data increases, revealing the great potential of pseudo data in advancing low-resource cross-modal molecule discovery.
翻译:分子发现是众多科学领域的基石,推动着新材料开发与创新药物设计。近年来,计算机模拟分子发现的研究揭示了跨模态技术的可喜成果,这类技术能够将分子结构与其描述性注释联系起来。然而,这些跨模态方法常面临数据稀缺问题,从而制约其性能表现与实际应用。本文通过利用大语言模型生成的"人工真实"数据来解决低资源挑战:我们首先提出基于检索的提示策略来构建高质量伪数据,进而探索利用该伪数据的最优方法。实验表明,使用伪数据进行领域适应不仅在所有现有方法中表现最优,还同时实现了更小的模型规模、更少的数据量和更低的训练成本,凸显其高效性。此外,随着伪数据量的增加,本方法展现出持续的性能提升,揭示了伪数据在推进低资源跨模态分子发现方面的巨大潜力。