Molecule discovery is a pivotal research field, impacting everything from medicine to materials. Recently, Large Language Models (LLMs) have been widely adopted in molecular understanding and generation, serving as a bridge between the molecular space and the natural language space, yet the alignment between molecules and their corresponding captions remains a significant challenge. Previous endeavors typically treat molecules as monolithic inputs, lacking an intermediate reasoning process and sacrificing explainability. In this work, we define fine-grained alignments as the precise correspondence between a molecule's sub-structures and the textual phrases that explain their properties. These alignments are crucial for LLMs to understand molecules in a more accurate and explainable manner. Normally, such fine-grained alignments require expert annotation, which is both costly and time-consuming. To allow LLMs to automatically label and learn the fine-grained alignments, we propose MolReFlect, a novel teacher-student framework, where a teacher LLM first generates and refines mappings between caption phrases and SMILES substructures and then explicitly teaches these detailed alignments to a student LLM. Experimental results demonstrate that MolReFlect enables LLMs to significantly outperform previous baselines, achieving the state-of-the-art performance in the molecule-caption translation task. Our codes are available via: https://github.com/phenixace/MolReFlect.
翻译:分子发现是连接医药与材料科学的重要研究领域。近年来,大型语言模型(LLM)被广泛应用于分子理解与生成,作为连接分子空间与自然语言空间的桥梁,但分子与其对应描述文本之间的对齐仍面临重大挑战。以往的研究通常将分子视为整体输入,缺乏中间推理过程且牺牲了解释性。本文将细粒度对齐定义为分子子结构与解释其性质的文本短语之间的精确对应关系,这种对齐对于LLMs以更精确且可解释的方式理解分子至关重要。通常,这种细粒度对齐需要专家标注,成本高昂且耗时。为让LLM自动标注并学习细粒度对齐,我们提出MolReFlect——一种新颖的教师-学生框架,其中教师LLM首先生成并优化描述短语与SMILES子结构之间的映射,然后显式地将这些详细对齐关系传授给学生LLM。实验结果表明,MolReFlect能使LLM显著超越先前基线,在分子-描述翻译任务中达到最优性能。我们的代码已开源:https://github.com/phenixace/MolReFlect。