GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning

Molecule property prediction has gained significant attention in recent years. The main bottleneck is the label insufficiency caused by expensive lab experiments. In order to alleviate this issue and to better leverage textual knowledge for tasks, this study investigates the feasibility of employing natural language instructions to accomplish molecule-related tasks in a zero-shot setting. We discover that existing molecule-text models perform poorly in this setting due to inadequate treatment of instructions and limited capacity for graphs. To overcome these issues, we propose GIMLET, which unifies language models for both graph and text data. By adopting generalized position embedding, our model is extended to encode both graph structures and instruction text without additional graph encoding modules. GIMLET also decouples encoding of the graph from tasks instructions in the attention mechanism, enhancing the generalization of graph features across novel tasks. We construct a dataset consisting of more than two thousand molecule tasks with corresponding instructions derived from task descriptions. We pretrain GIMLET on the molecule tasks along with instructions, enabling the model to transfer effectively to a broad range of tasks. Experimental results demonstrate that GIMLET significantly outperforms molecule-text baselines in instruction-based zero-shot learning, even achieving closed results to supervised GNN models on tasks such as toxcast and muv.

翻译：分子性质预测近年来受到广泛关注。其主要瓶颈在于昂贵的实验室实验导致的标签不足。为缓解这一问题并更好地利用任务中的文本知识，本研究探讨了在零样本设置下使用自然语言指令完成分子相关任务的可行性。我们发现，现有的分子-文本模型由于对指令处理不足以及图数据处理能力有限，在此设置下表现不佳。为解决这些问题，我们提出GIMLET，该模型统一了图数据和文本数据的语言模型。通过采用广义位置嵌入，我们的模型能够在不增加额外图编码模块的情况下，同时编码图结构和指令文本。GIMLET还在注意力机制中将图的编码与任务指令解耦，增强了图特征在新任务上的泛化能力。我们构建了一个包含两千多个分子任务的数据集，每个任务对应由任务描述导出的指令。我们在这些分子任务及对应的指令上对GIMLET进行预训练，使模型能够有效迁移到广泛的任务中。实验结果表明，在基于指令的零样本学习中，GIMLET显著优于分子-文本基线模型，甚至在toxcast和muv等任务上取得了与有监督的GNN模型相当的结果。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日