This paper focuses on using natural language descriptions to enhance predictive models in the chemistry field. Conventionally, chemoinformatics models are trained with extensive structured data manually extracted from the literature. In this paper, we introduce TextReact, a novel method that directly augments predictive chemistry with texts retrieved from the literature. TextReact retrieves text descriptions relevant for a given chemical reaction, and then aligns them with the molecular representation of the reaction. This alignment is enhanced via an auxiliary masked LM objective incorporated in the predictor training. We empirically validate the framework on two chemistry tasks: reaction condition recommendation and one-step retrosynthesis. By leveraging text retrieval, TextReact significantly outperforms state-of-the-art chemoinformatics models trained solely on molecular data.
翻译:本文聚焦于利用自然语言描述来增强化学领域的预测模型。传统上,化学信息学模型依赖于从文献中人工提取的大量结构化数据进行训练。本文提出了一种名为TextReact的新方法,直接从文献中检索文本描述以增强预测化学。TextReact首先检索与给定化学反应相关的文本描述,然后将其与反应对应的分子表示进行对齐。该对齐过程通过预测器训练中引入的辅助掩码语言模型目标函数得到强化。我们在两个化学任务上对该框架进行了实证验证:反应条件推荐和单步逆合成分析。通过利用文本检索,TextReact显著优于仅基于分子数据训练的现有最优化学信息学模型。