Structured chemical reaction information plays a vital role for chemists engaged in laboratory work and advanced endeavors such as computer-aided drug design. Despite the importance of extracting structured reactions from scientific literature, data annotation for this purpose is cost-prohibitive due to the significant labor required from domain experts. Consequently, the scarcity of sufficient training data poses an obstacle to the progress of related models in this domain. In this paper, we propose ReactIE, which combines two weakly supervised approaches for pre-training. Our method utilizes frequent patterns within the text as linguistic cues to identify specific characteristics of chemical reactions. Additionally, we adopt synthetic data from patent records as distant supervision to incorporate domain knowledge into the model. Experiments demonstrate that ReactIE achieves substantial improvements and outperforms all existing baselines.
翻译:结构化化学反应信息对于从事实验室工作的化学家以及参与计算机辅助药物设计等高级研究的化学家至关重要。尽管从科学文献中提取结构化反应信息具有重要意义,但由于需要领域专家投入大量人力,此类数据标注成本高昂。因此,充足训练数据的缺乏成为该领域相关模型发展的障碍。本文提出ReactIE方法,它结合了两种弱监督方法进行预训练。我们的方法利用文本中的频繁模式作为语言线索,以识别化学反应的具体特征。此外,我们采用来自专利记录的综合数据作为远程监督,将领域知识融入模型。实验表明,ReactIE取得了显著改进,并优于所有现有基线方法。