Deep generative models have emerged as an exciting avenue for inverse molecular design, with progress coming from the interplay between training algorithms and molecular representations. One of the key challenges in their applicability to materials science and chemistry has been the lack of access to sizeable training datasets with property labels. Published patents contain the first disclosure of new materials prior to their publication in journals, and are a vast source of scientific knowledge that has remained relatively untapped in the field of data-driven molecular design. Because patents are filed seeking to protect specific uses, molecules in patents can be considered to be weakly labeled into application classes. Furthermore, patents published by the US Patent and Trademark Office (USPTO) are downloadable and have machine-readable text and molecular structures. In this work, we train domain-specific generative models using patent data sources by developing an automated pipeline to go from USPTO patent digital files to the generation of novel candidates with minimal human intervention. We test the approach on two in-class extracted datasets, one in organic electronics and another in tyrosine kinase inhibitors. We then evaluate the ability of generative models trained on these in-class datasets on two categories of tasks (distribution learning and property optimization), identify strengths and limitations, and suggest possible explanations and remedies that could be used to overcome these in practice.
翻译:深度生成模型已成为逆向分子设计的一个令人兴奋的途径,其进展源于训练算法与分子表示之间的相互作用。这些模型在材料科学和化学领域应用的关键挑战之一是缺乏带有属性标签的大规模训练数据集。已公开的专利包含了新材料的首次披露,早于其在期刊上的发表,是数据驱动分子设计领域中尚未充分开发的海量科学知识来源。由于专利申请旨在保护特定用途,专利中的分子可被视为按应用类别弱标注。此外,美国专利商标局(USPTO)发布的专利可下载,并具有机器可读的文本和分子结构。在本工作中,我们通过开发一条自动化管道,从USPTO专利数字文件到以最小人工干预生成新候选分子,利用专利数据源训练了领域特定的生成模型。我们在两个类内提取的数据集上测试了该方法,一个涉及有机电子学,另一个涉及酪氨酸激酶抑制剂。随后,我们评估了在这些类内数据集上训练的生成模型在两类任务(分布学习和属性优化)上的能力,识别了其优势与局限性,并提出了可能的解释及实践中可用于克服这些局限的补救措施。