The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for classical problems in molecular design and synthesis planning. These new methods have the potential to optimize laboratory operations and fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each task, leading to the need for problem-specific fine-tuning and neglecting task interrelations. The main obstacle in this field is the lack of a unified representation between natural language and chemical representations, complicating and limiting human-machine interaction. Here, we propose a multi-domain, multi-task language model to solve a wide range of tasks in both the chemical and natural language domains. By leveraging multi-task learning, our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art baselines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which increase with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences by superseding problem-specific fine-tuning and enhancing human-model interactions.
翻译:近期神经语言模型的进展已成功应用于化学领域,为分子设计与合成规划等经典问题提供了生成式解决方案。这些新方法有望优化实验室操作流程,推动科学发现中数据驱动自动化的新时代。然而,每个任务通常仍需专用模型,导致需要针对特定问题进行微调,且忽视了任务间的关联性。该领域的主要障碍在于自然语言与化学表示之间缺乏统一表征,这使人与机器的交互复杂化且受限。本文提出一种多领域、多任务语言模型,以解决化学与自然语言领域的广泛任务。通过利用多任务学习,该模型可同时处理化学与自然语言,无需昂贵的单领域预训练或任务专属模型。值得注意的是,在跨领域权值共享机制下,本模型在单领域与跨领域任务中均显著超越当前最优基线。特别地,跨领域与跨任务的信息共享使跨领域任务性能大幅提升,其提升幅度随规模增加而增大(以十余项相关指标衡量)。本研究证明,此类模型可通过替代问题专用微调并增强人机交互,稳健高效地加速物理科学发现。