Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the $\textit{L+M-24}$ dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, $\textit{L+M-24}$ is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction.
翻译:语言-分子模型已成为分子发现与理解领域一个令人兴奋的研究方向。然而,由于分子-语言配对数据集的稀缺,训练这些模型面临挑战。目前已有的数据集可分为三类:1)规模较小、从现有数据库爬取的数据;2)规模较大但噪声显著、通过对科学文献进行实体链接构建的数据;3)基于模板将性质预测数据集转换为自然语言生成的数据。本文档详细介绍了为ACL 2024“语言+分子”研讨会共享任务创建的$\textit{L+M-24}$数据集。该数据集特别关注自然语言在分子设计中的三个关键优势:组合性、功能性与抽象性。