Large language models (LLMs) have established great success in the general domain of natural language processing. Their emerging task generalization and free-form dialogue capabilities can greatly help to design Chemical General Intelligence (CGI) to assist real-world research in chemistry. However, the existence of specialized language and knowledge in the field of chemistry, such as the highly informative SMILES notation, hinders the performance of general-domain LLMs in chemistry. To this end, we develop ChemDFM, the first LLM towards CGI. ChemDFM-13B is trained on 34B tokens from chemical literature, textbooks, and instructions as well as various data from the general domain. Therefore, it can store, understand, and reason over chemical knowledge and languages while still possessing advanced free-form language comprehension capabilities. Extensive quantitative evaluation shows that ChemDFM can significantly outperform the representative open-sourced LLMs. Moreover, ChemDFM can also surpass GPT-4 on a great portion of chemical tasks, despite the significant size difference. Further qualitative evaluations demonstrate the efficiency and effectiveness of ChemDFM in real-world research scenarios. We will open-source the ChemDFM model soon.
翻译:大语言模型(LLMs)已在自然语言处理通用领域取得显著成功。其涌现的任务泛化与自由对话能力极大有助于设计化学通用智能(CGI),以辅助化学领域的真实世界研究。然而,化学领域特有的专业语言与知识(如信息密度极高的SMILES符号表示)限制了通用LLMs在化学任务中的表现。为此,我们开发了首个面向CGI的大语言模型——ChemDFM。ChemDFM-13B在来自化学文献、教材、指令及通用领域多样化数据的340亿词元上完成训练,使其既能存储、理解并推理化学知识与语言,同时保留高级自由文本理解能力。广泛定量评估表明,ChemDFM显著优于代表性开源LLMs。尽管模型规模存在巨大差异,ChemDFM仍能在大部分化学任务上超越GPT-4。进一步定性评估证实了ChemDFM在实际科研场景中的高效性与有效性。我们将于近期开源ChemDFM模型。