Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information, posing challenges to accurate molecular comprehension. Traditional evaluation metrics for generated content fail to assess a model's accuracy in molecular understanding. To rectify the absence of factual evaluation, we present MoleculeQA, a novel question answering (QA) dataset which possesses 62K QA pairs over 23K molecules. Each QA pair, composed of a manual question, a positive option and three negative options, has consistent semantics with a molecular description from authoritative molecular corpus. MoleculeQA is not only the first benchmark for molecular factual bias evaluation but also the largest QA dataset for molecular research. A comprehensive evaluation on MoleculeQA for existing molecular LLMs exposes their deficiencies in specific areas and pinpoints several particularly crucial factors for molecular understanding.
翻译:大型语言模型在分子研究中发挥着越来越重要的作用,然而现有模型常生成错误信息,给准确的分子理解带来挑战。针对生成内容的传统评估指标无法衡量模型在分子理解上的准确性。为弥补事实性评估的缺失,我们提出了MoleculeQA,一个包含23K个分子上62K个问答对的新型问答(QA)数据集。每个问答对包含人工编写的问题、一个正确选项和三个错误选项,其语义与来自权威分子语料库的分子描述保持一致。MoleculeQA不仅是首个用于分子事实偏差评估的基准数据集,也是分子研究中最大的问答数据集。基于MoleculeQA对现有分子大型语言模型的全面评估揭示了它们在特定领域的不足,并指出了对分子理解尤为关键的若干因素。