Recent breakthroughs in Large Language Models (LLMs) have revolutionized natural language understanding and generation, igniting a surge of interest in leveraging these technologies in the field of scientific literature analysis. Existing benchmarks, however, inadequately evaluate the proficiency of LLMs in scientific literature analysis, especially in scenarios involving complex comprehension and multimodal data. In response, we introduced SciAssess, a benchmark tailored for the in-depth analysis of scientific literature, crafted to provide a thorough assessment of LLMs' efficacy. SciAssess focuses on evaluating LLMs' abilities in memorization, comprehension, and analysis within the context of scientific literature analysis. It includes representative tasks from diverse scientific fields, such as general chemistry, organic materials, and alloy materials. And rigorous quality control measures ensure its reliability in terms of correctness, anonymization, and copyright compliance. SciAssess evaluates leading LLMs, including GPT-4, GPT-3.5, and Gemini, identifying their strengths and aspects for improvement and supporting the ongoing development of LLM applications in scientific literature analysis. SciAssess and its resources are made available at https://sci-assess.github.io, offering a valuable tool for advancing LLM capabilities in scientific literature analysis.
翻译:近期大型语言模型(LLMs)的突破性进展革新了自然语言理解与生成技术,激发了将此类技术应用于科学文献分析领域的热潮。然而,现有基准未能充分评估LLMs在科学文献分析中的能力,尤其是在涉及复杂理解与多模态数据的场景中。为此,我们提出SciAssess——一个专为科学文献深度分析设计的基准,旨在全面评估LLMs的有效性。该基准聚焦于评估LLMs在科学文献分析中的记忆、理解与分析能力,涵盖来自普通化学、有机材料、合金材料等不同科学领域的代表性任务。通过严格的质量控制措施,确保了其在正确性、匿名化及版权合规方面的可靠性。基于SciAssess,我们评估了包括GPT-4、GPT-3.5及Gemini在内的主流LLMs,明确了其优势与改进方向,以支持LLMs在科学文献分析领域的持续发展。SciAssess及其相关资源已发布于https://sci-assess.github.io,为提升LLMs在科学文献分析中的能力提供了宝贵工具。