A college-level benchmark dataset for large language models (LLMs) in the materials science field, MaterialBENCH, is constructed. This dataset consists of problem-answer pairs, based on university textbooks. There are two types of problems: one is the free-response answer type, and the other is the multiple-choice type. Multiple-choice problems are constructed by adding three incorrect answers as choices to a correct answer, so that LLMs can choose one of the four as a response. Most of the problems for free-response answer and multiple-choice types overlap except for the format of the answers. We also conduct experiments using the MaterialBENCH on LLMs, including ChatGPT-3.5, ChatGPT-4, Bard (at the time of the experiments), and GPT-3.5 and GPT-4 with the OpenAI API. The differences and similarities in the performance of LLMs measured by the MaterialBENCH are analyzed and discussed. Performance differences between the free-response type and multiple-choice type in the same models and the influence of using system massages on multiple-choice problems are also studied. We anticipate that MaterialBENCH will encourage further developments of LLMs in reasoning abilities to solve more complicated problems and eventually contribute to materials research and discovery.
翻译:本研究构建了一个面向材料科学领域大语言模型(LLMs)的大学水平基准数据集MaterialBENCH。该数据集基于大学教材构建,由问题-答案对组成,包含两种题型:开放式问答题与多项选择题。多项选择题通过在正确答案基础上添加三个错误选项构成四选一形式,供LLMs进行选择。除答案格式外,开放式问答与选择题的问题内容高度重叠。我们利用MaterialBENCH对多种LLMs进行了实验评估,包括ChatGPT-3.5、ChatGPT-4、Bard(实验时版本)以及通过OpenAI API调用的GPT-3.5和GPT-4。研究分析并讨论了各LLMs在MaterialBENCH评估中表现的差异性与共性,同时探究了同一模型在开放式问答与选择题型上的性能差异,以及系统提示对多项选择题表现的影响。我们预期MaterialBENCH将推动LLMs推理能力的进一步发展,以解决更复杂的科学问题,最终为材料研究与发现作出贡献。