Large language models (LLMs) like ChatGPT have revealed amazing intelligence. How to evaluate the question-solving abilities of LLMs and their degrees of intelligence is a hot-spot but challenging issue. First, the question-solving abilities are interlaced with different ability branches like understanding and massive knowledge categories like mathematics. Second, the inputs of questions are multimodal that may involve text and images. Third, the response format of LLMs is diverse and thus poses great challenges for result extraction and evaluation. In this paper, we propose AGIBench -- a multi-granularity, multimodal, human-referenced, and auto-scoring benchmarking methodology for LLMs. Instead of a collection of blended questions, AGIBench focuses on three typical ability branches and adopts a four-tuple <ability branch, knowledge, difficulty, modal> to label the attributes of each question. First, it supports multi-granularity benchmarking, e.g., per-question, per-ability branch, per-knowledge, per-modal, per-dataset, and per-difficulty level granularities. Second, it contains multimodal input, including text and images. Third, it classifies all the questions into five degrees of difficulty according to the average accuracy rate of abundant educated humans (human-referenced). Fourth, it adopts zero-shot learning to avoid introducing additional unpredictability and provides an auto-scoring method to extract and judge the result. Finally, it defines multi-dimensional metrics, including accuracy under the average, worst, best, and majority voting cases, and repeatability. AGIBench is publically available from \url{https://www.benchcouncil.org/agibench}.
翻译:像ChatGPT这样的大语言模型(LLMs)已展现出惊人的智能。如何评估LLMs的问题解决能力及其智能程度是一个热点但具有挑战性的问题。首先,问题解决能力与理解能力等不同能力分支以及数学等大量知识类别相互交织。其次,问题的输入是多模态的,可能涉及文本和图像。第三,LLMs的回应格式多样化,给结果提取和评估带来了巨大挑战。本文提出AGIBench——一种面向LLMs的细粒度、多模态、人类参考、自动评分基准方法。AGIBench并非简单的混合问题集合,而是聚焦于三个典型能力分支,并采用四元组<能力分支,知识,难度,模态>来标注每个问题的属性。首先,它支持细粒度基准测试,例如逐问题、逐能力分支、逐知识、逐模态、逐数据集和逐难度级别粒度。其次,它包含多模态输入,包括文本和图像。第三,它根据大量受过教育的人类受试者的平均准确率(人类参考)将所有问题划分为五个难度等级。第四,它采用零样本学习以避免引入额外的不确定性,并提供一种自动评分方法来提取和评判结果。最后,它定义了多维指标,包括平均、最差、最佳和多数投票情况下的准确率,以及可重复性。AGIBench可通过\url{https://www.benchcouncil.org/agibench}公开获取。