Large language models (LLMs), like ChatGPT, have shown some human-like cognitive abilities. For comparing these abilities of different models, several benchmarks (i.e. sets of standard test questions) from different fields (e.g., Literature, Biology and Psychology) are often adopted and the test results under traditional metrics such as accuracy, recall and F1, are reported. However, such way for evaluating LLMs can be inefficient and inaccurate from the cognitive science perspective. Inspired by Computerized Adaptive Testing (CAT) used in psychometrics, we propose an adaptive testing framework for LLM evaluation. Rather than using a standard test set and simply reporting accuracy, this approach dynamically adjusts the characteristics of the test questions, such as difficulty, based on the model's performance. This allows for a more accurate estimation of the model's abilities, using fewer questions. More importantly, it allows LLMs to be compared with humans easily, which is essential for NLP models that aim for human-level ability. Our diagnostic reports have found that ChatGPT often behaves like a ``careless student'', prone to slip and occasionally guessing the questions. We conduct a fine-grained diagnosis and rank the latest 6 instruction-tuned LLMs from three aspects of Subject Knowledge, Mathematical Reasoning, and Programming, where GPT4 can outperform other models significantly and reach the cognitive ability of middle-level students. Different tests for different models using efficient adaptive testing -- we believe this has the potential to become a new norm in evaluating large language models.
翻译:大型语言模型(LLM),如ChatGPT,已展现出类似人类的认知能力。为了比较不同模型的这些能力,常采用来自不同领域(如文学、生物学和心理学)的多个基准测试(即标准测试题集),并报告在准确率、召回率和F1等传统指标下的测试结果。然而,从认知科学视角来看,这种评估LLM的方式可能效率低下且不够准确。受心理测量学中使用的计算机化自适应测试(CAT)启发,我们提出了一种用于LLM评估的自适应测试框架。该方法并非使用固定标准测试集并简单报告准确率,而是根据模型的表现动态调整测试题的特性(如难度)。这使得能够用更少的题目更准确地估计模型的能力。更重要的是,它使LLM能与人类进行便捷比较,这对于追求人类水平能力的自然语言处理(NLP)模型至关重要。我们的诊断报告发现,ChatGPT常常表现得像一名“粗心的学生”,容易出错并偶尔猜测答案。我们对最新的6种指令微调LLM进行了细粒度诊断,并从学科知识、数学推理和编程三个维度进行排序,其中GPT4显著优于其他模型,达到了中等水平学生的认知能力。针对不同模型采用高效自适应测试——我们认为这有望成为评估大语言模型的新规范。