Large language models (LLMs), like ChatGPT, have shown some human-like cognitive abilities. For comparing these abilities of different models, several benchmarks (i.e. sets of standard test questions) from different fields (e.g., Literature, Biology and Psychology) are often adopted and the test results under traditional metrics such as accuracy, recall and F1, are reported. However, such way for evaluating LLMs can be inefficient and inaccurate from the cognitive science perspective. Inspired by Computerized Adaptive Testing (CAT) used in psychometrics, we propose an adaptive testing framework for LLM evaluation. Rather than using a standard test set and simply reporting accuracy, this approach dynamically adjusts the characteristics of the test questions, such as difficulty, based on the model's performance. This allows for a more accurate estimation of the model's abilities, using fewer questions. More importantly, it allows LLMs to be compared with humans easily, which is essential for NLP models that aim for human-level ability. Our diagnostic reports have found that ChatGPT often behaves like a ``careless student'', prone to slip and occasionally guessing the questions. We conduct a fine-grained diagnosis and rank the latest 6 instruction-tuned LLMs from three aspects of Subject Knowledge, Mathematical Reasoning, and Programming, where GPT4 can outperform other models significantly and reach the cognitive ability of middle-level students. Different tests for different models using efficient adaptive testing -- we believe this has the potential to become a new norm in evaluating large language models.
翻译:[翻译摘要]
像ChatGPT这样的大语言模型(LLMs)已展现出类似人类的认知能力。为了比较不同模型的这些能力,研究者常采用来自不同领域(如文学、生物学和心理学)的多个基准测试(即标准测试题集),并报告在准确率、召回率和F1值等传统指标下的测试结果。然而,从认知科学角度来看,这种评估大语言模型的方式可能低效且不精确。受心理测量学中计算机自适应测试(CAT)的启发,我们提出一种用于大语言模型评估的自适应测试框架。该方法不依赖固定测试集并简单报告准确率,而是根据模型表现动态调整测试题目的特征(如难度)。这不仅能通过更少的题目更精确地估算模型能力,更重要的是,它使得大语言模型能与人类进行便捷比较——这对追求人类水平能力的NLP模型至关重要。我们的诊断报告发现,ChatGPT常表现出"粗心学生"特征,容易失误且偶尔会猜测答案。我们进行了细粒度诊断,从学科知识、数学推理和编程三个维度对最新6个经过指令微调的大语言模型进行排序,结果显示GPT-4能显著超越其他模型并达到中等学生的认知水平。利用高效的自适应测试为不同模型定制差异化评估——我们认为这有望成为评估大语言模型的新范式。