In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance. In this paper, we propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment. Our paradigm shifts the "location" of LLM evaluation from the "examination room" to the "hospital". Through conducting a "physical examination" on LLMs, it utilizes specific task-solving as the evaluation content, performs deep attribution of existing problems within LLMs, and provides recommendation for optimization.
翻译:当前大语言模型(LLMs)的评估基准存在评测内容受限、更新不及时、缺乏优化指导等问题。本文提出一种新的LLMs测评范式:基准测试-评估-测评。该范式将LLM评估的“场所”从“考场”转变为“医院”。通过对LLMs进行“体检”,以具体任务解决作为评测内容,对LLMs存在的现有问题进行深度归因,并提供优化建议。