Evaluation is the baton for the development of large language models. Current evaluations typically employ a single-item assessment paradigm for each atomic test objective, which struggles to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions. To this end, we propose a novel evaluation framework referred to as StructEval. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts, and therefore offers a comprehensive, robust and consistent evaluation for LLMs. Experiments on three widely-used benchmarks demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination and reducing the interference of potential biases, thereby providing more reliable and consistent conclusions regarding model capabilities. Our framework also sheds light on the design of future principled and trustworthy LLM evaluation protocols.
翻译:评估是指引大语言模型发展的指挥棒。现有评估方法通常对每个原子测试目标采用单项评估范式,难以辨别模型是否真正掌握所需能力,抑或仅是对特定问题的答案进行了记忆或猜测。为此,我们提出一种称为StructEval的新型评估框架。该框架以原子测试目标为起点,通过在多级认知维度与关键概念层面展开结构化评估,实现评测的深化与拓展,从而为大语言模型提供全面、鲁棒且一致的评估。在三个广泛使用的基准测试上的实验表明,StructEval能有效抵御数据污染风险并降低潜在偏差干扰,从而为模型能力提供更可靠、一致的结论。本框架也为未来构建具有理论依据且可信赖的大语言模型评估范式提供了启示。