Evaluation is the baton for the development of large language models. Current evaluations typically employ a single-item assessment paradigm for each atomic test objective, which struggles to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions. To this end, we propose a novel evaluation framework referred to as StructEval. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts, and therefore offers a comprehensive, robust and consistent evaluation for LLMs. Experiments on three widely-used benchmarks demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination and reducing the interference of potential biases, thereby providing more reliable and consistent conclusions regarding model capabilities. Our framework also sheds light on the design of future principled and trustworthy LLM evaluation protocols.
翻译:评估是大语言模型发展的指挥棒。当前评估通常对每个原子测试目标采用单项评估范式,难以辨别模型是否真正具备所需能力,抑或仅是对特定问题的答案进行记忆/猜测。为此,我们提出一种称为StructEval的新型评估框架。该框架从原子测试目标出发,通过在多个认知层次与关键概念上进行结构化评估,实现评测的深化与拓展,从而为大语言模型提供全面、鲁棒且一致的评估。在三个广泛使用的基准测试上的实验表明,StructEval可作为抵抗数据污染风险、减少潜在偏差干扰的可靠工具,从而提供关于模型能力的更可靠且一致的结论。我们的框架也为未来设计具有原则性且可信赖的大语言模型评估方案提供了启示。