As Large Language Models (LLMs) grow increasingly adept at managing complex tasks, the evaluation set must keep pace with these advancements to ensure it remains sufficiently discriminative. Item Discrimination (ID) theory, which is widely used in educational assessment, measures the ability of individual test items to differentiate between high and low performers. Inspired by this theory, we propose an ID-induced prompt synthesis framework for evaluating LLMs to ensure the evaluation set can continually update and refine according to model abilities. Our data synthesis framework prioritizes both breadth and specificity. It can generate prompts that comprehensively evaluate the capabilities of LLMs while revealing meaningful performance differences between models, allowing for effective discrimination of their relative strengths and weaknesses across various tasks and domains. To produce high-quality data, we incorporate a self-correct mechanism into our generalization framework, and develop two models to predict prompt discrimination and difficulty score to facilitate our data synthesis framework, contributing valuable tools to evaluation data synthesis research. We apply our generated data to evaluate five SOTA models. Our data achieves an average score of 51.92, accompanied by a variance of 10.06. By contrast, previous works (i.e., SELF-INSTRUCT and WizardLM) obtain an average score exceeding 67, with a variance below 3.2. The results demonstrate that the data generated by our framework is more challenging and discriminative compared to previous works. We will release a dataset of over 3,000 carefully crafted prompts to facilitate evaluation research of LLMs.
翻译:随着大语言模型(LLMs)处理复杂任务的能力日益增强,评估数据集必须同步发展以保持足够的区分能力。项目区分度(ID)理论在教育评估领域被广泛用于衡量单个测试题目区分高能力与低能力受试者的效能。受此理论启发,我们提出一种基于ID的提示合成框架用于LLMs评估,确保评估数据集能够根据模型能力持续更新与优化。我们的数据合成框架兼顾广度与特异性,既能生成全面评估LLMs能力的提示,又能揭示模型间具有意义的性能差异,从而有效区分它们在不同任务和领域中的相对优势与不足。为生成高质量数据,我们在泛化框架中引入自校正机制,并开发了两个分别预测提示区分度与难度分数的模型以支撑数据合成框架,为评估数据合成研究提供了有价值的工具。我们将生成的数据应用于五个前沿模型的评估,所得数据平均得分为51.92,方差为10.06。相比之下,已有工作(如SELF-INSTRUCT和WizardLM)获得的平均分超过67,方差低于3.2。结果表明,相较于已有工作,本框架生成的数据更具挑战性与区分度。我们将发布包含3000余条精心构建提示的数据集,以促进LLMs评估研究的发展。