Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns are raised about potential data contamination in their considerable volume of training corpus. Moreover, the static nature and fixed complexity of current benchmarks may inadequately gauge the advancing capabilities of LLMs. In this paper, we introduce DyVal, a general and flexible protocol for dynamic evaluation of LLMs. Based on our framework, we build graph-informed DyVal by leveraging the structural advantage of directed acyclic graphs to dynamically generate evaluation samples with controllable complexities. DyVal generates challenging evaluation sets on reasoning tasks including mathematics, logical reasoning, and algorithm problems. We evaluate various LLMs ranging from Flan-T5-large to GPT-3.5-Turbo and GPT-4. Experiments show that LLMs perform worse in DyVal-generated evaluation samples with different complexities, highlighting the significance of dynamic evaluation. We also analyze the failure cases and results of different prompting methods. Moreover, DyVal-generated samples are not only evaluation sets, but also helpful data for fine-tuning to improve the performance of LLMs on existing benchmarks. We hope that DyVal can shed light on future evaluation research of LLMs. Code is available at: https://github.com/microsoft/promptbench.
翻译:摘要:大语言模型(LLMs)在各类评估基准中取得了显著性能。然而,其大规模训练语料库中潜在的数据污染问题引发关注。此外,当前基准的静态特性与固定复杂度可能难以有效衡量LLMs持续进阶的能力。本文提出DyVal——一种通用且灵活的LLMs动态评估协议。基于该框架,我们利用有向无环图的结构优势构建了图驱动的DyVal方法,可动态生成复杂度可控的评估样本。DyVal为数学推理、逻辑推理及算法问题等推理任务生成了高难度评估集。我们评估了从Flan-T5-large到GPT-3.5-Turbo及GPT-4的多种LLMs。实验表明,面对不同复杂度的DyVal评估样本,所有LLMs表现均有所下降,凸显了动态评估的重要性。我们还分析了失败案例及不同提示方法的效果。值得注意的是,DyVal生成的样本不仅是评估数据集,还可作为提升LLMs现有基准性能的微调辅助数据。我们期待DyVal能为未来LLMs评估研究带来启示。代码详见:https://github.com/microsoft/promptbench。