Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns about their performance are raised on potential data contamination in their considerable volume of training corpus. Moreover, the static nature and fixed complexity of current benchmarks may inadequately gauge the advancing capabilities of LLMs. In this paper, we introduce DyVal, a novel, general, and flexible evaluation protocol for dynamic evaluation of LLMs. Based on our proposed dynamic evaluation framework, we build graph-informed DyVal by leveraging the structural advantage of directed acyclic graphs to dynamically generate evaluation samples with controllable complexities. DyVal generates challenging evaluation sets on reasoning tasks including mathematics, logical reasoning, and algorithm problems. We evaluate various LLMs ranging from Flan-T5-large to ChatGPT and GPT4. Experiments demonstrate that LLMs perform worse in DyVal-generated evaluation samples with different complexities, emphasizing the significance of dynamic evaluation. We also analyze the failure cases and results of different prompting methods. Moreover, DyVal-generated samples are not only evaluation sets, but also helpful data for fine-tuning to improve the performance of LLMs on existing benchmarks. We hope that DyVal can shed light on the future evaluation research of LLMs.
翻译:大语言模型(LLMs)在各类评估基准测试中取得了显著性能。然而,由于其训练语料库规模庞大,潜在的数据污染问题引发了对其可靠性的质疑。此外,当前基准测试的静态特性与固定难度难以充分衡量LLMs持续提升的能力。本文提出DyVal——一种新颖、通用且灵活的LLMs动态评估协议。基于提出的动态评估框架,我们利用有向无环图的结构优势构建了图驱动的DyVal方法,可动态生成复杂度可控的评估样本。DyVal在数学、逻辑推理及算法问题等推理任务中生成了具有挑战性的评估集。我们评估了从Flan-T5-large到ChatGPT及GPT4等多种LLMs。实验表明,在不同复杂度的DyVal生成样本中,LLMs表现欠佳,凸显了动态评估的重要性。同时,我们分析了失败案例及不同提示方法的作用。此外,DyVal生成的样本不仅可作为评估集,还可作为微调数据提升LLMs在现有基准上的性能。我们期望DyVal能为未来LLMs评估研究提供启示。