Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns about their performance are raised on potential data contamination in their considerable volume of training corpus. Moreover, the static nature and fixed complexity of current benchmarks may inadequately gauge the advancing capabilities of LLMs. In this paper, we introduce DyVal, a novel, general, and flexible evaluation protocol for dynamic evaluation of LLMs. Based on our proposed dynamic evaluation framework, we build graph-informed DyVal by leveraging the structural advantage of directed acyclic graphs to dynamically generate evaluation samples with controllable complexities. DyVal generates challenging evaluation sets on reasoning tasks including mathematics, logical reasoning, and algorithm problems. We evaluate various LLMs ranging from Flan-T5-large to ChatGPT and GPT4. Experiments demonstrate that LLMs perform worse in DyVal-generated evaluation samples with different complexities, emphasizing the significance of dynamic evaluation. We also analyze the failure cases and results of different prompting methods. Moreover, DyVal-generated samples are not only evaluation sets, but also helpful data for fine-tuning to improve the performance of LLMs on existing benchmarks. We hope that DyVal can shed light on the future evaluation research of LLMs.
翻译:大型语言模型(LLMs)已在各类评估基准中展现出卓越性能。然而,其大规模训练语料中的数据污染问题引发了对其表现真实性的担忧。此外,当前基准的静态性质和固定复杂度可能难以充分衡量LLMs持续演进的能力。本文提出DyVal——一种新颖、通用且灵活的LLMs动态评估协议。基于我们提出的动态评估框架,我们利用有向无环图的结构优势构建图启发的DyVal,从而动态生成复杂度可控的评估样本。DyVal在数学、逻辑推理和算法问题等推理任务中生成具有挑战性的评估集合。我们评估了从Flan-T5-large到ChatGPT和GPT4的多种LLMs。实验表明,LLMs在DyVal生成的具有不同复杂度的样本上表现较差,凸显了动态评估的重要性。同时,我们分析了失败案例及不同提示方法的效果。此外,DyVal生成的样本不仅是评估集,也可作为微调数据提升LLMs在现有基准上的性能。我们期望DyVal能为LLMs的未来评估研究提供启示。