Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an holistic perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against DOVE, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. DOVE consists of more than 250M prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation. Browse the data, contribute, and more: https://slab-nlp.github.io/DOVE/
翻译:近期研究发现,大语言模型对任意提示维度(包括分隔符类型、答案枚举符、指令措辞等)均表现出敏感性。这一现象对当前广泛使用的单提示评估方法提出了质疑。为此,我们提出DOVE(变体评估数据集)——一个包含多种评估基准提示扰动的大规模数据集。与既往研究不同,我们从整体视角考察大语言模型的敏感性,评估各维度扰动的联合效应,从而为每个实例生成数千种扰动。我们针对多个模型系列在DOVE上开展评估,获得若干重要发现,包括:选择高性能提示的高效方法、观察到few-shot示例可降低敏感性、识别出在所有扰动下均具有固有难度的实例。DOVE涵盖超过2.5亿个提示扰动及其模型输出结果,我们已将其公开,旨在推动学界共同迈向有意义、稳健且高效的评估体系。数据浏览、贡献及其他信息:https://slab-nlp.github.io/DOVE/