We introduce WiCkeD, a simple method to increase the complexity of existing multiple-choice benchmarks by randomly replacing a choice with "None of the above", a method often used in educational tests. We show that WiCkeD can be automatically applied to any existing benchmark, making it more challenging. We apply WiCkeD to 6 popular benchmarks and use it to evaluate 18 open-weight LLMs. The performance of the models drops 12.1 points on average with respect to the original versions of the datasets. When using chain-of-thought on 3 MMLU datasets, the performance drop for the WiCkeD variant is similar to the one observed when using the LLMs directly, showing that WiCkeD is also challenging for models with enhanced reasoning abilities. WiCkeD also uncovers that some models are more sensitive to the extra reasoning required, providing additional information with respect to the original benchmarks. We relase our code and data at https://github.com/ahmedselhady/wicked-benchmarks.
翻译:我们提出了WiCkeD,这是一种通过随机将一个选项替换为“以上都不是”(一种在教育测试中常用的方法)来增加现有多项选择题基准测试复杂性的简单方法。我们证明WiCkeD可以自动应用于任何现有基准测试,使其更具挑战性。我们将WiCkeD应用于6个流行的基准测试,并用其评估了18个开放权重的LLM。相对于数据集的原始版本,模型的性能平均下降了12.1个百分点。在3个MMLU数据集上使用思维链时,WiCkeD变体的性能下降与直接使用LLM时观察到的下降相似,这表明WiCkeD对于具有增强推理能力的模型也具有挑战性。WiCkeD还揭示了一些模型对所需额外推理更为敏感,从而提供了相对于原始基准测试的额外信息。我们在 https://github.com/ahmedselhady/wicked-benchmarks 发布了我们的代码和数据。