In this study, we take a closer look at how Winograd schema challenges can be used to evaluate common sense reasoning in LLMs. Specifically, we evaluate generative models of different sizes on the popular WinoGrande benchmark. We release WinoWhat, a new corpus, in which each instance of the WinoGrande validation set is paraphrased. Additionally, we evaluate the performance on the challenge across five common sense knowledge categories, giving more fine-grained insights on what types of knowledge are more challenging for LLMs. Surprisingly, all models perform significantly worse on WinoWhat, implying that LLM reasoning capabilities are overestimated on WinoGrande. To verify whether this is an effect of benchmark memorization, we match benchmark instances to LLM trainingdata and create two test-suites. We observe that memorization has a minimal effect on model performance on WinoGrande.
翻译:本研究深入探讨了如何利用Winograd模式挑战来评估大语言模型的常识推理能力。具体而言,我们在流行的WinoGrande基准测试上评估了不同规模的生成模型。我们发布了新语料库WinoWhat,其中对WinoGrande验证集的每个实例进行了释义处理。此外,我们根据五个常识知识类别评估了模型在该挑战上的表现,从而更细致地揭示了哪些类型的知识对大语言模型更具挑战性。令人惊讶的是,所有模型在WinoWhat上的表现均显著下降,这表明大语言模型在WinoGrande上的推理能力被高估了。为验证这是否源于基准测试记忆效应,我们将基准实例与大语言模型训练数据进行匹配,并创建了两个测试集。我们观察到记忆效应对模型在WinoGrande上的性能影响微乎其微。