Recent work shows that large language models (LLMs) can answer multiple-choice questions using only the choices, but does this mean that MCQA leaderboard rankings of LLMs are largely influenced by abilities in choices-only settings? To answer this, we use a contrast set that probes if LLMs over-rely on choices-only shortcuts in MCQA. While previous works build contrast sets via expensive human annotations or model-generated data which can be biased, we employ graph mining to extract contrast sets from existing MCQA datasets. We use our method on UnifiedQA, a group of six commonsense reasoning datasets with high choices-only accuracy, to build an 820-question contrast set. After validating our contrast set, we test 12 LLMs, finding that these models do not exhibit reliance on choice-only shortcuts when given both the question and choices. Thus, despite the susceptibility~of MCQA to high choices-only accuracy, we argue that LLMs are not obtaining high ranks on MCQA leaderboards just due to their ability to exploit choices-only shortcuts.
翻译:近期研究表明,大语言模型(LLMs)仅凭选项就能回答多项选择题,但这是否意味着MCQA(多项选择问答)排行榜上LLMs的排名在很大程度上受到其在仅选项设置下能力的影响?为回答此问题,我们使用一个对比集来探究LLMs在MCQA中是否过度依赖仅选项的捷径。以往工作通过昂贵的人工标注或可能带有偏见的模型生成数据来构建对比集,而我们则采用图挖掘技术从现有的MCQA数据集中提取对比集。我们将此方法应用于UnifiedQA——一组包含六个常识推理数据集且具有高仅选项准确率的集合——构建了一个包含820个问题的对比集。在验证我们的对比集后,我们测试了12个LLMs,发现当同时提供问题和选项时,这些模型并未表现出对仅选项捷径的依赖。因此,尽管MCQA易受高仅选项准确率的影响,我们认为LLMs在MCQA排行榜上获得高排名并非仅仅源于其利用仅选项捷径的能力。