Recent findings raise concerns about whether the evaluation of Multiple-Choice Question Answering (MCQA) accurately reflects the comprehension abilities of large language models. This paper explores the concept of choice sensitivity, which refers to the tendency for model decisions to be more influenced by the answer options than by a genuine understanding of the question. We introduce a new scoring method called Normalized Probability Shift by the Question (NPSQ), designed to isolate the impact of the question itself and provide a more reliable assessment of comprehension. Through experiments involving various input formats, including cloze, symbols, and hybrid formats, we find that traditional scoring methods - such as those based on log-likelihood or its length-normalized variant - are vulnerable to superficial characteristics of the answer choices. In contrast, NPSQ remains stable even when modifications are made to the answer options.
翻译:近期研究引发了对多项选择题回答(MCQA)评估是否能准确反映大型语言模型理解能力的担忧。本文探讨了"选项敏感性"的概念,即模型决策更易受答案选项影响而非真正理解问题的倾向。我们提出了一种名为"问题归一化概率偏移(NPSQ)"的新评分方法,旨在分离问题本身的影响,从而提供更可靠的理解能力评估。通过采用完形填空、符号及混合格式等多种输入形式的实验,我们发现传统评分方法(如基于对数似然或其长度归一化变体的方法)容易受到答案选项表面特征的影响。相比之下,即使对答案选项进行修改,NPSQ仍能保持稳定。