Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs) due to their simplicity and efficiency. However, there are concerns about whether MCQs can truly measure LLM's capabilities, particularly in knowledge-intensive scenarios where long-form generation (LFG) answers are required. The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ's efficacy, which we undertake in this paper by evaluating nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English. We identify a significant issue: LLMs exhibit an order sensitivity in bilingual MCQs, favoring answers located at specific positions, i.e., the first position. We further quantify the gap between MCQs and long-form generation questions (LFGQs) by comparing their direct outputs, token logits, and embeddings. Our results reveal a relatively low correlation between answers from MCQs and LFGQs for identical questions. Additionally, we propose two methods to quantify the consistency and confidence of LLMs' output, which can be generalized to other QA evaluation benchmarks. Notably, our analysis challenges the idea that the higher the consistency, the greater the accuracy. We also find MCQs to be less reliable than LFGQs in terms of expected calibration error. Finally, the misalignment between MCQs and LFGQs is not only reflected in the evaluation performance but also in the embedding space. Our code and models can be accessed at https://github.com/Meetyou-AI-Lab/Can-MC-Evaluate-LLMs.
翻译:选择题(MCQ)因其简便性和高效性而被广泛用于大语言模型(LLM)的评估中。然而,对于选择题能否真正衡量LLM的能力存在争议,尤其是在需要长文本生成(LFG)答案的知识密集型场景下。任务与评估方法之间的不匹配要求对MCQ的有效性进行深入分析,本文通过对九个LLM在两个语言(中文和英文)的四个问答(QA)数据集上的评估来开展此项研究。我们发现一个显著问题:LLM在双语选择题中表现出顺序敏感性,偏向于选择特定位置的答案,即第一个位置。我们进一步通过比较直接输出、令牌对数概率和嵌入来量化MCQ与长文本生成问题(LFGQ)之间的差距。结果显示,对于相同的问题,MCQ与LFGQ的答案之间相关性较低。此外,我们提出了两种量化LLM输出一致性和置信度的方法,这些方法可推广到其他QA评估基准。值得注意的是,我们的分析挑战了“一致性越高,准确性越高”的观点。同时,我们发现MCQ在期望校准误差方面的可靠性低于LFGQ。最后,MCQ与LFGQ之间的不匹配不仅体现在评估性能上,还反映在嵌入空间中。我们的代码和模型可通过https://github.com/Meetyou-AI-Lab/Can-MC-Evaluate-LLMs获取。