In the field of NLP, Large Language Models (LLMs) have markedly enhanced performance across a variety of tasks. However, the comprehensive evaluation of LLMs remains an inevitable challenge for the community. Recently, the adoption of Multiple Choice Question Answering (MCQA) as a benchmark for assessing LLMs has gained considerable traction. However, concerns regarding the robustness of this evaluative method persist. Building upon previous discussions on the issue of \textit{variability}, we reveal an additional dimension of concern: LLMs may perform MCQA by selecting the least incorrect option rather than distinctly correct. This observation suggests that LLMs might regard multiple options as correct, which could undermine the reliability of MCQA as a metric for evaluating LLMs. To address this challenge, we introduce an enhanced dataset augmentation method for MCQA, termed MCQA+, to provide a more accurate reflection of the model performance, thereby highlighting the necessity for more sophisticated evaluation mechanisms in the assessment of LLM capabilities.
翻译:在自然语言处理领域,大型语言模型(LLMs)已在多种任务中显著提升了性能表现。然而,对LLMs进行全面评估仍是该领域不可避免的挑战。近年来,采用多项选择题问答(MCQA)作为评估LLMs的基准方法已获得广泛关注,但关于该评估方法稳健性的疑虑持续存在。基于先前关于“可变性”问题的讨论,我们揭示了另一个值得关注的维度:LLMs可能通过选择最不错误的选项而非明确正确的选项来完成MCQA。这一观察表明LLMs可能将多个选项视为正确,这可能会削弱MCQA作为评估LLMs指标的可靠性。为应对这一挑战,我们提出一种改进的MCQA数据集增强方法(称为MCQA+),以更准确地反映模型性能,从而强调在评估LLM能力时需要更精细的评估机制。