In the field of natural language processing (NLP), Large Language Models (LLMs) have precipitated a paradigm shift, markedly enhancing performance in natural language generation tasks. Despite these advancements, the comprehensive evaluation of LLMs remains an inevitable challenge for the community. Recently, the utilization of Multiple Choice Question Answering (MCQA) as a benchmark for LLMs has gained considerable traction. This study investigates the rationality of MCQA as an evaluation method for LLMs. If LLMs genuinely understand the semantics of questions, their performance should exhibit consistency across the varied configurations derived from the same questions. Contrary to this expectation, our empirical findings suggest a notable disparity in the consistency of LLM responses, which we define as REsponse VAriability Syndrome (REVAS) of the LLMs, indicating that current MCQA-based benchmarks may not adequately capture the true capabilities of LLMs, which underscores the need for more robust evaluation mechanisms in assessing the performance of LLMs.
翻译:在自然语言处理领域,大型语言模型引发了范式转变,显著提升了自然语言生成任务的性能。尽管取得了这些进展,全面评估大型语言模型仍然是学界面临的一项不可避免的挑战。近年来,使用多项选择问答作为大型语言模型的基准方法已获得广泛关注。本研究探讨了多项选择问答作为大型语言模型评估方法的合理性。如果大型语言模型真正理解问题的语义,其性能在源自同一问题的不同配置下应表现出一致性。然而,与这一预期相反,我们的实证发现表明大型语言模型响应的连贯性存在显著差异,我们将此定义为大型语言模型的响应变异综合征,这表明当前基于多项选择问答的基准测试可能未能充分捕捉大型语言模型的真实能力,从而凸显了在评估大型语言模型性能时需要更稳健的评估机制。