Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, fundamentally reshaping the landscape of natural language processing (NLP) research. However, recent evaluation frameworks often rely on the output probabilities of LLMs for predictions, primarily due to computational constraints, diverging from real-world LLM usage scenarios. While widely employed, the efficacy of these probability-based evaluation strategies remains an open research question. This study aims to scrutinize the validity of such probability-based evaluation methods within the context of using LLMs for Multiple Choice Questions (MCQs), highlighting their inherent limitations. Our empirical investigation reveals that the prevalent probability-based evaluation method inadequately aligns with generation-based prediction. Furthermore, current evaluation frameworks typically assess LLMs through predictive tasks based on output probabilities rather than directly generating responses, owing to computational limitations. We illustrate that these probability-based approaches do not effectively correspond with generative predictions. The outcomes of our study can enhance the understanding of LLM evaluation methodologies and provide insights for future research in this domain.
翻译:大型语言模型(LLMs)已在多种应用中展现出卓越能力,从根本上重塑了自然语言处理(NLP)研究的格局。然而,受计算资源限制,当前的评估框架通常依赖LLMs的输出概率进行预测,这与实际应用场景存在显著差异。尽管基于概率的评估策略被广泛采用,其有效性仍是一个开放的研究问题。本研究旨在以多项选择题(MCQs)评估为背景,深入检验此类基于概率的评估方法的有效性,并揭示其固有局限性。我们的实证研究表明,当前主流的基于概率的评估方法与基于生成的预测之间存在显著偏差。此外,由于计算限制,现有评估框架通常通过基于输出概率的预测任务而非直接生成响应来评估LLMs。我们证明这些基于概率的方法无法有效对应生成式预测的结果。本研究结论有助于深化对LLM评估方法的理解,并为该领域的未来研究提供重要参考。