Natural Language Processing (NLP) is witnessing a remarkable breakthrough driven by the success of Large Language Models (LLMs). LLMs have gained significant attention across academia and industry for their versatile applications in text generation, question answering, and text summarization. As the landscape of NLP evolves with an increasing number of domain-specific LLMs employing diverse techniques and trained on various corpus, evaluating performance of these models becomes paramount. To quantify the performance, it's crucial to have a comprehensive grasp of existing metrics. Among the evaluation, metrics which quantifying the performance of LLMs play a pivotal role. This paper offers a comprehensive exploration of LLM evaluation from a metrics perspective, providing insights into the selection and interpretation of metrics currently in use. Our main goal is to elucidate their mathematical formulations and statistical interpretations. We shed light on the application of these metrics using recent Biomedical LLMs. Additionally, we offer a succinct comparison of these metrics, aiding researchers in selecting appropriate metrics for diverse tasks. The overarching goal is to furnish researchers with a pragmatic guide for effective LLM evaluation and metric selection, thereby advancing the understanding and application of these large language models.
翻译:自然语言处理(NLP)正因大语言模型(LLMs)的成功而取得显著突破。LLMs凭借其在文本生成、问答和文本摘要等多方面的广泛应用,已在学术界和工业界引起广泛关注。随着采用不同技术、基于各类语料库训练的领域专用LLMs数量日益增多,NLP领域格局持续演变,评估这些模型的性能变得至关重要。为量化性能,全面理解现有指标不可或缺。在评估中,量化LLMs性能的指标发挥着核心作用。本文从指标视角对LLM评估进行全面探讨,深入解读当前使用指标的选择与含义。我们的主要目标是阐明其数学公式与统计解释,并借助近期生物医学LLMs揭示这些指标的应用。此外,我们提供了这些指标的简洁对比,以助研究人员针对不同任务选择合适指标。总体目标是为研究人员提供一份实用指南,用于有效进行LLM评估与指标选择,从而推动对这些大语言模型的理解与应用。