The natural language understanding (NLU) performance of large language models (LLMs) has been evaluated across various tasks and datasets. The existing evaluation methods, however, do not take into account the variance in scores due to differences in prompts, which leads to unfair evaluation and comparison of NLU performance. Moreover, evaluation designed for specific prompts is inappropriate for instruction tuning, which aims to perform well with any prompt. It is therefore necessary to find a way to measure NLU performance in a fair manner, considering score variance between different instruction templates. In this study, we provide English and Japanese cross-lingual datasets for evaluating the NLU performance of LLMs, which include multiple instruction templates for fair evaluation of each task, along with regular expressions to constrain the output format. Furthermore, we propose the Sharpe score as an evaluation metric that takes into account the variance in scores between templates. Comprehensive analysis of English and Japanese LLMs reveals that the high variance among templates has a significant impact on the fair evaluation of LLMs.
翻译:大语言模型(LLM)的自然语言理解(NLU)性能已在各种任务和数据集上得到评估。然而,现有评估方法并未考虑因提示词差异导致的分数方差,这导致了对NLU性能的不公平评估与比较。此外,为特定提示设计的评估并不适用于指令微调,因为指令微调的目标是在任何提示下均表现良好。因此,有必要找到一种考虑不同指令模板间分数方差的公平方式来衡量NLU性能。在本研究中,我们提供了用于评估LLM的NLU性能的英语和日语跨语言数据集,这些数据集包含用于公平评估每个任务的多个指令模板,以及用于约束输出格式的正则表达式。此外,我们提出将夏普比率作为考虑模板间分数方差的评估指标。对英语和日语LLM的全面分析表明,模板间的高方差对LLM的公平评估具有显著影响。