One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval
翻译:大语言模型的核心能力之一在于遵循自然语言指令。然而,对此类能力的评估尚未标准化:人工评估成本高昂、耗时且难以客观复现,而基于大语言模型的自动评估则可能受到评估模型自身能力偏差或局限性的影响。为解决这些问题,我们提出了面向大语言模型的指令遵循评估基准IFEval。IFEval是一个简洁且易于复现的评估基准,其核心关注一组“可验证指令”,例如“写一篇超过400字的文章”或“至少提及AI关键词3次”。我们识别出25类此类可验证指令,并构建了约500条提示,每条提示包含一条或多条可验证指令。我们展示了市场上两款主流大语言模型的评估结果。相关代码与数据可在https://github.com/google-research/google-research/tree/master/instruction_following_eval 获取。