FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

Evaluation of Large Language Models (LLMs) is challenging because aligning to human values requires the composition of multiple skills and the required set of skills varies depending on the instruction. Recent studies have evaluated the performance of LLMs in two ways, (1) automatic evaluation on several independent benchmarks and (2) human or machined-based evaluation giving an overall score to the response. However, both settings are coarse-grained evaluations, not considering the nature of user instructions that require instance-wise skill composition, which limits the interpretation of the true capabilities of LLMs. In this paper, we introduce FLASK (Fine-grained Language Model Evaluation based on Alignment SKill Sets), a fine-grained evaluation protocol that can be used for both model-based and human-based evaluation which decomposes coarse-level scoring to an instance-wise skill set-level. Specifically, we define 12 fine-grained skills needed for LLMs to follow open-ended user instructions and construct an evaluation set by allocating a set of skills for each instance. Additionally, by annotating the target domains and difficulty level for each instance, FLASK provides a holistic view with a comprehensive analysis of a model's performance depending on skill, domain, and difficulty. Through using FLASK, we compare multiple open-sourced and proprietary LLMs and observe highly-correlated findings between model-based and human-based evaluations. FLASK enables developers to more accurately measure the model performance and how it can be improved by analyzing factors that make LLMs proficient in particular skills. For practitioners, FLASK can be used to recommend suitable models for particular situations through comprehensive comparison among various LLMs. We release the evaluation data and code implementation at https://github.com/kaistAI/FLASK.

翻译：评估大型语言模型（LLMs）具有挑战性，因为对齐人类价值观需要多种技能的协同组合，且所需技能集因指令而异。现有研究通常通过两种方式评估LLMs性能：（1）在多个独立基准上的自动评估；（2）基于人类或机器的整体评分评估。然而，这两种评估方式均为粗粒度评估，未能考虑用户指令需实例级技能组合的特性，从而限制了对LLMs真实能力的解读。本文提出FLASK（基于对齐技能集的细粒度语言模型评估协议），一种可用于模型评估与人工评估的细粒度评估框架，将粗粒度评分分解为实例级的技能集维度。具体而言，我们定义了LLMs遵循开放式用户指令所需的12种细粒度技能，并为每个实例分配对应技能集构建评估数据集。此外，通过标注每个实例的目标领域与难度等级，FLASK能根据不同技能、领域和难度维度提供模型性能的全面分析视角。基于FLASK框架，我们对比了多个开源与商业LLMs，发现模型评估与人工评估结果之间具有高度相关性。FLASK使开发者能够精确衡量模型性能，并通过分析影响LLMs特定技能精通程度的关键因素明确改进方向。对实践者而言，FLASK可通过多维度对比不同LLMs，为特定场景推荐最优模型。我们已在https://github.com/kaistAI/FLASK开源评估数据与代码实现。