Evaluation of Large Language Models (LLMs) is challenging because instruction-following necessitates alignment with human values and the required set of skills varies depending on the instruction. However, previous studies have mainly focused on coarse-grained evaluation (i.e. overall preference-based evaluation), which limits interpretability since it does not consider the nature of user instructions that require instance-wise skill composition. In this paper, we introduce FLASK (Fine-grained Language Model Evaluation based on Alignment Skill Sets), a fine-grained evaluation protocol for both human-based and model-based evaluation which decomposes coarse-level scoring to a skill set-level scoring for each instruction. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance and increasing the reliability of the evaluation. Using FLASK, we compare multiple open-source and proprietary LLMs and observe a high correlation between model-based and human-based evaluations. We publicly release the evaluation data and code implementation at https://github.com/kaistAI/FLASK.
翻译:大型语言模型(LLM)的评估极具挑战性,因为指令遵循需与人类价值观对齐,且所需技能集因指令而异。然而,先前研究主要关注粗粒度评估(即基于整体偏好的评估),这类方法因未考虑用户指令需要实例级技能组合的特性而限制了可解释性。本文提出FLASK(基于对齐技能集的细粒度语言模型评估),一种面向人类评估与模型评估的细粒度评估协议,将粗粒度评分分解为每条指令对应的技能集评分。实验表明,评估的细粒度性对于全面把握模型性能、提升评估可靠性至关重要。通过FLASK,我们比较了多个开源与专有LLM,发现基于模型与基于人类的评估结果具有高度相关性。我们已在https://github.com/kaistAI/FLASK 公开了评估数据与代码实现。