This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating Large Language Models' (LLMs) ability to follow instructions. Addressing a gap in current methodologies, DRFR breaks down complex instructions into simpler criteria, facilitating a detailed analysis of LLMs' compliance with various aspects of tasks. Alongside this metric, we present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. Our experiments compare DRFR with traditional scoring methods and explore annotation sources, including human experts, crowd-sourced workers, and GPT-4. The findings demonstrate DRFR's higher reliability and the effectiveness of using GPT-4 as a cost-efficient annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. This study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation.
翻译:本文提出分解需求遵循率(DRFR),一种评估大型语言模型(LLMs)指令遵循能力的新指标。为弥补现有方法的不足,DRFR将复杂指令分解为更简单的评判标准,促进对LLMs在任务各维度上遵循情况的细致分析。伴随该指标,我们推出InFoBench基准,包含涵盖多种约束类别的500条多样化指令与2,250道分解问题。实验将DRFR与传统评分方法进行比较,并探索了包括人类专家、众包工作者及GPT-4在内的标注来源。结果表明DRFR具有更高的可靠性,且使用GPT-4作为经济高效的标注者效果显著。通过该框架对多个先进LLMs的评估揭示了其在遵循复杂指令方面的优势与待改进之处。本研究贡献了新型指标与基准,为未来LLM开发与评估提供启示。