Instruction-tuned Large Language Models (LLMs) have achieved remarkable performance across various benchmark tasks. While providing instructions to LLMs for guiding their generations is user-friendly, assessing their instruction-following capabilities is still unclarified due to a lack of evaluation metrics. In this paper, we focus on evaluating the instruction-following ability of LLMs in the context of story-ending generation, which requires diverse and context-specific instructions. We propose an automatic evaluation pipeline that utilizes a machine reading comprehension (MRC) model to determine whether the generated story-ending reflects instruction. Our findings demonstrate that our proposed metric aligns with human evaluation. Furthermore, our experiments confirm that recent open-source LLMs can achieve instruction-following performance close to GPT-3.5, as assessed through automatic evaluation.
翻译:指令微调的大规模语言模型(LLMs)已在各类基准任务中取得显著性能。虽然向LLMs提供指令以引导其生成过程对用户友好,但由于缺乏评估指标,其指令遵循能力的评估机制仍不明确。本文聚焦于在故事结尾生成场景中评估LLMs的指令遵循能力,该任务需要多样化且上下文相关的指令。我们提出一种自动化评估流程,利用机器阅读理解(MRC)模型判断生成的故事结尾是否体现指令要求。实验结果表明,我们提出的评估指标与人工评估结果具有一致性。此外,通过自动化评估验证,近期开源LLMs在指令遵循性能上可达到接近GPT-3.5的水平。