Work on instruction-tuned Large Language Models (LLMs) has used automatic methods based on text overlap and LLM judgments as cost-effective alternatives to human evaluation. In this paper, we study the reliability of such methods across a broad range of tasks and in a cross-lingual setting. In contrast to previous findings, we observe considerable variability in correlations between automatic methods and human evaluators when scores are differentiated by task type. Specifically, the widely-used ROUGE-L metric strongly correlates with human judgments for short-answer English tasks but is unreliable in free-form generation tasks and cross-lingual transfer. The effectiveness of GPT-4 as an evaluator depends on including reference answers when prompting for assessments, which can lead to overly strict evaluations in free-form generation tasks. In summary, we find that, while automatic evaluation methods can approximate human judgements under specific conditions, their reliability is highly context-dependent. Our findings enhance the understanding of how automatic methods should be applied and interpreted when developing and evaluating instruction-tuned LLMs.
翻译:针对指令微调大语言模型的研究,常采用基于文本重叠及大语言模型评判的自动评估方法,作为人类评估的成本效益替代方案。本文系统研究了这类方法在广泛任务类型及跨语言场景中的可靠性。与既往发现不同,当评分按任务类型区分时,自动评估方法与人类评估者之间的相关性呈现显著差异:广泛使用的ROUGE-L指标在短答案英文任务中与人类判断高度相关,但在自由生成任务及跨语言迁移中表现不可靠;GPT-4作为评估者的有效性依赖于评估提示中是否包含参考答案,这可能导致其在自由生成任务中出现过度严苛的评估。综上,自动评估方法虽能在特定条件下近似人类判断,但其可靠性具有高度情境依赖性。本研究深化了对指令微调大语言模型开发与评估实践中自动方法应用与解读的理解。