Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing (NLP) tasks, such as Question Answering, Summarization, and Classification. The use of LLMs as evaluators, that can rank or score the output of other models (usually LLMs) has become increasingly popular, due to the limitations of current evaluation techniques including the lack of appropriate benchmarks, metrics, cost, and access to human annotators. While LLMs are capable of handling approximately 100 languages, the majority of languages beyond the top 20 lack systematic evaluation across various tasks, metrics, and benchmarks. This creates an urgent need to scale up multilingual evaluation to ensure a precise understanding of LLM performance across diverse languages. LLM-based evaluators seem like the perfect solution to this problem, as they do not require human annotators, human-created references, or benchmarks and can theoretically be used to evaluate any language covered by the LLM. In this paper, we investigate whether LLM-based evaluators can help scale up multilingual evaluation. Specifically, we calibrate LLM-based evaluation against 20k human judgments of five metrics across three text-generation tasks in eight languages. Our findings indicate that LLM-based evaluators may exhibit bias towards higher scores and should be used with caution and should always be calibrated with a dataset of native speaker judgments, particularly in low-resource and non-Latin script languages.
翻译:大语言模型(LLM)在自然语言处理(NLP)任务中表现出色,例如问答、摘要和分类。由于当前评估技术存在缺乏适当基准、指标、成本及人工注释员访问权限等局限性,将LLM作为评估器(可对其他模型(通常是LLM)的输出进行排序或评分)的做法日益流行。尽管LLM能够处理约100种语言,但除前20种语言外,大多数语言在各项任务、指标和基准上缺乏系统性评估。这迫切需要扩展多语言评估,以确保对不同语言中LLM性能的精确理解。基于LLM的评估器似乎是这一问题的理想解决方案,因为它们不需要人工注释员、人工创建的参考或基准,理论上可用于评估LLM覆盖的任何语言。本文探究了基于LLM的评估器能否帮助扩展多语言评估。具体而言,我们将基于LLM的评估结果与针对八种语言中三项文本生成任务的五种指标的2万个人工判断进行校准。我们的研究结果表明,基于LLM的评估器可能偏向于给出较高评分,应谨慎使用,并需始终用母语者判断数据集进行校准,尤其在低资源和非拉丁文字语言中。