Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations. Employing LLMs as evaluators to rank or score other models' outputs emerges as a viable solution, addressing the constraints tied to human annotators and established benchmarks. In this study, we explore the potential of LLM-based evaluators, specifically GPT-4 in enhancing multilingual evaluation by calibrating them against $20$K human judgments across three text-generation tasks, five metrics, and eight languages. Our analysis reveals a bias in GPT4-based evaluators towards higher scores, underscoring the necessity of calibration with native speaker judgments, especially in low-resource and non-Latin script languages, to ensure accurate evaluation of LLM performance across diverse languages.
翻译:大型语言模型在多种自然语言处理任务中表现出色,然而其评估(尤其是对排名前20语言以外的语言)仍因现有基准和度量标准的限制而存在不足。将大型语言模型作为评估器,用于对其它模型的输出进行排序或评分,成为一种可行方案,可规避人工标注者和既有基准的约束。在本研究中,我们探索了基于LLM的评估器(特别是GPT-4)在提升多语言评估方面的潜力,通过将其校准至涵盖三个文本生成任务、五种度量标准和八种语言的20K条人工判断数据。分析表明,基于GPT-4的评估器存在高分偏好偏差,这凸显了必须利用母语者判断进行校准的必要性,尤其是在低资源语言和非拉丁字母语言中,以确保对跨语言LLM性能的准确评估。