Modern large language models (LLMs) should generally benefit individuals from various cultural backgrounds around the world. However, most recent advanced generative evaluation benchmarks tailed for LLMs mainly focus on English. To this end, we introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages. For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs, such as general knowledge, logical reasoning, and so on. Each question is rigorously verified by human annotators. Notably, to sufficiently reflect the compatibility of LLMs in different cultural backgrounds, we perform localization for each non-English language. Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar). Following AlpacaEval, we employ GPT-4 as the adjudicator to automatically score different model outputs, which is shown closely related to human evaluation. We evaluate several representative multilingual LLMs on the proposed OMGEval, which we believe will provide a valuable reference for the community to further understand and improve the multilingual capability of LLMs. OMGEval is available at https://github.com/blcuicall/OMGEval.
翻译:现代大型语言模型(LLMs)理应惠及全球不同文化背景的用户。然而,当前针对LLMs的主流生成式评估基准大多仅关注英语能力。为此,我们推出了OMGEval——首个开源的多语言生成式测试集,能够系统评估LLMs在不同语言中的表现。针对每种语言,OMGEval提供804道开放式问题,全面覆盖LLMs的核心能力维度,包括常识知识、逻辑推理等。所有问题均经过人工标注者的严格校验。值得注意的是,为充分反映LLMs在不同文化背景下的适应性,我们对每种非英语语言进行了本土化适配。当前版本的OMGEval涵盖5种语言(即中文、俄语、法语、西班牙语、阿拉伯语)。参照AlpacaEval的评估范式,我们采用GPT-4作为评判器对模型输出进行自动评分,该方法经验证与人工评估高度吻合。我们在OMGEval上对多个代表性多语言LLMs进行了系统评测,相信该工作将为学界深入理解和提升LLMs的多语言能力提供重要参考。OMGEval已开源:https://github.com/blcuicall/OMGEval。