The rapid development of Large Language Models (LLMs) and the emergence of novel abilities with scale have necessitated the construction of holistic, diverse and challenging benchmarks such as HELM and BIG-bench. However, at the moment, most of these benchmarks focus only on performance in English and evaluations that include Southeast Asian (SEA) languages are few in number. We therefore propose BHASA, a holistic linguistic and cultural evaluation suite for LLMs in SEA languages. It comprises three components: (1) a NLP benchmark covering eight tasks across Natural Language Understanding (NLU), Generation (NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit that spans the gamut of linguistic phenomena including syntax, semantics and pragmatics, and (3) a cultural diagnostics dataset that probes for both cultural representation and sensitivity. For this preliminary effort, we implement the NLP benchmark only for Indonesian, Vietnamese, Thai and Tamil, and we only include Indonesian and Tamil for LINDSEA and the cultural diagnostics dataset. As GPT-4 is purportedly one of the best-performing multilingual LLMs at the moment, we use it as a yardstick to gauge the capabilities of LLMs in the context of SEA languages. Our initial experiments on GPT-4 with BHASA find it lacking in various aspects of linguistic capabilities, cultural representation and sensitivity in the targeted SEA languages. BHASA is a work in progress and will continue to be improved and expanded in the future.
翻译:大型语言模型(LLMs)的快速发展及其随规模涌现的新能力,催生了对HELM、BIG-bench等全面、多样且具有挑战性基准测试的需求。然而,目前这些基准大多仅聚焦英语性能,涵盖东南亚语言的评估寥寥无几。为此,我们提出BHASA——面向东南亚语言的LLMs语言与文化综合评估套件。该套件包含三个组件:(1)涵盖自然语言理解(NLU)、生成(NLG)与推理(NLR)八项任务的NLP基准;(2)LINDSEA语言诊断工具包,覆盖句法、语义与语用等完整语言现象谱系;(3)文化诊断数据集,用于探测文化表征与文化敏感性。作为初步工作,我们目前仅针对印尼语、越南语、泰语和泰米尔语实施NLP基准测试,LINDSEA与文化诊断数据集仅包含印尼语和泰米尔语。鉴于GPT-4据称是当前表现最佳的多语言LLMs之一,我们以其为标尺衡量LLMs在东南亚语言中的能力。使用BHASA对GPT-4的初步实验发现,其在目标东南亚语言的语言能力、文化表征和敏感性等多个方面存在不足。BHASA是一项持续改进的工作,未来将继续优化与扩展。