BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models

The rapid development of Large Language Models (LLMs) and the emergence of novel abilities with scale have necessitated the construction of holistic, diverse and challenging benchmarks such as HELM and BIG-bench. However, at the moment, most of these benchmarks focus only on performance in English and evaluations that include Southeast Asian (SEA) languages are few in number. We therefore propose BHASA, a holistic linguistic and cultural evaluation suite for LLMs in SEA languages. It comprises three components: (1) a NLP benchmark covering eight tasks across Natural Language Understanding (NLU), Generation (NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit that spans the gamut of linguistic phenomena including syntax, semantics and pragmatics, and (3) a cultural diagnostics dataset that probes for both cultural representation and sensitivity. For this preliminary effort, we implement the NLP benchmark only for Indonesian, Vietnamese, Thai and Tamil, and we only include Indonesian and Tamil for LINDSEA and the cultural diagnostics dataset. As GPT-4 is purportedly one of the best-performing multilingual LLMs at the moment, we use it as a yardstick to gauge the capabilities of LLMs in the context of SEA languages. Our initial experiments on GPT-4 with BHASA find it lacking in various aspects of linguistic capabilities, cultural representation and sensitivity in the targeted SEA languages. BHASA is a work in progress and will continue to be improved and expanded in the future. The repository for this paper can be found at: https://github.com/aisingapore/BHASA

翻译：大型语言模型（LLM）的快速发展以及随规模涌现的新能力，促使人们构建了如HELM和BIG-bench等全面、多样且富有挑战性的基准测试。然而，目前大多数基准测试仅侧重于英语性能，包含东南亚语言（SEA）的评估为数甚少。为此，我们提出BHASA——一个面向LLM的东南亚语言与文化综合评估套件。该套件包含三个组成部分：（1）涵盖自然语言理解（NLU）、自然语言生成（NLG）和自然语言推理（NLR）八类任务的NLP基准测试；（2）LINDSEA，一个覆盖句法、语义和语用学等语言现象全谱系的语言诊断工具；（3）用于探测文化表征与敏感性的文化诊断数据集。在初步工作中，我们仅针对印尼语、越南语、泰语和泰米尔语实施NLP基准测试，而LINDSEA与文化诊断数据集仅包含印尼语和泰米尔语。鉴于GPT-4据称是目前表现最佳的多语言LLM之一，我们以其为标杆来评估LLM在东南亚语言环境中的能力。基于BHASA对GPT-4的初步实验发现，其在目标东南亚语言的语言能力、文化表征及敏感性各维度均存在不足。BHASA是一项持续进行的工作，未来将继续改进与扩展。本文的代码仓库见：https://github.com/aisingapore/BHASA