The rapid advancement of large language models (LLMs) has highlighted the need for robust evaluation frameworks that assess their core capabilities, such as reasoning, knowledge, and commonsense, leading to the inception of certain widely-used benchmark suites such as the H6 benchmark. However, these benchmark suites are primarily built for the English language, and there exists a lack thereof for under-represented languages, in terms of LLM development, such as Thai. On the other hand, developing LLMs for Thai should also include enhancing the cultural understanding as well as core capabilities. To address these dual challenge in Thai LLM research, we propose two key benchmarks: Thai-H6 and Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI). Through a thorough evaluation of various LLMs with multi-lingual capabilities, we provide a comprehensive analysis of the proposed benchmarks and how they contribute to Thai LLM development. Furthermore, we will make both the datasets and evaluation code publicly available to encourage further research and development for Thai LLMs.
翻译:大语言模型(LLMs)的快速发展凸显了对评估其核心能力(如推理、知识和常识)的稳健评估框架的需求,这催生了一些广泛使用的基准测试套件,例如H6基准。然而,这些基准套件主要针对英语构建,而在LLM开发方面,对于泰语等未被充分代表的语言,此类基准尚属缺乏。另一方面,开发泰语LLM也应同时提升文化理解与核心能力。为应对泰语LLM研究中的这一双重挑战,我们提出了两个关键基准:Thai-H6与泰语文化及语言智能基准(ThaiCLI)。通过对多种具备多语言能力的LLMs进行深入评估,我们提供了对所提基准的全面分析,并阐述了它们如何促进泰语LLM的发展。此外,我们将公开数据集和评估代码,以鼓励针对泰语LLMs的进一步研究与开发。