Despite the impressive performance of LLMs on English-based tasks, little is known about their capabilities in specific languages such as Filipino. In this work, we address this gap by introducing FilBench, a Filipino-centric benchmark designed to evaluate LLMs across a diverse set of tasks and capabilities in Filipino, Tagalog, and Cebuano. We carefully curate the tasks in FilBench to reflect the priorities and trends of NLP research in the Philippines such as Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation. By evaluating 27 state-of-the-art LLMs on FilBench, we find that several LLMs suffer from reading comprehension and translation capabilities. Our results indicate that FilBench is challenging, with the best model, GPT-4o, achieving only a score of 72.23%. Moreover, we also find that models trained specifically for Southeast Asian languages tend to underperform on FilBench, with the highest-performing model, SEA-LION v3 70B, achieving only a score of 61.07%. Our work demonstrates the value of curating language-specific LLM benchmarks to aid in driving progress on Filipino NLP and increasing the inclusion of Philippine languages in LLM development.
翻译:尽管大型语言模型(LLMs)在英语任务上表现出色,但其在特定语言(如菲律宾语)中的能力仍鲜为人知。本研究通过引入FilBench填补了这一空白——这是一个以菲律宾语为中心的基准测试,旨在评估LLMs在菲律宾语、他加禄语和宿务语中多种任务与能力的表现。我们精心设计了FilBench中的任务,以反映菲律宾自然语言处理研究的重点与趋势,包括文化知识、经典自然语言处理任务、阅读理解及文本生成。通过对27个前沿LLMs在FilBench上的评估,我们发现多个模型在阅读理解和翻译能力上存在不足。结果表明FilBench具有挑战性,表现最佳的GPT-4o模型仅获得72.23%的得分。此外,我们还发现专门针对东南亚语言训练的模型在FilBench上表现欠佳,其中最优模型SEA-LION v3 70B仅获得61.07%的得分。本研究证明了构建语言特异性LLM基准的价值,有助于推动菲律宾自然语言处理的发展,并促进菲律宾语言在LLM开发中的包容性。