Large Language Models (LLMs) have emerged as one of the most important breakthroughs in NLP for their impressive skills in language generation and other language-specific tasks. Though LLMs have been evaluated in various tasks, mostly in English, they have not yet undergone thorough evaluation in under-resourced languages such as Bengali (Bangla). To this end, this paper introduces BenLLM-Eval, which consists of a comprehensive evaluation of LLMs to benchmark their performance in the Bengali language that has modest resources. In this regard, we select various important and diverse Bengali NLP tasks, such as text summarization, question answering, paraphrasing, natural language inference, transliteration, text classification, and sentiment analysis for zero-shot evaluation of popular LLMs, namely, GPT-3.5, LLaMA-2-13b-chat, and Claude-2. Our experimental results demonstrate that while in some Bengali NLP tasks, zero-shot LLMs could achieve performance on par, or even better than current SOTA fine-tuned models; in most tasks, their performance is quite poor (with the performance of open-source LLMs like LLaMA-2-13b-chat being significantly bad) in comparison to the current SOTA results. Therefore, it calls for further efforts to develop a better understanding of LLMs in modest-resourced languages like Bengali.
翻译:大语言模型(LLMs)因其在语言生成及其他语言特定任务中展现的卓越能力,已成为自然语言处理领域最重要的突破之一。尽管LLMs已在多种任务(主要为英语任务)中接受评估,但尚未在孟加拉语等资源匮乏语言中经过全面测试。为此,本文提出BenLLM-Eval,通过系统评估LLMs在资源有限的孟加拉语中的表现建立性能基准。我们选取了多种重要且多样化的孟加拉语NLP任务,包括文本摘要、问答、释义、自然语言推理、音译、文本分类及情感分析,对GPT-3.5、LLaMA-2-13b-chat和Claude-2等主流LLMs进行零样本评估。实验结果表明:在某些孟加拉语NLP任务中,零样本LLMs可达到甚至超越当前最优微调模型的性能;但在多数任务中,其表现远逊于现有最优结果(其中LLaMA-2-13b-chat等开源LLMs性能尤为低下)。这要求我们进一步努力,深化对LLMs在孟加拉语等资源有限语言中应用的理解。