The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available, their customization capabilities for specific tasks and datasets are often complex for different users. In this study, we introduce the LLMeBench framework. Initially developed to evaluate Arabic NLP tasks using OpenAI's GPT and BLOOM models; it can be seamlessly customized for any NLP task and model, regardless of language. The framework also features zero- and few-shot learning settings. A new custom dataset can be added in less than 10 minutes, and users can use their own model API keys to evaluate the task at hand. The developed framework has been already tested on 31 unique NLP tasks using 53 publicly available datasets within 90 experimental setups, involving approximately 296K data points. We plan to open-source the framework for the community (https://github.com/qcri/LLMeBench/). A video demonstrating the framework is available online (https://youtu.be/FkQn4UjYA0s).
翻译:近年来大语言模型的快速发展与成功,亟需评估其在不同语言、多种自然语言处理任务中的性能。尽管已有多个框架被开发并公开,但针对特定任务和数据集的定制能力对不同类型的用户而言仍显复杂。本研究提出LLMeBench框架,初始设计用于基于OpenAI的GPT和BLOOM模型评估阿拉伯语自然语言处理任务;该框架可无缝适配任意自然语言处理任务和模型,且不受语言限制。框架支持零样本和少样本学习设置。用户可在10分钟内添加新自定义数据集,并使用自有模型API密钥评估目标任务。该框架已针对53个公开数据集、90个实验配置、约29.6万个数据点,在31项独立自然语言处理任务上完成测试。我们计划将框架开源供社区使用(https://github.com/qcri/LLMeBench/),演示视频可在线上获取(https://youtu.be/FkQn4UjYA0s)。