Large language models (LLMs) have demonstrated impressive performance on various downstream tasks without requiring fine-tuning, including ChatGPT, a chat-based model built on top of LLMs such as GPT-3.5 and GPT-4. Despite having a lower training proportion compared to English, these models also exhibit remarkable capabilities in other languages. In this study, we assess the performance of GPT-3.5 and GPT-4 models on seven distinct Arabic NLP tasks: sentiment analysis, translation, transliteration, paraphrasing, part of speech tagging, summarization, and diacritization. Our findings reveal that GPT-4 outperforms GPT-3.5 on five out of the seven tasks. Furthermore, we conduct an extensive analysis of the sentiment analysis task, providing insights into how LLMs achieve exceptional results on a challenging dialectal dataset. Additionally, we introduce a new Python interface https://github.com/ARBML/Taqyim that facilitates the evaluation of these tasks effortlessly.
翻译:大语言模型(LLMs)在无需微调的情况下,已在多种下游任务中展现出令人瞩目的性能,其中包括基于GPT-3.5和GPT-4等大语言模型构建的对话式模型ChatGPT。尽管这些模型的训练数据中阿拉伯语占比低于英语,但它们在其他语言上也表现出卓越能力。本研究评估了GPT-3.5和GPT-4模型在七项阿拉伯语自然语言处理任务上的表现:情感分析、翻译、音译、释义、词性标注、摘要及变音符号标注。研究结果表明,GPT-4在七项任务中有五项优于GPT-3.5。此外,我们对情感分析任务进行了深入分析,揭示了LLMs在具有挑战性的方言数据集上取得优异结果的机制。同时,我们推出一个新的Python接口https://github.com/ARBML/Taqyim,可便捷地完成这些任务的评估。