With large Foundation Models (FMs), language technologies (AI in general) are entering a new paradigm: eliminating the need for developing large-scale task-specific datasets and supporting a variety of tasks through set-ups ranging from zero-shot to few-shot learning. However, understanding FMs capabilities requires a systematic benchmarking effort by comparing FMs performance with the state-of-the-art (SOTA) task-specific models. With that goal, past work focused on the English language and included a few efforts with multiple languages. Our study contributes to ongoing research by evaluating FMs performance for standard Arabic NLP and Speech processing, including a range of tasks from sequence tagging to content classification across diverse domains. We start with zero-shot learning using GPT-3.5-turbo, Whisper, and USM, addressing 33 unique tasks using 59 publicly available datasets resulting in 96 test setups. For a few tasks, FMs performs on par or exceeds the performance of the SOTA models but for the majority it under-performs. Given the importance of prompt for the FMs performance, we discuss our prompt strategies in detail and elaborate on our findings. Our future work on Arabic AI will explore few-shot prompting, expand the range of tasks, and investigate additional open-source models.
翻译:随着大型基础模型(FMs)的兴起,语言技术(广义上的人工智能)正迈入新范式:无需开发大规模任务专用数据集,即可通过从零样本到少样本学习的多种配置支持各类任务。然而,理解FMs的能力需要系统化的基准评测——通过对比FMs性能与当前最优(SOTA)任务专用模型。此前研究主要聚焦英语,涉及多语言的工作相对有限。本研究通过评估FMs在标准阿拉伯语自然语言处理与语音处理任务中的表现,涵盖从序列标注到跨领域内容分类的多种任务,为当前研究做出贡献。我们采用GPT-3.5-turbo、Whisper和USM进行零样本学习,利用59个公开数据集覆盖33项独特任务,构建了96组测试配置。结果显示,FMs在部分任务中达到或超越SOTA模型性能,但在多数任务中表现不足。鉴于提示词对FMs性能的关键影响,我们详细讨论提示词策略并深入分析研究结论。未来阿拉伯语AI研究将探索少样本提示、拓展任务范围,并纳入更多开源模型。