Large language models (LLMs) finetuned to follow human instructions have recently emerged as a breakthrough in AI. Models such as Google Bard and OpenAI ChatGPT, for example, are surprisingly powerful tools for question answering, code debugging, and dialogue generation. Despite the purported multilingual proficiency of these models, their linguistic inclusivity remains insufficiently explored. Considering this constraint, we present a thorough assessment of Bard and ChatGPT (encompassing both GPT-3.5 and GPT-4) regarding their machine translation proficiencies across ten varieties of Arabic. Our evaluation covers diverse Arabic varieties such as Classical Arabic, Modern Standard Arabic, and several nuanced dialectal variants. Furthermore, we undertake a human-centric study to scrutinize the efficacy of the most recent model, Bard, in following human instructions during translation tasks. Our exhaustive analysis indicates that LLMs may encounter challenges with certain Arabic dialects, particularly those for which minimal public data exists, such as Algerian and Mauritanian dialects. However, they exhibit satisfactory performance with more prevalent dialects, albeit occasionally trailing behind established commercial systems like Google Translate. Additionally, our analysis reveals a circumscribed capability of Bard in aligning with human instructions in translation contexts. Collectively, our findings underscore that prevailing LLMs remain far from inclusive, with only limited ability to cater for the linguistic and cultural intricacies of diverse communities.
翻译:近期,经过微调以遵循人类指令的大型语言模型(LLMs)已成为人工智能领域的突破性进展。例如,Google Bard和OpenAI ChatGPT等模型在问答、代码调试和对话生成方面展现出惊人的能力。尽管这些模型声称具备多语言能力,但其语言包容性仍未被充分探索。针对这一局限性,我们对Bard和ChatGPT(涵盖GPT-3.5和GPT-4)在十种阿拉伯语变体上的机器翻译能力进行了全面评估。评估覆盖多种阿拉伯语变体,包括古典阿拉伯语、现代标准阿拉伯语及若干细微差异的方言变体。此外,我们开展了一项以人为中心的研究,以审视最新模型Bard在翻译任务中遵循人类指令的效果。我们的深入分析表明,LLMs在处理某些阿拉伯语方言(尤其是公开数据极少的方言,如阿尔及利亚和毛里塔尼亚方言)时可能面临挑战。然而,在更常见的方言上,它们的表现令人满意,尽管偶尔仍落后于Google Translate等成熟商业系统。此外,我们的分析揭示了Bard在翻译语境中与人类指令对齐的能力有限。综合而言,我们的研究结果强调,当前LLMs远未实现包容性,只能有限地满足不同社区的语言和文化复杂性需求。