ChatGPT has recently emerged as a powerful NLP tool that can carry out a variety of tasks. However, the range of languages ChatGPT can handle remains largely a mystery. To uncover which languages ChatGPT `knows', we investigate its language identification (LID) abilities. For this purpose, we compile Babel-670, a benchmark comprising 670 languages representing 24 language families spoken in five continents. Languages in Babel-670 run the gamut from the very high-resource to the very low-resource. We then study ChatGPT's (both GPT-3.5 and GPT-4) ability to (i) identify language names and language codes (ii) under zero- and few-shot conditions (iii) with and without provision of a label set. When compared to smaller finetuned LID tools, we find that ChatGPT lags behind. For example, it has poor performance on African languages. We conclude that current large language models would benefit from further development before they can sufficiently serve diverse communities.
翻译:ChatGPT近期已成为一种能够执行多种任务的强大自然语言处理工具。然而,ChatGPT可处理的语言范围在很大程度上仍是个谜。为揭示ChatGPT「认知」的语言种类,本研究对其语言识别能力进行了探究。为此,我们构建了Babel-670基准数据集,包含涵盖五大洲24个语系的670种语言。该数据集中的语言资源从极其丰富到极度匮乏不等。我们重点研究了ChatGPT(包括GPT-3.5和GPT-4)在以下条件下的表现:(i)识别语言名称与语言代码的能力;(ii)零样本与少样本设置下的表现;(iii)是否提供标签集的情况。通过对比规模较小的微调语言识别工具,我们发现ChatGPT的表现相对落后。例如,其对非洲语言的识别性能较差。我们得出结论:当前的大语言模型在充分服务多元社区之前,仍需进一步发展完善。