In this paper, we compare general-purpose pretrained models, GPT-4-Turbo and Llama-3-8b-Instruct with special-purpose models fine-tuned on specific tasks, XLM-Roberta-large, mT5-large, and Llama-3-8b-Instruct. We focus on seven classification and six generation tasks to evaluate the performance of these models on Urdu language. Urdu has 70 million native speakers, yet it remains underrepresented in Natural Language Processing (NLP). Despite the frequent advancements in Large Language Models (LLMs), their performance in low-resource languages, including Urdu, still needs to be explored. We also conduct a human evaluation for the generation tasks and compare the results with the evaluations performed by GPT-4-Turbo and Llama-3-8b-Instruct. We find that special-purpose models consistently outperform general-purpose models across various tasks. We also find that the evaluation done by GPT-4-Turbo for generation tasks aligns more closely with human evaluation compared to the evaluation by Llama-3-8b-Instruct. This paper contributes to the NLP community by providing insights into the effectiveness of general and specific-purpose LLMs for low-resource languages.
翻译:本文比较了通用预训练模型(GPT-4-Turbo和Llama-3-8b-Instruct)与针对特定任务微调的专用模型(XLM-Roberta-large、mT5-large和Llama-3-8b-Instruct)。我们聚焦于七项分类任务和六项生成任务,以评估这些模型在乌尔都语上的性能。乌尔都语拥有七千万母语使用者,但在自然语言处理(NLP)领域仍处于代表性不足的状态。尽管大型语言模型(LLMs)不断取得进展,但其在包括乌尔都语在内的低资源语言中的性能仍有待探索。我们还对生成任务进行了人工评估,并将结果与GPT-4-Turbo和Llama-3-8b-Instruct进行的评估进行了比较。我们发现,在各种任务中,专用模型的表现始终优于通用模型。我们还发现,在生成任务评估方面,GPT-4-Turbo的评估结果与人工评估的一致性高于Llama-3-8b-Instruct的评估结果。本文通过深入探讨通用与专用大型语言模型在低资源语言上的有效性,为NLP社区做出了贡献。