Generalists vs. Specialists: Evaluating Large Language Models for Urdu

In this paper, we compare general-purpose pretrained models, GPT-4-Turbo and Llama-3-8b-Instruct with special-purpose models fine-tuned on specific tasks, XLM-Roberta-large, mT5-large, and Llama-3-8b-Instruct. We focus on seven classification and six generation tasks to evaluate the performance of these models on Urdu language. Urdu has 70 million native speakers, yet it remains underrepresented in Natural Language Processing (NLP). Despite the frequent advancements in Large Language Models (LLMs), their performance in low-resource languages, including Urdu, still needs to be explored. We also conduct a human evaluation for the generation tasks and compare the results with the evaluations performed by GPT-4-Turbo and Llama-3-8b-Instruct. We find that special-purpose models consistently outperform general-purpose models across various tasks. We also find that the evaluation done by GPT-4-Turbo for generation tasks aligns more closely with human evaluation compared to the evaluation by Llama-3-8b-Instruct. This paper contributes to the NLP community by providing insights into the effectiveness of general and specific-purpose LLMs for low-resource languages.

翻译：本文比较了通用预训练模型（GPT-4-Turbo和Llama-3-8b-Instruct）与针对特定任务微调的专用模型（XLM-Roberta-large、mT5-large和Llama-3-8b-Instruct）。我们聚焦于七项分类任务和六项生成任务，以评估这些模型在乌尔都语上的性能。乌尔都语拥有七千万母语使用者，但在自然语言处理（NLP）领域仍处于代表性不足的状态。尽管大型语言模型（LLMs）不断取得进展，但其在包括乌尔都语在内的低资源语言中的性能仍有待探索。我们还对生成任务进行了人工评估，并将结果与GPT-4-Turbo和Llama-3-8b-Instruct进行的评估进行了比较。我们发现，在各种任务中，专用模型的表现始终优于通用模型。我们还发现，在生成任务评估方面，GPT-4-Turbo的评估结果与人工评估的一致性高于Llama-3-8b-Instruct的评估结果。本文通过深入探讨通用与专用大型语言模型在低资源语言上的有效性，为NLP社区做出了贡献。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日