Sun-Shine: A Large Language Model for Tibetan Culture

Cheng Huang,Fan Gao,Nyima Tashi,Yutong Liu,Xiangxiang Wang,Thupten Tsering,Ban Ma-bao,Renzeg Duojie,Gadeng Luosang,Rinchen Dongrub,Dorje Tashi,Xiao Feng,Yongbin Yu

Tibetan, a minority language in China, features a highly intricate grammatical structure, characterized by four verb tenses and a tense system with frequent irregularities, contributing to its extensive inflectional diversity. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in many domains. Despite the success in other fields, current LLMs often fall short in catering to the needs of domain experts like Tibetans, and the potential of LLMs for Tibetan culture is under-explored. The intrinsic reasons are the immense and intricate nature of Tibetan culture as well as the necessity for higher granularity and richness in knowledge. Simultaneously, the complexity and uniqueness of its grammatical structure, coupled with its status as a minority ethnic language, contribute to data scarcity, which remains a fundamental challenge. To alleviate these issues, we introduce Llama-Sunshine (Sun-Shine), the first large language model for Tibetan culture, which is expert in various Tibetan language processing tasks. Sun-Shine incorporates state-of-the-art model architectures optimized for Tibetan's linguistic features. We also propose TIB-STC, a comprehensive dataset comprising diverse Tibetan texts such as literature, religious scripts, news, and conversational data, which is also the first large-scale dataset for Tibetan culture. Though comprehensive experiments, Sun-Shine not only demonstrates a higher level of knowledge expertise for Tibetan culture but also gains preliminary embodied intelligence capabilities in Tibetan language processing tasks, like language modeling, text classification, machine translation, and syntactic analysis. Moreover, it excels in low-resource scenarios, showcasing strong generalization capabilities.

翻译：藏语作为中国的少数民族语言，其语法体系高度复杂，具有四种动词时态且时态系统存在大量不规则变化，导致其屈折形态极为丰富。近年来，大语言模型（LLMs）的发展已在众多领域引发范式变革。尽管在其他领域取得成功，现有LLMs往往难以满足藏族等特定领域专家的需求，且LLMs在藏族文化中的应用潜力尚未得到充分探索。其内在原因在于藏族文化体系的宏大性与复杂性，以及对知识粒度与丰富度的更高要求。同时，藏语语法结构的独特复杂性及其作为少数民族语言的地位，共同导致了数据稀缺问题，这仍是根本性挑战。为缓解这些问题，我们推出了Llama-Sunshine（Sun-Shine）——首个面向藏族文化的大语言模型，该模型专精于多种藏语处理任务。Sun-Shine融合了针对藏语语言学特征优化的前沿模型架构。我们还构建了TIB-STC综合数据集，涵盖文学、宗教典籍、新闻及会话数据等多元藏语文本，这也是首个面向藏族文化的大规模数据集。通过系统实验验证，Sun-Shine不仅在藏族文化知识专精度方面表现卓越，更在语言建模、文本分类、机器翻译、句法分析等藏语处理任务中展现出初步的具身智能能力。此外，该模型在低资源场景下表现优异，展现出强大的泛化能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日