Tibetan, a minority language in China, features a highly intricate grammatical structure, characterized by four verb tenses and a tense system with frequent irregularities, contributing to its extensive inflectional diversity. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in many domains. Despite the success in other fields, current LLMs often fall short in catering to the needs of domain experts like Tibetans, and the potential of LLMs for Tibetan culture is under-explored. The intrinsic reasons are the immense and intricate nature of Tibetan culture as well as the necessity for higher granularity and richness in knowledge. Simultaneously, the complexity and uniqueness of its grammatical structure, coupled with its status as a minority ethnic language, contribute to data scarcity, which remains a fundamental challenge. To alleviate these issues, we introduce Llama-Sunshine (Sun-Shine), the first large language model for Tibetan culture, which is expert in various Tibetan language processing tasks. Sun-Shine incorporates state-of-the-art model architectures optimized for Tibetan's linguistic features. We also propose TIB-STC, a comprehensive dataset comprising diverse Tibetan texts such as literature, religious scripts, news, and conversational data, which is also the first large-scale dataset for Tibetan culture. Though comprehensive experiments, Sun-Shine not only demonstrates a higher level of knowledge expertise for Tibetan culture but also gains preliminary embodied intelligence capabilities in Tibetan language processing tasks, like language modeling, text classification, machine translation, and syntactic analysis. Moreover, it excels in low-resource scenarios, showcasing strong generalization capabilities.
翻译:藏语作为中国的少数民族语言,其语法体系高度复杂,具有四种动词时态且时态系统存在大量不规则变化,导致其屈折形态极为丰富。近年来,大语言模型(LLMs)的发展已在众多领域引发范式变革。尽管在其他领域取得成功,现有LLMs往往难以满足藏族等特定领域专家的需求,且LLMs在藏族文化中的应用潜力尚未得到充分探索。其内在原因在于藏族文化体系的宏大性与复杂性,以及对知识粒度与丰富度的更高要求。同时,藏语语法结构的独特复杂性及其作为少数民族语言的地位,共同导致了数据稀缺问题,这仍是根本性挑战。为缓解这些问题,我们推出了Llama-Sunshine(Sun-Shine)——首个面向藏族文化的大语言模型,该模型专精于多种藏语处理任务。Sun-Shine融合了针对藏语语言学特征优化的前沿模型架构。我们还构建了TIB-STC综合数据集,涵盖文学、宗教典籍、新闻及会话数据等多元藏语文本,这也是首个面向藏族文化的大规模数据集。通过系统实验验证,Sun-Shine不仅在藏族文化知识专精度方面表现卓越,更在语言建模、文本分类、机器翻译、句法分析等藏语处理任务中展现出初步的具身智能能力。此外,该模型在低资源场景下表现优异,展现出强大的泛化能力。