TFD: A Comprehensive Structured Tibetan Foundation Dataset for Low-Resource Language Processing and Large-Scale Modeling

Cheng Huang,Fan Gao,Nyima Tashi,Yutong Liu,Xiangxiang Wang,Thupten Tsering,Ban Ma-bao,Xiao Feng,Renzeg Duojie,Gadeng Luosang,Rinchen Dongrub,Dorje Tashi,Hao Wang,Yongbin Yu

Large language models (LLMs) have achieved remarkable success in high-resource languages, yet progress for Tibetan remains severely constrained by the lack of large-scale, high-quality, and structured data. Existing Tibetan resources are fragmented, domain-limited, and insufficient to support modern LLM pipelines requiring pretraining, instruction tuning, safety alignment, and reasoning supervision. We introduce the \textbf{T}ibetan \textbf{F}oundation \textbf{D}ataset (\textbf{TFD}), the first comprehensive, large-scale, and expert-curated dataset explicitly designed for Tibetan large language modeling. \textit{TFD} comprises two complementary components: \textit{TIBSTC}, a unified corpus of over 11 billion tokens spanning literature, law, medicine, religion, and everyday communication, and \textit{TIBSTC-CoT}, the first large-scale Tibetan chain-of-thought dataset supporting explicit multi-step reasoning across diverse domains. Unlike prior Tibetan datasets, \textit{TFD} is structurally organized to support the full LLM development lifecycle, including pretraining, supervised fine-tuning, safety alignment, and preference optimization. We demonstrate its utility by training the \textit{Sun-Shine} family of Tibetan LLMs and evaluating them on understanding, safety, reasoning, and generation tasks. Results show consistent improvements over strong open-source and proprietary baselines, underscoring the importance of large-scale, structured data for low-resource language modeling. We release \textit{TFD} to facilitate reproducible research and the development of robust, culturally aligned Tibetan LLMs. Code and data are available at https://github.com/Vicentvankor/sun-shine.

翻译：大语言模型（LLM）在高资源语言中取得了显著成功，然而藏语的发展仍因缺乏大规模、高质量、结构化的数据而受到严重制约。现有藏语资源零散、领域受限，不足以支持需要预训练、指令微调、安全对齐和推理监督的现代大语言模型流程。本文介绍了\textbf{藏语基础数据集}（\textbf{TFD}），这是首个为藏语大语言模型明确设计的综合性、大规模、专家策划的数据集。\textit{TFD}包含两个互补的组成部分：\textit{TIBSTC}，一个涵盖文学、法律、医学、宗教和日常交流的超过110亿词元的统一语料库；以及\textit{TIBSTC-CoT}，首个支持跨领域显式多步推理的大规模藏语思维链数据集。与以往的藏语数据集不同，\textit{TFD}在结构上组织以支持完整的大语言模型开发生命周期，包括预训练、监督微调、安全对齐和偏好优化。我们通过训练\textit{Sun-Shine}系列藏语大语言模型并在理解、安全、推理和生成任务上对其进行评估，证明了其实用性。结果显示，相较于强大的开源和专有基线模型，该系列模型取得了持续的性能提升，凸显了大规模结构化数据对于低资源语言建模的重要性。我们公开\textit{TFD}以促进可复现的研究以及稳健、文化对齐的藏语大语言模型的开发。代码与数据可在 https://github.com/Vicentvankor/sun-shine 获取。