VBART: The Turkish LLM - 专知论文

We present VBART, the first Turkish sequence-to-sequence Large Language Models (LLMs) pre-trained on a large corpus from scratch. VBART are compact LLMs based on good ideas leveraged from BART and mBART models and come in two sizes, Large and XLarge. Fine-tuned VBART models surpass the prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering and question generation tasks. They allow fine-tuning for future text generation tasks and datasets, carving a new path for Turkish Natural Language Processing (NLP) research. Our work shows that having a pre-trained LLM for Turkish outperforms up to 3x multilingual models, improving existing results and providing efficient models for training and inference. Moreover, we show that our monolingual tokenizer is up to 11x more efficient than multilingual tokenizers. Last but not least, we introduce a method to enlarge an existing pre-trained LLM and question the relevancy of Chinchilla Scaling Law to sequence-to-sequence masked language models. Our fine-tuned models, tokenizer and cleaned vngrs-web-corpus of 135 GB are publicly available at huggingface.co/vngrs-ai.

翻译：我们提出了VBART，这是首个从零开始在大型语料库上预训练的土耳其语序列到序列大语言模型（LLMs）。VBART是基于BART和mBART模型中的优秀思想构建的紧凑型大语言模型，分为Large和XLarge两种规模。经过微调的VBART模型在抽象文本摘要、标题生成、文本释义、问答和问题生成任务中超越了先前最先进的结果。这些模型可针对未来的文本生成任务和数据集进行微调，为土耳其语自然语言处理（NLP）研究开辟了新路径。我们的工作表明，针对土耳其语的预训练大语言模型性能优于多语言模型最高达3倍，改进了现有结果，并提供了用于训练和推理的高效模型。此外，我们证明单语言分词器的效率比多语言分词器高出最高达11倍。最后但同样重要的是，我们提出了一种扩展现有预训练大语言模型的方法，并探讨了Chinchilla缩放定律对序列到序列掩码语言模型的适用性。我们的微调模型、分词器以及经过清洗的135 GB vngrs网络语料库已在huggingface.co/vngrs-ai公开发布。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日