VBART: The Turkish LLM - 专知论文

We present VBART, the first Turkish sequence-to-sequence Large Language Models (LLMs) pre-trained on a large corpus from scratch. VBART are compact LLMs based on good ideas leveraged from BART and mBART models and come in two sizes, Large and XLarge. Fine-tuned VBART models surpass the prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering and question generation tasks. They allow fine-tuning for future text generation tasks and datasets, carving a new path for Turkish Natural Language Processing (NLP) research. Our work shows that having a pre-trained LLM for Turkish outperforms up to 3x multilingual models, improving existing results and providing efficient models for training and inference. Moreover, we show that our monolingual tokenizer is 7x more efficient than OpenAI's multilingual tokenizer. Last but not least, we introduce a method to enlarge an existing pre-trained LLM and question the relevancy of Chinchilla Scaling Law to sequence-to-sequence masked language models. Our fine-tuned models, tokenizer and cleaned web corpus of 135 GB are publicly available at huggingface.co/vngrs-ai.

翻译：我们提出了VBART，这是首个从头开始在大规模语料库上预训练的土耳其语序列到序列大语言模型。VBART是基于BART和mBART模型的核心理念构建的紧凑型大语言模型，提供Large和XLarge两种规模。微调后的VBART模型在抽象式文本摘要、标题生成、文本改写、问答和问题生成任务中超越了此前最先进的结果。该模型支持针对未来文本生成任务和数据集的微调，为土耳其语自然语言处理研究开辟了新路径。我们的研究表明，拥有一个预训练的土耳其语大语言模型可使性能比多语言模型提升高达3倍，不仅改进了现有成果，还提供了更高效的训练与推理模型。此外，我们证明单语分词器的效率比OpenAI的多语言分词器高出7倍。最后但同样重要的是，我们提出了一种扩展现有预训练大语言模型的方法，并质疑了Chinchilla缩放定律对序列到序列掩码语言模型的适用性。我们的微调模型、分词器以及135GB的清洗网络语料库均已公开在huggingface.co/vngrs-ai。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日