We present VBART, the first Turkish sequence-to-sequence Large Language Models (LLMs) pre-trained on a large corpus from scratch. VBART are compact LLMs based on good ideas leveraged from BART and mBART models and come in two sizes, Large and XLarge. Fine-tuned VBART models surpass the prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering and question generation tasks. They allow fine-tuning for future text generation tasks and datasets, carving a new path for Turkish Natural Language Processing (NLP) research. Our work shows that having a pre-trained LLM for Turkish outperforms up to 3x multilingual models, improving existing results and providing efficient models for training and inference. Moreover, we show that our monolingual tokenizer is 7x more efficient than OpenAI's multilingual tokenizer. Last but not least, we introduce a method to enlarge an existing pre-trained LLM and question the relevancy of Chinchilla Scaling Law to sequence-to-sequence masked language models. Our fine-tuned models, tokenizer and cleaned web corpus of 135 GB are publicly available at huggingface.co/vngrs-ai.
翻译:我们提出了VBART,这是首个从头开始在大规模语料库上预训练的土耳其语序列到序列大语言模型。VBART是基于BART和mBART模型的核心理念构建的紧凑型大语言模型,提供Large和XLarge两种规模。微调后的VBART模型在抽象式文本摘要、标题生成、文本改写、问答和问题生成任务中超越了此前最先进的结果。该模型支持针对未来文本生成任务和数据集的微调,为土耳其语自然语言处理研究开辟了新路径。我们的研究表明,拥有一个预训练的土耳其语大语言模型可使性能比多语言模型提升高达3倍,不仅改进了现有成果,还提供了更高效的训练与推理模型。此外,我们证明单语分词器的效率比OpenAI的多语言分词器高出7倍。最后但同样重要的是,我们提出了一种扩展现有预训练大语言模型的方法,并质疑了Chinchilla缩放定律对序列到序列掩码语言模型的适用性。我们的微调模型、分词器以及135GB的清洗网络语料库均已公开在huggingface.co/vngrs-ai。