We present VBART, the first Turkish sequence-to-sequence Large Language Models (LLMs) pre-trained on a large corpus from scratch. VBART are compact LLMs based on good ideas leveraged from BART and mBART models and come in two sizes, Large and XLarge. Fine-tuned VBART models surpass the prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering and question generation tasks. They allow fine-tuning for future text generation tasks and datasets, carving a new path for Turkish Natural Language Processing (NLP) research. Our work shows that having a pre-trained LLM for Turkish outperforms up to 3x multilingual models, improving existing results and providing efficient models for training and inference. Moreover, we show that our monolingual tokenizer is up to 11x more efficient than multilingual tokenizers. Last but not least, we introduce a method to enlarge an existing pre-trained LLM and question the relevancy of Chinchilla Scaling Law to sequence-to-sequence masked language models. Our fine-tuned models, tokenizer and cleaned vngrs-web-corpus of 135 GB are publicly available at huggingface.co/vngrs-ai.
翻译:我们提出了VBART,这是首个从零开始在大型语料库上预训练的土耳其语序列到序列大语言模型(LLMs)。VBART是基于BART和mBART模型中的优秀思想构建的紧凑型大语言模型,分为Large和XLarge两种规模。经过微调的VBART模型在抽象文本摘要、标题生成、文本释义、问答和问题生成任务中超越了先前最先进的结果。这些模型可针对未来的文本生成任务和数据集进行微调,为土耳其语自然语言处理(NLP)研究开辟了新路径。我们的工作表明,针对土耳其语的预训练大语言模型性能优于多语言模型最高达3倍,改进了现有结果,并提供了用于训练和推理的高效模型。此外,我们证明单语言分词器的效率比多语言分词器高出最高达11倍。最后但同样重要的是,我们提出了一种扩展现有预训练大语言模型的方法,并探讨了Chinchilla缩放定律对序列到序列掩码语言模型的适用性。我们的微调模型、分词器以及经过清洗的135 GB vngrs网络语料库已在huggingface.co/vngrs-ai公开发布。