To build large language models for Ukrainian we need to expand our corpora with large amounts of new algorithmic tasks expressed in natural language. Examples of task performance expressed in English are abundant, so with a high-quality translation system our community will be enabled to curate datasets faster. To aid this goal, we introduce a recipe to build a translation system using supervised finetuning of a large pretrained language model with a noisy parallel dataset of 3M pairs of Ukrainian and English sentences followed by a second phase of training using 17K examples selected by k-fold perplexity filtering on another dataset of higher quality. Our decoder-only model named Dragoman beats performance of previous state of the art encoder-decoder models on the FLORES devtest set.
翻译:为构建乌克兰语大语言模型,我们需要用自然语言表述的大量新算法任务来扩充语料库。由于英语任务描述示例十分丰富,借助高质量的翻译系统,学界将能更快地整理数据集。为此,我们提出了一种翻译系统的构建方案:首先利用包含300万对乌克兰语-英语句子的噪声平行数据集对大型预训练语言模型进行监督微调,随后通过基于另一高质量数据集的k折困惑度过滤筛选出的1.7万个示例进行第二阶段训练。我们提出的纯解码器模型Dragoman在FLORES开发测试集上的表现超越了此前最先进的编码器-解码器模型。