This paper presents a solution for the Multilingual Text Detoxification task in the PAN-2024 competition of the SmurfCat team. Using data augmentation through machine translation and a special filtering procedure, we collected an additional multilingual parallel dataset for text detoxification. Using the obtained data, we fine-tuned several multilingual sequence-to-sequence models, such as mT0 and Aya, on a text detoxification task. We applied the ORPO alignment technique to the final model. Our final model has only 3.7 billion parameters and achieves state-of-the-art results for the Ukrainian language and near state-of-the-art results for other languages. In the competition, our team achieved first place in the automated evaluation with a score of 0.52 and second place in the final human evaluation with a score of 0.74.
翻译:本文介绍了SmurfCat团队在PAN-2024竞赛中针对多语言文本去毒化任务的解决方案。通过机器翻译的数据增强和特殊过滤流程,我们收集了一个额外的多语言平行数据集用于文本去毒化。利用所获数据,我们在文本去毒化任务上对多个多语言序列到序列模型(如mT0和Aya)进行了微调。我们对最终模型应用了ORPO对齐技术。我们的最终模型仅有37亿参数,在乌克兰语上取得了最先进的结果,在其他语言上也获得了接近最先进的性能。在竞赛中,我们的团队在自动评估中以0.52分获得第一名,在最终人工评估中以0.74分获得第二名。