Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.
翻译:现有的多语言文本去毒方法受限于平行多语言数据集的稀缺性。本研究提出一种多语言平行去毒数据生成流程,并介绍了SynthDetoxM——一个通过人工收集与合成生成的多语言平行文本去毒数据集,包含德语、法语、西班牙语和俄语共计16,000个高质量去毒句对。数据源自不同的毒性评估数据集,并经由九种现代开源大语言模型在少样本设置下进行重写。实验表明,即使在数据受限条件下,基于所生成合成数据集训练的模型性能仍优于基于人工标注MultiParaDetox数据集训练的模型。在SynthDetoxM上训练的模型在少样本设置下超越了所有评估的大语言模型。我们公开数据集与代码以促进多语言文本去毒领域的进一步研究。