Text Detoxification as Style Transfer in English and Hindi

This paper focuses on text detoxification, i.e., automatically converting toxic text into non-toxic text. This task contributes to safer and more respectful online communication and can be considered a Text Style Transfer (TST) task, where the text style changes while its content is preserved. We present three approaches: knowledge transfer from a similar task, multi-task learning approach, combining sequence-to-sequence modeling with various toxicity classification tasks, and, delete and reconstruct approach. To support our research, we utilize a dataset provided by Dementieva et al.(2021), which contains multiple versions of detoxified texts corresponding to toxic texts. In our experiments, we selected the best variants through expert human annotators, creating a dataset where each toxic sentence is paired with a single, appropriate detoxified version. Additionally, we introduced a small Hindi parallel dataset, aligning with a part of the English dataset, suitable for evaluation purposes. Our results demonstrate that our approach effectively balances text detoxication while preserving the actual content and maintaining fluency.

翻译：本文聚焦于文本去毒化任务，即将有毒文本自动转化为无害文本。该任务有助于构建更安全、更文明的在线交流环境，可归类为文本风格迁移（TST）任务——在保持文本内容不变的同时改变其风格。我们提出了三种方法：基于相似任务的知识迁移、结合序列到序列建模与多种毒性分类任务的多任务学习方法，以及删除与重构方法。为支持研究，我们使用了Dementieva等人（2021）提供的包含多版去毒化文本对应有毒文本的数据集。实验中，我们通过专家人工标注筛选出最优版本，构建了每个有毒句子仅对应单一恰当去毒化版本的数据集。此外，我们还引入了一个小型印地语平行数据集，其内容与部分英语数据集对齐，适用于评估目的。实验结果表明，我们的方法能够有效平衡文本去毒化效果，同时保留原始内容并保持语言流畅性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日