A Text-to-Text Model for Multilingual Offensive Language Identification

The ubiquity of offensive content on social media is a growing cause for concern among companies and government organizations. Recently, transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance in detecting various forms of offensive content (e.g. hate speech, cyberbullying, and cyberaggression). However, the majority of these models are limited in their capabilities due to their encoder-only architecture, which restricts the number and types of labels in downstream tasks. Addressing these limitations, this study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5) trained on two large offensive language identification datasets; SOLID and CCTK. We investigate the effectiveness of combining two datasets and selecting an optimal threshold in semi-supervised instances in SOLID in the T5 retraining step. Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks. Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5 and evaluate its performance on a set of six different languages (German, Hindi, Korean, Marathi, Sinhala, and Spanish). The results demonstrate that this multilingual model achieves a new state-of-the-art on all the above datasets, showing its usefulness in multilingual scenarios. Our proposed T5-based models will be made freely available to the community.

翻译：社交媒体上攻击性内容的普遍存在日益引起企业和政府机构的担忧。近年来，基于Transformer的模型（如BERT、XLNET和XLM-R）在检测各类攻击性内容（如仇恨言论、网络霸凌和网络攻击）方面取得了最先进的性能。然而，这些模型大多受限于其仅含编码器的架构，从而限制了在下游任务中可使用的标签数量和类型。为解决这些局限，本研究首次提出了基于编码器-解码器架构的预训练模型，用于攻击性语言识别，该模型采用文本到文本的Transformer（T5），并在两个大型攻击性语言识别数据集（SOLID和CCTK）上进行训练。我们探究了在T5重训练步骤中，合并两个数据集以及为SOLID中的半监督实例选择最优阈值的有效性。我们的预训练T5模型在多个英语基准测试中优于其他针对攻击性语言检测微调的Transformer模型（如fBERT和HateBERT）。采用类似方法，我们还利用mT5训练了首个用于攻击性语言识别的多语言预训练模型，并在六种不同语言（德语、印地语、韩语、马拉地语、僧伽罗语和西班牙语）的数据集上评估其性能。结果表明，该多语言模型在所有上述数据集上均达到了新的最优水平，展现了其在多语言场景中的实用性。我们提出的基于T5的模型将免费向社区开放。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日