OffLanDat: A Community Based Implicit Offensive Language Dataset Generated by Large Language Model Through Prompt Engineering

The widespread presence of offensive languages on social media has resulted in adverse effects on societal well-being. As a result, it has become very important to address this issue with high priority. Offensive languages exist in both explicit and implicit forms, with the latter being more challenging to detect. Current research in this domain encounters several challenges. Firstly, the existing datasets primarily rely on the collection of texts containing explicit offensive keywords, making it challenging to capture implicitly offensive contents that are devoid of these keywords. Secondly, usual methodologies tend to focus solely on textual analysis, neglecting the valuable insights that community information can provide. In this research paper, we introduce a novel dataset OffLanDat, a community based implicit offensive language dataset generated by ChatGPT containing data for 38 different target groups. Despite limitations in generating offensive texts using ChatGPT due to ethical constraints, we present a prompt-based approach that effectively generates implicit offensive languages. To ensure data quality, we evaluate our data with human. Additionally, we employ a prompt-based Zero-Shot method with ChatGPT and compare the detection results between human annotation and ChatGPT annotation. We utilize existing state-of-the-art models to see how effective they are in detecting such languages. We will make our code and dataset public for other researchers.

翻译：社交媒体上冒犯性语言的广泛存在对社会福祉造成了负面影响，因此迫切需要优先解决这一问题。冒犯性语言既有显性形式也有隐性形式，其中后者更难检测。当前该领域研究面临多项挑战：首先，现有数据集主要依赖包含显性冒犯关键词的文本收集，难以捕捉不含此类关键词的隐含冒犯内容；其次，常规方法仅聚焦于文本分析，忽视了社区信息所能提供的宝贵见解。本文提出新型数据集OffLanDat——一个由ChatGPT生成的、覆盖38个不同目标群体的基于社区的隐含冒犯性语言数据集。尽管因伦理约束限制了ChatGPT直接生成冒犯性文本，我们提出了一种基于提示的方法，有效生成了隐含冒犯性语言。为确保数据质量，我们采用人工评估数据；同时，通过基于提示的零样本方法对比人工标注与ChatGPT标注的检测结果。我们利用现有最优模型检验其对此类语言的检测效能，并将公开代码与数据集供其他研究者使用。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日