Automated Identification of Toxic Code Reviews Using ToxiCR

Toxic conversations during software development interactions may have serious repercussions on a Free and Open Source Software (FOSS) development project. For example, victims of toxic conversations may become afraid to express themselves, therefore get demotivated, and may eventually leave the project. Automated filtering of toxic conversations may help a FOSS community to maintain healthy interactions among its members. However, off-the-shelf toxicity detectors perform poorly on Software Engineering (SE) datasets, such as one curated from code review comments. To encounter this challenge, we present ToxiCR, a supervised learning-based toxicity identification tool for code review interactions. ToxiCR includes a choice to select one of the ten supervised learning algorithms, an option to select text vectorization techniques, eight preprocessing steps, and a large-scale labeled dataset of 19,571 code review comments. Two out of those eight preprocessing steps are SE domain specific. With our rigorous evaluation of the models with various combinations of preprocessing steps and vectorization techniques, we have identified the best combination for our dataset that boosts 95.8% accuracy and 88.9% F1 score. ToxiCR significantly outperforms existing toxicity detectors on our dataset. We have released our dataset, pre-trained models, evaluation results, and source code publicly available at: https://github.com/WSU-SEAL/ToxiCR

翻译：软件开发生成交互中的毒性对话可能对自由及开源软件（FOSS）开发项目造成严重后果。例如，毒性对话的受害者可能因害怕表达自我而丧失积极性，最终选择退出项目。自动化毒性对话过滤有助于FOSS社区维护成员间的健康互动。然而，现有的毒性检测器在软件工程（SE）数据集（如从代码审查评论中整理的数据集）上表现不佳。针对这一挑战，我们提出ToxiCR——一种基于监督学习的代码审查交互毒性识别工具。ToxiCR提供十种监督学习算法中的任意选择、文本向量化技术选项、八种预处理步骤，以及包含19,571条代码审查评论的大规模标注数据集。其中，八种预处理步骤中有两项为SE领域专用。通过对预处理步骤与向量化技术不同组合的严格模型评估，我们确定了数据集的最优组合，该组合实现了95.8%的准确率和88.9%的F1分数。在我们的数据集上，ToxiCR显著优于现有毒性检测器。我们已将数据集、预训练模型、评估结果及源代码公开于：https://github.com/WSU-SEAL/ToxiCR

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【图深度学习GDL论文大全】A comprehensive collection of recent papers on graph deep learning

专知会员服务

47+阅读 · 2019年12月1日