Incivility in Open Source Projects: A Comprehensive Annotated Dataset of Locked GitHub Issue Threads

In the dynamic landscape of open source software (OSS) development, understanding and addressing incivility within issue discussions is crucial for fostering healthy and productive collaborations. This paper presents a curated dataset of 404 locked GitHub issue discussion threads and 5961 individual comments, collected from 213 OSS projects. We annotated the comments with various categories of incivility using Tone Bearing Discussion Features (TBDFs), and, for each issue thread, we annotated the triggers, targets, and consequences of incivility. We observed that Bitter frustration, Impatience, and Mocking are the most prevalent TBDFs exhibited in our dataset. The most common triggers, targets, and consequences of incivility include Failed use of tool/code or error messages, People, and Discontinued further discussion, respectively. This dataset can serve as a valuable resource for analyzing incivility in OSS and improving automated tools to detect and mitigate such behavior.

翻译：在开源软件（OSS）开发的动态环境中，理解并解决议题讨论中的不文明行为对促进健康高效的合作至关重要。本文构建了一个涵盖213个开源项目、包含404条锁定GitHub议题讨论线程及5961条独立评论的精选数据集。我们采用语态承载讨论特征（TBDFs）对评论进行多类别不文明行为标注，并针对每条议题线程标注了不文明行为的触发因素、作用对象及后果。研究发现，痛苦沮丧、不耐烦和嘲讽是本数据集中最常见的TBDFs表现形态。不文明行为最常见的触发因素、作用对象及后果分别为"工具/代码使用失败或错误信息"、"人员"及"终止进一步讨论"。该数据集可作为分析开源社区不文明现象的重要资源，并为改进自动化检测与缓解此类行为的工具提供支持。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日