In an industry dominated by straight men, many developers representing other gender identities and sexual orientations often encounter hateful or discriminatory messages. Such communications pose barriers to participation for women and LGBTQ+ persons. Due to sheer volume, manual inspection of all communications for discriminatory communication is infeasible for a large-scale Free Open-Source Software (FLOSS) community. To address this challenge, this study aims to develop an automated mechanism to identify Sexual orientation and Gender identity Discriminatory (SGID) texts from software developers' communications. On this goal, we trained and evaluated SGID4SE ( Sexual orientation and Gender Identity Discriminatory text identification for (4) Software Engineering texts) as a supervised learning-based SGID detection tool. SGID4SE incorporates six preprocessing steps and ten state-of-the-art algorithms. SGID4SE implements six different strategies to improve the performance of the minority class. We empirically evaluated each strategy and identified an optimum configuration for each algorithm. In our ten-fold cross-validation-based evaluations, a BERT-based model boosts the best performance with 85.9% precision, 80.0% recall, and 82.9% F1-Score for the SGID class. This model achieves 95.7% accuracy and 80.4% Matthews Correlation Coefficient. Our dataset and tool establish a foundation for further research in this direction.
翻译:在由直男主导的行业中,许多代表其他性别认同与性取向的开发者常遭遇仇恨或歧视性信息。此类沟通对女性和LGBTQ+群体的参与构成障碍。由于信息量庞大,手动检查大规模自由开源软件社区中的所有沟通内容以识别歧视性信息并不可行。针对这一挑战,本研究旨在开发一种自动化机制,从软件开发者的沟通文本中识别性取向与性别认同歧视性文本。为此,我们训练并评估了SGID4SE(面向软件工程文本的性取向与性别认同歧视文本识别工具),这是一种基于监督学习的SGID检测工具。SGID4SE集成了六项预处理步骤与十种前沿算法,并实施了六种不同策略以提升少数类别的检测性能。我们通过实验评估了每种策略,并为每种算法确定了最优配置。在基于十折交叉验证的评估中,基于BERT的模型在SGID类别上取得了最优性能:精确率85.9%、召回率80.0%、F1分数82.9%。该模型准确率达95.7%,马修斯相关系数为80.4%。我们的数据集与工具为该方向的后续研究奠定了基础。