With the growing application of AI-based systems in our lives and society, there is a rising need to ensure that AI-based systems are developed and used in a responsible way. Fairness is one of the socio-technical concerns that must be addressed in AI-based systems for this purpose. Unfair AI-based systems, particularly, unfair AI-based mobile apps, can pose difficulties for a significant proportion of the global populace. This paper aims to deeply analyze fairness concerns in AI-based app reviews. We first manually constructed a ground-truth dataset including a statistical sample of fairness and non-fairness reviews. Leveraging the ground-truth dataset, we then developed and evaluated a set of machine learning and deep learning classifiers that distinguish fairness reviews from non-fairness reviews. Our experiments show that our best-performing classifier can detect fairness reviews with a precision of 94%. We then applied the best-performing classifier on approximately 9.5M reviews collected from 108 AI-based apps and identified around 92K fairness reviews. While the fairness reviews appear in 23 app categories, we found that the 'communication' and 'social' app categories have the highest percentage of fairness reviews. Next, applying the K-means clustering technique to the 92K fairness reviews, followed by manual analysis, led to the identification of six distinct types of fairness concerns (e.g., 'receiving different quality of features and services in different platforms and devices' and 'lack of transparency and fairness in dealing with user-generated content'). Finally, the manual analysis of 2,248 app owners' responses to the fairness reviews identified six root causes (e.g., 'copyright issues', 'external factors', 'development cost') that app owners report to justify fairness concerns.
翻译:随着基于AI的系统在我们的生活和社会中的应用日益广泛,确保这些系统的开发和使用以负责任的方式进行的需求日益增加。公平性是为此目的必须在AI系统中解决的社会技术关注之一。不公平的AI系统,尤其是不公平的AI移动应用,可能对全球很大一部分人口造成困难。本文旨在深入分析基于AI的应用评论中的公平性关注。我们首先手动构建了一个包含公平性评论和非公平性评论统计样本的基准数据集。利用该基准数据集,我们随后开发并评估了一组机器学习与深度学习分类器,用于区分公平性评论和非公平性评论。实验表明,性能最优的分类器能够以94%的精确率检测出公平性评论。随后,我们将该最优分类器应用于从108个基于AI的应用中收集的约950万条评论,识别出约92,000条公平性评论。尽管公平性评论出现在23个应用类别中,但我们发现“通讯”和“社交”应用类别的公平性评论占比最高。接着,对92,000条公平性评论应用K-means聚类技术,并结合人工分析,识别出六种不同类型的公平性关注(例如,“在不同平台和设备上获得不同质量的特性和服务”以及“处理用户生成内容时缺乏透明度和公平性”)。最后,对2,248位应用所有者对公平性评论的回复进行人工分析,确定了应用所有者解释公平性关注时所报告的六种根本原因(例如,“版权问题”、“外部因素”、“开发成本”)。