Personal informatics (PI) systems, powered by smartphones and wearables, enable people to lead healthier lifestyles by providing meaningful and actionable insights that break down barriers between users and their health information. Today, such systems are used by billions of users for monitoring not only physical activity and sleep but also vital signs and women's and heart health, among others. Despite their widespread usage, the processing of sensitive PI data may suffer from biases, which may entail practical and ethical implications. In this work, we present the first comprehensive empirical and analytical study of bias in PI systems, including biases in raw data and in the entire machine learning life cycle. We use the most detailed framework to date for exploring the different sources of bias and find that biases exist both in the data generation and the model learning and implementation streams. According to our results, the most affected minority groups are users with health issues, such as diabetes, joint issues, and hypertension, and female users, whose data biases are propagated or even amplified by learning models, while intersectional biases can also be observed.
翻译:个人信息系统(PI)借助智能手机与可穿戴设备,通过提供打破用户与健康信息壁垒的有意义且可操作的洞察,帮助人们养成更健康的生活方式。如今,数十亿用户使用此类系统监测的不仅是身体活动与睡眠,还包括生命体征、女性健康及心脏健康等。尽管其应用广泛,敏感个人数据的处理可能受到偏见的影响,从而引发实践与伦理层面的问题。本研究首次对PI系统中的偏见进行了全面的实证与分析研究,涵盖原始数据中的偏见以及机器学习全生命周期中的偏见。我们采用迄今为止最详尽的框架探索不同来源的偏见,发现偏见既存在于数据生成环节,也存在于模型学习与实现流程中。根据研究结果,受影响最严重的少数群体是患有糖尿病、关节问题、高血压等健康问题的用户及女性用户,其数据偏见会在学习模型中被传播甚至放大,同时还可观察到交叉性偏见。