New data sources, and artificial intelligence (AI) methods to extract information from them are becoming plentiful, and relevant to decision making in many societal applications. An important example is street view imagery, available in over 100 countries, and considered for applications such as assessing built environment aspects in relation to community health outcomes. Relevant to such uses, important examples of bias in the use of AI are evident when decision-making based on data fails to account for the robustness of the data, or predictions are based on spurious correlations. To study this risk, we utilize 2.02 million GSV images along with health, demographic, and socioeconomic data from New York City. Initially, we demonstrate that built environment characteristics inferred from GSV labels at the intra-city level may exhibit inadequate alignment with the ground truth. We also find that the average individual-level behavior of physical inactivity significantly mediates the impact of built environment features by census tract, as measured through GSV. Finally, using a causal framework which accounts for these mediators of environmental impacts on health, we find that altering 10% of samples in the two lowest tertiles would result in a 4.17 (95% CI 3.84 to 4.55) or 17.2 (95% CI 14.4 to 21.3) times bigger decrease on the prevalence of obesity or diabetes, than the same proportional intervention on the number of crosswalks by census tract. This work illustrates important issues of robustness and model specification for informing effective allocation of interventions using new data sources.
翻译:新数据源及从中提取信息的人工智能方法日益丰富,并在众多社会领域的决策制定中发挥重要作用。一个典型例子是街景图像——已覆盖超过100个国家,并被用于评估建成环境与社区健康结果关系的应用场景。然而,此类应用中存在重要偏差案例:基于数据的决策若未能考虑数据鲁棒性,或预测依赖于虚假相关性,则可能产生误导性结果。为研究这一风险,我们使用了纽约市202万张谷歌街景图像,并结合健康、人口统计及社会经济数据。首先,我们发现从街景标签推断出的城市内部建成环境特征与真实情况的一致性不足。其次,研究发现通过人口普查区层面测量的街景数据,个体平均久坐行为显著中介了建成环境特征的影响。最后,采用考虑环境健康影响中介变量的因果框架,我们发现对最低三分位数中10%的样本进行干预,可使肥胖患病率减少幅度达到相同比例干预交叉口数量效果的4.17倍(95%置信区间3.84-4.55),糖尿病患病率减少幅度可达17.2倍(95%置信区间14.4-21.3)。本研究揭示了利用新数据源指导有效干预分配时,鲁棒性与模型规范性的关键问题。