Machine learning (ML) models are increasingly used in various applications, from recommendation systems in e-commerce to diagnosis prediction in healthcare. In this paper, we present a novel dynamic framework for thinking about the deployment of ML models in a performative, human-ML collaborative system. In our framework, the introduction of ML recommendations changes the data-generating process of human decisions, which are only a proxy to the ground truth and which are then used to train future versions of the model. We show that this dynamic process in principle can converge to different stable points, i.e. where the ML model and the Human+ML system have the same performance. Some of these stable points are suboptimal with respect to the actual ground truth. As a proof of concept, we conduct an empirical user study with 1,408 participants. In the study, humans solve instances of the knapsack problem with the help of machine learning predictions of varying performance. This is an ideal setting because we can identify the actual ground truth, and evaluate the performance of human decisions supported by ML recommendations. We find that for many levels of ML performance, humans can improve upon the ML predictions. We also find that the improvement could be even higher if humans rationally followed the ML recommendations. Finally, we test whether monetary incentives can increase the quality of human decisions, but we fail to find any positive effect. Using our empirical data to approximate our collaborative system suggests that the learning process would dynamically reach an equilibrium performance that is around 92% of the maximum knapsack value. Our results have practical implications for the deployment of ML models in contexts where human decisions may deviate from the indisputable ground truth.
翻译:机器学习(ML)模型正日益广泛地应用于各种场景,从电子商务的推荐系统到医疗保健的诊断预测。本文提出了一种新颖的动态框架,用于思考在表演性的人机协作系统中部署ML模型的问题。在我们的框架中,ML推荐的引入改变了人类决策的数据生成过程——这些决策仅是真实情况的代理,随后又被用于训练模型的新版本。我们证明,这一动态过程原则上可以收敛到不同的稳定点,即ML模型与“人类+ML”系统具有相同性能的状态。其中部分稳定点相对于实际真实情况是次优的。作为概念验证,我们开展了一项包含1,408名参与者的实证用户研究。在该研究中,人类在具有不同性能的机器学习预测辅助下求解背包问题实例。这是一个理想的研究场景,因为我们可以确定实际真实情况,并评估ML推荐支持下人类决策的性能。我们发现,在许多ML性能水平下,人类能够改进ML的预测结果。我们还发现,如果人类理性地遵循ML建议,改进幅度甚至可能更大。最后,我们检验了货币激励是否能提高人类决策质量,但未发现任何积极效应。使用我们的实证数据来近似模拟协作系统表明,学习过程将动态达到一个均衡性能,约为背包最大价值的92%。我们的研究结果对于在人类决策可能偏离无可争议的真实情况的场景中部署ML模型具有实际意义。