Machine learning (ML) models are increasingly used in various applications, from recommendation systems in e-commerce to diagnosis prediction in healthcare. In this paper, we present a novel dynamic framework for thinking about the deployment of ML models in a performative, human-ML collaborative system. In our framework, the introduction of ML recommendations changes the data generating process of human decisions, which are only a proxy to the ground truth and which are then used to train future versions of the model. We show that this dynamic process in principle can converge to different stable points, i.e. where the ML model and the Human+ML system have the same performance. Some of these stable points are suboptimal with respect to the actual ground truth. We conduct an empirical user study with 1,408 participants to showcase this process. In the study, humans solve instances of the knapsack problem with the help of machine learning predictions. This is an ideal setting because we can see how ML models learn to imitate human decisions and how this learning process converges to a stable point. We find that for many levels of ML performance, humans can improve the ML predictions to dynamically reach an equilibrium performance that is around 92% of the maximum knapsack value. We also find that the equilibrium performance could be even higher if humans rationally followed the ML recommendations. Finally, we test whether monetary incentives can increase the quality of human decisions, but we fail to find any positive effect. Our results have practical implications for the deployment of ML models in contexts where human decisions may deviate from the indisputable ground truth.
翻译:机器学习(ML)模型正日益广泛地应用于各类场景,从电子商务中的推荐系统到医疗健康领域的诊断预测。本文提出一种新颖的动态框架,用于思考ML模型在表演性人机协作系统中的部署问题。在我们的框架中,ML推荐机制的引入会改变人类决策的数据生成过程——这些决策仅是真实情况的代理指标,随后又被用于训练模型的新版本。我们证明,该动态过程原则上可收敛至不同的稳定点,即ML模型与“人类+ML”系统达到相同性能的状态。其中部分稳定点相对于实际真实情况而言是次优的。我们通过一项包含1,408名参与者的实证用户研究来展示这一过程。在该研究中,人类借助机器学习预测来解决背包问题的实例。这是一个理想的研究场景,因为我们可以观察ML模型如何学习模仿人类决策,以及该学习过程如何收敛至稳定点。研究发现,在ML性能的多数水平下,人类能够改进ML预测,从而动态达到约背包最大价值92%的均衡性能。我们还发现,若人类能理性遵循ML建议,均衡性能甚至可能更高。最后,我们检验了货币激励能否提升人类决策质量,但未发现任何积极效应。本研究结果对于在人类决策可能偏离无可争议真实情况的场景中部署ML模型具有实际意义。