Machine learning (ML) models that achieve high average accuracy can still underperform on semantically coherent subsets (i.e. "slices") of data. This behavior can have significant societal consequences for the safety or bias of the model in deployment, but identifying these underperforming slices can be difficult in practice, especially in domains where practitioners lack access to group annotations to define coherent subsets of their data. Motivated by these challenges, ML researchers have developed new slice discovery algorithms that aim to group together coherent and high-error subsets of data. However, there has been little evaluation focused on whether these tools help humans form correct hypotheses about where (for which groups) their model underperforms. We conduct a controlled user study (N = 15) where we show 40 slices output by two state-of-the-art slice discovery algorithms to users, and ask them to form hypotheses about where an object detection model underperforms. Our results provide positive evidence that these tools provide some benefit over a naive baseline, and also shed light on challenges faced by users during the hypothesis formation step. We conclude by discussing design opportunities for ML and HCI researchers. Our findings point to the importance of centering users when designing and evaluating new tools for slice discovery.
翻译:机器学习模型在取得高平均准确率的同时,仍可能在语义连贯的数据子集(即“切片”)上表现不佳。这种行为可能在部署中对模型的安全性或偏见产生重大社会影响,但在实践中识别这些表现不佳的切片十分困难,尤其是在从业者缺乏组标注来定义数据连贯子集的领域。受这些挑战的推动,机器学习研究人员开发了新的切片发现算法,旨在将连贯且误差较高的数据子集归为一组。然而,目前鲜有评估关注这些工具是否能够帮助人类对其模型在哪些群体上表现不佳形成正确假设。我们开展了一项受控用户研究(N=15),向用户展示两个最先进切片发现算法输出的40个切片,并要求他们形成关于目标检测模型在哪些方面表现不佳的假设。我们的结果提供了积极证据,表明这些工具相比朴素基线具有优势,同时也揭示了用户在假设形成步骤中面临的挑战。最后,我们讨论了机器学习与人机交互研究者的设计机遇。我们的研究结果强调了在设计和评估新的切片发现工具时,以用户为中心的重要性。