Machine learning (ML) models that achieve high average accuracy can still underperform on semantically coherent subsets ("slices") of data. This behavior can have significant societal consequences for the safety or bias of the model in deployment, but identifying these underperforming slices can be difficult in practice, especially in domains where practitioners lack access to group annotations to define coherent subsets of their data. Motivated by these challenges, ML researchers have developed new slice discovery algorithms that aim to group together coherent and high-error subsets of data. However, there has been little evaluation focused on whether these tools help humans form correct hypotheses about where (for which groups) their model underperforms. We conduct a controlled user study (N = 15) where we show 40 slices output by two state-of-the-art slice discovery algorithms to users, and ask them to form hypotheses about an object detection model. Our results provide positive evidence that these tools provide some benefit over a naive baseline, and also shed light on challenges faced by users during the hypothesis formation step. We conclude by discussing design opportunities for ML and HCI researchers. Our findings point to the importance of centering users when creating and evaluating new tools for slice discovery.
翻译:机器学习(ML)模型即便达到较高的平均精度,仍可能在数据中语义连贯的子集(“切片”)上表现不佳。这种表现在部署时可能对模型的安全性或偏差产生重大社会后果,但在实践中识别这些表现不佳的数据切片往往存在困难,尤其是在从业者无法获取分组标注来定义数据中连贯子集的领域。受这些挑战的驱动,ML研究者开发了新型切片发现算法,旨在将数据中连贯且高误差的子集进行分组。然而,目前鲜有评估关注这些工具是否能帮助人类对其模型在哪些(哪些组)数据上表现不佳形成正确假设。我们开展了一项受控用户研究(N=15),向用户展示两种最先进切片发现算法输出的40个数据切片,并要求他们对一个目标检测模型形成假设。研究结果提供了正面证据,表明这些工具相较于朴素基线方法具有一定优势,同时也揭示了用户在假设形成阶段面临的挑战。最后,我们讨论了ML与人机交互(HCI)研究者可开展的设计机遇。研究结论表明,在创建和评估新型切片发现工具时,以用户为中心至关重要。