Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms

Machine learning (ML) models that achieve high average accuracy can still underperform on semantically coherent subsets ("slices") of data. This behavior can have significant societal consequences for the safety or bias of the model in deployment, but identifying these underperforming slices can be difficult in practice, especially in domains where practitioners lack access to group annotations to define coherent subsets of their data. Motivated by these challenges, ML researchers have developed new slice discovery algorithms that aim to group together coherent and high-error subsets of data. However, there has been little evaluation focused on whether these tools help humans form correct hypotheses about where (for which groups) their model underperforms. We conduct a controlled user study (N = 15) where we show 40 slices output by two state-of-the-art slice discovery algorithms to users, and ask them to form hypotheses about an object detection model. Our results provide positive evidence that these tools provide some benefit over a naive baseline, and also shed light on challenges faced by users during the hypothesis formation step. We conclude by discussing design opportunities for ML and HCI researchers. Our findings point to the importance of centering users when creating and evaluating new tools for slice discovery.

翻译：机器学习（ML）模型即便达到较高的平均精度，仍可能在数据中语义连贯的子集（“切片”）上表现不佳。这种表现在部署时可能对模型的安全性或偏差产生重大社会后果，但在实践中识别这些表现不佳的数据切片往往存在困难，尤其是在从业者无法获取分组标注来定义数据中连贯子集的领域。受这些挑战的驱动，ML研究者开发了新型切片发现算法，旨在将数据中连贯且高误差的子集进行分组。然而，目前鲜有评估关注这些工具是否能帮助人类对其模型在哪些（哪些组）数据上表现不佳形成正确假设。我们开展了一项受控用户研究（N=15），向用户展示两种最先进切片发现算法输出的40个数据切片，并要求他们对一个目标检测模型形成假设。研究结果提供了正面证据，表明这些工具相较于朴素基线方法具有一定优势，同时也揭示了用户在假设形成阶段面临的挑战。最后，我们讨论了ML与人机交互（HCI）研究者可开展的设计机遇。研究结论表明，在创建和评估新型切片发现工具时，以用户为中心至关重要。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日